CUBRID Post-Processing — Aggregation, Window/Analytic Functions, and Sort vs Hash GROUP BY
Contents:
- Theoretical Background
- Common DBMS Design
- CUBRID’s Approach
- Source Walkthrough
- Cross-check Notes
- Open Questions
- Sources
Theoretical Background
Section titled “Theoretical Background”Post-processing is the half of query execution that runs after all input tuples have been produced — by table scans, index scans, joins, or subquery list files. Three textbook problems share the same machinery and the same listfile-and-sort substrate:
-
Grouped aggregation — given a stream of input tuples and a set of grouping keys, produce one output tuple per distinct key together with the values of
SUM,AVG,MIN,MAX,COUNT, etc. Two strategies exist:- Sort-based GROUP BY: sort the input by the grouping key, then make a single linear pass; tuples with the same key are contiguous, so an accumulator can be reset on every key change. Cost is dominated by the sort.
- Hash-based GROUP BY: maintain an in-memory hash table keyed by the grouping key; each input tuple probes the table and either creates a new accumulator or folds into an existing one. Cost is dominated by hash probes; no sort is needed if the table fits in memory. When it does not, the hash table either spills to disk or degrades to sort-based.
Database Internals (Petrov, 2019), Chapter 14 (“Query Execution”) frames the trade-off as a function of grouping selectivity (groups / tuples) and memory budget; high selectivity (few distinct keys) favours hash, low selectivity (many distinct keys) favours sort.
-
Window / analytic functions —
RANK(),ROW_NUMBER(),LAG(),LEAD(),MEDIAN(),PERCENTILE_CONT(), etc., as defined in SQL:2003. Each window function has three sub-clauses:PARTITION BY(groups the rows),ORDER BY(orders within a partition), and an optional frame (ROWS BETWEEN ... AND ...). Unlike aggregates, every input tuple yields an output tuple; the window function value for tuple $t$ depends on $t$‘s frame inside its partition. The textbook execution model is:- Sort the input by
(PARTITION BY, ORDER BY). - Walk the sorted stream tracking partition boundaries and sort-key boundaries.
- For each tuple, compute the function value over the frame.
For navigation functions like
LAGand offset-into-group functions likeNTH_VALUE, this requires random access to other tuples in the same partition — implemented by writing intermediate group/value list files and re-scanning them.
- Sort the input by
-
External sort — when an input list file does not fit in the sort buffer (
PRM_ID_SR_NBUFFERS), the sort module degrades to a classic two-phase merge sort: an in-phase that produces sorted runs from buffer-sized chunks, then an ex-phase that does k-way merging of the runs. This is the substrate for both sort-based GROUP BY and the analytic-function sort. Petrov Ch. 14 (sort) and Garcia-Molina/Ullman/Widom (Database Systems: The Complete Book, Ch. 15.4) describe the model; CUBRID’s implementation lives insrc/storage/external_sort.c.
The three problems share three properties: (a) the input is a
list file (QFILE_LIST_ID) — a virtual sequential file backed
by temporary pages; (b) sorting is invoked through a callback
sandwich — sort_listfile (..., get_func, put_func, ...) reads
sort records via get_func and emits sorted records to
put_func; (c) the per-key state (accumulator for aggregation,
window-function state for analytics) is kept in a small in-memory
struct and folded into the put callback so that the sort
itself does not know it is doing aggregation.
This callback shape is what makes CUBRID’s sort-based GROUP BY a
two-line caller (qexec_groupby → sort_listfile) and lets the
hash-based path graft onto the same skeleton: when the hash table
spills, it dumps its key/value pairs into a list file and the
final pass is again a sort_listfile call.
Common DBMS Design
Section titled “Common DBMS Design”Engines diverge most visibly in where the strategy choice is made (planner vs. executor), and in how the spill is handled. Three reference designs frame CUBRID’s choices.
PostgreSQL — nodeAgg.c, nodeWindowAgg.c
Section titled “PostgreSQL — nodeAgg.c, nodeWindowAgg.c”PostgreSQL’s executor has two distinct nodes:
-
Agg— implemented insrc/backend/executor/nodeAgg.c, driven by theAGG_HASHED/AGG_SORTED/AGG_PLAIN/AGG_MIXEDstrategies chosen by the planner (create_agg_plan). The strategy is a plan-time decision based onenable_hashagg,work_mem, and the estimated number of groups. When a hash agg overflowswork_mem, PostgreSQL since v13 spills via grace hashing: it partitions overflow tuples to per-partition tape files and re-aggregates each partition independently. The accumulator (AggStatePerTrans) and the per-group state (AggStatePerGroup) are kept inaggcontextmemory. -
WindowAgg— implemented innodeWindowAgg.c. PostgreSQL sorts input by(PARTITION BY, ORDER BY)upstream (a separate Sort node feeds the WindowAgg) and then walks partitions, computing each function’s value within the frame. Frame logic is generic overRANGE,ROWS, andGROUPSmodes.
CUBRID’s executor uses a single BUILDLIST_PROC node for
GROUP BY and a separate analytic loop, but the strategy choice is
runtime — the executor decides at end_one_iteration time
whether to keep aggregating into the hash table or fall back to
sort, based on observed selectivity (more on this below).
MySQL — Item_sum, temporary tables, Aggregator_distinct
Section titled “MySQL — Item_sum, temporary tables, Aggregator_distinct”MySQL uses Item_sum subclasses (Item_sum_avg, Item_sum_min,
…) for each aggregate, and historically materialised the
GROUP BY through a temporary table keyed on the grouping
expression. Modern MySQL (8.0+) has AggregateIterator with
hash-based aggregation by default; the temp-table fallback still
exists for cases where the hash table would exceed
tmp_table_size / max_heap_table_size. Window functions are
handled by WindowIterator over a sorted stream.
The MySQL contract — one accumulator object per aggregate per
group, mutated as tuples flow through — is the same shape as
CUBRID’s AGGREGATE_TYPE chain (g_agg_list).
Oracle / SQL Server — sort/hash GROUP BY with cost-based fallback
Section titled “Oracle / SQL Server — sort/hash GROUP BY with cost-based fallback”Oracle’s optimizer chooses between HASH GROUP BY and SORT GROUP BY based on estimated cardinalities; SQL Server’s Stream Aggregate versus Hash Match (Aggregate) is the analogous
choice. Both vendors spill hash overflow to TempDB-equivalent
storage and continue with a partition-based recombination,
similar to PostgreSQL’s modern HashAgg.
Theory ↔ CUBRID mapping
Section titled “Theory ↔ CUBRID mapping”| Theoretical concept | CUBRID name |
|---|---|
| Per-group accumulator | AGGREGATE_TYPE::accumulator (cubxasl::aggregate_accumulator) |
| Aggregate chain in execution order | BUILDLIST_PROC_NODE::g_agg_list (xasl.h) |
| Output-side aggregate chain (SELECT-list shape) | GROUPBY_STATE::g_output_agg_list |
| Per-group state during sort-based execution | GROUPBY_STATE (query_executor.c:213) |
| Rollup level array | GROUPBY_STATE::g_dim of GROUPBY_DIMENSION |
| Hash GROUP BY context | AGGREGATE_HASH_CONTEXT referenced via GROUPBY_STATE::agg_hash_context |
| Hash table spill list | AGGREGATE_HASH_CONTEXT::part_list_id |
| Hash table memory ceiling | PRM_ID_MAX_AGG_HASH_SIZE (default 2 MB) |
| Hash strategy degrade threshold | HASH_AGGREGATE_VH_SELECTIVITY_TUPLE_THRESHOLD (2000) + HASH_AGGREGATE_VH_SELECTIVITY_THRESHOLD (0.5) |
| Sort callback contract | sort_listfile (get_fn, put_fn, cmp_fn, key_info, ...) (external_sort.c:1347) |
| Sort-based GROUP BY put callback | qexec_gby_put_next |
| Hash-based GROUP BY put callback | qexec_hash_gby_put_next |
| Window-function per-function state | ANALYTIC_FUNCTION_STATE (query_executor.c:259) |
| Window-function execution state | ANALYTIC_STATE (query_executor.c:288) |
| Per-group tuple count (for windows) | ANALYTIC_FUNCTION_STATE::curr_group_tuple_count |
| Within-sort-key tuple count (for windows) | ANALYTIC_FUNCTION_STATE::curr_sort_key_tuple_count |
| Group-header sidecar list | ANALYTIC_FUNCTION_STATE::group_list_id |
| Sort-key sidecar list | ANALYTIC_FUNCTION_STATE::value_list_id |
| Window-function output reassembly pass | qexec_analytic_update_group_result |
CUBRID’s Approach
Section titled “CUBRID’s Approach”The qexec_execute_mainblock_internal driver, after producing
the unsorted input list file, dispatches into post-processing
through three entry points:
qexec_groupby— forBUILDLIST_PROCXASL nodes that carry agroupby_list. This is the GROUP BY (with optional WITH ROLLUP) lane.qexec_execute_analytic— for XASL nodes carrying anANALYTIC_EVAL_TYPEchain. Called once per distinct sort list required by the chain; that is, multiple analytic functions sharing the same(PARTITION BY, ORDER BY)are grouped at parse time (pt_optimize_analytic_list) so they ride one sort.qexec_orderby_distinct— forORDER BYandDISTINCT, not covered in this document; it is the samesort_listfilesubstrate without the per-key accumulator.
flowchart LR
subgraph PROC["Processing (input feed)"]
SCAN["Scan / join / subquery"]
LIST["xasl->list_id\n(input list file)"]
HASH["AGGREGATE_HASH_CONTEXT\n(in-memory hash table)"]
PART["part_list_id\n(spilled hash entries)"]
end
subgraph POST["Post-Processing"]
GBY["qexec_groupby\n(qexec_execute_mainblock_internal)"]
ANA["qexec_execute_analytic"]
SLF1["sort_listfile\n(GBY: get=qexec_gby_get_next,\n put=qexec_gby_put_next)"]
SLF2["sort_listfile\n(HGBY: put=qexec_hash_gby_put_next)"]
SLF3["sort_listfile\n(ANA: put=qexec_analytic_put_next)"]
OUT["output_file\n(QFILE_LIST_ID)"]
end
SCAN --> LIST
SCAN -. hash agg eligible .-> HASH
HASH -. LRU spill .-> PART
HASH -. very-high selectivity .-> PART
LIST --> GBY
PART --> GBY
HASH --> GBY
GBY -- sort path --> SLF1 --> OUT
GBY -- hash path --> SLF2 --> OUT
LIST --> ANA --> SLF3 --> OUT
The three sub-topics — aggregation, analytic, and the sort/hash
choice — share the GROUPBY state struct, the
AGGREGATE_HASH_CONTEXT, and the sort_listfile substrate, but
each has its own put-callback and its own state machine. We walk
them in turn.
Aggregation / GROUP BY
Section titled “Aggregation / GROUP BY”The GROUP BY lane has two state objects: GROUPBY_STATE
(per-query, holds the sort/hash strategy choice and the rollup
dimension array) and GROUPBY_DIMENSION (one slot per rollup
level, each carrying its own d_agg_list for accumulators).
// GROUPBY_STATE — src/query/query_executor.c (excerpt)typedef struct groupby_state GROUPBY_STATE;struct groupby_state{ int state; SORTKEY_INFO key_info; QFILE_LIST_SCAN_ID *input_scan; QFILE_LIST_ID *output_file;
PRED_EXPR *having_pred; AGGREGATE_TYPE *g_output_agg_list; /* SELECT-list shape */ REGU_VARIABLE_LIST g_regu_list; /* fetch into db_value */ REGU_VARIABLE_LIST g_hk_regu_list; /* hash-key reguvars */ VAL_LIST *g_val_list; OUTPTR_LIST *g_outptr_list; XASL_NODE *xasl; XASL_STATE *xasl_state;
RECDES current_key; /* last seen group key */ RECDES gby_rec; QFILE_TUPLE_RECORD input_tpl; int input_recs;
bool with_rollup; GROUPBY_DIMENSION *g_dim; /* rollup levels */ int g_dim_levels;
int hash_eligible; AGGREGATE_HASH_CONTEXT *agg_hash_context; // ... condensed ...};The XASL generator pre-builds five lists on the
BUILDLIST_PROC_NODE that this state binds to: g_outptr_list
(SELECT-list shape, what the user sees), g_val_list (non-aggregate
columns and aggregate arguments, in GROUP BY order with
SELECT-only columns appended at the tail), g_regu_list (reguvars
that fetch tuple positions into those db_values), g_agg_list
(one AGGREGATE_TYPE node per aggregate function, holding both
accumulator.value and the function’s argument), and
g_hk_sort_regu_list (hash-key reguvars, only present when
hash_eligible == true). The deck (Sort, Hash Group By.pptx,
slides 2-7) emphasises the GROUP-BY-order vs. SELECT-list-order
distinction; this matters because the input tuples written by
processing follow g_outptr_list’s GROUP BY order, so the sort
key reads cleanly off the head of each tuple.
The ordering of the sub-lists is deliberate. The deck’s example
SELECT c1, c2, avg(c1) FROM t1 GROUP BY c2, c1 produces an
input list file shaped (c2, c1, c1_arg) even though the
SELECT-list reads (c1, c2, avg(c1)). The g_outptr_list
preserves SELECT-list shape for output composition, while the
input list file is in GROUP BY order so that the sort key prefix
of every tuple is exactly the grouping vector.
Sort-based pipeline
Section titled “Sort-based pipeline”qexec_groupby is the entry; it allocates a GROUPBY_STATE,
opens an output list file, and hands the input list file to
sort_listfile with two callbacks:
// qexec_groupby — src/query/query_executor.c (condensed)qexec_groupby (THREAD_ENTRY * thread_p, XASL_NODE * xasl, ...){ // ... condensed: late binding, grbynum init ... if (qexec_initialize_groupby_state (&gbstate, buildlist->groupby_list, ...) == NULL) GOTO_EXIT_ON_ERROR;
/* open output list file */ gbstate.output_file = qfile_open_list (thread_p, &output_type_list, ...);
/* quick-finalize escapes for empty input or pure hash-agg ... */
/* sort + aggregate: callback sandwich */ if (sort_listfile (thread_p, NULL_VOLID, estimated_pages, &qexec_gby_get_next, &gbstate, &qexec_gby_put_next, &gbstate, gbstate.cmp_fn, &gbstate.key_info, SORT_DUP, NO_SORT_LIMIT, ...) != NO_ERROR) GOTO_EXIT_ON_ERROR; // ... condensed ...}qexec_gby_get_next is trivial: it pulls the next tuple from
the input list scan and turns it into a SORT_REC:
// qexec_gby_get_next — src/query/query_executor.cstatic SORT_STATUSqexec_gby_get_next (THREAD_ENTRY * thread_p, RECDES * recdes, void *arg){ GROUPBY_STATE *gbstate = (GROUPBY_STATE *) arg; return qfile_make_sort_key (thread_p, &gbstate->key_info, recdes, gbstate->input_scan, &gbstate->input_tpl);}qexec_gby_put_next is where aggregation happens. The sort
module calls it once per sorted record (or once per chain of
duplicates when SORT_DUP is set); the callback compares the
record’s key against gbstate->current_key, finalises the prior
group when they differ, starts a new group, and folds the new
tuple into the accumulator chain. The state-machine sketch:
stateDiagram-v2 [*] --> Initial : sort begins Initial --> StartGroup : first sort_record\n(qexec_gby_start_group_dim) StartGroup --> Accumulating : qexec_gby_agg_tuple Accumulating --> Accumulating : same key\n(qexec_gby_agg_tuple) Accumulating --> Finalize : new key\n(qexec_gby_finalize_group_dim) Finalize --> StartGroup : qexec_gby_start_group(0) Accumulating --> EndOfInput : sort drains EndOfInput --> Finalize : final group Finalize --> [*]
The actual key comparison and dimension-walking lives inside
qexec_gby_finalize_group_dim, which is the rollup-aware
finalize path:
// qexec_gby_finalize_group_dim — src/query/query_executor.c (condensed)static intqexec_gby_finalize_group_dim (THREAD_ENTRY * thread_p, GROUPBY_STATE * gbstate, const RECDES * recdes){ int i, j, nkeys, level = 0;
qexec_gby_finalize_group (thread_p, gbstate, 0, gbstate->with_rollup); if (gbstate->with_rollup && recdes) { SORT_REC *key = (SORT_REC *) recdes->data; level = gbstate->g_dim_levels; nkeys = gbstate->key_info.nkeys;
/* find the first key prefix that differs and finalize all * dimensions above that level */ for (i = 1; i < nkeys; i++) { gbstate->key_info.nkeys = i; if ((*gbstate->cmp_fn) (&gbstate->current_key.data, &key, &gbstate->key_info) != 0) { for (j = gbstate->g_dim_levels - 1; j > i; j--) { qexec_gby_finalize_group (thread_p, gbstate, j, true); qexec_gby_start_group (thread_p, gbstate, NULL, j); } level = i + 1; break; } } gbstate->key_info.nkeys = nkeys; }
qexec_gby_start_group (thread_p, gbstate, recdes, 0); return level;}Two non-obvious facts. First, the rollup loop finalizes the
deepest dimension first and walks up: if the prefix c1 is
unchanged but (c1,c2) differs, the dimensions tracking
(c1,c2,c3) and (c1,c2) are finalized while the c1-only and
total-aggregate dimensions stay open. The deck’s slide on
GROUPBY_DIMENSION (Sort, Hash Group By.pptx, slide 24) calls
this out as “독특한 점” — g_dim[0] is the most-detailed level
(all GROUP BY columns), g_dim[g_dim_levels-1] is the total
aggregate, and the intermediate slots are the rollup prefixes.
Second, the sort module’s nkeys field is mutated during the
comparison loop to compare prefix-by-prefix; the original value
is saved and restored to keep the outer sort consistent.
Each qexec_gby_finalize_group call dumps the dimension’s
accumulators into gbstate->output_file:
// qexec_gby_finalize_group — src/query/query_executor.c (condensed)static voidqexec_gby_finalize_group (THREAD_ENTRY * thread_p, GROUPBY_STATE * gbstate, int N, bool keep_list_file){ // ... condensed ... qdata_finalize_aggregate_list (thread_p, gbstate->g_dim[N].d_agg_list, keep_list_file, NULL);
/* move dimension's accumulators into g_output_agg_list */ AGGREGATE_TYPE *g_outp = gbstate->g_output_agg_list; AGGREGATE_TYPE *d_aggp = gbstate->g_dim[N].d_agg_list; while (g_outp != NULL && d_aggp != NULL) { if (g_outp->function != PT_GROUPBY_NUM && d_aggp->accumulator.value != NULL && g_outp->accumulator.value != NULL) { pr_clear_value (g_outp->accumulator.value); *g_outp->accumulator.value = *d_aggp->accumulator.value; PRIM_SET_NULL (d_aggp->accumulator.value); } g_outp = g_outp->next; d_aggp = d_aggp->next; }
/* HAVING + GROUP_BY_NUM evaluation */ if (gbstate->having_pred && eval_pred (thread_p, gbstate->having_pred, ...) != V_TRUE) goto wrapup;
/* compose output tuple via output pointer list and append */ qexec_generate_tuple_descriptor (thread_p, gbstate->output_file, gbstate->g_outptr_list, &xasl_state->vd); qfile_generate_tuple_into_list (thread_p, gbstate->output_file, T_NORMAL); // ... condensed ...}The accumulator-to-output value copy uses an aliasing shortcut
— *g_outp->accumulator.value = *d_aggp->accumulator.value — and
then immediately blanks the dimension accumulator with
PRIM_SET_NULL. This is intentional: the SELECT-list-shaped
g_output_agg_list shares its accumulator.value db_values with
g_outptr_list’s reguvars, so the tuple-generation pass picks up
the just-finalized aggregate by virtue of the shared pointer.
The deck (slide 16, “g_agg_list->accumulator.value 는
g_output_list에서 공유되므로…”) makes this sharing explicit.
The accumulator-update step itself runs inside qexec_gby_put_next’s
non-finalize branch via qexec_gby_agg_tuple:
// qexec_gby_agg_tuple — src/query/query_executor.cstatic voidqexec_gby_agg_tuple (THREAD_ENTRY * thread_p, GROUPBY_STATE * gbstate, QFILE_TUPLE tpl, int peek){ if (fetch_val_list (thread_p, gbstate->g_regu_list, &gbstate->xasl_state->vd, NULL, NULL, tpl, peek) != NO_ERROR) GOTO_EXIT_ON_ERROR;
for (int i = 0; i < gbstate->g_dim_levels; i++) { if (qdata_evaluate_aggregate_list (thread_p, gbstate->g_dim[i].d_agg_list, &gbstate->xasl_state->vd, NULL, false) != NO_ERROR) GOTO_EXIT_ON_ERROR; }}The inner loop walks every rollup dimension and folds the
same input tuple into each — the most-detailed dimension and
each rollup prefix. qdata_evaluate_aggregate_list (in
query_aggregate.cpp) is the per-aggregate-function dispatch,
with a separate inner switch per PT_MIN, PT_MAX, PT_SUM,
PT_AVG, PT_COUNT, etc. The deck (page 9) summarises the
behaviours: PT_MIN/PT_MAX overwrite when the new value
beats the running extremum; PT_AVG/PT_SUM initialise on the
first non-NULL value and add on subsequent ones; PT_COUNT
increments unconditionally for COUNT(*) and only on non-NULL
for COUNT(col).
Group key transition: qexec_gby_start_group
Section titled “Group key transition: qexec_gby_start_group”When a new key arrives, qexec_gby_start_group records it as
gbstate->current_key and re-initialises the dimension’s
aggregate list:
// qexec_gby_start_group — src/query/query_executor.c (condensed)static voidqexec_gby_start_group (THREAD_ENTRY * thread_p, GROUPBY_STATE * gbstate, const RECDES * recdes, int N){ if (N == 0 && recdes) { /* save key for later equality tests */ memcpy (gbstate->current_key.data, recdes->data, recdes->length); gbstate->current_key.length = recdes->length; } /* zero the accumulators of dimension N */ QEXEC_CLEAR_AGG_LIST_VALUE (gbstate->g_dim[N].d_agg_list); qdata_initialize_aggregate_list (thread_p, gbstate->g_dim[N].d_agg_list, gbstate->xasl_state->query_id);}Only N == 0 (the most-detailed dimension) saves the key; the
rollup dimensions piggyback on the same key but with truncated
prefixes computed by the comparator.
Hash-based pipeline
Section titled “Hash-based pipeline”When hash_eligible == true, the executor builds an
AGGREGATE_HASH_CONTEXT during processing (i.e., as input
tuples arrive at end_one_iteration) and routes them through
qexec_hash_gby_agg_tuple before they reach xasl->list_id.
The post-processing is what cleans up the hash table:
// qexec_hash_gby_agg_tuple — src/query/query_executor.c (condensed)static intqexec_hash_gby_agg_tuple (THREAD_ENTRY * thread_p, XASL_NODE * xasl, XASL_STATE * xasl_state, BUILDLIST_PROC_NODE * proc, QFILE_TUPLE_RECORD * tplrec, QFILE_TUPLE_DESCRIPTOR * tpldesc, QFILE_LIST_ID * groupby_list, bool * output_tuple){ AGGREGATE_HASH_CONTEXT *context = proc->agg_hash_context; AGGREGATE_HASH_KEY *key = context->temp_key; AGGREGATE_HASH_VALUE *value;
if (context->state == HS_REJECT_ALL) return NO_ERROR; /* sort fallback engaged */
qexec_build_agg_hkey (thread_p, xasl_state, proc->g_hk_scan_regu_list, NULL, key); value = (AGGREGATE_HASH_VALUE *) mht_get (context->hash_table, (void *) key);
if (value == NULL) { /* new group: copy key, allocate accumulators, save first tuple */ AGGREGATE_HASH_KEY *new_key = qdata_copy_agg_hkey (thread_p, key); AGGREGATE_HASH_VALUE *new_value = qdata_alloc_agg_hvalue (thread_p, proc->g_func_count, proc->g_agg_list); // ... condensed: stash first tuple into new_value->first_tuple ... mht_put (context->hash_table, new_key, new_value); context->tuple_count++; context->group_count++; context->hash_size += qdata_get_agg_hkey_size (new_key); context->hash_size += qdata_get_agg_hvalue_size (new_value, false); } else { /* existing group: fold accumulators */ *output_tuple = false; value->tuple_count++; context->tuple_count++; qdata_evaluate_aggregate_list (thread_p, proc->g_agg_list, &xasl_state->vd, value->accumulators, false); context->hash_size += qdata_get_agg_hvalue_size (value, true); }
/* keep hash table within memory limit (LRU spill) */ while (context->hash_size > (int) mem_limit) { HENTRY_PTR hentry = context->hash_table->lru_head; key = (AGGREGATE_HASH_KEY *) hentry->key; value = (AGGREGATE_HASH_VALUE *) hentry->data; qdata_save_agg_hentry_to_list (thread_p, key, value, context->temp_dbval_array, context->part_list_id); if (value->first_tuple.tpl != NULL) qfile_add_tuple_to_list (thread_p, groupby_list, value->first_tuple.tpl); context->hash_size -= qdata_get_agg_hkey_size (key); context->hash_size -= qdata_get_agg_hvalue_size (value, false); mht_rem (context->hash_table, key, qdata_free_agg_hentry, NULL); }
/* very-high selectivity → bail to sort path */ if (context->tuple_count > HASH_AGGREGATE_VH_SELECTIVITY_TUPLE_THRESHOLD) { float selectivity = (float) context->group_count / context->tuple_count; if (selectivity > HASH_AGGREGATE_VH_SELECTIVITY_THRESHOLD) { context->state = HS_REJECT_ALL; qdata_save_agg_htable_to_list (thread_p, context->hash_table, groupby_list, context->part_list_id, context->temp_dbval_array); } } // ... condensed ...}This is the function that materialises the deck’s two
“key-duplication” cases (slides 18-20). When the hash table
exceeds prm_get_bigint_value (PRM_ID_MAX_AGG_HASH_SIZE) (default
2 MB), an LRU entry is spilled to part_list_id and removed;
nothing prevents the same key from being re-inserted later, so
part_list_id may end up with multiple entries per key, each
already partially aggregated. When the selectivity (groups
divided by tuples) exceeds 0.5 over more than 2000 input tuples,
the executor concludes the hash strategy is wasting memory on
near-unique keys, sets state = HS_REJECT_ALL, dumps the entire
table to part_list_id, and from that point forward writes new
tuples directly to xasl->list_id for sort-based aggregation.
Post-processing for the hash path then runs sort_listfile
twice — once over part_list_id (the hash overflow), then once
over xasl->list_id (the sort fallback). The first sort uses
qexec_hash_gby_put_next as the put callback to merge
accumulators within sorted runs:
// qexec_hash_gby_put_next — src/query/query_executor.c (condensed)static intqexec_hash_gby_put_next (THREAD_ENTRY * thread_p, const RECDES * recdes, void *arg){ GROUPBY_STATE *state = (GROUPBY_STATE *) arg; AGGREGATE_HASH_CONTEXT *context = state->agg_hash_context;
for (SORT_REC *key = (SORT_REC *) recdes->data; key; key = key->next) { // ... condensed: load (key, value) from sorted record ... qdata_load_agg_hentry_from_tuple (thread_p, data, context->temp_part_key, context->temp_part_value, ...);
if (qdata_agg_hkey_eq (context->curr_part_key, context->temp_part_key) && context->sorted_count > 0) { /* same key: fuse accumulators */ AGGREGATE_TYPE *agg_list = state->g_output_agg_list; int i = 0; while (agg_list != NULL) { qdata_aggregate_accumulator_to_accumulator ( thread_p, &context->curr_part_value->accumulators[i], &agg_list->accumulator_domain, agg_list->function, agg_list->domain, &context->temp_part_value->accumulators[i]); agg_list = agg_list->next; i++; } } else { /* different key: dump curr to sorted_part_list_id, swap */ if (context->sorted_count > 0) qdata_save_agg_hentry_to_list (thread_p, context->curr_part_key, context->curr_part_value, context->temp_dbval_array, context->sorted_part_list_id); /* swap (curr, temp) */ // ... condensed ... } context->sorted_count++; } return NO_ERROR;}The fusion here is qdata_aggregate_accumulator_to_accumulator
— two pre-aggregated values (e.g., two partial sums for the same
key) being combined. This is the hash-spill counterpart of
qdata_evaluate_aggregate_list, which combines an accumulator
with a raw tuple value.
After this first pass, sorted_part_list_id holds one tuple per
distinct key with the merged accumulators. The second pass over
xasl->list_id reuses the sort-based qexec_gby_put_next
machinery — the only addition is a cross-reference into
sorted_part_list_id via qexec_locate_agg_hentry_in_list to
pick up any residual partial aggregates that a key already has
on the hash side.
WITH ROLLUP — multi-dimension finalize
Section titled “WITH ROLLUP — multi-dimension finalize”GROUP BY ... WITH ROLLUP adds g_dim_levels = nkeys + 1
slots. Each input tuple, in qexec_gby_agg_tuple, is folded
into every dimension. At a key boundary,
qexec_gby_finalize_group_dim walks dimensions deepest-first and
finalizes only those whose key prefix changed. The deck’s worked
example for GROUP BY c1, c2, c3 WITH ROLLUP (page 19-21) shows
how moving from (1,1,3) to (1,2,1) finalizes the deepest two
dimensions but keeps the (c1) and total dimensions running.
A subtle correctness issue surfaces in CBRD-25611 (deck slide
29): when the SELECT-list contains an expression ceil(c1) and
that expression lands in a position before c1 itself in the
GROUP BY order, the rollup finalize that NULLs out the c1
column also propagates that NULL through the expression. The
fix has to thread the original (non-NULL) value through to the
expression evaluator independently of the dimension flag.
Analytic / Window Functions
Section titled “Analytic / Window Functions”Window functions are structurally similar to aggregation — both
sort by a key and walk the sorted stream — but differ in two
ways: every input tuple is also an output tuple, and several
window functions need random access to other tuples in the
same partition (for LAG, LEAD, NTH_VALUE, MEDIAN,
PERCENTILE_CONT). CUBRID handles both by writing two sidecar
list files per analytic function during the first pass and
re-scanning them during a second qexec_analytic_update_group_result
pass.
State objects
Section titled “State objects”// ANALYTIC_FUNCTION_STATE — src/query/query_executor.c (excerpt)struct analytic_function_state{ ANALYTIC_TYPE *func_p; RECDES current_key;
QFILE_LIST_ID *group_list_id; /* file containing group headers */ QFILE_LIST_ID *value_list_id; /* file containing group values */ QFILE_LIST_SCAN_ID group_scan_id; QFILE_LIST_SCAN_ID value_scan_id; QFILE_TUPLE_RECORD group_tplrec; QFILE_TUPLE_RECORD value_tplrec;
DB_VALUE cgtc_dbval, cgtc_nn_dbval, csktc_dbval; int curr_group_tuple_count; /* tuples in current group */ int curr_group_tuple_count_nn; /* tuples in current group, non-NULL */ int curr_sort_key_tuple_count; /* tuples sharing current sort key */
int group_tuple_position; /* scan position in current group */ int group_tuple_position_nn; /* same, ignoring NULL values */ int sort_key_tuple_position; /* scan position in current sort key */ int group_consumed_tuples; /* tuples already emitted */};The two sidecar list files are the heart of the design.
group_list_id carries one (count, count_nn) tuple per partition;
value_list_id carries one (sort_key_count, value) tuple per
distinct sort key within a partition. The first-pass produces
these two files and a third — analytic_state->interm_file,
which is the input list file with hidden columns added so the
re-scan can identify partition and sort-key boundaries by
position alone. The second pass re-reads interm_file linearly,
positions cursors in group_list_id and value_list_id, and
emits one output tuple per input tuple.
qexec_execute_analytic — outer loop
Section titled “qexec_execute_analytic — outer loop”// qexec_execute_analytic — src/query/query_executor.c (condensed)qexec_execute_analytic (THREAD_ENTRY * thread_p, XASL_NODE * xasl, XASL_STATE * xasl_state, ANALYTIC_EVAL_TYPE * analytic_eval, ...){ // ... condensed: setup state, regulist domains, intermediate file ... qexec_initialize_analytic_state (thread_p, &analytic_state, analytic_eval->head, ...);
/* open intermediate and output files; intermediate carries * the input tuple shape plus hidden cols for partition/order tracking */ if (is_skip_sort) analytic_state.interm_file = list_id; else analytic_state.interm_file = qfile_open_list (thread_p, &interm_type_list, NULL, xasl_state->query_id, ls_flag, NULL);
/* open input scan + intermediate scan */ qfile_open_list_scan (list_id, &input_scan_id); qfile_open_list_scan (analytic_state.interm_file, &interm_scan_id);
if (is_skip_sort) { /* sort can be skipped if input arrived already sorted (planner) */ qexec_analytic_update_group_result (thread_p, &analytic_state); goto wrapup; }
/* sort + first-pass: writes interm_file, group_list_id, value_list_id */ estimated_pages = qfile_get_estimated_pages_for_sorting (list_id, &analytic_state.key_info); analytic_state.key_info.use_original = 1; analytic_state.cmp_fn = &qfile_compare_partial_sort_record;
if (sort_listfile (thread_p, NULL_VOLID, estimated_pages, &qexec_analytic_get_next, &analytic_state, &qexec_analytic_put_next, &analytic_state, analytic_state.cmp_fn, &analytic_state.key_info, SORT_DUP, NO_SORT_LIMIT, ..., SORT_ANALYTIC) != NO_ERROR) GOTO_EXIT_ON_ERROR;
/* finalize trailing groups (sort_listfile has no terminate hook) */ if (analytic_state.input_recs != 0) { for (i = 0; i < analytic_state.func_count; i++) qexec_analytic_finalize_group (thread_p, analytic_state.xasl_state, &analytic_state.func_state_list[i], false);
/* second pass: re-scan interm_file, position sidecars, emit output */ qexec_analytic_update_group_result (thread_p, &analytic_state); } // ... condensed ...}The driver is called once per distinct sort list. The query
parser groups window functions that share the same
(PARTITION BY, ORDER BY) into one ANALYTIC_EVAL_TYPE, so the
upstream parse-tree pass pt_optimize_analytic_list minimises
the number of sort passes. The deck’s example —
max(c3) over (partition by c1 order by c2),
rank() over (order by c1), min(c3) over (order by c2) — is
two sort passes: one keyed on (c1,c2) for the first two
functions, one keyed on c2 for the third.
First pass: qexec_analytic_put_next
Section titled “First pass: qexec_analytic_put_next”flowchart TD
A["sort_record arrives in qexec_analytic_put_next"]
B{"current_key is NULL?\n(first record)"}
C["start_group(reinit=true)\nfor every function"]
D{"key matches current_key\n(full sort_list_size)?"}
E["KEEP_RANK; continue"]
F{"sort_prefix_size == 0?\n(no PARTITION BY)"}
G{"key matches current_key\n(prefix only)?"}
H["finalize_group(is_same_group=true)\nstart_group(reinit=false)"]
I["finalize_group(is_same_group=false)\nstart_group(reinit=true)"]
J["qexec_analytic_add_tuple\n(for every function)"]
A --> B
B -- yes --> C --> J
B -- no --> D
D -- yes --> E --> J
D -- no --> F
F -- yes (no partition) --> H
F -- no --> G
G -- yes (same partition) --> H
G -- no --> I
H --> J
I --> J
The two-tier comparison — full sort key first, then prefix-only
on mismatch — is the partition vs. ordering distinction. The
inner loop in qexec_analytic_put_next (page 4 of the deck)
mutates analytic_state->key_info.nkeys to switch between
func_p->sort_list_size (full key, partition + order) and
func_p->sort_prefix_size (PARTITION BY columns only):
// qexec_analytic_put_next — src/query/query_executor.c (condensed)static intqexec_analytic_put_next (THREAD_ENTRY * thread_p, const RECDES * recdes, void *arg){ ANALYTIC_STATE *analytic_state = (ANALYTIC_STATE *) arg; // ... condensed: load original tuple from sort key page ...
if (analytic_state->input_recs == 0) { /* first record: start every function */ for (i = 0; i < analytic_state->func_count; i++) qexec_analytic_start_group (thread_p, analytic_state->xasl_state, &analytic_state->func_state_list[i], recdes, /*reinit=*/true); } else { for (i = 0; i < analytic_state->func_count; i++) { ANALYTIC_FUNCTION_STATE *func_state = &analytic_state->func_state_list[i]; bool is_same_group = false;
/* compare full sort key */ int nkeys = analytic_state->key_info.nkeys; analytic_state->key_info.nkeys = func_state->func_p->sort_list_size; if (((*analytic_state->cmp_fn) (&func_state->current_key.data, &key, &analytic_state->key_info) == 0) || analytic_state->key_info.nkeys == 0) { analytic_state->key_info.nkeys = nkeys; ANALYTIC_FUNC_SET_FLAG (func_state->func_p, ANALYTIC_KEEP_RANK); if (QPROC_ANALYTIC_IS_OFFSET_FUNCTION (func_state->func_p)) is_same_group = true; /* offset funcs need per-tuple distinct */ else continue; /* same sort key, no work */ } analytic_state->key_info.nkeys = nkeys;
/* compare prefix only */ if (!is_same_group) { if (func_state->func_p->sort_prefix_size == 0) is_same_group = true; else { nkeys = analytic_state->key_info.nkeys; analytic_state->key_info.nkeys = func_state->func_p->sort_prefix_size; if ((*analytic_state->cmp_fn) (&func_state->current_key.data, &key, &analytic_state->key_info) == 0) is_same_group = true; analytic_state->key_info.nkeys = nkeys; } }
if (!is_same_group) { /* partition boundary */ qexec_analytic_finalize_group (thread_p, analytic_state->xasl_state, func_state, /*is_same_group=*/false); pr_clear_value (func_state->func_p->value); qexec_analytic_start_group (thread_p, analytic_state->xasl_state, func_state, recdes, /*reinit=*/true); } else if (...) { /* sort-key boundary (within partition) */ qexec_analytic_finalize_group (thread_p, analytic_state->xasl_state, func_state, /*is_same_group=*/true); qexec_analytic_start_group (thread_p, analytic_state->xasl_state, func_state, recdes, /*reinit=*/false); } } }
qexec_analytic_add_tuple (thread_p, analytic_state, data, peek); analytic_state->input_recs++;}The two start_group arguments matter: reinit=true means
“new partition, reset accumulator and counters from scratch”;
reinit=false means “still in same partition, but new sort key
— save the partial value and reset only the sort-key counter”.
The deck’s slide on qexec_analytic_start_group (page 4)
contrasts the two cases.
Per-function state machine
Section titled “Per-function state machine”stateDiagram-v2 [*] --> Initial : sort begins Initial --> NewPartition : first record\nstart_group(reinit=true) NewPartition --> SameSortKey : add_tuple SameSortKey --> SameSortKey : same sort key\nadd_tuple SameSortKey --> NewSortKey : sort key changes\n(within partition) NewSortKey --> NewPartitionFinalize : partition prefix changes NewPartitionFinalize --> NewPartition : finalize(is_same_group=false)\nstart_group(reinit=true) NewSortKey --> SameSortKey : finalize(is_same_group=true)\nstart_group(reinit=false)\nadd_tuple SameSortKey --> EOI : sort drains EOI --> FinalFinalize : finalize trailing group FinalFinalize --> SecondPass : update_group_result SecondPass --> [*]
qexec_analytic_finalize_group writes one tuple to each of the
two sidecar list files:
// qexec_analytic_finalize_group — src/query/query_executor.c (condensed)static intqexec_analytic_finalize_group (THREAD_ENTRY * thread_p, XASL_STATE * xasl_state, ANALYTIC_FUNCTION_STATE * func_state, bool is_same_group){ /* compute the function's final value for the just-closed sort-key range */ qdata_finalize_analytic_func (thread_p, func_state->func_p, is_same_group);
if (!DB_IS_NULL (func_state->func_p->value)) func_state->curr_group_tuple_count_nn += func_state->curr_sort_key_tuple_count;
/* group_list_id gets one (count, count_nn) tuple per partition */ if (!is_same_group) qfile_fast_intint_tuple_to_list (thread_p, func_state->group_list_id, func_state->curr_group_tuple_count, func_state->curr_group_tuple_count_nn);
/* value_list_id gets one (sort_key_count, value) tuple per distinct sort key */ qfile_fast_intval_tuple_to_list (thread_p, func_state->value_list_id, func_state->curr_sort_key_tuple_count, func_state->func_p->value); // ... condensed ...}The first-pass output (interm_file + group_list_id + value_list_id) is enough information for the second pass to reconstruct each function’s output column for every input tuple.
Second pass: qexec_analytic_update_group_result
Section titled “Second pass: qexec_analytic_update_group_result”This pass re-reads the intermediate file linearly. For each
tuple, for each window function, it advances cursors in the
sidecar files and computes the per-tuple output value. The
cursor advancement is implemented in qexec_analytic_value_advance
(forward and backward navigation over the sort-key list while
tracking partition headers via the group list) and in
qexec_analytic_value_lookup (random-access seek to a position
within the current partition).
// qexec_analytic_update_group_result — src/query/query_executor.c (condensed)static intqexec_analytic_update_group_result (THREAD_ENTRY * thread_p, ANALYTIC_STATE * analytic_state){ // ... condensed: open scans on group_list_id, value_list_id, interm_file ...
while ((sc = qfile_scan_list_next (thread_p, &interm_scan_id, &tplrec_scan, PEEK)) == S_SUCCESS) { fetch_val_list (thread_p, analytic_state->a_regu_list, &xasl_state->vd, ...); qexec_analytic_eval_instnum_pred (thread_p, analytic_state, stage);
for (i = 0; i < analytic_state->func_count; i++) { ANALYTIC_FUNCTION_STATE *func_state = &analytic_state->func_state_list[i]; ANALYTIC_TYPE *func_p = func_state->func_p;
if (func_p->function == PT_ROW_NUMBER) ; /* already computed in first pass; nothing to do */ else if (QPROC_ANALYTIC_IS_OFFSET_FUNCTION (func_p)) { if (func_state->group_tuple_position == -1) qexec_analytic_value_advance (thread_p, func_state, 1, 1, ignore_nulls); else if (func_state->group_consumed_tuples >= func_state->curr_group_tuple_count) { qexec_analytic_group_header_next (thread_p, func_state); func_state->group_consumed_tuples = 0; } } else if (QPROC_IS_INTERPOLATION_FUNC (func_p)) { if (func_state->group_consumed_tuples >= func_state->curr_group_tuple_count) { qexec_analytic_group_header_next (thread_p, func_state); func_state->group_consumed_tuples = 0; } } else qexec_analytic_value_advance (thread_p, func_state, 1, 1, ignore_nulls);
/* function-specific evaluation */ switch (func_p->function) { case PT_LEAD: case PT_LAG: case PT_NTH_VALUE: qexec_analytic_evaluate_offset_function (thread_p, func_state, analytic_state); break; case PT_NTILE: qexec_analytic_evaluate_ntile_function (thread_p, func_state); break; case PT_CUME_DIST: case PT_PERCENT_RANK: qexec_analytic_evaluate_cume_dist_percent_rank_function (thread_p, func_state); break; case PT_MEDIAN: case PT_PERCENTILE_CONT: case PT_PERCENTILE_DISC: qexec_analytic_evaluate_interpolation_function (thread_p, func_state); break; }
func_state->group_consumed_tuples++; }
if (analytic_state->is_output_rec) qexec_insert_tuple_into_list (thread_p, analytic_state->output_file, analytic_state->a_outptr_list, &xasl_state->vd, &tplrec_write); } // ... condensed: cleanup ...}Offset functions: LEAD, LAG, NTH_VALUE
Section titled “Offset functions: LEAD, LAG, NTH_VALUE”The offset evaluator translates the offset argument into a
target position within the partition and uses
qexec_analytic_value_lookup to read the value from the
sort-key sidecar:
// qexec_analytic_evaluate_offset_function — src/query/query_executor.c (condensed)static intqexec_analytic_evaluate_offset_function (THREAD_ENTRY * thread_p, ANALYTIC_FUNCTION_STATE * func_state, ANALYTIC_STATE * analytic_state){ // ... condensed: locate offset reguvar, read its int value ...
switch (func_p->function) { case PT_LEAD: target_idx = func_state->group_consumed_tuples + offset; break; case PT_LAG: target_idx = func_state->group_consumed_tuples - offset; break; case PT_NTH_VALUE: // ... condensed: cast to double, validate range ... target_idx = (int) floor (nth_idx) - 1; if (func_p->from_last) target_idx = group_tuple_count - target_idx; break; }
if (target_idx >= 0 && target_idx < group_tuple_count) qexec_analytic_value_lookup (thread_p, func_state, target_idx, func_p->ignore_nulls); else put_default = true; /* off the end → default value */
if (put_default) { pr_clear_value (func_p->value); pr_clone_value (default_val_coerced, func_p->value); } return error;}The deck (page 8-13) walks LAG(c2, 1) over four tuples in
detail, showing how group_consumed_tuples advances per output
tuple, how group_list_id provides the partition cardinality
that bounds the lookup, and how the default value falls back
when the target index is out of range.
Interpolation functions: MEDIAN, PERCENTILE_CONT, PERCENTILE_DISC
Section titled “Interpolation functions: MEDIAN, PERCENTILE_CONT, PERCENTILE_DISC”These functions need to read the value at a fractional position within the partition. The evaluator computes the floor and ceiling positions of the target and either picks the middle row (odd count) or interpolates between the two middle rows:
// qexec_analytic_evaluate_interpolation_function — src/query/query_executor.c (condensed)static intqexec_analytic_evaluate_interpolation_function (THREAD_ENTRY * thread_p, ANALYTIC_FUNCTION_STATE * func_state){ ANALYTIC_TYPE *func_p = func_state->func_p;
/* MEDIAN/PERCENTILE evaluated only at start of group; later tuples reuse */ if (func_state->group_tuple_position_nn > 0) return NO_ERROR; if (func_p->is_const_operand) return NO_ERROR; if (func_state->curr_group_tuple_count_nn == 0) return NO_ERROR;
if (func_p->function == PT_MEDIAN) percentile_d = 0.5; else { // ... condensed: fetch percentile reguvar, validate ... percentile_d = db_get_double (peek_value_p); if (func_p->function == PT_PERCENTILE_DISC) percentile_d = ceil (percentile_d * func_state->curr_group_tuple_count_nn) / func_state->curr_group_tuple_count_nn; }
row_num_d = ((double) (func_state->curr_group_tuple_count_nn - 1)) * percentile_d; f_row_num_d = floor (row_num_d); c_row_num_d = (func_p->function == PT_PERCENTILE_DISC) ? f_row_num_d : ceil (row_num_d);
if (f_row_num_d == c_row_num_d) { qexec_analytic_value_lookup (thread_p, func_state, (int) f_row_num_d, true); qdata_apply_interpolation_function_coercion (func_p->value, &func_p->domain, func_p->value, func_p->function); } else { DB_VALUE c_value, f_value; qexec_analytic_value_lookup (thread_p, func_state, (int) f_row_num_d, true); pr_clone_value (func_p->value, &f_value); qexec_analytic_value_lookup (thread_p, func_state, (int) c_row_num_d, true); pr_clone_value (func_p->value, &c_value); qdata_interpolation_function_values (&f_value, &c_value, row_num_d, f_row_num_d, c_row_num_d, &func_p->domain, func_p->value, func_p->function); } return NO_ERROR;}The “group_tuple_position_nn > 0 → return” guard means the
median is computed once per partition, on the first non-NULL
tuple, and reused for every subsequent tuple in that partition.
Hence the deck’s note that all rows of partition c1=1 get the
same median(c2) output (page 18: “동일 그룹 내 다른 튜플을
처리하는 경우, 이미 계산된 중간 값을 재사용”).
Position navigation: qexec_analytic_value_advance
Section titled “Position navigation: qexec_analytic_value_advance”This is the routine that walks back and forth across both sidecar list files coherently:
// qexec_analytic_value_advance — src/query/query_executor.c (condensed)static intqexec_analytic_value_advance (THREAD_ENTRY * thread_p, ANALYTIC_FUNCTION_STATE * func_state, int amount, int max_group_changes, bool ignore_nulls){ while (sc == S_SUCCESS && amount != 0) { /* clamp to group end if max_group_changes <= 0 */ if (amount > 0) { if (max_group_changes <= 0 && func_state->group_tuple_position >= func_state->curr_group_tuple_count - 1) break; func_state->sort_key_tuple_position++; func_state->group_tuple_position++; } else { /* symmetric for backward */ }
/* sort-key boundary: load adjacent sort-key header */ if (func_state->sort_key_tuple_position < 0) { qfile_scan_list_prev (thread_p, &func_state->value_scan_id, ...); qexec_analytic_sort_key_header_load (func_state, ...); func_state->sort_key_tuple_position = func_state->curr_sort_key_tuple_count - 1; } else if (func_state->sort_key_tuple_position >= func_state->curr_sort_key_tuple_count) { qfile_scan_list_next (thread_p, &func_state->value_scan_id, ...); qexec_analytic_sort_key_header_load (func_state, ...); func_state->sort_key_tuple_position = 0; }
/* group boundary: load adjacent group header */ if (func_state->group_tuple_position < 0) { /* prev group */ max_group_changes--; } else if (func_state->group_tuple_position >= func_state->curr_group_tuple_count) { /* next group */ max_group_changes--; }
/* adjust amount: if ignore_nulls, only non-NULL counts */ if (amount > 0) amount -= (DB_IS_NULL (func_p->value) && ignore_nulls) ? 0 : 1; else amount += (DB_IS_NULL (func_p->value) && ignore_nulls) ? 0 : 1; }
if (amount != 0) { if (func_state->curr_group_tuple_count_nn == 0 && ignore_nulls) return NO_ERROR; /* all-NULL group → result is NULL */ return ER_FAILED; /* genuine miss */ } return NO_ERROR;}Two state machines run in lockstep — sort-key cursor and group
cursor — each with its own backing list file and its own
position counter. The max_group_changes argument is the budget
for crossing partition boundaries (0 = stay within current
group, 1 = at most one boundary crossing, used by
qexec_analytic_group_header_next).
qexec_analytic_value_lookup — random access
Section titled “qexec_analytic_value_lookup — random access”// qexec_analytic_value_lookup — src/query/query_executor.c (condensed)static intqexec_analytic_value_lookup (THREAD_ENTRY * thread_p, ANALYTIC_FUNCTION_STATE * func_state, int position, bool ignore_nulls){ int offset = position - (ignore_nulls ? func_state->group_tuple_position_nn : func_state->group_tuple_position); if (offset != 0) return qexec_analytic_value_advance (thread_p, func_state, offset, 0, ignore_nulls);
return qexec_analytic_sort_key_header_load (func_state, true);}This is the relative-positioning wrapper used by the offset and
interpolation evaluators. It computes a delta from the current
position and calls qexec_analytic_value_advance with
max_group_changes = 0 (no partition crossing).
Sort vs Hash GROUP BY — Strategy Choice and Spill
Section titled “Sort vs Hash GROUP BY — Strategy Choice and Spill”The sort/hash decision is made in three layers, only the first of which is plan-time:
-
Plan-time eligibility: the XASL generator sets
BUILDLIST_PROC_NODE::g_hash_eligiblewhen the GROUP BY is amenable to hashing — primary criteria are absence of non-deterministic aggregates and presence of equality-based key extraction. This is propagated toGROUPBY_STATE::hash_eligibleinqexec_initialize_groupby_state. -
Runtime memory ceiling: while processing,
qexec_hash_gby_agg_tupletrackscontext->hash_size(sum of key sizes + accumulator sizes). When it crossesprm_get_bigint_value (PRM_ID_MAX_AGG_HASH_SIZE)(default 2 MB), the LRU head of the hash table is evicted topart_list_id. This does not change strategy; it just trims the working set. -
Runtime selectivity bailout: after
HASH_AGGREGATE_VH_SELECTIVITY_TUPLE_THRESHOLD = 2000tuples have been processed, the executor checksgroup_count / tuple_count > HASH_AGGREGATE_VH_SELECTIVITY_THRESHOLD = 0.5f. If the ratio is too high (the hash table is mostly distinct keys),stateflips toHS_REJECT_ALL, the entire table is dumped topart_list_idviaqdata_save_agg_htable_to_list, and from that point all new input tuples write directly toxasl->list_idfor sort-based aggregation.
flowchart TD
START["end_one_iteration\nreceives a tuple"]
ELIG{"hash_eligible &&\nstate != HS_REJECT_ALL?"}
HKEY["build hash key\n(qexec_build_agg_hkey)"]
PROBE{"key in hash_table?"}
NEW["allocate hash entry\nstash first_tuple"]
FOLD["fold accumulator\n(qdata_evaluate_aggregate_list)"]
MEM{"hash_size > MAX_AGG_HASH_SIZE?"}
EVICT["spill LRU head to part_list_id\nshrink hash_size"]
SEL{"tuple_count > 2000?"}
RATIO{"group_count/tuple_count > 0.5?"}
REJECT["state = HS_REJECT_ALL\nqdata_save_agg_htable_to_list\n→ part_list_id"]
SORT["write tuple to xasl->list_id\n(sort path)"]
POST["Post-processing"]
START --> ELIG
ELIG -- no --> SORT
ELIG -- yes --> HKEY --> PROBE
PROBE -- no --> NEW --> MEM
PROBE -- yes --> FOLD --> MEM
MEM -- yes --> EVICT --> SEL
MEM -- no --> SEL
SEL -- no --> POST
SEL -- yes --> RATIO
RATIO -- no --> POST
RATIO -- yes --> REJECT --> SORT
SORT --> POST
Post-processing for the hash path
Section titled “Post-processing for the hash path”Even on the hash path, post-processing almost always sorts.
qexec_groupby examines the state and chooses one of three
shapes:
-
Pure hash, sort-skip:
xasl->list_id->tuple_cnt == 0,agg_hash_context->part_list_id->tuple_cnt == 0, andPRM_ID_AGG_HASH_RESPECT_ORDER == false. The early-exit branch inqexec_groupbywalks the in-memory hash table and emits one output tuple per entry directly. No sort. This is the only path that avoidssort_listfileentirely. -
Hash with spill:
part_list_idis non-empty (LRU evictions happened, or the very-high-selectivity bailout fired). The executor sortspart_list_idto merge per-key partial accumulators (usingqexec_hash_gby_put_next), then sortsxasl->list_id(usingqexec_gby_put_next) with cross-references into the merged partial list. Twosort_listfilecalls, the same callback substrate as pure sort-based GROUP BY. -
Sort fallback:
state == HS_REJECT_ALL. Same as case 2;xasl->list_idcarries the post-bailout tuples,part_list_idcarries the dumped hash table.
The deck (page 23) lists the exact conditions for the
sort-skip optimisation: hash group by, no WITH ROLLUP,
agg_hash_respect_order == no, and no spill during processing.
External sort substrate
Section titled “External sort substrate”When qexec_groupby or qexec_execute_analytic calls
sort_listfile, the actual sort lives in
src/storage/external_sort.c:
// sort_listfile — src/storage/external_sort.c (condensed)sort_listfile (THREAD_ENTRY * thread_p, INT16 volid, int est_inp_pg_cnt, SORT_GET_FUNC * get_fn, void *get_arg, SORT_PUT_FUNC * put_fn, void *put_arg, SORT_CMP_FUNC * cmp_fn, void *cmp_arg, SORT_DUP_OPTION option, int limit, ..., SORT_PARALLEL_TYPE parallel_type){ SORT_PARAM ori_sort_param; SORT_PARAM *sort_param = &ori_sort_param; // ... condensed: copy callbacks into sort_param ... sort_param->limit = limit;
/* sort buffer sized by PRM_ID_SR_NBUFFERS (default), capped at input estimate */ sort_param->tot_buffers = MIN (prm_get_integer_value (PRM_ID_SR_NBUFFERS), input_pages); sort_param->tot_buffers = MAX (4, sort_param->tot_buffers); sort_param->internal_memory = (char *) malloc ((size_t) sort_param->tot_buffers * (size_t) DB_PAGESIZE);
/* temp file count derived from buffer/input ratio; runs go into half_files */ sort_param->half_files = sort_get_num_half_tmpfiles (sort_param->tot_buffers, input_pages); sort_param->tot_tempfiles = sort_param->half_files << 1;
/* main loop: in-phase (run generation) → ex-phase (k-way merge) */ // ... condensed ...}The sort buffer size is one of PRM_ID_SR_NBUFFERS pages
(default at the time of writing varies; check the source). When
input fits in tot_buffers * DB_PAGESIZE of memory, the sort is
purely in-memory; otherwise it writes sorted runs to temp files
and merges them. The parallel_type argument routes between
serial and parallel sort variants — SORT_ANALYTIC and the
default carry hints to the parallel sort scheduler.
A non-obvious interaction: sort_listfile does not have a
“sort finished, please drain” hook, so both
qexec_gby_finalize_group_dim and qexec_execute_analytic have
to finalize the trailing group manually after sort_listfile
returns — the loop in qexec_execute_analytic after
sort_listfile is exactly this clean-up:
// trailing finalize after sort_listfile (qexec_execute_analytic excerpt)if (analytic_state.input_recs != 0) { for (i = 0; i < analytic_state.func_count; i++) qexec_analytic_finalize_group (thread_p, analytic_state.xasl_state, &analytic_state.func_state_list[i], false); qexec_analytic_update_group_result (thread_p, &analytic_state); }The deck reflects the same invariant for GROUP BY — the
finalize-before-start ordering inside qexec_gby_put_next only
fires on key transitions, leaving the very last group unflushed
until the post-sort_listfile cleanup.
Source Walkthrough
Section titled “Source Walkthrough”Anchor on symbol names, not line numbers. Line numbers below are scoped to the doc’s
updated:date.
Aggregation / GROUP BY
Section titled “Aggregation / GROUP BY”Driver and state:
qexec_groupby— top-level GROUP BY entry; allocatesGROUPBY_STATE, opens output list, callssort_listfilewith the sort callbacks.GROUPBY_STATE(query_executor.c:213) — per-query state.GROUPBY_DIMENSION(query_executor.c:206) — per-rollup-level slot with its own aggregate list.GROUPBY_DIMENSION_FLAG—GROUP_BY,ROLLUP,CUBE,SET.qexec_initialize_groupby_state— initialises the state and binds the hash context if eligible.qexec_clear_groupby_state— cleanup.qexec_gby_init_group_dim— allocates the dimension array (g_dim_levels = with_rollup ? nkeys + 1 : 1).
Sort-based path:
qexec_gby_get_next— sortget_fn; turns input list tuples intoSORT_RECs.qexec_gby_put_next— sortput_fn; iterates the chainedSORT_REClist, dispatchingstart_group/agg_tuple/finalize_group_dimbased on key transitions.qexec_gby_start_group_dim— start every dimension on the first record.qexec_gby_start_group— record the new key, zero one dimension’s accumulators.qexec_gby_agg_tuple— fold the current input tuple into every dimension’s accumulator chain.qexec_gby_finalize_group_dim— rollup-aware key-prefix comparison; finalize all dimensions whose prefix changed.qexec_gby_finalize_group— emit one output tuple from the given dimension; evaluateHAVINGandGROUP_BY_NUM.qexec_gby_finalize_group_val_list— NULL out columns that were rolled up away.
Hash path:
qexec_hash_gby_agg_tuple— processing-side hash insert/probe; LRU spill; very-high-selectivity bailout.qexec_hash_gby_get_next— sortget_fnforpart_list_id.qexec_hash_gby_put_next— sortput_fn; merges accumulators for runs of identical keys.qexec_alloc_agg_hash_context/qexec_alloc_agg_hash_context_buildlist_xasl— allocate and initialise the hash context.qexec_free_agg_hash_context— cleanup.qexec_build_agg_hkey— populate anAGGREGATE_HASH_KEYfrom the current reguvars.qexec_locate_agg_hentry_in_list— scansorted_part_list_idfor a key during the second hash-path sort.
Aggregate evaluation helpers (query_aggregate.cpp):
qdata_evaluate_aggregate_list— fold one tuple into anAGGREGATE_TYPEchain (per-function dispatch).qdata_initialize_aggregate_list— zero the chain at start of group.qdata_finalize_aggregate_list— close the chain (e.g., divide forAVG).qdata_aggregate_accumulator_to_accumulator— merge two accumulators (used in hash-spill consolidation).qdata_alloc_agg_hvalue/qdata_free_agg_hvalue— accumulator storage in the hash context.qdata_save_agg_hentry_to_list— write one (key, value) entry topart_list_id.qdata_save_agg_htable_to_list— bulk dump the hash table on selectivity bailout.qdata_load_agg_hentry_from_tuple— re-hydrate (key, value) from a sorted partial-list tuple.
Analytic / Window Functions
Section titled “Analytic / Window Functions”Driver and state:
qexec_execute_analytic— top-level analytic entry; runs once per shared sort list.ANALYTIC_STATE(query_executor.c:288) — per-query state.ANALYTIC_FUNCTION_STATE(query_executor.c:259) — per-function state; carries the two sidecar list files.qexec_initialize_analytic_state— initialises the state and per-function sub-states.qexec_initialize_analytic_function_state— the inner builder for oneANALYTIC_FUNCTION_STATE.qexec_clear_analytic_state— cleanup.
First pass (sort + state-update):
qexec_analytic_get_next— sortget_fn.qexec_analytic_put_next— sortput_fn; the partition/sort-key state machine.qexec_analytic_start_group— record key, optionally re-init accumulator (new partition) or just sort-key counter (same partition).qexec_analytic_add_tuple— push the current tuple into every function’s accumulator and intointerm_file.qexec_analytic_finalize_group— emit one (group_count, group_count_nn) tuple togroup_list_idand one (sort_key_count, value) tuple tovalue_list_id.qexec_analytic_eval_instnum_pred— evaluatesINST_NUM()at the appropriate stage.
Second pass (output reassembly):
qexec_analytic_update_group_result— re-scaninterm_file, position cursors, dispatch per-function evaluator, write output.qexec_analytic_value_advance— bidirectional walk over the sort-key sidecar with group-boundary tracking.qexec_analytic_value_lookup— random-access seek to a position within the current partition.qexec_analytic_group_header_next— walk to the next partition.qexec_analytic_group_header_load/qexec_analytic_sort_key_header_load— deserialize the sidecar headers intofunc_statecounters.
Per-function evaluators:
qexec_analytic_evaluate_offset_function—LEAD,LAG,NTH_VALUE.qexec_analytic_evaluate_ntile_function—NTILE.qexec_analytic_evaluate_cume_dist_percent_rank_function—CUME_DIST,PERCENT_RANK.qexec_analytic_evaluate_interpolation_function—MEDIAN,PERCENTILE_CONT,PERCENTILE_DISC.qexec_analytic_eval_in_processing— for analytic functions whose value can be computed during processing (e.g.,ROW_NUMBER).
Helpers (query_analytic.cpp):
qdata_initialize_analytic_func— zero per-group counters, open distinct list files when needed.qdata_evaluate_analytic_func— fold one tuple into the function-specific running state.qdata_finalize_analytic_func— compute the per-sort-key final value.
Sort vs Hash GROUP BY — Strategy Choice and Spill
Section titled “Sort vs Hash GROUP BY — Strategy Choice and Spill”Selectivity / memory constants:
HASH_AGGREGATE_VH_SELECTIVITY_TUPLE_THRESHOLD= 2000.HASH_AGGREGATE_VH_SELECTIVITY_THRESHOLD= 0.5f.PRM_ID_MAX_AGG_HASH_SIZE(default 2 MB) — memory ceiling.PRM_ID_AGG_HASH_RESPECT_ORDER— disables the sort-skip short-circuit even when otherwise eligible.PRM_ID_SR_NBUFFERS— sort buffer page count.
External sort substrate (storage/external_sort.c):
sort_listfile— public entry; copies callbacks intoSORT_PARAMand runs in-phase + ex-phase.sort_listfile_internal— the actual driver.sort_inphase_sort— produces sorted runs from buffer-sized chunks.sort_run_sort— quicksort over the in-buffer key array.sort_run_flush— flushes a sorted run to a temp file.sort_exphase_merge— k-way merge of runs.sort_exphase_merge_elim_dup— merge with duplicate elimination (forDISTINCT-style sorts).SORT_PARAM— central state struct (file table, callback bundle, comparator).SORT_PARALLEL_TYPE— selects between serial,SORT_ANALYTIC, and other parallel scheduler modes.
Position hints as of 2026-04-30
Section titled “Position hints as of 2026-04-30”| Symbol | File | Line |
|---|---|---|
HASH_AGGREGATE_VH_SELECTIVITY_TUPLE_THRESHOLD | query_executor.c | 127 |
HASH_AGGREGATE_VH_SELECTIVITY_THRESHOLD | query_executor.c | 130 |
GROUPBY_DIMENSION_FLAG | query_executor.c | 200 |
GROUPBY_DIMENSION struct | query_executor.c | 206 |
GROUPBY_STATE struct | query_executor.c | 213 |
ANALYTIC_FUNCTION_STATE struct | query_executor.c | 259 |
ANALYTIC_STATE struct | query_executor.c | 288 |
qexec_initialize_groupby_state | query_executor.c | 4289 |
qexec_gby_agg_tuple | query_executor.c | 4478 |
qexec_hash_gby_agg_tuple_public | query_executor.c | 4516 |
qexec_hash_gby_agg_tuple | query_executor.c | 4537 |
qexec_hash_gby_get_next | query_executor.c | 4778 |
qexec_hash_gby_put_next | query_executor.c | 4793 |
qexec_gby_get_next | query_executor.c | 4940 |
qexec_gby_put_next | query_executor.c | 4956 |
qexec_groupby | query_executor.c | 5247 |
qexec_execute_mainblock_internal | query_executor.c | 15058 |
qexec_gby_finalize_group_dim | query_executor.c | 19092 |
qexec_gby_finalize_group | query_executor.c | 19184 |
qexec_gby_start_group_dim | query_executor.c | 19408 |
qexec_gby_start_group | query_executor.c | 19430 |
qexec_gby_init_group_dim | query_executor.c | 19501 |
qexec_execute_analytic | query_executor.c | 21007 |
qexec_initialize_analytic_state | query_executor.c | 21593 |
qexec_analytic_get_next | query_executor.c | 21743 |
qexec_analytic_put_next | query_executor.c | 21761 |
qexec_analytic_start_group | query_executor.c | 22045 |
qexec_analytic_finalize_group | query_executor.c | 22118 |
qexec_analytic_add_tuple | query_executor.c | 22195 |
qexec_analytic_evaluate_offset_function | query_executor.c | 22416 |
qexec_analytic_evaluate_cume_dist_percent_rank_function | query_executor.c | 22634 |
qexec_analytic_evaluate_interpolation_function | query_executor.c | 22676 |
qexec_analytic_group_header_load | query_executor.c | 22807 |
qexec_analytic_sort_key_header_load | query_executor.c | 22847 |
qexec_analytic_value_advance | query_executor.c | 22910 |
qexec_analytic_value_lookup | query_executor.c | 23080 |
qexec_analytic_group_header_next | query_executor.c | 23112 |
qexec_analytic_update_group_result | query_executor.c | 23135 |
qexec_alloc_agg_hash_context_buildlist_xasl | query_executor.c | 26906 |
qexec_alloc_agg_hash_context | query_executor.c | 26922 |
qexec_free_agg_hash_context | query_executor.c | 27157 |
qexec_build_agg_hkey | query_executor.c | 27282 |
qexec_locate_agg_hentry_in_list | query_executor.c | 27327 |
sort_run_sort | storage/external_sort.c | 1171 |
sort_listfile | storage/external_sort.c | 1347 |
sort_listfile_internal | storage/external_sort.c | 1749 |
sort_inphase_sort | storage/external_sort.c | 1849 |
sort_run_flush | storage/external_sort.c | 2294 |
sort_exphase_merge_elim_dup | storage/external_sort.c | 2457 |
sort_exphase_merge | storage/external_sort.c | 3381 |
sort_get_avg_numpages_of_nonempty_tmpfile | storage/external_sort.c | 4137 |
sort_get_numpages_of_active_infiles | storage/external_sort.c | 5685 |
sort_run_add_new | storage/external_sort.c | 5747 |
Cross-check Notes
Section titled “Cross-check Notes”The raw analyses (Post-processing(aggregation).pdf,
Post-Processing(analytic).pdf, Sort, Hash Group By.pptx) were
captured as project-internal teaching material; the structural
description matches the present source faithfully, but a few
points have drifted or were imprecisely stated.
-
HASH_AGGREGATE_VH_SELECTIVITY_TUPLE_THRESHOLDvalue. The deck (slides 18, 20) says “200개” (200 tuples) and notes “변경될 수 있음” (may change). The current source defines it as 2000 (query_executor.c:127). The threshold has indeed moved by an order of magnitude since the deck was authored. -
“Sort_listfile에는 qexec_gby_get_next와 qexec_gby_put_next 함수가 인자로 전달됩니다.” Verified — the call signature in
qexec_groupbymatches the deck’s narration. The deck’s flowchart for the put callback (page 7) showsqexec_analytic_put_nextas the entry node, which is a copy-paste from the analytic deck; the actual entry for GROUP BY isqexec_gby_put_next. -
groupby_dimensionordering. The deck (slide 24) describesg_dim[0]as “모든 컬럼을 기준으로 집계한 결과” (full-detail aggregation) and the higher indices as the rollup prefixes, with the total aggregate (no grouping columns) atg_dim[1]and the prefix levels ing_dim[2..g_dim_levels-1]. Verified atqexec_gby_init_group_dim: index 0 carries theGROUPBY_DIM_FLAG_GROUP_BYflag, and whenwith_rollupall indices additionally carryGROUPBY_DIM_FLAG_ROLLUP. The finalize order inqexec_gby_finalize_group_dimwalks fromg_dim_levels - 1down toi + 1, which matches the deck’s “g_dim 배열 끝에서부터, nkey + 1까지 finalize” rule. -
CBRD-25611 — rollup expression NULL bug. Deck slide 29 describes the fix at a high level. Whether the patch lives in the current
qexec_gby_finalize_group_val_listor in the expression evaluator was not traced; investigation path is the CBRD-25611 commit log onjira.cubrid.org. -
Analytic deck flowchart on slide page 4. The decision tree describes the partition-prefix re-comparison in correct order, but the labels “비교할 키 개수 = sort_prefix_size” elide the fact that
nkeysis also set back to its original value after each comparison; the source code’s save-and-restore pattern inqexec_analytic_put_nextis the actual contract. -
qexec_initialize_groupby_statecallsqfile_initialize_sort_key_infoonly whengroupby_list != NULL. Verified at the head of the function:qexec_groupbyitself early-exits ifbuildlist->groupby_list == NULL, so the state-init never sees a NULL list. -
qexec_execute_analyticruns once per sort list, not once per analytic function. Verified at the call site inqexec_execute_mainblock_internal(the loop iteratesxasl->proc.buildlist.a_eval_listand callsqexec_execute_analyticper entry). The deck describes the per-sort-list iteration but elides the planner-side merge inpt_optimize_analytic_listthat produces those entries. -
is_skip_sortpath inqexec_execute_analytic. Triggered when the planner has already produced a sorted input; the function jumps directly toqexec_analytic_update_group_resultwithout callingsort_listfile. Verified at theif (is_skip_sort) { ... goto wrapup; }branch. The deck does not discuss this optimisation.
Open Questions
Section titled “Open Questions”-
Why the
is_same_group=truecase for offset functions keepsstart_group(reinit=false)even within the same sort key. The comment inqexec_analytic_put_next(“offset functions will treat all tuples in a group as having a different sort key regardless if this is true or not”) points to a per-tuple distinct value requirement; the resultingvalue_list_idthen has more entries than the strict sort-key boundary count. Investigation path: trace the read-side cursor inqexec_analytic_evaluate_offset_functionto confirm how the extra entries are consumed. -
qexec_analytic_value_advancefailure semantics on all-NULL groups. When the function targets a position that would skip past every tuple because all values are NULL andignore_nulls == true, it returnsNO_ERRORwith thefunc_p->valueleft at its previous content. Whether the caller resets the value to NULL before relying on it (e.g., inqexec_analytic_value_lookupcallers) needs auditing. -
SORT_ANALYTICparallel sort routing.sort_listfiletakes aSORT_PARALLEL_TYPEargument; the analytic call site usesSORT_ANALYTICand the GROUP BY call site uses the default. The downstream parallel-sort scheduler inexternal_sort.cneeds to be traced to confirm the analytic variant has a distinct scheduling policy. Investigation path: readsort_listfile_internaland the parallel hooks undersort_run_sort. -
What happens to
agg_hash_contextafter aHS_REJECT_ALLbailout if a new iteration entersqexec_hash_gby_agg_tuple? Thestate == HS_REJECT_ALLguard at function entry returns immediately, so the tuple bypasses the hash and is written directly toxasl->list_idby the caller. But the hash table itself was just dumped viaqdata_save_agg_htable_to_list— is themht_tablefreed at that point, or does it linger empty untilqexec_free_agg_hash_context? Memory accounting depends on the answer. -
PRM_ID_AGG_HASH_RESPECT_ORDERsemantics. When set toyes(default), it forces post-processing throughsort_listfileeven when the sort-skip optimisation would otherwise apply. Why is the defaultyes? Hypothesis: some downstream consumer (e.g., DISTINCT, a top-level ORDER BY that is not present here but is composed elsewhere) requires the output ordering to be deterministic; without sorting, two queries with the same hash agg could produce different row orders. Investigation path:git logon the parameter’s introduction. -
Trailing-group finalize after
sort_listfile. Bothqexec_groupbyandqexec_execute_analyticrely on the knowledge thatsort_listfiledoes not call any “drain” hook, so they finalize the trailing group after the sort returns. If a future refactor adds such a hook (parallel sort already has a more complex completion path), the trailing finalize could double-fire. The contract is implicit; making it explicit (e.g., a documented invariant inexternal_sort.h) would be defensive. -
Memory ownership of
interm_file. Whenis_skip_sortis true,analytic_state.interm_file = list_id(the input list file). When false,interm_fileis a freshly opened list file. The cleanup branch at the end ofqexec_execute_analyticdestroyslist_idafter copying the output; whetherinterm_fileis destroyed independently in theis_skip_sortcase (where they alias) or whether the destroy is correctly skipped needs audit. Investigation path: traceqexec_clear_analytic_state. -
Hash key extraction from non-position reguvars. The deck asserts that
g_hk_regu_listis built fromproc->g_hk_scan_regu_list, which isTYPE_POSITION-based (slide 7: “regu_list는 TYPE_POSITION 타입”). When the GROUP BY key is an expression rather than a column, theTYPE_POSITIONreguvars must be backed by an upstream evaluation that materialises the expression value into a tuple position. Whether this materialisation lives inxasl_generation.cor in the optimizer’spt_optimize_*_for_groupbypass needs confirmation.
Sources
Section titled “Sources”Raw analyses (raw/code-analysis/cubrid/query-processing/)
Section titled “Raw analyses (raw/code-analysis/cubrid/query-processing/)”Post-processing(aggregation).pdf— sort-based and hash-based GROUP BY execution, GROUPBY_DIMENSION rollup mechanics, deck excerpts ofqexec_groupbyflow.Post-Processing(analytic).pdf— window/analytic function execution:qexec_analytic_put_nextflow, sidecar list files, worked examples forLAGandMEDIAN.Sort, Hash Group By.pptx— the Aug 2025 internal slide deck on the strategy choice; XASL list shape (g_outptr_list,g_val_list,g_regu_list,g_agg_list), processing-side hash insert, key-duplication scenarios, sort-skip conditions, CBRD-25611._converted/post-processingaggregation.pdf.md,_converted/post-processinganalytic.pdf.md,_converted/sort-hash-group-by.pptx.md— markitdown extracts.
Sibling docs
Section titled “Sibling docs”knowledge/research/dbms-general/database-internals.md— Petrov, Ch. 14 (“Query Execution”); the textbook framing for sort vs. hash aggregation and the external-sort substrate.knowledge/code-analysis/cubrid/cubrid-heartbeat.md— style exemplar for this document’s structure.
Textbook chapters / papers
Section titled “Textbook chapters / papers”- Database Internals (Petrov, 2019), Ch. 14 — query execution models, sort/hash trade-offs, external sort.
- Database Systems: The Complete Book (Garcia-Molina, Ullman, Widom), Ch. 15 — physical query plans, two-pass algorithms based on sorting and hashing.
- ISO/IEC 9075-2:2003 (SQL:2003) —
WINDOWclause definition; the specification CUBRID’s analytic execution implements. - PostgreSQL
src/backend/executor/nodeAgg.candnodeWindowAgg.csource (BSD license) — comparative reference for the strategy-choice and window-execution alternatives.
CUBRID source (/data/hgryoo/references/cubrid/)
Section titled “CUBRID source (/data/hgryoo/references/cubrid/)”src/query/query_executor.c— bulk of post-processing:qexec_groupby,qexec_execute_analytic, the put callbacks, the per-function evaluators, the hash-context driver.src/query/query_aggregate.cpp— aggregate evaluation helpers:qdata_evaluate_aggregate_list,qdata_finalize_aggregate_list, hash-key/value lifecycle.src/query/query_analytic.cpp— analytic evaluation helpers:qdata_initialize_analytic_func,qdata_evaluate_analytic_func,qdata_finalize_analytic_func.src/storage/external_sort.c/external_sort.h—sort_listfileand the in-phase / ex-phase merge sort.src/query/list_file.c—qfile_*list-file primitives used for both input and the analytic sidecars (group_list_id,value_list_id,interm_file).src/xasl/buildlist_proc_node.hpp(or wherever the node is defined) —BUILDLIST_PROC_NODEcarryingg_outptr_list,g_agg_list,g_hash_eligible,agg_hash_context,g_with_rollup.src/parser/parse_tree.h,xasl_generation.c— parser and XASL-builder side that constructs the lists and decidesg_hash_eligible;pt_optimize_analytic_listfor the analytic function shared-sort merge.