CUBRID Post-Processing — Aggregation, Window/Analytic Functions, and Sort vs Hash GROUP BY

Contents:

Theoretical Background
Common DBMS Design
CUBRID’s Approach
Source Walkthrough
Cross-check Notes
Open Questions
Sources

Theoretical Background

Post-processing is the half of query execution that runs after all input tuples have been produced — by table scans, index scans, joins, or subquery list files. Three textbook problems share the same machinery and the same listfile-and-sort substrate:

Grouped aggregation — given a stream of input tuples and a set of grouping keys, produce one output tuple per distinct key together with the values of SUM, AVG, MIN, MAX, COUNT, etc. Two strategies exist:
- Sort-based GROUP BY: sort the input by the grouping key, then make a single linear pass; tuples with the same key are contiguous, so an accumulator can be reset on every key change. Cost is dominated by the sort.
- Hash-based GROUP BY: maintain an in-memory hash table keyed by the grouping key; each input tuple probes the table and either creates a new accumulator or folds into an existing one. Cost is dominated by hash probes; no sort is needed if the table fits in memory. When it does not, the hash table either spills to disk or degrades to sort-based.
Database Internals (Petrov, 2019), Chapter 14 (“Query Execution”) frames the trade-off as a function of grouping selectivity (groups / tuples) and memory budget; high selectivity (few distinct keys) favours hash, low selectivity (many distinct keys) favours sort.
Window / analytic functions — RANK(), ROW_NUMBER(), LAG(), LEAD(), MEDIAN(), PERCENTILE_CONT(), etc., as defined in SQL:2003. Each window function has three sub-clauses: PARTITION BY (groups the rows), ORDER BY (orders within a partition), and an optional frame (ROWS BETWEEN ... AND ...). Unlike aggregates, every input tuple yields an output tuple; the window function value for tuple $t$ depends on $t$‘s frame inside its partition. The textbook execution model is:
- Sort the input by (PARTITION BY, ORDER BY).
- Walk the sorted stream tracking partition boundaries and sort-key boundaries.
- For each tuple, compute the function value over the frame. For navigation functions like LAG and offset-into-group functions like NTH_VALUE, this requires random access to other tuples in the same partition — implemented by writing intermediate group/value list files and re-scanning them.
External sort — when an input list file does not fit in the sort buffer (PRM_ID_SR_NBUFFERS), the sort module degrades to a classic two-phase merge sort: an in-phase that produces sorted runs from buffer-sized chunks, then an ex-phase that does k-way merging of the runs. This is the substrate for both sort-based GROUP BY and the analytic-function sort. Petrov Ch. 14 (sort) and Garcia-Molina/Ullman/Widom (Database Systems: The Complete Book, Ch. 15.4) describe the model; CUBRID’s implementation lives in src/storage/external_sort.c.

The three problems share three properties: (a) the input is a list file (QFILE_LIST_ID) — a virtual sequential file backed by temporary pages; (b) sorting is invoked through a callback sandwich — sort_listfile (..., get_func, put_func, ...) reads sort records via get_func and emits sorted records to put_func; (c) the per-key state (accumulator for aggregation, window-function state for analytics) is kept in a small in-memory struct and folded into the put callback so that the sort itself does not know it is doing aggregation.

This callback shape is what makes CUBRID’s sort-based GROUP BY a two-line caller (qexec_groupby → sort_listfile) and lets the hash-based path graft onto the same skeleton: when the hash table spills, it dumps its key/value pairs into a list file and the final pass is again a sort_listfile call.

Common DBMS Design

Engines diverge most visibly in where the strategy choice is made (planner vs. executor), and in how the spill is handled. Three reference designs frame CUBRID’s choices.

PostgreSQL — `nodeAgg.c`, `nodeWindowAgg.c`

PostgreSQL’s executor has two distinct nodes:

Agg — implemented in src/backend/executor/nodeAgg.c, driven by the AGG_HASHED / AGG_SORTED / AGG_PLAIN / AGG_MIXED strategies chosen by the planner (create_agg_plan). The strategy is a plan-time decision based on enable_hashagg, work_mem, and the estimated number of groups. When a hash agg overflows work_mem, PostgreSQL since v13 spills via grace hashing: it partitions overflow tuples to per-partition tape files and re-aggregates each partition independently. The accumulator (AggStatePerTrans) and the per-group state (AggStatePerGroup) are kept in aggcontext memory.
WindowAgg — implemented in nodeWindowAgg.c. PostgreSQL sorts input by (PARTITION BY, ORDER BY) upstream (a separate Sort node feeds the WindowAgg) and then walks partitions, computing each function’s value within the frame. Frame logic is generic over RANGE, ROWS, and GROUPS modes.

CUBRID’s executor uses a single BUILDLIST_PROC node for GROUP BY and a separate analytic loop, but the strategy choice is runtime — the executor decides at end_one_iteration time whether to keep aggregating into the hash table or fall back to sort, based on observed selectivity (more on this below).

MySQL — `Item_sum`, temporary tables, `Aggregator_distinct`

MySQL uses Item_sum subclasses (Item_sum_avg, Item_sum_min, …) for each aggregate, and historically materialised the GROUP BY through a temporary table keyed on the grouping expression. Modern MySQL (8.0+) has AggregateIterator with hash-based aggregation by default; the temp-table fallback still exists for cases where the hash table would exceed tmp_table_size / max_heap_table_size. Window functions are handled by WindowIterator over a sorted stream.

The MySQL contract — one accumulator object per aggregate per group, mutated as tuples flow through — is the same shape as CUBRID’s AGGREGATE_TYPE chain (g_agg_list).

Oracle / SQL Server — sort/hash GROUP BY with cost-based fallback

Oracle’s optimizer chooses between HASH GROUP BY and SORT GROUP BY based on estimated cardinalities; SQL Server’s Stream Aggregate versus Hash Match (Aggregate) is the analogous choice. Both vendors spill hash overflow to TempDB-equivalent storage and continue with a partition-based recombination, similar to PostgreSQL’s modern HashAgg.

Theory ↔ CUBRID mapping

Theoretical concept	CUBRID name
Per-group accumulator	`AGGREGATE_TYPE::accumulator` (cubxasl::aggregate_accumulator)
Aggregate chain in execution order	`BUILDLIST_PROC_NODE::g_agg_list` (`xasl.h`)
Output-side aggregate chain (SELECT-list shape)	`GROUPBY_STATE::g_output_agg_list`
Per-group state during sort-based execution	`GROUPBY_STATE` (`query_executor.c:213`)
Rollup level array	`GROUPBY_STATE::g_dim` of `GROUPBY_DIMENSION`
Hash GROUP BY context	`AGGREGATE_HASH_CONTEXT` referenced via `GROUPBY_STATE::agg_hash_context`
Hash table spill list	`AGGREGATE_HASH_CONTEXT::part_list_id`
Hash table memory ceiling	`PRM_ID_MAX_AGG_HASH_SIZE` (default 2 MB)
Hash strategy degrade threshold	`HASH_AGGREGATE_VH_SELECTIVITY_TUPLE_THRESHOLD` (2000) + `HASH_AGGREGATE_VH_SELECTIVITY_THRESHOLD` (0.5)
Sort callback contract	`sort_listfile (get_fn, put_fn, cmp_fn, key_info, ...)` (`external_sort.c:1347`)
Sort-based GROUP BY put callback	`qexec_gby_put_next`
Hash-based GROUP BY put callback	`qexec_hash_gby_put_next`
Window-function per-function state	`ANALYTIC_FUNCTION_STATE` (`query_executor.c:259`)
Window-function execution state	`ANALYTIC_STATE` (`query_executor.c:288`)
Per-group tuple count (for windows)	`ANALYTIC_FUNCTION_STATE::curr_group_tuple_count`
Within-sort-key tuple count (for windows)	`ANALYTIC_FUNCTION_STATE::curr_sort_key_tuple_count`
Group-header sidecar list	`ANALYTIC_FUNCTION_STATE::group_list_id`
Sort-key sidecar list	`ANALYTIC_FUNCTION_STATE::value_list_id`
Window-function output reassembly pass	`qexec_analytic_update_group_result`

CUBRID’s Approach

The qexec_execute_mainblock_internal driver, after producing the unsorted input list file, dispatches into post-processing through three entry points:

qexec_groupby — for BUILDLIST_PROC XASL nodes that carry a groupby_list. This is the GROUP BY (with optional WITH ROLLUP) lane.
qexec_execute_analytic — for XASL nodes carrying an ANALYTIC_EVAL_TYPE chain. Called once per distinct sort list required by the chain; that is, multiple analytic functions sharing the same (PARTITION BY, ORDER BY) are grouped at parse time (pt_optimize_analytic_list) so they ride one sort.
qexec_orderby_distinct — for ORDER BY and DISTINCT, not covered in this document; it is the same sort_listfile substrate without the per-key accumulator.

flowchart LR
  subgraph PROC["Processing (input feed)"]
    SCAN["Scan / join / subquery"]
    LIST["xasl->list_id\n(input list file)"]
    HASH["AGGREGATE_HASH_CONTEXT\n(in-memory hash table)"]
    PART["part_list_id\n(spilled hash entries)"]
  end
  subgraph POST["Post-Processing"]
    GBY["qexec_groupby\n(qexec_execute_mainblock_internal)"]
    ANA["qexec_execute_analytic"]
    SLF1["sort_listfile\n(GBY: get=qexec_gby_get_next,\n      put=qexec_gby_put_next)"]
    SLF2["sort_listfile\n(HGBY: put=qexec_hash_gby_put_next)"]
    SLF3["sort_listfile\n(ANA: put=qexec_analytic_put_next)"]
    OUT["output_file\n(QFILE_LIST_ID)"]
  end
  SCAN --> LIST
  SCAN -. hash agg eligible .-> HASH
  HASH -. LRU spill .-> PART
  HASH -. very-high selectivity .-> PART
  LIST --> GBY
  PART --> GBY
  HASH --> GBY
  GBY -- sort path --> SLF1 --> OUT
  GBY -- hash path --> SLF2 --> OUT
  LIST --> ANA --> SLF3 --> OUT

The three sub-topics — aggregation, analytic, and the sort/hash choice — share the GROUPBY state struct, the AGGREGATE_HASH_CONTEXT, and the sort_listfile substrate, but each has its own put-callback and its own state machine. We walk them in turn.

Aggregation / GROUP BY

The GROUP BY lane has two state objects: GROUPBY_STATE (per-query, holds the sort/hash strategy choice and the rollup dimension array) and GROUPBY_DIMENSION (one slot per rollup level, each carrying its own d_agg_list for accumulators).

// GROUPBY_STATE — src/query/query_executor.c (excerpt)
typedef struct groupby_state GROUPBY_STATE;
struct groupby_state
{
  int state;
  SORTKEY_INFO key_info;
  QFILE_LIST_SCAN_ID *input_scan;
  QFILE_LIST_ID *output_file;

  PRED_EXPR *having_pred;
  AGGREGATE_TYPE *g_output_agg_list;          /* SELECT-list shape */
  REGU_VARIABLE_LIST g_regu_list;             /* fetch into db_value */
  REGU_VARIABLE_LIST g_hk_regu_list;          /* hash-key reguvars */
  VAL_LIST *g_val_list;
  OUTPTR_LIST *g_outptr_list;
  XASL_NODE *xasl;
  XASL_STATE *xasl_state;

  RECDES current_key;                         /* last seen group key */
  RECDES gby_rec;
  QFILE_TUPLE_RECORD input_tpl;
  int input_recs;

  bool with_rollup;
  GROUPBY_DIMENSION *g_dim;                   /* rollup levels */
  int g_dim_levels;

  int hash_eligible;
  AGGREGATE_HASH_CONTEXT *agg_hash_context;
  // ... condensed ...
};

The XASL generator pre-builds five lists on the BUILDLIST_PROC_NODE that this state binds to: g_outptr_list (SELECT-list shape, what the user sees), g_val_list (non-aggregate columns and aggregate arguments, in GROUP BY order with SELECT-only columns appended at the tail), g_regu_list (reguvars that fetch tuple positions into those db_values), g_agg_list (one AGGREGATE_TYPE node per aggregate function, holding both accumulator.value and the function’s argument), and g_hk_sort_regu_list (hash-key reguvars, only present when hash_eligible == true). The deck (Sort, Hash Group By.pptx, slides 2-7) emphasises the GROUP-BY-order vs. SELECT-list-order distinction; this matters because the input tuples written by processing follow g_outptr_list’s GROUP BY order, so the sort key reads cleanly off the head of each tuple.

The ordering of the sub-lists is deliberate. The deck’s example SELECT c1, c2, avg(c1) FROM t1 GROUP BY c2, c1 produces an input list file shaped (c2, c1, c1_arg) even though the SELECT-list reads (c1, c2, avg(c1)). The g_outptr_list preserves SELECT-list shape for output composition, while the input list file is in GROUP BY order so that the sort key prefix of every tuple is exactly the grouping vector.

Sort-based pipeline

qexec_groupby is the entry; it allocates a GROUPBY_STATE, opens an output list file, and hands the input list file to sort_listfile with two callbacks:

// qexec_groupby — src/query/query_executor.c (condensed)
qexec_groupby (THREAD_ENTRY * thread_p, XASL_NODE * xasl, ...)
{
  // ... condensed: late binding, grbynum init ...
  if (qexec_initialize_groupby_state (&gbstate, buildlist->groupby_list, ...) == NULL)
    GOTO_EXIT_ON_ERROR;

  /* open output list file */
  gbstate.output_file = qfile_open_list (thread_p, &output_type_list, ...);

  /* quick-finalize escapes for empty input or pure hash-agg ... */

  /* sort + aggregate: callback sandwich */
  if (sort_listfile (thread_p, NULL_VOLID, estimated_pages,
                     &qexec_gby_get_next, &gbstate,
                     &qexec_gby_put_next, &gbstate,
                     gbstate.cmp_fn, &gbstate.key_info,
                     SORT_DUP, NO_SORT_LIMIT, ...) != NO_ERROR)
    GOTO_EXIT_ON_ERROR;
  // ... condensed ...
}

qexec_gby_get_next is trivial: it pulls the next tuple from the input list scan and turns it into a SORT_REC:

// qexec_gby_get_next — src/query/query_executor.c
static SORT_STATUS
qexec_gby_get_next (THREAD_ENTRY * thread_p, RECDES * recdes, void *arg)
{
  GROUPBY_STATE *gbstate = (GROUPBY_STATE *) arg;
  return qfile_make_sort_key (thread_p, &gbstate->key_info,
                              recdes, gbstate->input_scan,
                              &gbstate->input_tpl);
}

qexec_gby_put_next is where aggregation happens. The sort module calls it once per sorted record (or once per chain of duplicates when SORT_DUP is set); the callback compares the record’s key against gbstate->current_key, finalises the prior group when they differ, starts a new group, and folds the new tuple into the accumulator chain. The state-machine sketch:

stateDiagram-v2
  [*] --> Initial : sort begins
  Initial --> StartGroup : first sort_record\n(qexec_gby_start_group_dim)
  StartGroup --> Accumulating : qexec_gby_agg_tuple
  Accumulating --> Accumulating : same key\n(qexec_gby_agg_tuple)
  Accumulating --> Finalize : new key\n(qexec_gby_finalize_group_dim)
  Finalize --> StartGroup : qexec_gby_start_group(0)
  Accumulating --> EndOfInput : sort drains
  EndOfInput --> Finalize : final group
  Finalize --> [*]

The actual key comparison and dimension-walking lives inside qexec_gby_finalize_group_dim, which is the rollup-aware finalize path:

// qexec_gby_finalize_group_dim — src/query/query_executor.c (condensed)
static int
qexec_gby_finalize_group_dim (THREAD_ENTRY * thread_p,
                              GROUPBY_STATE * gbstate,
                              const RECDES * recdes)
{
  int i, j, nkeys, level = 0;

  qexec_gby_finalize_group (thread_p, gbstate, 0, gbstate->with_rollup);
  if (gbstate->with_rollup && recdes)
    {
      SORT_REC *key = (SORT_REC *) recdes->data;
      level = gbstate->g_dim_levels;
      nkeys = gbstate->key_info.nkeys;

      /* find the first key prefix that differs and finalize all
       * dimensions above that level */
      for (i = 1; i < nkeys; i++)
        {
          gbstate->key_info.nkeys = i;
          if ((*gbstate->cmp_fn) (&gbstate->current_key.data, &key,
                                  &gbstate->key_info) != 0)
            {
              for (j = gbstate->g_dim_levels - 1; j > i; j--)
                {
                  qexec_gby_finalize_group (thread_p, gbstate, j, true);
                  qexec_gby_start_group  (thread_p, gbstate, NULL, j);
                }
              level = i + 1;
              break;
            }
        }
      gbstate->key_info.nkeys = nkeys;
    }

  qexec_gby_start_group (thread_p, gbstate, recdes, 0);
  return level;
}

Two non-obvious facts. First, the rollup loop finalizes the deepest dimension first and walks up: if the prefix c1 is unchanged but (c1,c2) differs, the dimensions tracking (c1,c2,c3) and (c1,c2) are finalized while the c1-only and total-aggregate dimensions stay open. The deck’s slide on GROUPBY_DIMENSION (Sort, Hash Group By.pptx, slide 24) calls this out as “독특한 점” — g_dim[0] is the most-detailed level (all GROUP BY columns), g_dim[g_dim_levels-1] is the total aggregate, and the intermediate slots are the rollup prefixes. Second, the sort module’s nkeys field is mutated during the comparison loop to compare prefix-by-prefix; the original value is saved and restored to keep the outer sort consistent.

Each qexec_gby_finalize_group call dumps the dimension’s accumulators into gbstate->output_file:

// qexec_gby_finalize_group — src/query/query_executor.c (condensed)
static void
qexec_gby_finalize_group (THREAD_ENTRY * thread_p,
                          GROUPBY_STATE * gbstate, int N, bool keep_list_file)
{
  // ... condensed ...
  qdata_finalize_aggregate_list (thread_p, gbstate->g_dim[N].d_agg_list,
                                 keep_list_file, NULL);

  /* move dimension's accumulators into g_output_agg_list */
  AGGREGATE_TYPE *g_outp = gbstate->g_output_agg_list;
  AGGREGATE_TYPE *d_aggp = gbstate->g_dim[N].d_agg_list;
  while (g_outp != NULL && d_aggp != NULL)
    {
      if (g_outp->function != PT_GROUPBY_NUM
          && d_aggp->accumulator.value != NULL
          && g_outp->accumulator.value != NULL)
        {
          pr_clear_value (g_outp->accumulator.value);
          *g_outp->accumulator.value = *d_aggp->accumulator.value;
          PRIM_SET_NULL (d_aggp->accumulator.value);
        }
      g_outp = g_outp->next;
      d_aggp = d_aggp->next;
    }

  /* HAVING + GROUP_BY_NUM evaluation */
  if (gbstate->having_pred && eval_pred (thread_p, gbstate->having_pred, ...) != V_TRUE)
    goto wrapup;

  /* compose output tuple via output pointer list and append */
  qexec_generate_tuple_descriptor (thread_p, gbstate->output_file,
                                   gbstate->g_outptr_list, &xasl_state->vd);
  qfile_generate_tuple_into_list (thread_p, gbstate->output_file, T_NORMAL);
  // ... condensed ...
}

The accumulator-to-output value copy uses an aliasing shortcut — *g_outp->accumulator.value = *d_aggp->accumulator.value — and then immediately blanks the dimension accumulator with PRIM_SET_NULL. This is intentional: the SELECT-list-shaped g_output_agg_list shares its accumulator.value db_values with g_outptr_list’s reguvars, so the tuple-generation pass picks up the just-finalized aggregate by virtue of the shared pointer. The deck (slide 16, “g_agg_list->accumulator.value 는 g_output_list에서 공유되므로…”) makes this sharing explicit.

The accumulator-update step itself runs inside qexec_gby_put_next’s non-finalize branch via qexec_gby_agg_tuple:

// qexec_gby_agg_tuple — src/query/query_executor.c
static void
qexec_gby_agg_tuple (THREAD_ENTRY * thread_p, GROUPBY_STATE * gbstate,
                     QFILE_TUPLE tpl, int peek)
{
  if (fetch_val_list (thread_p, gbstate->g_regu_list, &gbstate->xasl_state->vd,
                      NULL, NULL, tpl, peek) != NO_ERROR)
    GOTO_EXIT_ON_ERROR;

  for (int i = 0; i < gbstate->g_dim_levels; i++)
    {
      if (qdata_evaluate_aggregate_list (thread_p, gbstate->g_dim[i].d_agg_list,
                                         &gbstate->xasl_state->vd, NULL, false)
          != NO_ERROR)
        GOTO_EXIT_ON_ERROR;
    }
}

The inner loop walks every rollup dimension and folds the same input tuple into each — the most-detailed dimension and each rollup prefix. qdata_evaluate_aggregate_list (in query_aggregate.cpp) is the per-aggregate-function dispatch, with a separate inner switch per PT_MIN, PT_MAX, PT_SUM, PT_AVG, PT_COUNT, etc. The deck (page 9) summarises the behaviours: PT_MIN/PT_MAX overwrite when the new value beats the running extremum; PT_AVG/PT_SUM initialise on the first non-NULL value and add on subsequent ones; PT_COUNT increments unconditionally for COUNT(*) and only on non-NULL for COUNT(col).

Group key transition: `qexec_gby_start_group`

When a new key arrives, qexec_gby_start_group records it as gbstate->current_key and re-initialises the dimension’s aggregate list:

// qexec_gby_start_group — src/query/query_executor.c (condensed)
static void
qexec_gby_start_group (THREAD_ENTRY * thread_p, GROUPBY_STATE * gbstate,
                       const RECDES * recdes, int N)
{
  if (N == 0 && recdes)
    {
      /* save key for later equality tests */
      memcpy (gbstate->current_key.data, recdes->data, recdes->length);
      gbstate->current_key.length = recdes->length;
    }
  /* zero the accumulators of dimension N */
  QEXEC_CLEAR_AGG_LIST_VALUE (gbstate->g_dim[N].d_agg_list);
  qdata_initialize_aggregate_list (thread_p, gbstate->g_dim[N].d_agg_list,
                                   gbstate->xasl_state->query_id);
}

Only N == 0 (the most-detailed dimension) saves the key; the rollup dimensions piggyback on the same key but with truncated prefixes computed by the comparator.

Hash-based pipeline

When hash_eligible == true, the executor builds an AGGREGATE_HASH_CONTEXT during processing (i.e., as input tuples arrive at end_one_iteration) and routes them through qexec_hash_gby_agg_tuple before they reach xasl->list_id. The post-processing is what cleans up the hash table:

// qexec_hash_gby_agg_tuple — src/query/query_executor.c (condensed)
static int
qexec_hash_gby_agg_tuple (THREAD_ENTRY * thread_p, XASL_NODE * xasl,
                          XASL_STATE * xasl_state, BUILDLIST_PROC_NODE * proc,
                          QFILE_TUPLE_RECORD * tplrec,
                          QFILE_TUPLE_DESCRIPTOR * tpldesc,
                          QFILE_LIST_ID * groupby_list, bool * output_tuple)
{
  AGGREGATE_HASH_CONTEXT *context = proc->agg_hash_context;
  AGGREGATE_HASH_KEY *key = context->temp_key;
  AGGREGATE_HASH_VALUE *value;

  if (context->state == HS_REJECT_ALL)
    return NO_ERROR;                      /* sort fallback engaged */

  qexec_build_agg_hkey (thread_p, xasl_state, proc->g_hk_scan_regu_list, NULL, key);
  value = (AGGREGATE_HASH_VALUE *) mht_get (context->hash_table, (void *) key);

  if (value == NULL)
    {
      /* new group: copy key, allocate accumulators, save first tuple */
      AGGREGATE_HASH_KEY *new_key = qdata_copy_agg_hkey (thread_p, key);
      AGGREGATE_HASH_VALUE *new_value =
        qdata_alloc_agg_hvalue (thread_p, proc->g_func_count, proc->g_agg_list);
      // ... condensed: stash first tuple into new_value->first_tuple ...
      mht_put (context->hash_table, new_key, new_value);
      context->tuple_count++;
      context->group_count++;
      context->hash_size += qdata_get_agg_hkey_size (new_key);
      context->hash_size += qdata_get_agg_hvalue_size (new_value, false);
    }
  else
    {
      /* existing group: fold accumulators */
      *output_tuple = false;
      value->tuple_count++;
      context->tuple_count++;
      qdata_evaluate_aggregate_list (thread_p, proc->g_agg_list, &xasl_state->vd,
                                     value->accumulators, false);
      context->hash_size += qdata_get_agg_hvalue_size (value, true);
    }

  /* keep hash table within memory limit (LRU spill) */
  while (context->hash_size > (int) mem_limit)
    {
      HENTRY_PTR hentry = context->hash_table->lru_head;
      key   = (AGGREGATE_HASH_KEY *) hentry->key;
      value = (AGGREGATE_HASH_VALUE *) hentry->data;
      qdata_save_agg_hentry_to_list (thread_p, key, value,
                                     context->temp_dbval_array,
                                     context->part_list_id);
      if (value->first_tuple.tpl != NULL)
        qfile_add_tuple_to_list (thread_p, groupby_list, value->first_tuple.tpl);
      context->hash_size -= qdata_get_agg_hkey_size (key);
      context->hash_size -= qdata_get_agg_hvalue_size (value, false);
      mht_rem (context->hash_table, key, qdata_free_agg_hentry, NULL);
    }

  /* very-high selectivity → bail to sort path */
  if (context->tuple_count > HASH_AGGREGATE_VH_SELECTIVITY_TUPLE_THRESHOLD)
    {
      float selectivity = (float) context->group_count / context->tuple_count;
      if (selectivity > HASH_AGGREGATE_VH_SELECTIVITY_THRESHOLD)
        {
          context->state = HS_REJECT_ALL;
          qdata_save_agg_htable_to_list (thread_p, context->hash_table,
                                         groupby_list, context->part_list_id,
                                         context->temp_dbval_array);
        }
    }
  // ... condensed ...
}

This is the function that materialises the deck’s two “key-duplication” cases (slides 18-20). When the hash table exceeds prm_get_bigint_value (PRM_ID_MAX_AGG_HASH_SIZE) (default 2 MB), an LRU entry is spilled to part_list_id and removed; nothing prevents the same key from being re-inserted later, so part_list_id may end up with multiple entries per key, each already partially aggregated. When the selectivity (groups divided by tuples) exceeds 0.5 over more than 2000 input tuples, the executor concludes the hash strategy is wasting memory on near-unique keys, sets state = HS_REJECT_ALL, dumps the entire table to part_list_id, and from that point forward writes new tuples directly to xasl->list_id for sort-based aggregation.

Post-processing for the hash path then runs sort_listfile twice — once over part_list_id (the hash overflow), then once over xasl->list_id (the sort fallback). The first sort uses qexec_hash_gby_put_next as the put callback to merge accumulators within sorted runs:

// qexec_hash_gby_put_next — src/query/query_executor.c (condensed)
static int
qexec_hash_gby_put_next (THREAD_ENTRY * thread_p,
                         const RECDES * recdes, void *arg)
{
  GROUPBY_STATE *state = (GROUPBY_STATE *) arg;
  AGGREGATE_HASH_CONTEXT *context = state->agg_hash_context;

  for (SORT_REC *key = (SORT_REC *) recdes->data; key; key = key->next)
    {
      // ... condensed: load (key, value) from sorted record ...
      qdata_load_agg_hentry_from_tuple (thread_p, data, context->temp_part_key,
                                        context->temp_part_value, ...);

      if (qdata_agg_hkey_eq (context->curr_part_key, context->temp_part_key)
          && context->sorted_count > 0)
        {
          /* same key: fuse accumulators */
          AGGREGATE_TYPE *agg_list = state->g_output_agg_list;
          int i = 0;
          while (agg_list != NULL)
            {
              qdata_aggregate_accumulator_to_accumulator (
                thread_p, &context->curr_part_value->accumulators[i],
                &agg_list->accumulator_domain, agg_list->function,
                agg_list->domain,
                &context->temp_part_value->accumulators[i]);
              agg_list = agg_list->next; i++;
            }
        }
      else
        {
          /* different key: dump curr to sorted_part_list_id, swap */
          if (context->sorted_count > 0)
            qdata_save_agg_hentry_to_list (thread_p, context->curr_part_key,
                                           context->curr_part_value,
                                           context->temp_dbval_array,
                                           context->sorted_part_list_id);
          /* swap (curr, temp) */
          // ... condensed ...
        }
      context->sorted_count++;
    }
  return NO_ERROR;
}

The fusion here is qdata_aggregate_accumulator_to_accumulator — two pre-aggregated values (e.g., two partial sums for the same key) being combined. This is the hash-spill counterpart of qdata_evaluate_aggregate_list, which combines an accumulator with a raw tuple value.

After this first pass, sorted_part_list_id holds one tuple per distinct key with the merged accumulators. The second pass over xasl->list_id reuses the sort-based qexec_gby_put_next machinery — the only addition is a cross-reference into sorted_part_list_id via qexec_locate_agg_hentry_in_list to pick up any residual partial aggregates that a key already has on the hash side.

WITH ROLLUP — multi-dimension finalize

GROUP BY ... WITH ROLLUP adds g_dim_levels = nkeys + 1 slots. Each input tuple, in qexec_gby_agg_tuple, is folded into every dimension. At a key boundary, qexec_gby_finalize_group_dim walks dimensions deepest-first and finalizes only those whose key prefix changed. The deck’s worked example for GROUP BY c1, c2, c3 WITH ROLLUP (page 19-21) shows how moving from (1,1,3) to (1,2,1) finalizes the deepest two dimensions but keeps the (c1) and total dimensions running.

A subtle correctness issue surfaces in CBRD-25611 (deck slide 29): when the SELECT-list contains an expression ceil(c1) and that expression lands in a position before c1 itself in the GROUP BY order, the rollup finalize that NULLs out the c1 column also propagates that NULL through the expression. The fix has to thread the original (non-NULL) value through to the expression evaluator independently of the dimension flag.

Analytic / Window Functions

Window functions are structurally similar to aggregation — both sort by a key and walk the sorted stream — but differ in two ways: every input tuple is also an output tuple, and several window functions need random access to other tuples in the same partition (for LAG, LEAD, NTH_VALUE, MEDIAN, PERCENTILE_CONT). CUBRID handles both by writing two sidecar list files per analytic function during the first pass and re-scanning them during a second qexec_analytic_update_group_result pass.

State objects

// ANALYTIC_FUNCTION_STATE — src/query/query_executor.c (excerpt)
struct analytic_function_state
{
  ANALYTIC_TYPE *func_p;
  RECDES current_key;

  QFILE_LIST_ID *group_list_id;        /* file containing group headers */
  QFILE_LIST_ID *value_list_id;        /* file containing group values  */
  QFILE_LIST_SCAN_ID group_scan_id;
  QFILE_LIST_SCAN_ID value_scan_id;
  QFILE_TUPLE_RECORD group_tplrec;
  QFILE_TUPLE_RECORD value_tplrec;

  DB_VALUE cgtc_dbval, cgtc_nn_dbval, csktc_dbval;
  int curr_group_tuple_count;          /* tuples in current group */
  int curr_group_tuple_count_nn;       /* tuples in current group, non-NULL */
  int curr_sort_key_tuple_count;       /* tuples sharing current sort key */

  int group_tuple_position;            /* scan position in current group */
  int group_tuple_position_nn;         /* same, ignoring NULL values */
  int sort_key_tuple_position;         /* scan position in current sort key */
  int group_consumed_tuples;           /* tuples already emitted */
};

The two sidecar list files are the heart of the design. group_list_id carries one (count, count_nn) tuple per partition; value_list_id carries one (sort_key_count, value) tuple per distinct sort key within a partition. The first-pass produces these two files and a third — analytic_state->interm_file, which is the input list file with hidden columns added so the re-scan can identify partition and sort-key boundaries by position alone. The second pass re-reads interm_file linearly, positions cursors in group_list_id and value_list_id, and emits one output tuple per input tuple.

`qexec_execute_analytic` — outer loop

// qexec_execute_analytic — src/query/query_executor.c (condensed)
qexec_execute_analytic (THREAD_ENTRY * thread_p, XASL_NODE * xasl,
                        XASL_STATE * xasl_state,
                        ANALYTIC_EVAL_TYPE * analytic_eval, ...)
{
  // ... condensed: setup state, regulist domains, intermediate file ...
  qexec_initialize_analytic_state (thread_p, &analytic_state, analytic_eval->head, ...);

  /* open intermediate and output files; intermediate carries
   * the input tuple shape plus hidden cols for partition/order tracking */
  if (is_skip_sort)
    analytic_state.interm_file = list_id;
  else
    analytic_state.interm_file =
      qfile_open_list (thread_p, &interm_type_list, NULL, xasl_state->query_id, ls_flag, NULL);

  /* open input scan + intermediate scan */
  qfile_open_list_scan (list_id, &input_scan_id);
  qfile_open_list_scan (analytic_state.interm_file, &interm_scan_id);

  if (is_skip_sort)
    {
      /* sort can be skipped if input arrived already sorted (planner) */
      qexec_analytic_update_group_result (thread_p, &analytic_state);
      goto wrapup;
    }

  /* sort + first-pass: writes interm_file, group_list_id, value_list_id */
  estimated_pages = qfile_get_estimated_pages_for_sorting (list_id, &analytic_state.key_info);
  analytic_state.key_info.use_original = 1;
  analytic_state.cmp_fn = &qfile_compare_partial_sort_record;

  if (sort_listfile (thread_p, NULL_VOLID, estimated_pages,
                     &qexec_analytic_get_next, &analytic_state,
                     &qexec_analytic_put_next, &analytic_state,
                     analytic_state.cmp_fn, &analytic_state.key_info,
                     SORT_DUP, NO_SORT_LIMIT, ..., SORT_ANALYTIC) != NO_ERROR)
    GOTO_EXIT_ON_ERROR;

  /* finalize trailing groups (sort_listfile has no terminate hook) */
  if (analytic_state.input_recs != 0)
    {
      for (i = 0; i < analytic_state.func_count; i++)
        qexec_analytic_finalize_group (thread_p, analytic_state.xasl_state,
                                       &analytic_state.func_state_list[i], false);

      /* second pass: re-scan interm_file, position sidecars, emit output */
      qexec_analytic_update_group_result (thread_p, &analytic_state);
    }
  // ... condensed ...
}

The driver is called once per distinct sort list. The query parser groups window functions that share the same (PARTITION BY, ORDER BY) into one ANALYTIC_EVAL_TYPE, so the upstream parse-tree pass pt_optimize_analytic_list minimises the number of sort passes. The deck’s example — max(c3) over (partition by c1 order by c2), rank() over (order by c1), min(c3) over (order by c2) — is two sort passes: one keyed on (c1,c2) for the first two functions, one keyed on c2 for the third.

First pass: `qexec_analytic_put_next`

flowchart TD
  A["sort_record arrives in qexec_analytic_put_next"]
  B{"current_key is NULL?\n(first record)"}
  C["start_group(reinit=true)\nfor every function"]
  D{"key matches current_key\n(full sort_list_size)?"}
  E["KEEP_RANK; continue"]
  F{"sort_prefix_size == 0?\n(no PARTITION BY)"}
  G{"key matches current_key\n(prefix only)?"}
  H["finalize_group(is_same_group=true)\nstart_group(reinit=false)"]
  I["finalize_group(is_same_group=false)\nstart_group(reinit=true)"]
  J["qexec_analytic_add_tuple\n(for every function)"]
  A --> B
  B -- yes --> C --> J
  B -- no --> D
  D -- yes --> E --> J
  D -- no --> F
  F -- yes (no partition) --> H
  F -- no --> G
  G -- yes (same partition) --> H
  G -- no --> I
  H --> J
  I --> J

The two-tier comparison — full sort key first, then prefix-only on mismatch — is the partition vs. ordering distinction. The inner loop in qexec_analytic_put_next (page 4 of the deck) mutates analytic_state->key_info.nkeys to switch between func_p->sort_list_size (full key, partition + order) and func_p->sort_prefix_size (PARTITION BY columns only):

// qexec_analytic_put_next — src/query/query_executor.c (condensed)
static int
qexec_analytic_put_next (THREAD_ENTRY * thread_p,
                         const RECDES * recdes, void *arg)
{
  ANALYTIC_STATE *analytic_state = (ANALYTIC_STATE *) arg;
  // ... condensed: load original tuple from sort key page ...

  if (analytic_state->input_recs == 0)
    {
      /* first record: start every function */
      for (i = 0; i < analytic_state->func_count; i++)
        qexec_analytic_start_group (thread_p, analytic_state->xasl_state,
                                    &analytic_state->func_state_list[i],
                                    recdes, /*reinit=*/true);
    }
  else
    {
      for (i = 0; i < analytic_state->func_count; i++)
        {
          ANALYTIC_FUNCTION_STATE *func_state = &analytic_state->func_state_list[i];
          bool is_same_group = false;

          /* compare full sort key */
          int nkeys = analytic_state->key_info.nkeys;
          analytic_state->key_info.nkeys = func_state->func_p->sort_list_size;
          if (((*analytic_state->cmp_fn) (&func_state->current_key.data, &key,
                                          &analytic_state->key_info) == 0)
              || analytic_state->key_info.nkeys == 0)
            {
              analytic_state->key_info.nkeys = nkeys;
              ANALYTIC_FUNC_SET_FLAG (func_state->func_p, ANALYTIC_KEEP_RANK);
              if (QPROC_ANALYTIC_IS_OFFSET_FUNCTION (func_state->func_p))
                is_same_group = true;       /* offset funcs need per-tuple distinct */
              else
                continue;                   /* same sort key, no work */
            }
          analytic_state->key_info.nkeys = nkeys;

          /* compare prefix only */
          if (!is_same_group)
            {
              if (func_state->func_p->sort_prefix_size == 0)
                is_same_group = true;
              else
                {
                  nkeys = analytic_state->key_info.nkeys;
                  analytic_state->key_info.nkeys = func_state->func_p->sort_prefix_size;
                  if ((*analytic_state->cmp_fn) (&func_state->current_key.data, &key,
                                                 &analytic_state->key_info) == 0)
                    is_same_group = true;
                  analytic_state->key_info.nkeys = nkeys;
                }
            }

          if (!is_same_group)
            {
              /* partition boundary */
              qexec_analytic_finalize_group (thread_p, analytic_state->xasl_state,
                                             func_state, /*is_same_group=*/false);
              pr_clear_value (func_state->func_p->value);
              qexec_analytic_start_group (thread_p, analytic_state->xasl_state,
                                          func_state, recdes, /*reinit=*/true);
            }
          else if (...)
            {
              /* sort-key boundary (within partition) */
              qexec_analytic_finalize_group (thread_p, analytic_state->xasl_state,
                                             func_state, /*is_same_group=*/true);
              qexec_analytic_start_group (thread_p, analytic_state->xasl_state,
                                          func_state, recdes, /*reinit=*/false);
            }
        }
    }

  qexec_analytic_add_tuple (thread_p, analytic_state, data, peek);
  analytic_state->input_recs++;
}

The two start_group arguments matter: reinit=true means “new partition, reset accumulator and counters from scratch”; reinit=false means “still in same partition, but new sort key — save the partial value and reset only the sort-key counter”. The deck’s slide on qexec_analytic_start_group (page 4) contrasts the two cases.

Per-function state machine

stateDiagram-v2
  [*] --> Initial : sort begins
  Initial --> NewPartition : first record\nstart_group(reinit=true)
  NewPartition --> SameSortKey : add_tuple
  SameSortKey --> SameSortKey : same sort key\nadd_tuple
  SameSortKey --> NewSortKey : sort key changes\n(within partition)
  NewSortKey --> NewPartitionFinalize : partition prefix changes
  NewPartitionFinalize --> NewPartition : finalize(is_same_group=false)\nstart_group(reinit=true)
  NewSortKey --> SameSortKey : finalize(is_same_group=true)\nstart_group(reinit=false)\nadd_tuple
  SameSortKey --> EOI : sort drains
  EOI --> FinalFinalize : finalize trailing group
  FinalFinalize --> SecondPass : update_group_result
  SecondPass --> [*]

qexec_analytic_finalize_group writes one tuple to each of the two sidecar list files:

// qexec_analytic_finalize_group — src/query/query_executor.c (condensed)
static int
qexec_analytic_finalize_group (THREAD_ENTRY * thread_p, XASL_STATE * xasl_state,
                               ANALYTIC_FUNCTION_STATE * func_state,
                               bool is_same_group)
{
  /* compute the function's final value for the just-closed sort-key range */
  qdata_finalize_analytic_func (thread_p, func_state->func_p, is_same_group);

  if (!DB_IS_NULL (func_state->func_p->value))
    func_state->curr_group_tuple_count_nn += func_state->curr_sort_key_tuple_count;

  /* group_list_id gets one (count, count_nn) tuple per partition */
  if (!is_same_group)
    qfile_fast_intint_tuple_to_list (thread_p, func_state->group_list_id,
                                     func_state->curr_group_tuple_count,
                                     func_state->curr_group_tuple_count_nn);

  /* value_list_id gets one (sort_key_count, value) tuple per distinct sort key */
  qfile_fast_intval_tuple_to_list (thread_p, func_state->value_list_id,
                                   func_state->curr_sort_key_tuple_count,
                                   func_state->func_p->value);
  // ... condensed ...
}

The first-pass output (interm_file + group_list_id + value_list_id) is enough information for the second pass to reconstruct each function’s output column for every input tuple.

Second pass: `qexec_analytic_update_group_result`

This pass re-reads the intermediate file linearly. For each tuple, for each window function, it advances cursors in the sidecar files and computes the per-tuple output value. The cursor advancement is implemented in qexec_analytic_value_advance (forward and backward navigation over the sort-key list while tracking partition headers via the group list) and in qexec_analytic_value_lookup (random-access seek to a position within the current partition).

// qexec_analytic_update_group_result — src/query/query_executor.c (condensed)
static int
qexec_analytic_update_group_result (THREAD_ENTRY * thread_p, ANALYTIC_STATE * analytic_state)
{
  // ... condensed: open scans on group_list_id, value_list_id, interm_file ...

  while ((sc = qfile_scan_list_next (thread_p, &interm_scan_id, &tplrec_scan, PEEK))
         == S_SUCCESS)
    {
      fetch_val_list (thread_p, analytic_state->a_regu_list, &xasl_state->vd, ...);
      qexec_analytic_eval_instnum_pred (thread_p, analytic_state, stage);

      for (i = 0; i < analytic_state->func_count; i++)
        {
          ANALYTIC_FUNCTION_STATE *func_state = &analytic_state->func_state_list[i];
          ANALYTIC_TYPE *func_p = func_state->func_p;

          if (func_p->function == PT_ROW_NUMBER)
            ;    /* already computed in first pass; nothing to do */
          else if (QPROC_ANALYTIC_IS_OFFSET_FUNCTION (func_p))
            {
              if (func_state->group_tuple_position == -1)
                qexec_analytic_value_advance (thread_p, func_state, 1, 1, ignore_nulls);
              else if (func_state->group_consumed_tuples >= func_state->curr_group_tuple_count)
                {
                  qexec_analytic_group_header_next (thread_p, func_state);
                  func_state->group_consumed_tuples = 0;
                }
            }
          else if (QPROC_IS_INTERPOLATION_FUNC (func_p))
            {
              if (func_state->group_consumed_tuples >= func_state->curr_group_tuple_count)
                {
                  qexec_analytic_group_header_next (thread_p, func_state);
                  func_state->group_consumed_tuples = 0;
                }
            }
          else
            qexec_analytic_value_advance (thread_p, func_state, 1, 1, ignore_nulls);

          /* function-specific evaluation */
          switch (func_p->function)
            {
            case PT_LEAD: case PT_LAG: case PT_NTH_VALUE:
              qexec_analytic_evaluate_offset_function (thread_p, func_state, analytic_state);
              break;
            case PT_NTILE:
              qexec_analytic_evaluate_ntile_function (thread_p, func_state);
              break;
            case PT_CUME_DIST: case PT_PERCENT_RANK:
              qexec_analytic_evaluate_cume_dist_percent_rank_function (thread_p, func_state);
              break;
            case PT_MEDIAN: case PT_PERCENTILE_CONT: case PT_PERCENTILE_DISC:
              qexec_analytic_evaluate_interpolation_function (thread_p, func_state);
              break;
            }

          func_state->group_consumed_tuples++;
        }

      if (analytic_state->is_output_rec)
        qexec_insert_tuple_into_list (thread_p, analytic_state->output_file,
                                      analytic_state->a_outptr_list,
                                      &xasl_state->vd, &tplrec_write);
    }
  // ... condensed: cleanup ...
}

Offset functions: `LEAD`, `LAG`, `NTH_VALUE`

The offset evaluator translates the offset argument into a target position within the partition and uses qexec_analytic_value_lookup to read the value from the sort-key sidecar:

// qexec_analytic_evaluate_offset_function — src/query/query_executor.c (condensed)
static int
qexec_analytic_evaluate_offset_function (THREAD_ENTRY * thread_p,
                                         ANALYTIC_FUNCTION_STATE * func_state,
                                         ANALYTIC_STATE * analytic_state)
{
  // ... condensed: locate offset reguvar, read its int value ...

  switch (func_p->function)
    {
    case PT_LEAD:
      target_idx = func_state->group_consumed_tuples + offset;
      break;
    case PT_LAG:
      target_idx = func_state->group_consumed_tuples - offset;
      break;
    case PT_NTH_VALUE:
      // ... condensed: cast to double, validate range ...
      target_idx = (int) floor (nth_idx) - 1;
      if (func_p->from_last)
        target_idx = group_tuple_count - target_idx;
      break;
    }

  if (target_idx >= 0 && target_idx < group_tuple_count)
    qexec_analytic_value_lookup (thread_p, func_state, target_idx, func_p->ignore_nulls);
  else
    put_default = true;            /* off the end → default value */

  if (put_default)
    {
      pr_clear_value (func_p->value);
      pr_clone_value (default_val_coerced, func_p->value);
    }
  return error;
}

The deck (page 8-13) walks LAG(c2, 1) over four tuples in detail, showing how group_consumed_tuples advances per output tuple, how group_list_id provides the partition cardinality that bounds the lookup, and how the default value falls back when the target index is out of range.

Interpolation functions: `MEDIAN`, `PERCENTILE_CONT`, `PERCENTILE_DISC`

These functions need to read the value at a fractional position within the partition. The evaluator computes the floor and ceiling positions of the target and either picks the middle row (odd count) or interpolates between the two middle rows:

// qexec_analytic_evaluate_interpolation_function — src/query/query_executor.c (condensed)
static int
qexec_analytic_evaluate_interpolation_function (THREAD_ENTRY * thread_p,
                                                ANALYTIC_FUNCTION_STATE * func_state)
{
  ANALYTIC_TYPE *func_p = func_state->func_p;

  /* MEDIAN/PERCENTILE evaluated only at start of group; later tuples reuse */
  if (func_state->group_tuple_position_nn > 0)
    return NO_ERROR;
  if (func_p->is_const_operand)
    return NO_ERROR;
  if (func_state->curr_group_tuple_count_nn == 0)
    return NO_ERROR;

  if (func_p->function == PT_MEDIAN)
    percentile_d = 0.5;
  else
    {
      // ... condensed: fetch percentile reguvar, validate ...
      percentile_d = db_get_double (peek_value_p);
      if (func_p->function == PT_PERCENTILE_DISC)
        percentile_d = ceil (percentile_d * func_state->curr_group_tuple_count_nn)
                       / func_state->curr_group_tuple_count_nn;
    }

  row_num_d = ((double) (func_state->curr_group_tuple_count_nn - 1)) * percentile_d;
  f_row_num_d = floor (row_num_d);
  c_row_num_d = (func_p->function == PT_PERCENTILE_DISC)
                ? f_row_num_d
                : ceil (row_num_d);

  if (f_row_num_d == c_row_num_d)
    {
      qexec_analytic_value_lookup (thread_p, func_state, (int) f_row_num_d, true);
      qdata_apply_interpolation_function_coercion (func_p->value, &func_p->domain,
                                                   func_p->value, func_p->function);
    }
  else
    {
      DB_VALUE c_value, f_value;
      qexec_analytic_value_lookup (thread_p, func_state, (int) f_row_num_d, true);
      pr_clone_value (func_p->value, &f_value);
      qexec_analytic_value_lookup (thread_p, func_state, (int) c_row_num_d, true);
      pr_clone_value (func_p->value, &c_value);
      qdata_interpolation_function_values (&f_value, &c_value, row_num_d,
                                           f_row_num_d, c_row_num_d, &func_p->domain,
                                           func_p->value, func_p->function);
    }
  return NO_ERROR;
}

The “group_tuple_position_nn > 0 → return” guard means the median is computed once per partition, on the first non-NULL tuple, and reused for every subsequent tuple in that partition. Hence the deck’s note that all rows of partition c1=1 get the same median(c2) output (page 18: “동일 그룹 내 다른 튜플을 처리하는 경우, 이미 계산된 중간 값을 재사용”).

Position navigation: `qexec_analytic_value_advance`

This is the routine that walks back and forth across both sidecar list files coherently:

// qexec_analytic_value_advance — src/query/query_executor.c (condensed)
static int
qexec_analytic_value_advance (THREAD_ENTRY * thread_p,
                              ANALYTIC_FUNCTION_STATE * func_state,
                              int amount, int max_group_changes,
                              bool ignore_nulls)
{
  while (sc == S_SUCCESS && amount != 0)
    {
      /* clamp to group end if max_group_changes <= 0 */
      if (amount > 0)
        {
          if (max_group_changes <= 0
              && func_state->group_tuple_position >= func_state->curr_group_tuple_count - 1)
            break;
          func_state->sort_key_tuple_position++;
          func_state->group_tuple_position++;
        }
      else { /* symmetric for backward */ }

      /* sort-key boundary: load adjacent sort-key header */
      if (func_state->sort_key_tuple_position < 0)
        {
          qfile_scan_list_prev (thread_p, &func_state->value_scan_id, ...);
          qexec_analytic_sort_key_header_load (func_state, ...);
          func_state->sort_key_tuple_position = func_state->curr_sort_key_tuple_count - 1;
        }
      else if (func_state->sort_key_tuple_position >= func_state->curr_sort_key_tuple_count)
        {
          qfile_scan_list_next (thread_p, &func_state->value_scan_id, ...);
          qexec_analytic_sort_key_header_load (func_state, ...);
          func_state->sort_key_tuple_position = 0;
        }

      /* group boundary: load adjacent group header */
      if (func_state->group_tuple_position < 0) { /* prev group */ max_group_changes--; }
      else if (func_state->group_tuple_position >= func_state->curr_group_tuple_count)
        { /* next group */ max_group_changes--; }

      /* adjust amount: if ignore_nulls, only non-NULL counts */
      if (amount > 0)
        amount -= (DB_IS_NULL (func_p->value) && ignore_nulls) ? 0 : 1;
      else
        amount += (DB_IS_NULL (func_p->value) && ignore_nulls) ? 0 : 1;
    }

  if (amount != 0)
    {
      if (func_state->curr_group_tuple_count_nn == 0 && ignore_nulls)
        return NO_ERROR;       /* all-NULL group → result is NULL */
      return ER_FAILED;        /* genuine miss */
    }
  return NO_ERROR;
}

Two state machines run in lockstep — sort-key cursor and group cursor — each with its own backing list file and its own position counter. The max_group_changes argument is the budget for crossing partition boundaries (0 = stay within current group, 1 = at most one boundary crossing, used by qexec_analytic_group_header_next).

`qexec_analytic_value_lookup` — random access

// qexec_analytic_value_lookup — src/query/query_executor.c (condensed)
static int
qexec_analytic_value_lookup (THREAD_ENTRY * thread_p,
                             ANALYTIC_FUNCTION_STATE * func_state,
                             int position, bool ignore_nulls)
{
  int offset = position - (ignore_nulls
                            ? func_state->group_tuple_position_nn
                            : func_state->group_tuple_position);
  if (offset != 0)
    return qexec_analytic_value_advance (thread_p, func_state, offset, 0, ignore_nulls);

  return qexec_analytic_sort_key_header_load (func_state, true);
}

This is the relative-positioning wrapper used by the offset and interpolation evaluators. It computes a delta from the current position and calls qexec_analytic_value_advance with max_group_changes = 0 (no partition crossing).

Sort vs Hash GROUP BY — Strategy Choice and Spill

The sort/hash decision is made in three layers, only the first of which is plan-time:

Plan-time eligibility: the XASL generator sets BUILDLIST_PROC_NODE::g_hash_eligible when the GROUP BY is amenable to hashing — primary criteria are absence of non-deterministic aggregates and presence of equality-based key extraction. This is propagated to GROUPBY_STATE::hash_eligible in qexec_initialize_groupby_state.
Runtime memory ceiling: while processing, qexec_hash_gby_agg_tuple tracks context->hash_size (sum of key sizes + accumulator sizes). When it crosses prm_get_bigint_value (PRM_ID_MAX_AGG_HASH_SIZE) (default 2 MB), the LRU head of the hash table is evicted to part_list_id. This does not change strategy; it just trims the working set.
Runtime selectivity bailout: after HASH_AGGREGATE_VH_SELECTIVITY_TUPLE_THRESHOLD = 2000 tuples have been processed, the executor checks group_count / tuple_count > HASH_AGGREGATE_VH_SELECTIVITY_THRESHOLD = 0.5f. If the ratio is too high (the hash table is mostly distinct keys), state flips to HS_REJECT_ALL, the entire table is dumped to part_list_id via qdata_save_agg_htable_to_list, and from that point all new input tuples write directly to xasl->list_id for sort-based aggregation.

flowchart TD
  START["end_one_iteration\nreceives a tuple"]
  ELIG{"hash_eligible &&\nstate != HS_REJECT_ALL?"}
  HKEY["build hash key\n(qexec_build_agg_hkey)"]
  PROBE{"key in hash_table?"}
  NEW["allocate hash entry\nstash first_tuple"]
  FOLD["fold accumulator\n(qdata_evaluate_aggregate_list)"]
  MEM{"hash_size > MAX_AGG_HASH_SIZE?"}
  EVICT["spill LRU head to part_list_id\nshrink hash_size"]
  SEL{"tuple_count > 2000?"}
  RATIO{"group_count/tuple_count > 0.5?"}
  REJECT["state = HS_REJECT_ALL\nqdata_save_agg_htable_to_list\n→ part_list_id"]
  SORT["write tuple to xasl->list_id\n(sort path)"]
  POST["Post-processing"]
  START --> ELIG
  ELIG -- no --> SORT
  ELIG -- yes --> HKEY --> PROBE
  PROBE -- no --> NEW --> MEM
  PROBE -- yes --> FOLD --> MEM
  MEM -- yes --> EVICT --> SEL
  MEM -- no --> SEL
  SEL -- no --> POST
  SEL -- yes --> RATIO
  RATIO -- no --> POST
  RATIO -- yes --> REJECT --> SORT
  SORT --> POST

Post-processing for the hash path

Even on the hash path, post-processing almost always sorts. qexec_groupby examines the state and chooses one of three shapes:

Pure hash, sort-skip: xasl->list_id->tuple_cnt == 0, agg_hash_context->part_list_id->tuple_cnt == 0, and PRM_ID_AGG_HASH_RESPECT_ORDER == false. The early-exit branch in qexec_groupby walks the in-memory hash table and emits one output tuple per entry directly. No sort. This is the only path that avoids sort_listfile entirely.
Hash with spill: part_list_id is non-empty (LRU evictions happened, or the very-high-selectivity bailout fired). The executor sorts part_list_id to merge per-key partial accumulators (using qexec_hash_gby_put_next), then sorts xasl->list_id (using qexec_gby_put_next) with cross-references into the merged partial list. Two sort_listfile calls, the same callback substrate as pure sort-based GROUP BY.
Sort fallback: state == HS_REJECT_ALL. Same as case 2; xasl->list_id carries the post-bailout tuples, part_list_id carries the dumped hash table.

The deck (page 23) lists the exact conditions for the sort-skip optimisation: hash group by, no WITH ROLLUP, agg_hash_respect_order == no, and no spill during processing.

External sort substrate

When qexec_groupby or qexec_execute_analytic calls sort_listfile, the actual sort lives in src/storage/external_sort.c:

// sort_listfile — src/storage/external_sort.c (condensed)
sort_listfile (THREAD_ENTRY * thread_p, INT16 volid, int est_inp_pg_cnt,
               SORT_GET_FUNC * get_fn, void *get_arg,
               SORT_PUT_FUNC * put_fn, void *put_arg,
               SORT_CMP_FUNC * cmp_fn, void *cmp_arg,
               SORT_DUP_OPTION option, int limit, ...,
               SORT_PARALLEL_TYPE parallel_type)
{
  SORT_PARAM ori_sort_param;
  SORT_PARAM *sort_param = &ori_sort_param;
  // ... condensed: copy callbacks into sort_param ...
  sort_param->limit = limit;

  /* sort buffer sized by PRM_ID_SR_NBUFFERS (default), capped at input estimate */
  sort_param->tot_buffers = MIN (prm_get_integer_value (PRM_ID_SR_NBUFFERS), input_pages);
  sort_param->tot_buffers = MAX (4, sort_param->tot_buffers);
  sort_param->internal_memory =
    (char *) malloc ((size_t) sort_param->tot_buffers * (size_t) DB_PAGESIZE);

  /* temp file count derived from buffer/input ratio; runs go into half_files */
  sort_param->half_files = sort_get_num_half_tmpfiles (sort_param->tot_buffers, input_pages);
  sort_param->tot_tempfiles = sort_param->half_files << 1;

  /* main loop: in-phase (run generation) → ex-phase (k-way merge) */
  // ... condensed ...
}

The sort buffer size is one of PRM_ID_SR_NBUFFERS pages (default at the time of writing varies; check the source). When input fits in tot_buffers * DB_PAGESIZE of memory, the sort is purely in-memory; otherwise it writes sorted runs to temp files and merges them. The parallel_type argument routes between serial and parallel sort variants — SORT_ANALYTIC and the default carry hints to the parallel sort scheduler.

A non-obvious interaction: sort_listfile does not have a “sort finished, please drain” hook, so both qexec_gby_finalize_group_dim and qexec_execute_analytic have to finalize the trailing group manually after sort_listfile returns — the loop in qexec_execute_analytic after sort_listfile is exactly this clean-up:

// trailing finalize after sort_listfile (qexec_execute_analytic excerpt)
if (analytic_state.input_recs != 0)
  {
    for (i = 0; i < analytic_state.func_count; i++)
      qexec_analytic_finalize_group (thread_p, analytic_state.xasl_state,
                                     &analytic_state.func_state_list[i], false);
    qexec_analytic_update_group_result (thread_p, &analytic_state);
  }

The deck reflects the same invariant for GROUP BY — the finalize-before-start ordering inside qexec_gby_put_next only fires on key transitions, leaving the very last group unflushed until the post-sort_listfile cleanup.

Source Walkthrough

Anchor on symbol names, not line numbers. Line numbers below are scoped to the doc’s updated: date.

Aggregation / GROUP BY

Driver and state:

qexec_groupby — top-level GROUP BY entry; allocates GROUPBY_STATE, opens output list, calls sort_listfile with the sort callbacks.
GROUPBY_STATE (query_executor.c:213) — per-query state.
GROUPBY_DIMENSION (query_executor.c:206) — per-rollup-level slot with its own aggregate list.
GROUPBY_DIMENSION_FLAG — GROUP_BY, ROLLUP, CUBE, SET.
qexec_initialize_groupby_state — initialises the state and binds the hash context if eligible.
qexec_clear_groupby_state — cleanup.
qexec_gby_init_group_dim — allocates the dimension array (g_dim_levels = with_rollup ? nkeys + 1 : 1).

Sort-based path:

qexec_gby_get_next — sort get_fn; turns input list tuples into SORT_RECs.
qexec_gby_put_next — sort put_fn; iterates the chained SORT_REC list, dispatching start_group / agg_tuple / finalize_group_dim based on key transitions.
qexec_gby_start_group_dim — start every dimension on the first record.
qexec_gby_start_group — record the new key, zero one dimension’s accumulators.
qexec_gby_agg_tuple — fold the current input tuple into every dimension’s accumulator chain.
qexec_gby_finalize_group_dim — rollup-aware key-prefix comparison; finalize all dimensions whose prefix changed.
qexec_gby_finalize_group — emit one output tuple from the given dimension; evaluate HAVING and GROUP_BY_NUM.
qexec_gby_finalize_group_val_list — NULL out columns that were rolled up away.

Hash path:

qexec_hash_gby_agg_tuple — processing-side hash insert/probe; LRU spill; very-high-selectivity bailout.
qexec_hash_gby_get_next — sort get_fn for part_list_id.
qexec_hash_gby_put_next — sort put_fn; merges accumulators for runs of identical keys.
qexec_alloc_agg_hash_context / qexec_alloc_agg_hash_context_buildlist_xasl — allocate and initialise the hash context.
qexec_free_agg_hash_context — cleanup.
qexec_build_agg_hkey — populate an AGGREGATE_HASH_KEY from the current reguvars.
qexec_locate_agg_hentry_in_list — scan sorted_part_list_id for a key during the second hash-path sort.

Aggregate evaluation helpers (query_aggregate.cpp):

qdata_evaluate_aggregate_list — fold one tuple into an AGGREGATE_TYPE chain (per-function dispatch).
qdata_initialize_aggregate_list — zero the chain at start of group.
qdata_finalize_aggregate_list — close the chain (e.g., divide for AVG).
qdata_aggregate_accumulator_to_accumulator — merge two accumulators (used in hash-spill consolidation).
qdata_alloc_agg_hvalue / qdata_free_agg_hvalue — accumulator storage in the hash context.
qdata_save_agg_hentry_to_list — write one (key, value) entry to part_list_id.
qdata_save_agg_htable_to_list — bulk dump the hash table on selectivity bailout.
qdata_load_agg_hentry_from_tuple — re-hydrate (key, value) from a sorted partial-list tuple.

Analytic / Window Functions

Driver and state:

qexec_execute_analytic — top-level analytic entry; runs once per shared sort list.
ANALYTIC_STATE (query_executor.c:288) — per-query state.
ANALYTIC_FUNCTION_STATE (query_executor.c:259) — per-function state; carries the two sidecar list files.
qexec_initialize_analytic_state — initialises the state and per-function sub-states.
qexec_initialize_analytic_function_state — the inner builder for one ANALYTIC_FUNCTION_STATE.
qexec_clear_analytic_state — cleanup.

First pass (sort + state-update):

qexec_analytic_get_next — sort get_fn.
qexec_analytic_put_next — sort put_fn; the partition/sort-key state machine.
qexec_analytic_start_group — record key, optionally re-init accumulator (new partition) or just sort-key counter (same partition).
qexec_analytic_add_tuple — push the current tuple into every function’s accumulator and into interm_file.
qexec_analytic_finalize_group — emit one (group_count, group_count_nn) tuple to group_list_id and one (sort_key_count, value) tuple to value_list_id.
qexec_analytic_eval_instnum_pred — evaluates INST_NUM() at the appropriate stage.

Second pass (output reassembly):

qexec_analytic_update_group_result — re-scan interm_file, position cursors, dispatch per-function evaluator, write output.
qexec_analytic_value_advance — bidirectional walk over the sort-key sidecar with group-boundary tracking.
qexec_analytic_value_lookup — random-access seek to a position within the current partition.
qexec_analytic_group_header_next — walk to the next partition.
qexec_analytic_group_header_load / qexec_analytic_sort_key_header_load — deserialize the sidecar headers into func_state counters.

Per-function evaluators:

qexec_analytic_evaluate_offset_function — LEAD, LAG, NTH_VALUE.
qexec_analytic_evaluate_ntile_function — NTILE.
qexec_analytic_evaluate_cume_dist_percent_rank_function — CUME_DIST, PERCENT_RANK.
qexec_analytic_evaluate_interpolation_function — MEDIAN, PERCENTILE_CONT, PERCENTILE_DISC.
qexec_analytic_eval_in_processing — for analytic functions whose value can be computed during processing (e.g., ROW_NUMBER).

Helpers (query_analytic.cpp):

qdata_initialize_analytic_func — zero per-group counters, open distinct list files when needed.
qdata_evaluate_analytic_func — fold one tuple into the function-specific running state.
qdata_finalize_analytic_func — compute the per-sort-key final value.

Sort vs Hash GROUP BY — Strategy Choice and Spill

Selectivity / memory constants:

HASH_AGGREGATE_VH_SELECTIVITY_TUPLE_THRESHOLD = 2000.
HASH_AGGREGATE_VH_SELECTIVITY_THRESHOLD = 0.5f.
PRM_ID_MAX_AGG_HASH_SIZE (default 2 MB) — memory ceiling.
PRM_ID_AGG_HASH_RESPECT_ORDER — disables the sort-skip short-circuit even when otherwise eligible.
PRM_ID_SR_NBUFFERS — sort buffer page count.

External sort substrate (storage/external_sort.c):

sort_listfile — public entry; copies callbacks into SORT_PARAM and runs in-phase + ex-phase.
sort_listfile_internal — the actual driver.
sort_inphase_sort — produces sorted runs from buffer-sized chunks.
sort_run_sort — quicksort over the in-buffer key array.
sort_run_flush — flushes a sorted run to a temp file.
sort_exphase_merge — k-way merge of runs.
sort_exphase_merge_elim_dup — merge with duplicate elimination (for DISTINCT-style sorts).
SORT_PARAM — central state struct (file table, callback bundle, comparator).
SORT_PARALLEL_TYPE — selects between serial, SORT_ANALYTIC, and other parallel scheduler modes.

Position hints as of 2026-04-30

Symbol	File	Line
`HASH_AGGREGATE_VH_SELECTIVITY_TUPLE_THRESHOLD`	`query_executor.c`	127
`HASH_AGGREGATE_VH_SELECTIVITY_THRESHOLD`	`query_executor.c`	130
`GROUPBY_DIMENSION_FLAG`	`query_executor.c`	200
`GROUPBY_DIMENSION` struct	`query_executor.c`	206
`GROUPBY_STATE` struct	`query_executor.c`	213
`ANALYTIC_FUNCTION_STATE` struct	`query_executor.c`	259
`ANALYTIC_STATE` struct	`query_executor.c`	288
`qexec_initialize_groupby_state`	`query_executor.c`	4289
`qexec_gby_agg_tuple`	`query_executor.c`	4478
`qexec_hash_gby_agg_tuple_public`	`query_executor.c`	4516
`qexec_hash_gby_agg_tuple`	`query_executor.c`	4537
`qexec_hash_gby_get_next`	`query_executor.c`	4778
`qexec_hash_gby_put_next`	`query_executor.c`	4793
`qexec_gby_get_next`	`query_executor.c`	4940
`qexec_gby_put_next`	`query_executor.c`	4956
`qexec_groupby`	`query_executor.c`	5247
`qexec_execute_mainblock_internal`	`query_executor.c`	15058
`qexec_gby_finalize_group_dim`	`query_executor.c`	19092
`qexec_gby_finalize_group`	`query_executor.c`	19184
`qexec_gby_start_group_dim`	`query_executor.c`	19408
`qexec_gby_start_group`	`query_executor.c`	19430
`qexec_gby_init_group_dim`	`query_executor.c`	19501
`qexec_execute_analytic`	`query_executor.c`	21007
`qexec_initialize_analytic_state`	`query_executor.c`	21593
`qexec_analytic_get_next`	`query_executor.c`	21743
`qexec_analytic_put_next`	`query_executor.c`	21761
`qexec_analytic_start_group`	`query_executor.c`	22045
`qexec_analytic_finalize_group`	`query_executor.c`	22118
`qexec_analytic_add_tuple`	`query_executor.c`	22195
`qexec_analytic_evaluate_offset_function`	`query_executor.c`	22416
`qexec_analytic_evaluate_cume_dist_percent_rank_function`	`query_executor.c`	22634
`qexec_analytic_evaluate_interpolation_function`	`query_executor.c`	22676
`qexec_analytic_group_header_load`	`query_executor.c`	22807
`qexec_analytic_sort_key_header_load`	`query_executor.c`	22847
`qexec_analytic_value_advance`	`query_executor.c`	22910
`qexec_analytic_value_lookup`	`query_executor.c`	23080
`qexec_analytic_group_header_next`	`query_executor.c`	23112
`qexec_analytic_update_group_result`	`query_executor.c`	23135
`qexec_alloc_agg_hash_context_buildlist_xasl`	`query_executor.c`	26906
`qexec_alloc_agg_hash_context`	`query_executor.c`	26922
`qexec_free_agg_hash_context`	`query_executor.c`	27157
`qexec_build_agg_hkey`	`query_executor.c`	27282
`qexec_locate_agg_hentry_in_list`	`query_executor.c`	27327
`sort_run_sort`	`storage/external_sort.c`	1171
`sort_listfile`	`storage/external_sort.c`	1347
`sort_listfile_internal`	`storage/external_sort.c`	1749
`sort_inphase_sort`	`storage/external_sort.c`	1849
`sort_run_flush`	`storage/external_sort.c`	2294
`sort_exphase_merge_elim_dup`	`storage/external_sort.c`	2457
`sort_exphase_merge`	`storage/external_sort.c`	3381
`sort_get_avg_numpages_of_nonempty_tmpfile`	`storage/external_sort.c`	4137
`sort_get_numpages_of_active_infiles`	`storage/external_sort.c`	5685
`sort_run_add_new`	`storage/external_sort.c`	5747

Cross-check Notes

The raw analyses (Post-processing(aggregation).pdf, Post-Processing(analytic).pdf, Sort, Hash Group By.pptx) were captured as project-internal teaching material; the structural description matches the present source faithfully, but a few points have drifted or were imprecisely stated.

HASH_AGGREGATE_VH_SELECTIVITY_TUPLE_THRESHOLD value. The deck (slides 18, 20) says “200개” (200 tuples) and notes “변경될 수 있음” (may change). The current source defines it as 2000 (query_executor.c:127). The threshold has indeed moved by an order of magnitude since the deck was authored.
“Sort_listfile에는 qexec_gby_get_next와 qexec_gby_put_next 함수가 인자로 전달됩니다.” Verified — the call signature in qexec_groupby matches the deck’s narration. The deck’s flowchart for the put callback (page 7) shows qexec_analytic_put_next as the entry node, which is a copy-paste from the analytic deck; the actual entry for GROUP BY is qexec_gby_put_next.
groupby_dimension ordering. The deck (slide 24) describes g_dim[0] as “모든 컬럼을 기준으로 집계한 결과” (full-detail aggregation) and the higher indices as the rollup prefixes, with the total aggregate (no grouping columns) at g_dim[1] and the prefix levels in g_dim[2..g_dim_levels-1]. Verified at qexec_gby_init_group_dim: index 0 carries the GROUPBY_DIM_FLAG_GROUP_BY flag, and when with_rollup all indices additionally carry GROUPBY_DIM_FLAG_ROLLUP. The finalize order in qexec_gby_finalize_group_dim walks from g_dim_levels - 1 down to i + 1, which matches the deck’s “g_dim 배열 끝에서부터, nkey + 1까지 finalize” rule.
CBRD-25611 — rollup expression NULL bug. Deck slide 29 describes the fix at a high level. Whether the patch lives in the current qexec_gby_finalize_group_val_list or in the expression evaluator was not traced; investigation path is the CBRD-25611 commit log on jira.cubrid.org.
Analytic deck flowchart on slide page 4. The decision tree describes the partition-prefix re-comparison in correct order, but the labels “비교할 키 개수 = sort_prefix_size” elide the fact that nkeys is also set back to its original value after each comparison; the source code’s save-and-restore pattern in qexec_analytic_put_next is the actual contract.
qexec_initialize_groupby_state calls qfile_initialize_sort_key_info only when groupby_list != NULL. Verified at the head of the function: qexec_groupby itself early-exits if buildlist->groupby_list == NULL, so the state-init never sees a NULL list.
qexec_execute_analytic runs once per sort list, not once per analytic function. Verified at the call site in qexec_execute_mainblock_internal (the loop iterates xasl->proc.buildlist.a_eval_list and calls qexec_execute_analytic per entry). The deck describes the per-sort-list iteration but elides the planner-side merge in pt_optimize_analytic_list that produces those entries.
is_skip_sort path in qexec_execute_analytic. Triggered when the planner has already produced a sorted input; the function jumps directly to qexec_analytic_update_group_result without calling sort_listfile. Verified at the if (is_skip_sort) { ... goto wrapup; } branch. The deck does not discuss this optimisation.

Open Questions

Why the is_same_group=true case for offset functions keeps start_group(reinit=false) even within the same sort key. The comment in qexec_analytic_put_next (“offset functions will treat all tuples in a group as having a different sort key regardless if this is true or not”) points to a per-tuple distinct value requirement; the resulting value_list_id then has more entries than the strict sort-key boundary count. Investigation path: trace the read-side cursor in qexec_analytic_evaluate_offset_function to confirm how the extra entries are consumed.
qexec_analytic_value_advance failure semantics on all-NULL groups. When the function targets a position that would skip past every tuple because all values are NULL and ignore_nulls == true, it returns NO_ERROR with the func_p->value left at its previous content. Whether the caller resets the value to NULL before relying on it (e.g., in qexec_analytic_value_lookup callers) needs auditing.
SORT_ANALYTIC parallel sort routing. sort_listfile takes a SORT_PARALLEL_TYPE argument; the analytic call site uses SORT_ANALYTIC and the GROUP BY call site uses the default. The downstream parallel-sort scheduler in external_sort.c needs to be traced to confirm the analytic variant has a distinct scheduling policy. Investigation path: read sort_listfile_internal and the parallel hooks under sort_run_sort.
What happens to agg_hash_context after a HS_REJECT_ALL bailout if a new iteration enters qexec_hash_gby_agg_tuple? The state == HS_REJECT_ALL guard at function entry returns immediately, so the tuple bypasses the hash and is written directly to xasl->list_id by the caller. But the hash table itself was just dumped via qdata_save_agg_htable_to_list — is the mht_table freed at that point, or does it linger empty until qexec_free_agg_hash_context? Memory accounting depends on the answer.
PRM_ID_AGG_HASH_RESPECT_ORDER semantics. When set to yes (default), it forces post-processing through sort_listfile even when the sort-skip optimisation would otherwise apply. Why is the default yes? Hypothesis: some downstream consumer (e.g., DISTINCT, a top-level ORDER BY that is not present here but is composed elsewhere) requires the output ordering to be deterministic; without sorting, two queries with the same hash agg could produce different row orders. Investigation path: git log on the parameter’s introduction.
Trailing-group finalize after sort_listfile. Both qexec_groupby and qexec_execute_analytic rely on the knowledge that sort_listfile does not call any “drain” hook, so they finalize the trailing group after the sort returns. If a future refactor adds such a hook (parallel sort already has a more complex completion path), the trailing finalize could double-fire. The contract is implicit; making it explicit (e.g., a documented invariant in external_sort.h) would be defensive.
Memory ownership of interm_file. When is_skip_sort is true, analytic_state.interm_file = list_id (the input list file). When false, interm_file is a freshly opened list file. The cleanup branch at the end of qexec_execute_analytic destroys list_id after copying the output; whether interm_file is destroyed independently in the is_skip_sort case (where they alias) or whether the destroy is correctly skipped needs audit. Investigation path: trace qexec_clear_analytic_state.
Hash key extraction from non-position reguvars. The deck asserts that g_hk_regu_list is built from proc->g_hk_scan_regu_list, which is TYPE_POSITION-based (slide 7: “regu_list는 TYPE_POSITION 타입”). When the GROUP BY key is an expression rather than a column, the TYPE_POSITION reguvars must be backed by an upstream evaluation that materialises the expression value into a tuple position. Whether this materialisation lives in xasl_generation.c or in the optimizer’s pt_optimize_*_for_groupby pass needs confirmation.

Sources

Raw analyses (`raw/code-analysis/cubrid/query-processing/`)

Post-processing(aggregation).pdf — sort-based and hash-based GROUP BY execution, GROUPBY_DIMENSION rollup mechanics, deck excerpts of qexec_groupby flow.
Post-Processing(analytic).pdf — window/analytic function execution: qexec_analytic_put_next flow, sidecar list files, worked examples for LAG and MEDIAN.
Sort, Hash Group By.pptx — the Aug 2025 internal slide deck on the strategy choice; XASL list shape (g_outptr_list, g_val_list, g_regu_list, g_agg_list), processing-side hash insert, key-duplication scenarios, sort-skip conditions, CBRD-25611.
_converted/post-processingaggregation.pdf.md, _converted/post-processinganalytic.pdf.md, _converted/sort-hash-group-by.pptx.md — markitdown extracts.

Sibling docs

knowledge/research/dbms-general/database-internals.md — Petrov, Ch. 14 (“Query Execution”); the textbook framing for sort vs. hash aggregation and the external-sort substrate.
knowledge/code-analysis/cubrid/cubrid-heartbeat.md — style exemplar for this document’s structure.

Textbook chapters / papers

Database Internals (Petrov, 2019), Ch. 14 — query execution models, sort/hash trade-offs, external sort.
Database Systems: The Complete Book (Garcia-Molina, Ullman, Widom), Ch. 15 — physical query plans, two-pass algorithms based on sorting and hashing.
ISO/IEC 9075-2:2003 (SQL:2003) — WINDOW clause definition; the specification CUBRID’s analytic execution implements.
PostgreSQL src/backend/executor/nodeAgg.c and nodeWindowAgg.c source (BSD license) — comparative reference for the strategy-choice and window-execution alternatives.

CUBRID source (`/data/hgryoo/references/cubrid/`)

src/query/query_executor.c — bulk of post-processing: qexec_groupby, qexec_execute_analytic, the put callbacks, the per-function evaluators, the hash-context driver.
src/query/query_aggregate.cpp — aggregate evaluation helpers: qdata_evaluate_aggregate_list, qdata_finalize_aggregate_list, hash-key/value lifecycle.
src/query/query_analytic.cpp — analytic evaluation helpers: qdata_initialize_analytic_func, qdata_evaluate_analytic_func, qdata_finalize_analytic_func.
src/storage/external_sort.c / external_sort.h — sort_listfile and the in-phase / ex-phase merge sort.
src/query/list_file.c — qfile_* list-file primitives used for both input and the analytic sidecars (group_list_id, value_list_id, interm_file).
src/xasl/buildlist_proc_node.hpp (or wherever the node is defined) — BUILDLIST_PROC_NODE carrying g_outptr_list, g_agg_list, g_hash_eligible, agg_hash_context, g_with_rollup.
src/parser/parse_tree.h, xasl_generation.c — parser and XASL-builder side that constructs the lists and decides g_hash_eligible; pt_optimize_analytic_list for the analytic function shared-sort merge.

CUBRID Post-Processing — Aggregation, Window/Analytic Functions, and Sort vs Hash GROUP BY

Theoretical Background

Common DBMS Design

PostgreSQL — nodeAgg.c, nodeWindowAgg.c

MySQL — Item_sum, temporary tables, Aggregator_distinct

Oracle / SQL Server — sort/hash GROUP BY with cost-based fallback

Theory ↔ CUBRID mapping

CUBRID’s Approach

Aggregation / GROUP BY

Sort-based pipeline

Group key transition: qexec_gby_start_group

Hash-based pipeline

WITH ROLLUP — multi-dimension finalize

Analytic / Window Functions

State objects

qexec_execute_analytic — outer loop

First pass: qexec_analytic_put_next

Per-function state machine

Second pass: qexec_analytic_update_group_result

Offset functions: LEAD, LAG, NTH_VALUE

Interpolation functions: MEDIAN, PERCENTILE_CONT, PERCENTILE_DISC

Position navigation: qexec_analytic_value_advance

qexec_analytic_value_lookup — random access

Sort vs Hash GROUP BY — Strategy Choice and Spill

Post-processing for the hash path

External sort substrate

Source Walkthrough

Aggregation / GROUP BY

Analytic / Window Functions

Sort vs Hash GROUP BY — Strategy Choice and Spill

Position hints as of 2026-04-30

Cross-check Notes

Open Questions

Sources

Raw analyses (raw/code-analysis/cubrid/query-processing/)

Sibling docs

Textbook chapters / papers

CUBRID source (/data/hgryoo/references/cubrid/)

PostgreSQL — `nodeAgg.c`, `nodeWindowAgg.c`

MySQL — `Item_sum`, temporary tables, `Aggregator_distinct`

Group key transition: `qexec_gby_start_group`

`qexec_execute_analytic` — outer loop

First pass: `qexec_analytic_put_next`

Second pass: `qexec_analytic_update_group_result`

Offset functions: `LEAD`, `LAG`, `NTH_VALUE`

Interpolation functions: `MEDIAN`, `PERCENTILE_CONT`, `PERCENTILE_DISC`

Position navigation: `qexec_analytic_value_advance`

`qexec_analytic_value_lookup` — random access

Raw analyses (`raw/code-analysis/cubrid/query-processing/`)

CUBRID source (`/data/hgryoo/references/cubrid/`)