Skip to content

PostgreSQL Aggregation, Sort, and Window Executors

Contents:

The demand-driven iterator model that postgres-executor.md describes treats every operator as an open / next / close triple that returns one tuple per next. But not every relational operator can honor that contract symmetrically. Database System Concepts (Silberschatz, 7e, ch. 15 “Query Processing”) draws the line that organizes this whole document: the distinction between pipelineable and blocking operators.

A pipelineable operator can emit an output tuple after consuming only a bounded prefix of its input — a selection, a projection, a nested-loop join. A blocking operator (DSC §15.7.2.1 calls these the operations that “cannot generate any output until they have examined all their input”) must consume all of its input before it can produce its first output. The textbook names sorting and aggregation as the canonical blockers: “the sort operation … produces its entire output only after it has consumed all of its input,” and aggregation likewise must see every row of a group before it can finalize that group’s result. A blocking operator therefore breaks a pipeline into two — everything below it runs to completion before anything above it starts — which is exactly why a query’s first row can be slow to appear when a Sort or hashed Agg sits near the root.

Three theory anchors from DSC frame the three families this document covers.

1. Aggregation — grouped reduction. DSC §15.5 (“Other Operations → Aggregation”) presents grouping as: partition the input into groups that agree on the grouping columns, then apply each aggregate’s reduction (sum, count, avg, …) within a group. The textbook gives two physical realizations, exactly mirroring the two for grouped join:

  • Sorted aggregation. Sort the input on the grouping columns; then a single linear pass detects group boundaries (a tuple whose grouping columns differ from its predecessor starts a new group) and finalizes each group as the boundary is crossed. Cost is dominated by the sort; memory is O(1) in the number of groups because only one group’s running state is live at a time.
  • Hash aggregation. Build a hash table keyed by the grouping columns; each input tuple probes the table and updates the matching group’s running state in place. No global ordering is needed, and output appears only after the whole input is consumed. Memory is O(number of distinct groups) — which is the operator’s Achilles heel: if the groups don’t fit in memory, a naive hash aggregate either thrashes or fails.

2. Sorting — external merge sort. DSC §15.4 (“Sorting”) treats the case where the relation does not fit in memory as the default and presents external sort-merge: read the input in memory-sized chunks, sort each chunk in memory to produce a run, write runs to disk, then merge the runs in one or more passes. The number of merge passes is logarithmic in the run count, and the in-memory sort of each run is the inner workhorse. A query engine factors all of this into a reusable sort utility so that the Sort plan node, the sorted-aggregate’s input preparation, and index builds can all share one tuned implementation.

3. Window functions — frames over an ordered partition. Window functions (OVER (PARTITION BY … ORDER BY … ROWS/RANGE BETWEEN …)) are the SQL:2003 generalization of aggregation: instead of collapsing a group to one row, they compute, for every input row, an aggregate or ranking over a window frame — a contiguous (or, with exclusion, near-contiguous) sub-range of the row’s partition defined relative to the current row. DSC §5.5.1 (“Windowing”) frames the model: rows are partitioned and ordered; each row’s frame is a sliding window; the result preserves the input cardinality. The naive evaluation — recompute the aggregate over the whole frame for every row — is O(partition × frame); the interesting engineering is in making the frame slide incrementally.

The through-line: all three are blocking or partially-blocking, all three need a place to put data that may not fit in memory, and all three are sensitive to whether the input arrives already ordered. PostgreSQL’s implementations are best read as three points on a how much must I buffer before I can answer spectrum.

One more textbook idea ties them together: the memory hierarchy is the real cost model. DSC’s whole treatment of external sorting and hashing is predicated on the relation not fitting in RAM, so the algorithm’s job is to minimize passes over disk. A blocking operator that fits in memory is cheap and invisible; the same operator that overflows pays the external-algorithm tax. Every PostgreSQL knob in this document — work_mem, hash_mem, bounded sort, the incremental-sort group size, the spill partition count — exists to manage exactly where an operator sits relative to that memory cliff. A reader who keeps the fits / does-not-fit dichotomy in mind will recognize each spill, each bounded mode, and each prefix-group bound as a move to stay on the cheap side of it.

The textbook gives the algorithms. This section names the engineering conventions that production engines layer on top, the patterns PostgreSQL’s specific symbols in the next section instantiate.

Push the choice of strategy into the optimizer

Section titled “Push the choice of strategy into the optimizer”

Sorted vs. hashed aggregation, full vs. incremental sort, the order of window-function stacking — these are plan-time decisions. The executor node does not re-derive them; it reads a strategy tag off its Plan node and switches on it. The cost model weighs the sort that sorted aggregation needs against the memory risk that hash aggregation runs, and emits one tag. Keeping the decision in the planner means the executor’s job is mechanical dispatch, and the same executor code serves every strategy.

A shared, tunable sort utility — not per-node sort code

Section titled “A shared, tunable sort utility — not per-node sort code”

Rather than open-code external merge sort inside the Sort node, the sorted-aggregate input path, and the index builder, engines centralize it in one module (PostgreSQL: tuplesort.c). That module owns the work-memory budget, the in-memory quicksort, the run generation, the tape merge, the optional bounded (top-N heap) mode for LIMIT, and the datum fast path for single-column sorts. Every consumer hands it tuples and later pulls them back sorted. This document treats the utility as a black box (owned by postgres-tuplesort.md) and focuses on how the nodes drive it.

Spill, don’t fail, when the hash table overflows

Section titled “Spill, don’t fail, when the hash table overflows”

The fatal weakness of hash aggregation — unbounded memory in the number of groups — is tamed by spilling. When the live hash table exceeds its memory budget, the engine stops creating new groups; tuples that would start a new group are partitioned by a hash of their grouping key and written to disk, to be re-aggregated in a later batch. Because a batch can itself overflow, spilling is recursive — each batch carries a count of hash bits already consumed, and a too-big batch re-partitions on the next bits. This converts a hard out-of-memory failure into graceful disk use, at the cost of extra I/O.

Exploit pre-sortedness: incremental and bounded sorts

Section titled “Exploit pre-sortedness: incremental and bounded sorts”

If the input is already sorted on a prefix of the requested keys (because a child index or a lower sort provided it), a full sort is wasteful. The incremental sort convention sorts one prefix-group at a time: read rows until the prefix key changes, sort just that group on the remaining keys, emit it, repeat. This bounds memory to one group and — crucially — returns the first rows before the whole input is read, which is the entire point under LIMIT. Independently, a bounded (top-N) sort discards all but the smallest N tuples seen so far when only LIMIT N rows are ever needed.

Buffer the partition; slide the frame incrementally

Section titled “Buffer the partition; slide the frame incrementally”

Window functions need random access to other rows of the same partition, so the engine buffers each partition in a spillable tuple store and keeps several read cursors into it (one per window function plus frame-boundary probes). The performance trick is the inverse transition function: for a moving frame whose head only advances, instead of recomputing the aggregate over the whole frame at each row, the engine adds newly-entered rows with the forward transition function and removes newly-exited rows with an inverse transition function, sliding the running state in amortized O(1) per row.

Two-tier memory: per-group state vs. per-tuple scratch

Section titled “Two-tier memory: per-group state vs. per-tuple scratch”

Aggregation accumulates running state across many input rows, so the transition value must outlive the per-tuple scratch used to evaluate the aggregate’s arguments. Engines therefore keep at least two memory arenas: a longer-lived one for transition values (reset only at group boundaries) and a per-input-tuple one (reset every row). Grouping sets multiply this: a separate transition arena per concurrently-tracked grouping set, so that crossing a fine-grained group’s boundary can reset that set’s state while the coarser rollup levels keep accumulating. The reset is a bulk context-clear, never a retail free() per transition value — which is why even an aggregate over billions of rows holds bounded transient memory.

Recompile expressions when the tuple format changes

Section titled “Recompile expressions when the tuple format changes”

A subtle but load-bearing convention: the fused transition expression is compiled for a specific input tuple shape. Spilled tuples come back from disk as minimal tuples, not in the outer plan’s native format, so an engine that JIT-compiles or pre-builds its transition expression must recompile it when switching to read spilled data. Skipping this would either mis-deform the spilled tuple or silently read garbage. PostgreSQL makes this explicit at every format boundary rather than forcing one universal format and paying a copy on the common in-memory path.

Theory / conventionPostgreSQL name
Aggregation strategy tagAggStrategy enum: AGG_PLAIN / AGG_SORTED / AGG_HASHED / AGG_MIXED
Sorted-aggregation group boundaryExecQualAndReset(phase->eqfunctions[…]) in agg_retrieve_direct
Hash-aggregation tableper-grouping-set TupleHashTable in AggStatePerHash
Hash spill trigger / modehash_agg_check_limitshash_agg_enter_spill_mode
Partitioned spill filesHashAggSpill over LogicalTapeSet; recursive HashAggBatch
Grouping-sets single-pass planphase chain (AggStatePerPhase), initialize_phase
Shared sort utilitytuplesort.c (tuplesort_begin_heap / _begin_datum / _gettupleslot)
Datum (single-column) sort fast pathSortState.datumSorttuplesort_begin_datum / tuplesort_getdatum
Bounded (top-N) sortSortState.boundedtuplesort_set_bound
Incremental sort state machineINCSORT_LOADFULLSORT / READFULLSORT / LOADPREFIXSORT / READPREFIXSORT
Prefix-group boundary testisCurrentGroup over PresortedKeyData
Window partition bufferWindowAggState.buffer (a Tuplestorestate)
Window per-function cursorsWindowObject.readptr / markptr into the tuplestore
Moving-frame incremental aggregateforward transfn + invtransfn in eval_windowaggregates
Per-group transition arena vs. per-tupleaggcontexts[] / hashcontext vs. tmpcontext

The planner side — why a given strategy was chosen, how GROUPING SETS is decomposed into a phase chain, where IncrementalSort paths are generated — is owned by postgres-planner-overview.md and postgres-path-generation.md. The tuplesort.c internals (run generation, tape merge, top-N heap) are owned by postgres-tuplesort.md. The ExecBuildAggTrans expression that fuses argument evaluation with transition-function calls is owned by postgres-expression-eval.md. This document covers the nodes: how each drives the executor’s pull protocol, what it buffers, and how it spills.

PostgreSQL renders the three families as four executor nodes — Agg, Sort, IncrementalSort, WindowAgg — each a PlanState subtype with an ExecProcNode function pointer, exactly like every other node in postgres-executor.md. What distinguishes them is internal state and buffering, not the external interface.

The planner tags the Agg plan node with an AggStrategy:

// AggStrategy — src/include/nodes/nodes.h
typedef enum AggStrategy
{
AGG_PLAIN, /* simple agg across all input rows */
AGG_SORTED, /* grouped agg, input must be sorted */
AGG_HASHED, /* grouped agg, use internal hashtable */
AGG_MIXED, /* grouped agg, hash and sort both used */
} AggStrategy;

ExecAgg is a thin dispatcher: AGG_HASHED/AGG_MIXED build then drain a hash table; AGG_PLAIN/AGG_SORTED stream groups directly off the (planner-ordered) input.

// ExecAgg — src/backend/executor/nodeAgg.c
static TupleTableSlot *
ExecAgg(PlanState *pstate)
{
AggState *node = castNode(AggState, pstate);
TupleTableSlot *result = NULL;
if (!node->agg_done)
{
switch (node->phase->aggstrategy)
{
case AGG_HASHED:
if (!node->table_filled)
agg_fill_hash_table(node);
/* FALLTHROUGH */
case AGG_MIXED:
result = agg_retrieve_hash_table(node);
break;
case AGG_PLAIN:
case AGG_SORTED:
result = agg_retrieve_direct(node);
break;
}
if (!TupIsNull(result))
return result;
}
return NULL;
}

Sorted/plain path. agg_retrieve_direct exploits the invariant that a sorted (or plain, single-group) input lets it process one group with O(1) state. It fetches the first tuple of a group, runs advance_aggregates over every subsequent tuple until a group boundary — detected by running the grouping-column equality expression against the previous row — then finalizes and projects the group. The per-group transition state lives in aggcontexts[], which is rescanned (so registered shutdown callbacks fire) at each boundary; the per-input-tuple scratch lives in tmpcontext, reset every row:

// agg_retrieve_direct — src/backend/executor/nodeAgg.c (condensed)
for (;;)
{
/* ... advance over the current group, fetch next tuple ... */
advance_aggregates(aggstate);
ResetExprContext(tmpcontext); /* per-input-tuple scratch */
outerslot = fetch_input_tuple(aggstate);
if (TupIsNull(outerslot))
break; /* group runs to end of input */
}

The group boundary itself is detected by running the grouping-column equality expression for the next grouping set against the previous row; ExecQualAndReset both evaluates the comparison and resets the per-tuple context. For grouping sets, nextSetSize selects which set’s equality function to use, so a single ordered pass can carry several rollup levels at once:

// agg_retrieve_direct — src/backend/executor/nodeAgg.c (grouping-set boundary)
tmpcontext->ecxt_innertuple = econtext->ecxt_outertuple;
if (aggstate->input_done ||
(node->aggstrategy != AGG_PLAIN &&
aggstate->projected_set != -1 &&
aggstate->projected_set < (numGroupingSets - 1) &&
nextSetSize > 0 &&
!ExecQualAndReset(aggstate->phase->eqfunctions[nextSetSize - 1],
tmpcontext)))
{
aggstate->projected_set += 1; /* advance to next grouping set */
}
else
{
aggstate->projected_set = 0; /* start the first/only set fresh */
/* ... fetch grp_firstTuple, initialize_aggregates, then advance loop ... */
}

Hashed path. agg_fill_hash_table drains the entire outer plan into per-grouping-set hash tables (it is a blocking step), then agg_retrieve_hash_table iterates the tables emitting one row per group. For each input tuple lookup_hash_entries probes every hash table; a hit advances that group’s state in place, a miss creates a new group (unless spill mode is active):

// lookup_hash_entries — src/backend/executor/nodeAgg.c (condensed)
for (setno = 0; setno < aggstate->num_hashes; setno++)
{
/* if hash table already spilled, don't create new entries */
p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;
select_current_set(aggstate, setno, true);
prepare_hash_slot(perhash, outerslot, hashslot);
entry = LookupTupleHashEntry(hashtable, hashslot, p_isnew, &hash);
if (entry != NULL)
{
if (isnew)
initialize_hash_entry(aggstate, hashtable, entry);
pergroup[setno] = TupleHashEntryGetAdditional(hashtable, entry);
}
else
{
/* no room for a new group: spill this tuple to disk */
hashagg_spill_tuple(aggstate, &aggstate->hash_spills[setno], slot, hash);
pergroup[setno] = NULL;
}
}

Spilling. After inserting a group, hash_agg_check_limits sums the metadata, entry, and transition-value memory; if it exceeds hash_mem_limit or the group count exceeds hash_ngroups_limit, the node enters spill mode. Spilled tuples are partitioned by hash bits onto LogicalTapes; once the in-memory groups are drained, agg_refill_hash_table pops one spilled HashAggBatch, resets the tables, and re-aggregates it — recursing (re-partitioning on more hash bits) if a batch is itself too large.

flowchart TB
  START["ExecAgg first call<br/>AGG_HASHED"]
  FILL["agg_fill_hash_table:<br/>drain outer plan"]
  LOOK["lookup_hash_entries<br/>probe each hashtable"]
  HIT{"group<br/>found?"}
  ADV["advance_aggregates<br/>(update state in place)"]
  CHK["hash_agg_check_limits"]
  OVER{"over<br/>hash_mem?"}
  SPILLMODE["hash_agg_enter_spill_mode<br/>(stop new groups)"]
  SPILL["hashagg_spill_tuple<br/>partition to LogicalTape"]
  DRAIN["agg_retrieve_hash_table_in_memory"]
  REFILL{"spilled<br/>batches left?"}
  POP["agg_refill_hash_table:<br/>pop HashAggBatch, reset,<br/>re-aggregate (recursive)"]
  DONE["agg_done = true"]

  START --> FILL --> LOOK --> HIT
  HIT -- yes --> ADV --> CHK --> OVER
  HIT -- "no, spill mode" --> SPILL
  OVER -- yes --> SPILLMODE
  OVER -- no --> LOOK
  SPILLMODE --> LOOK
  SPILL --> LOOK
  FILL --> DRAIN --> REFILL
  REFILL -- yes --> POP --> DRAIN
  REFILL -- no --> DONE

Grouping sets. A ROLLUP/CUBE/GROUPING SETS query is one “real” Agg node with a chain of additional Agg nodes. The executor collapses these into phases: phase 0 holds all hashed grouping sets; phases 1..n each hold one sorted rollup with its own sort order. Within a single ordered pass, agg_retrieve_direct tracks one transition state per grouping set (in aggcontexts[0..k]) and resets only the finer-grained sets at each boundary. AGG_MIXED populates the hash tables during the first sorted phase, then switches to phase 0 to read them out — initialize_phase is the state transition.

Sort: a pure blocking operator over tuplesort

Section titled “Sort: a pure blocking operator over tuplesort”

Sort is the textbook blocker. The first ExecSort call drains the child entirely into tuplesort.c, calls tuplesort_performsort, and flips sort_Done; every later call just pulls the next sorted tuple back. The node picks a datum sort for a single output column (cheaper, especially for pass-by-value types) and a heap/tuple sort otherwise, and forwards the planner’s bounded/randomAccess hints to the utility:

// ExecSort — src/backend/executor/nodeSort.c (condensed)
if (!node->sort_Done)
{
if (node->datumSort)
tuplesortstate = tuplesort_begin_datum(/* type, op, coll, … */);
else
tuplesortstate = tuplesort_begin_heap(tupDesc, plannode->numCols,
plannode->sortColIdx, /* … */);
if (node->bounded)
tuplesort_set_bound(tuplesortstate, node->bound);
for (;;) /* blocking: read ALL input */
{
slot = ExecProcNode(outerNode);
if (TupIsNull(slot))
break;
tuplesort_puttupleslot(tuplesortstate, slot);
}
tuplesort_performsort(tuplesortstate);
node->sort_Done = true;
}
/* subsequent calls: stream sorted tuples out */
(void) tuplesort_gettupleslot(tuplesortstate, ScanDirectionIsForward(dir),
false, slot, NULL);

The node deliberately leaves its ExprContext uninitialized — Sort never evaluates a qual or projection; it only reorders its child’s slots.

IncrementalSort: a four-state machine over a presorted prefix

Section titled “IncrementalSort: a four-state machine over a presorted prefix”

When the input is already ordered on a prefix of the sort keys, IncrementalSort avoids one giant sort by sorting one prefix-group at a time. It runs a four-state machine (INCSORT_LOADFULLSORT, INCSORT_READFULLSORT, INCSORT_LOADPREFIXSORT, INCSORT_READPREFIXSORT) with a hybrid heuristic: it first accumulates a minimum group of DEFAULT_MIN_GROUP_SIZE (32) tuples sorting on all keys (cheap for tiny groups), and only when a group proves large does it switch to sorting just the unsorted suffix keys. Prefix-group membership is tested by isCurrentGroup, comparing presorted columns from the last key backward (the tail key is likeliest to change first):

// isCurrentGroup — src/backend/executor/nodeIncrementalSort.c (condensed)
for (int i = nPresortedCols - 1; i >= 0; i--) /* last key first */
{
datumA = slot_getattr(pivot, attno, &isnullA);
datumB = slot_getattr(tuple, attno, &isnullB);
/* NULL-vs-NULL handled specially, else call cached equality fn */
key->fcinfo->args[0].value = datumA;
key->fcinfo->args[1].value = datumB;
result = FunctionCallInvoke(key->fcinfo);
if (!DatumGetBool(result))
return false; /* prefix changed: new group */
}
return true;

The payoff is twofold: memory is bounded to one prefix group (less spilling), and the node returns its first rows before reading the whole input — a large win under LIMIT, the case the planner most often picks it for.

flowchart TB
  LF["INCSORT_LOADFULLSORT:<br/>accumulate up to 32 tuples,<br/>sort on ALL keys"]
  BIG{"group still<br/>growing past<br/>min size?"}
  SW["switchToPresortedPrefixMode:<br/>re-bucket accumulated tuples"]
  LP["INCSORT_LOADPREFIXSORT:<br/>load one prefix group,<br/>sort on SUFFIX keys only"]
  RF["INCSORT_READFULLSORT:<br/>emit sorted full batch"]
  RP["INCSORT_READPREFIXSORT:<br/>emit sorted prefix group"]
  MORE{"more input?"}
  END["return empty slot"]

  LF --> BIG
  BIG -- "no (small/end)" --> RF
  BIG -- "yes" --> SW --> LP --> RP
  RF --> MORE
  RP --> MORE
  MORE -- yes --> LF
  MORE -- no --> END

WindowAgg: a buffered partition with a sliding frame

Section titled “WindowAgg: a buffered partition with a sliding frame”

WindowAgg evaluates window functions for a single window spec; the planner stacks several with intervening Sort nodes for multiple specs. Input must arrive sorted by PARTITION BY then ORDER BY. The node buffers each partition in a Tuplestorestate (begin_partition / spool_tuples) so window functions can reach any row of the partition through the WindowObject API, which keeps per-function read/mark cursors into the store.

The performance core is eval_windowaggregates. For a frame whose head is UNBOUNDED PRECEDING with no exclusion, rows only enter the frame, so it runs the forward transition function incrementally. For a frame whose head advances, it removes exited rows with the aggregate’s inverse transition function rather than recomputing from scratch; if no inverse function exists (or it declines), it restarts the aggregation over the new frame. Peer-group sharing lets one computed value serve all rows with an identical frame.

// ExecWindowAgg — src/backend/executor/nodeWindowAgg.c (condensed)
if (winstate->status == WINDOWAGG_DONE)
return NULL;
for (;;)
{
if (winstate->next_partition)
begin_partition(winstate); /* buffer a fresh partition */
else
winstate->currentpos++; /* advance current row; frame moves */
spool_tuples(winstate, winstate->currentpos);
if (winstate->partition_spooled &&
winstate->currentpos >= winstate->spooled_rows)
{
release_partition(winstate);
if (winstate->more_partitions)
begin_partition(winstate);
else { winstate->status = WINDOWAGG_DONE; return NULL; }
}
/* eval frame-head/tail, window funcs, project; loop if qual rejects */
}

The decision tree at the heart of eval_windowaggregates is the restart-vs-slide test. For each aggregate, the node restarts the running transition value from scratch only when it has no choice: the very first row, an advancing frame head with no inverse function, any EXCLUDE clause, or a new frame that does not overlap the old one. In every other case it slides — removing the rows that fell off the top with the inverse transition function and adding the rows that entered at the bottom with the forward one:

// eval_windowaggregates — src/backend/executor/nodeWindowAgg.c (condensed)
numaggs_restart = 0;
for (i = 0; i < numaggs; i++)
{
peraggstate = &winstate->peragg[i];
if (winstate->currentpos == 0 || /* first row */
(winstate->aggregatedbase != winstate->frameheadpos && /* head moved */
!OidIsValid(peraggstate->invtransfn_oid)) || /* no inverse fn */
(winstate->frameOptions & FRAMEOPTION_EXCLUSION) || /* EXCLUDE clause */
winstate->aggregatedupto <= winstate->frameheadpos) /* no overlap */
{
peraggstate->restart = true;
numaggs_restart++;
}
else
peraggstate->restart = false;
}
/* slide: remove rows that fell off the frame head via inverse transition */
while (numaggs_restart < numaggs &&
winstate->aggregatedbase < winstate->frameheadpos)
{
window_gettupleslot(agg_winobj, winstate->aggregatedbase, temp_slot);
/* ... advance_windowaggregate_base() per agg, or mark restart on failure ... */
}

This is the difference between O(partition) and O(partition × frame) work for a moving-average-style query: sum/count/avg have inverse transition functions and slide; max/min over a shrinking frame do not, so they restart.

Four files, one per node, all under src/backend/executor/. The symbols below are the stable anchors; line numbers live only in the position-hint table at the end of this section.

The dispatch and retrieval spine:

  • ExecAgg — the ExecProcNode entry; switches on node->phase->aggstrategy and routes to the direct or hash retriever.
  • agg_retrieve_direct — the AGG_PLAIN/AGG_SORTED engine. Walks groups off ordered input, detecting boundaries with the phase’s grouping equality expressions (phase->eqfunctions[…] via ExecQualAndReset), finalizing and projecting one group per call. Drives the grouping-sets phase machine: when a phase’s input is exhausted it calls initialize_phase to advance to the next sort order, and an AGG_MIXED node switches to phase 0 to drain hash tables.
  • agg_fill_hash_table — blocking pre-pass for AGG_HASHED: drains the outer plan, calling lookup_hash_entries + advance_aggregates per tuple, then hashagg_finish_initial_spills.
  • lookup_hash_entries — probes each grouping set’s TupleHashTable; hit ⇒ advance in place; miss ⇒ create a group, or (in spill mode) hashagg_spill_tuple.
  • agg_retrieve_hash_table / agg_retrieve_hash_table_in_memory — iterate the filled tables, finalizing one group per call; when in-memory groups are exhausted, call agg_refill_hash_table.
  • agg_refill_hash_table — pops one spilled HashAggBatch, resets the tables, recompiles the transition expression for MinimalTuple input, and re-aggregates the batch (recursively spilling on overflow).

The memory/limit machinery:

  • build_hash_tables / build_hash_table — create one TupleHashTable per grouping set with a bucket count from hash_choose_num_buckets.
  • hash_agg_set_limits — derives mem_limit and ngroups_limit from get_hash_memory_limit(), reserving tape-buffer memory if a spill is expected.
  • hash_agg_check_limits — after each new group, sums meta + entry + transition-value memory and trips spilling if over the memory or group limit.
  • hash_agg_enter_spill_mode — flips hash_spill_mode, recompiles expressions, and lazily creates the LogicalTapeSet and per-set HashAggSpill partitions.
  • ExecInitAgg — builds the AggState, the phase array, the per-agg / per-trans tables, and (for hashed/mixed) the hash tables and contexts; reads use_hashing off the strategy tag.
  • ExecReScanAgg — resets for re-execution: a hashed agg can simply re-iterate its existing table if it has no params and never spilled.

Key structs/enums: AggStrategy (nodes.h), AggStatePerHashData, AggStatePerGroupData, HashAggSpill, HashAggBatch, AggStatePerPhase.

  • ExecSort — blocking on first call: builds the tuplesort (tuplesort_begin_datum for a single column when datumSort, else tuplesort_begin_heap), feeds every child tuple (tuplesort_puttupleslot / tuplesort_putdatum), tuplesort_performsort, sets sort_Done; later calls stream out via tuplesort_gettupleslot / tuplesort_getdatum.
  • ExecInitSort — sets randomAccess from EXEC_FLAG_REWIND | BACKWARD | MARK, decides datumSort from output column count, and shields the child from rewind/backward/mark requirements. Sort nodes do not initialize an ExprContext (no qual, no projection).
  • ExecReScanSort — reuses the sorted output if random access and no param change; otherwise tears the tuplesort down for a fresh sort.
  • ExecIncrementalSort — the four-state machine: load a min-size group full-sorting on all keys (INCSORT_LOADFULLSORT), emit it (INCSORT_READFULLSORT), and on large groups transition to suffix-only sorting (INCSORT_LOADPREFIXSORT / READPREFIXSORT).
  • preparePresortedCols — caches the equality FmgrInfo/FunctionCallInfo for each presorted key into PresortedKeyData.
  • isCurrentGroup — tests prefix-group membership against the group pivot, comparing the last presorted key first.
  • switchToPresortedPrefixMode — re-buckets already-accumulated tuples into the prefix-sort state when a large group is detected, carrying over any straggler prefix group via n_fullsort_remaining.
  • ExecInitIncrementalSort — builds the state; DEFAULT_MIN_GROUP_SIZE (32) is the hybrid threshold.
  • ExecWindowAgg — partition loop; per row spools input, advances the current position, updates frame validity, evaluates window functions, and projects. status is one of WINDOWAGG_RUN / WINDOWAGG_PASSTHROUGH / WINDOWAGG_PASSTHROUGH_STRICT / WINDOWAGG_DONE.
  • begin_partition — resets frame/group counters, allocates the tuplestore (prepare_tuplestore), resets per-function mark/seek cursors, and stores the first row.
  • spool_tuples — reads outer rows into the tuplestore up to a target position (or the whole partition), detecting the partition boundary.
  • eval_windowaggregates — the moving-aggregate engine: incremental forward transition for fixed frame heads, inverse transition (advance_windowaggregate_base) for advancing heads, full restart when no inverse exists, and peer-group result caching.
  • advance_windowaggregate — forward transition for one aggregate over one row (FILTER, strictness, expanded-object handling).
  • update_frameheadpos / update_frametailpos / row_is_in_frame — resolve ROWS/RANGE/GROUPS frame boundaries and exclusion against the buffered partition.
  • ExecInitWindowAgg — builds WindowAggState, the per-function and per-agg tables, the WindowObjects, and frame-offset expressions.

Frame semantics flags (FRAMEOPTION_RANGE / _ROWS / _GROUPS, FRAMEOPTION_EXCLUDE_*) live in parsenodes.h.

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
AggStrategy (enum)src/include/nodes/nodes.h358
initialize_phasesrc/backend/executor/nodeAgg.c479
build_hash_tablessrc/backend/executor/nodeAgg.c1466
hash_agg_set_limitssrc/backend/executor/nodeAgg.c1809
hash_agg_check_limitssrc/backend/executor/nodeAgg.c1867
hash_agg_enter_spill_modesrc/backend/executor/nodeAgg.c1911
lookup_hash_entriessrc/backend/executor/nodeAgg.c2181
ExecAggsrc/backend/executor/nodeAgg.c2244
agg_retrieve_directsrc/backend/executor/nodeAgg.c2280
agg_fill_hash_tablesrc/backend/executor/nodeAgg.c2626
agg_refill_hash_tablesrc/backend/executor/nodeAgg.c2680
agg_retrieve_hash_tablesrc/backend/executor/nodeAgg.c2835
ExecInitAggsrc/backend/executor/nodeAgg.c3279
ExecReScanAggsrc/backend/executor/nodeAgg.c4466
ExecSortsrc/backend/executor/nodeSort.c50
ExecInitSortsrc/backend/executor/nodeSort.c221
ExecReScanSortsrc/backend/executor/nodeSort.c362
preparePresortedColssrc/backend/executor/nodeIncrementalSort.c164
isCurrentGroupsrc/backend/executor/nodeIncrementalSort.c212
switchToPresortedPrefixModesrc/backend/executor/nodeIncrementalSort.c286
DEFAULT_MIN_GROUP_SIZE (macro)src/backend/executor/nodeIncrementalSort.c467
ExecIncrementalSortsrc/backend/executor/nodeIncrementalSort.c495
ExecInitIncrementalSortsrc/backend/executor/nodeIncrementalSort.c976
advance_windowaggregatesrc/backend/executor/nodeWindowAgg.c243
advance_windowaggregate_basesrc/backend/executor/nodeWindowAgg.c420
eval_windowaggregatessrc/backend/executor/nodeWindowAgg.c664
begin_partitionsrc/backend/executor/nodeWindowAgg.c1194
spool_tuplessrc/backend/executor/nodeWindowAgg.c1283
row_is_in_framesrc/backend/executor/nodeWindowAgg.c1427
update_frameheadpossrc/backend/executor/nodeWindowAgg.c1536
ExecWindowAggsrc/backend/executor/nodeWindowAgg.c2211
ExecInitWindowAggsrc/backend/executor/nodeWindowAgg.c2479

Each entry leads with a fact about the current source at commit 273fe94 (REL_18_STABLE), readable without other materials. The trailing sentence shows how it was checked.

  • ExecAgg dispatches on exactly four strategies. The switch on node->phase->aggstrategy has cases AGG_HASHED (falling through to AGG_MIXED), AGG_MIXED, and AGG_PLAIN/AGG_SORTED; the enum in nodes.h defines precisely AGG_PLAIN, AGG_SORTED, AGG_HASHED, AGG_MIXED. Verified by reading ExecAgg and the AggStrategy enum on 2026-06-05.

  • Hash aggregation enters spill mode on either a memory or a group-count overrun. hash_agg_check_limits computes total_mem = meta_mem + entry_mem + tval_mem and trips hash_agg_enter_spill_mode when, with at least one group present, total_mem > hash_mem_limit || ngroups > hash_ngroups_limit. The ngroups_limit exists specifically for transition states (e.g. array_agg) that grow far past their initial size. Verified by reading hash_agg_check_limits and hash_agg_set_limits.

  • Spilled aggregation is recursive over partitioned logical tapes. hash_agg_enter_spill_mode creates a LogicalTapeSet and one HashAggSpill per grouping set; agg_refill_hash_table pops one HashAggBatch (a stack via llast / list_delete_last), carries used_bits and input_card, and may itself spill again. Verified by reading hash_agg_enter_spill_mode and agg_refill_hash_table.

  • Spilled tuples are read back as MinimalTuple, forcing an expression recompile. agg_refill_hash_table calls hashagg_recompile_expressions(aggstate, true, true) and stores tuples with ExecStoreMinimalTuple before probing. Verified by reading agg_refill_hash_table.

  • AGG_MIXED populates hash tables during sorted phase 1 and reads them out in phase 0. In agg_retrieve_direct, during phase 1 of a mixed agg (current_phase == 1) each tuple also calls lookup_hash_entries; when input ends the node initialize_phase(…, 0) and switches to agg_retrieve_hash_table. agg_refill_hash_table mirrors this, toggling current_phase to 1 to process a batch then back to 0. Verified by reading both functions.

  • Sort is fully blocking and never evaluates quals or projection. ExecSort loads the entire outer plan into the tuplesort before returning the first tuple (sort_Done guards this), and ExecInitSort’s comment states “Sort nodes don’t initialize their ExprContexts because they never call ExecQual or ExecProject.” Verified by reading ExecSort and ExecInitSort.

  • Sort chooses a datum sort only for a single output column. ExecInitSort sets datumSort = true when the result tuple descriptor has one attribute, and ExecSort then uses tuplesort_begin_datum / tuplesort_getdatum instead of the heap variants. Verified by reading both.

  • Incremental sort’s hybrid threshold is 32 tuples. #define DEFAULT_MIN_GROUP_SIZE 32; ExecIncrementalSort accumulates up to minGroupSize tuples in full-sort mode before considering a switch, and clamps minGroupSize to the bound when bounded. Verified by reading the macro and the INCSORT_LOADFULLSORT block.

  • Incremental sort compares presorted keys from the last key backward. isCurrentGroup loops for (i = nPresortedCols - 1; i >= 0; i--), with a comment that the tail key is likeliest to change first, minimizing comparator calls. Verified by reading isCurrentGroup.

  • WindowAgg buffers each partition in a tuplestore and slides moving-frame aggregates with an inverse transition function. begin_partition allocates winstate->buffer and eval_windowaggregates documents and implements the three regimes — incremental forward transition for fixed UNBOUNDED PRECEDING heads, inverse transition for advancing heads, full restart when no inverse function exists or it returns NULL. Verified by reading begin_partition, eval_windowaggregates, and advance_windowaggregate_base.

  • A WindowAgg handles exactly one window spec; multiple specs are stacked by the planner. The file header states each WindowAgg “works for just a single window specification” and the planner “generates a stack of WindowAggs with intervening Sort nodes.” Verified by reading the nodeWindowAgg.c header comment.

  1. What exactly determines the partition count and re-partition bit allotment for hash-agg spills? hash_choose_num_partitions and the used_bits accounting in HashAggBatch were read at call sites but the bit-budget math (HASHAGG_MAX_PARTITIONS, the cost balance between partition fan-out and recursion depth) was not traced. Investigation path: read hash_choose_num_partitions and hashagg_spill_init in nodeAgg.c.

  2. How does WINDOWAGG_PASSTHROUGH_STRICT interact with a non-top-level WindowAgg that must still feed rows upward? spool_tuples keeps rows for an upper WindowAgg even in pass-through; the precise condition that selects PASSTHROUGH vs. PASSTHROUGH_STRICT (tied to runCondition pushdown) was only partially traced. Investigation path: read the runCondition handling around ExecWindowAgg’s status assignment.

  3. Where is the boundary between the Sort node and tuplesort.c’s bounded / parallel modes? This doc treats run generation, tape merge, and the top-N heap as a black box. The parallel-worker Sort path (shared_info, tuplesort_get_stats) was seen but not analyzed. Investigation path: postgres-tuplesort.md plus nodeSort.c’s am_worker branch.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

Pointers, not analysis. Each bullet is a handle for a follow-up doc; depth here is intentionally shallow.

  • Spilling hash aggregation is recent in PostgreSQL. Before PG 13, hash aggregation could exceed work_mem without bound (the planner simply avoided it when it expected too many groups, sometimes mis-estimating). The partitioned-spill machinery (hash_agg_enter_spill_mode, HashAggBatch) landed in PG 13 (Jeff Davis), bringing PostgreSQL in line with the “grace hash” / partition-and-recurse pattern that DSC §15.5 and the hash-join literature (hashjoin in postgres-scan-nodes.md territory) had long assumed. A side-by-side with hash join spilling would show how much machinery the two share.

  • CUBRID’s aggregation and sort paths. CUBRID evaluates aggregation inside its XASL tree with a GROUPBY / aggregate_list mechanism and a separate external-sort utility, but its hash-aggregation and grouping-sets support, and whether it spills comparably, differ from PostgreSQL’s phase chain. A comparison of how each engine handles ROLLUP/CUBE in a single ordered pass would sharpen what PostgreSQL’s AGG_MIXED phase machine buys.

  • Vectorized aggregation and sorting. The tuple-at-a-time transition call is exactly the per-tuple overhead the vectorized engines attack. MonetDB/X100 (Boncz, Zukowski & Nes, CIDR 2005) and the column-store lineage (dbms-papers/cstore.md) aggregate over vectors of grouping keys and use vectorized partitioning sorts. PostgreSQL stays tuple-at-a-time; ExecBuildAggTrans fusing argument evaluation and the transition call into one JIT-able expression (postgres-expression-eval.md, postgres-jit.md) is its answer to the interpreter overhead instead.

  • Adaptive aggregation: hash vs. sort decided at run time. PostgreSQL fixes the strategy at plan time. Systems such as Hellerstein’s eddies line of work, and more recently run-time-adaptive operators in HyPer/Umbra, switch between hashing and sorting based on observed cardinality. The research frontier is choosing — and re-choosing — the aggregation physical operator after seeing real group counts, which would make PostgreSQL’s plan-time AggStrategy tag a run-time variable.

  • Incremental sort as a near-sorted-input optimization. Incremental sort (PG 13, Tomáš Vondra / James Coleman) is PostgreSQL’s embodiment of the broad “exploit pre-sortedness” idea also seen in adaptive sorts (Timsort’s run detection, natural merge sort). The frontier question is whether the planner picks it often enough: it shines under LIMIT with a partially-ordered child, but the cost crossover vs. a full sort is estimate-sensitive. A study of IncrementalSort path selection in postgres-path-generation.md would close that loop.

  • Window-function frame algorithms. The inverse-transition trick (Leis et al., “Efficient Processing of Window Functions in Analytical SQL Queries”, VLDB 2015, from the HyPer group) formalizes segment-tree and aggregation-tree approaches that beat PostgreSQL’s linear forward/inverse slide for RANGE/GROUPS frames with non-invertible aggregates (e.g. max over a shrinking frame). PostgreSQL implements the forward+inverse+restart trichotomy but not the tree-based O(log n) frame aggregation; that gap is a concrete optimization frontier.

  • Code (REL_18_STABLE, commit 273fe94, PG 18.x) — source of truth:

    • src/backend/executor/nodeAgg.cAgg strategies, hash spill, grouping-set phases.
    • src/backend/executor/nodeSort.c — blocking Sort over tuplesort.c.
    • src/backend/executor/nodeIncrementalSort.c — four-state incremental sort.
    • src/backend/executor/nodeWindowAgg.c — partition buffering, frame sliding, inverse transition.
    • src/include/nodes/nodes.hAggStrategy enum.
    • src/include/nodes/execnodes.hAggState, SortState, IncrementalSortState (INCSORT_*), WindowAggState (WINDOWAGG_*).
    • src/include/nodes/parsenodes.hFRAMEOPTION_* window-frame flags.
  • Theoryknowledge/research/dbms-general/database-system-concepts.md (Silberschatz 7e): §15.4 Sorting (external sort-merge), §15.5 Aggregation (sorted vs. hashed), §15.7.2 pipelined vs. blocking operators, §5.5.1 Windowing.

  • Sibling code-analysis docs (cross-references, not duplicated here):

    • postgres-executor.md — the PlanState tree, ExecProcNode pull protocol, TupleTableSlot, memory contexts.
    • postgres-tuplesort.md — the tuplesort.c engine (run generation, tape merge, top-N heap, datum vs. heap sort, parallel sort).
    • postgres-expression-eval.mdExprState / ExecBuildAggTrans (the fused transition expression), qual and projection evaluation.
    • postgres-planner-overview.md, postgres-path-generation.md — why a strategy/path is chosen.
  • Comparative pointersdbms-papers/cstore.md (column store / vectorization contrast); MonetDB/X100 (Boncz et al., CIDR 2005); Leis et al., “Efficient Processing of Window Functions in Analytical SQL Queries” (VLDB 2015).