PostgreSQL Aggregation, Sort, and Window Executors
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”The demand-driven iterator model that postgres-executor.md describes
treats every operator as an open / next / close triple that returns
one tuple per next. But not every relational operator can honor that
contract symmetrically. Database System Concepts (Silberschatz, 7e,
ch. 15 “Query Processing”) draws the line that organizes this whole
document: the distinction between pipelineable and blocking
operators.
A pipelineable operator can emit an output tuple after consuming only
a bounded prefix of its input — a selection, a projection, a nested-loop
join. A blocking operator (DSC §15.7.2.1 calls these the operations
that “cannot generate any output until they have examined all their
input”) must consume all of its input before it can produce its first
output. The textbook names sorting and aggregation as the canonical
blockers: “the sort operation … produces its entire output only after it
has consumed all of its input,” and aggregation likewise must see every
row of a group before it can finalize that group’s result. A blocking
operator therefore breaks a pipeline into two — everything below it runs
to completion before anything above it starts — which is exactly why a
query’s first row can be slow to appear when a Sort or hashed Agg
sits near the root.
Three theory anchors from DSC frame the three families this document covers.
1. Aggregation — grouped reduction. DSC §15.5 (“Other Operations →
Aggregation”) presents grouping as: partition the input into groups that
agree on the grouping columns, then apply each aggregate’s reduction
(sum, count, avg, …) within a group. The textbook gives two
physical realizations, exactly mirroring the two for grouped join:
- Sorted aggregation. Sort the input on the grouping columns; then a single linear pass detects group boundaries (a tuple whose grouping columns differ from its predecessor starts a new group) and finalizes each group as the boundary is crossed. Cost is dominated by the sort; memory is O(1) in the number of groups because only one group’s running state is live at a time.
- Hash aggregation. Build a hash table keyed by the grouping columns; each input tuple probes the table and updates the matching group’s running state in place. No global ordering is needed, and output appears only after the whole input is consumed. Memory is O(number of distinct groups) — which is the operator’s Achilles heel: if the groups don’t fit in memory, a naive hash aggregate either thrashes or fails.
2. Sorting — external merge sort. DSC §15.4 (“Sorting”) treats the
case where the relation does not fit in memory as the default and presents
external sort-merge: read the input in memory-sized chunks, sort each
chunk in memory to produce a run, write runs to disk, then merge the
runs in one or more passes. The number of merge passes is logarithmic in
the run count, and the in-memory sort of each run is the inner workhorse.
A query engine factors all of this into a reusable sort utility so that
the Sort plan node, the sorted-aggregate’s input preparation, and index
builds can all share one tuned implementation.
3. Window functions — frames over an ordered partition. Window
functions (OVER (PARTITION BY … ORDER BY … ROWS/RANGE BETWEEN …)) are
the SQL:2003 generalization of aggregation: instead of collapsing a group
to one row, they compute, for every input row, an aggregate or ranking
over a window frame — a contiguous (or, with exclusion,
near-contiguous) sub-range of the row’s partition defined relative to the
current row. DSC §5.5.1 (“Windowing”) frames the model: rows are
partitioned and ordered; each row’s frame is a sliding window; the result
preserves the input cardinality. The naive evaluation — recompute the
aggregate over the whole frame for every row — is O(partition × frame);
the interesting engineering is in making the frame slide incrementally.
The through-line: all three are blocking or partially-blocking, all three need a place to put data that may not fit in memory, and all three are sensitive to whether the input arrives already ordered. PostgreSQL’s implementations are best read as three points on a how much must I buffer before I can answer spectrum.
One more textbook idea ties them together: the memory hierarchy is the
real cost model. DSC’s whole treatment of external sorting and hashing is
predicated on the relation not fitting in RAM, so the algorithm’s job is to
minimize passes over disk. A blocking operator that fits in memory is cheap
and invisible; the same operator that overflows pays the external-algorithm
tax. Every PostgreSQL knob in this document — work_mem, hash_mem,
bounded sort, the incremental-sort group size, the spill partition count —
exists to manage exactly where an operator sits relative to that memory
cliff. A reader who keeps the fits / does-not-fit dichotomy in mind will
recognize each spill, each bounded mode, and each prefix-group bound as a
move to stay on the cheap side of it.
Common DBMS Design
Section titled “Common DBMS Design”The textbook gives the algorithms. This section names the engineering conventions that production engines layer on top, the patterns PostgreSQL’s specific symbols in the next section instantiate.
Push the choice of strategy into the optimizer
Section titled “Push the choice of strategy into the optimizer”Sorted vs. hashed aggregation, full vs. incremental sort, the order of
window-function stacking — these are plan-time decisions. The executor
node does not re-derive them; it reads a strategy tag off its Plan node
and switches on it. The cost model weighs the sort that sorted
aggregation needs against the memory risk that hash aggregation runs, and
emits one tag. Keeping the decision in the planner means the executor’s
job is mechanical dispatch, and the same executor code serves every
strategy.
A shared, tunable sort utility — not per-node sort code
Section titled “A shared, tunable sort utility — not per-node sort code”Rather than open-code external merge sort inside the Sort node, the
sorted-aggregate input path, and the index builder, engines centralize it
in one module (PostgreSQL: tuplesort.c). That module owns the
work-memory budget, the in-memory quicksort, the run generation, the tape
merge, the optional bounded (top-N heap) mode for LIMIT, and the
datum fast path for single-column sorts. Every consumer hands it tuples
and later pulls them back sorted. This document treats the utility as a
black box (owned by postgres-tuplesort.md) and focuses on how the nodes
drive it.
Spill, don’t fail, when the hash table overflows
Section titled “Spill, don’t fail, when the hash table overflows”The fatal weakness of hash aggregation — unbounded memory in the number of groups — is tamed by spilling. When the live hash table exceeds its memory budget, the engine stops creating new groups; tuples that would start a new group are partitioned by a hash of their grouping key and written to disk, to be re-aggregated in a later batch. Because a batch can itself overflow, spilling is recursive — each batch carries a count of hash bits already consumed, and a too-big batch re-partitions on the next bits. This converts a hard out-of-memory failure into graceful disk use, at the cost of extra I/O.
Exploit pre-sortedness: incremental and bounded sorts
Section titled “Exploit pre-sortedness: incremental and bounded sorts”If the input is already sorted on a prefix of the requested keys (because
a child index or a lower sort provided it), a full sort is wasteful. The
incremental sort convention sorts one prefix-group at a time: read
rows until the prefix key changes, sort just that group on the remaining
keys, emit it, repeat. This bounds memory to one group and — crucially —
returns the first rows before the whole input is read, which is the entire
point under LIMIT. Independently, a bounded (top-N) sort discards
all but the smallest N tuples seen so far when only LIMIT N rows are
ever needed.
Buffer the partition; slide the frame incrementally
Section titled “Buffer the partition; slide the frame incrementally”Window functions need random access to other rows of the same partition, so the engine buffers each partition in a spillable tuple store and keeps several read cursors into it (one per window function plus frame-boundary probes). The performance trick is the inverse transition function: for a moving frame whose head only advances, instead of recomputing the aggregate over the whole frame at each row, the engine adds newly-entered rows with the forward transition function and removes newly-exited rows with an inverse transition function, sliding the running state in amortized O(1) per row.
Two-tier memory: per-group state vs. per-tuple scratch
Section titled “Two-tier memory: per-group state vs. per-tuple scratch”Aggregation accumulates running state across many input rows, so the
transition value must outlive the per-tuple scratch used to evaluate the
aggregate’s arguments. Engines therefore keep at least two memory arenas:
a longer-lived one for transition values (reset only at group boundaries)
and a per-input-tuple one (reset every row). Grouping sets multiply this:
a separate transition arena per concurrently-tracked grouping set, so that
crossing a fine-grained group’s boundary can reset that set’s state while
the coarser rollup levels keep accumulating. The reset is a bulk
context-clear, never a retail free() per transition value — which is why
even an aggregate over billions of rows holds bounded transient memory.
Recompile expressions when the tuple format changes
Section titled “Recompile expressions when the tuple format changes”A subtle but load-bearing convention: the fused transition expression is compiled for a specific input tuple shape. Spilled tuples come back from disk as minimal tuples, not in the outer plan’s native format, so an engine that JIT-compiles or pre-builds its transition expression must recompile it when switching to read spilled data. Skipping this would either mis-deform the spilled tuple or silently read garbage. PostgreSQL makes this explicit at every format boundary rather than forcing one universal format and paying a copy on the common in-memory path.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Theory / convention | PostgreSQL name |
|---|---|
| Aggregation strategy tag | AggStrategy enum: AGG_PLAIN / AGG_SORTED / AGG_HASHED / AGG_MIXED |
| Sorted-aggregation group boundary | ExecQualAndReset(phase->eqfunctions[…]) in agg_retrieve_direct |
| Hash-aggregation table | per-grouping-set TupleHashTable in AggStatePerHash |
| Hash spill trigger / mode | hash_agg_check_limits → hash_agg_enter_spill_mode |
| Partitioned spill files | HashAggSpill over LogicalTapeSet; recursive HashAggBatch |
| Grouping-sets single-pass plan | phase chain (AggStatePerPhase), initialize_phase |
| Shared sort utility | tuplesort.c (tuplesort_begin_heap / _begin_datum / _gettupleslot) |
| Datum (single-column) sort fast path | SortState.datumSort → tuplesort_begin_datum / tuplesort_getdatum |
| Bounded (top-N) sort | SortState.bounded → tuplesort_set_bound |
| Incremental sort state machine | INCSORT_LOADFULLSORT / READFULLSORT / LOADPREFIXSORT / READPREFIXSORT |
| Prefix-group boundary test | isCurrentGroup over PresortedKeyData |
| Window partition buffer | WindowAggState.buffer (a Tuplestorestate) |
| Window per-function cursors | WindowObject.readptr / markptr into the tuplestore |
| Moving-frame incremental aggregate | forward transfn + invtransfn in eval_windowaggregates |
| Per-group transition arena vs. per-tuple | aggcontexts[] / hashcontext vs. tmpcontext |
The planner side — why a given strategy was chosen, how GROUPING SETS
is decomposed into a phase chain, where IncrementalSort paths are
generated — is owned by postgres-planner-overview.md and
postgres-path-generation.md. The tuplesort.c internals (run
generation, tape merge, top-N heap) are owned by postgres-tuplesort.md.
The ExecBuildAggTrans expression that fuses argument evaluation with
transition-function calls is owned by postgres-expression-eval.md. This
document covers the nodes: how each drives the executor’s pull
protocol, what it buffers, and how it spills.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL renders the three families as four executor nodes — Agg,
Sort, IncrementalSort, WindowAgg — each a PlanState subtype with an
ExecProcNode function pointer, exactly like every other node in
postgres-executor.md. What distinguishes them is internal state and
buffering, not the external interface.
Agg: four strategies behind one ExecAgg
Section titled “Agg: four strategies behind one ExecAgg”The planner tags the Agg plan node with an AggStrategy:
// AggStrategy — src/include/nodes/nodes.htypedef enum AggStrategy{ AGG_PLAIN, /* simple agg across all input rows */ AGG_SORTED, /* grouped agg, input must be sorted */ AGG_HASHED, /* grouped agg, use internal hashtable */ AGG_MIXED, /* grouped agg, hash and sort both used */} AggStrategy;ExecAgg is a thin dispatcher: AGG_HASHED/AGG_MIXED build then drain a
hash table; AGG_PLAIN/AGG_SORTED stream groups directly off the
(planner-ordered) input.
// ExecAgg — src/backend/executor/nodeAgg.cstatic TupleTableSlot *ExecAgg(PlanState *pstate){ AggState *node = castNode(AggState, pstate); TupleTableSlot *result = NULL;
if (!node->agg_done) { switch (node->phase->aggstrategy) { case AGG_HASHED: if (!node->table_filled) agg_fill_hash_table(node); /* FALLTHROUGH */ case AGG_MIXED: result = agg_retrieve_hash_table(node); break; case AGG_PLAIN: case AGG_SORTED: result = agg_retrieve_direct(node); break; } if (!TupIsNull(result)) return result; } return NULL;}Sorted/plain path. agg_retrieve_direct exploits the invariant that a
sorted (or plain, single-group) input lets it process one group with O(1)
state. It fetches the first tuple of a group, runs advance_aggregates
over every subsequent tuple until a group boundary — detected by
running the grouping-column equality expression against the previous row —
then finalizes and projects the group. The per-group transition state lives
in aggcontexts[], which is rescanned (so registered shutdown callbacks
fire) at each boundary; the per-input-tuple scratch lives in tmpcontext,
reset every row:
// agg_retrieve_direct — src/backend/executor/nodeAgg.c (condensed)for (;;){ /* ... advance over the current group, fetch next tuple ... */ advance_aggregates(aggstate); ResetExprContext(tmpcontext); /* per-input-tuple scratch */ outerslot = fetch_input_tuple(aggstate); if (TupIsNull(outerslot)) break; /* group runs to end of input */}The group boundary itself is detected by running the grouping-column
equality expression for the next grouping set against the previous row;
ExecQualAndReset both evaluates the comparison and resets the per-tuple
context. For grouping sets, nextSetSize selects which set’s equality
function to use, so a single ordered pass can carry several rollup levels
at once:
// agg_retrieve_direct — src/backend/executor/nodeAgg.c (grouping-set boundary)tmpcontext->ecxt_innertuple = econtext->ecxt_outertuple;if (aggstate->input_done || (node->aggstrategy != AGG_PLAIN && aggstate->projected_set != -1 && aggstate->projected_set < (numGroupingSets - 1) && nextSetSize > 0 && !ExecQualAndReset(aggstate->phase->eqfunctions[nextSetSize - 1], tmpcontext))){ aggstate->projected_set += 1; /* advance to next grouping set */}else{ aggstate->projected_set = 0; /* start the first/only set fresh */ /* ... fetch grp_firstTuple, initialize_aggregates, then advance loop ... */}Hashed path. agg_fill_hash_table drains the entire outer plan into
per-grouping-set hash tables (it is a blocking step), then
agg_retrieve_hash_table iterates the tables emitting one row per group.
For each input tuple lookup_hash_entries probes every hash table; a hit
advances that group’s state in place, a miss creates a new group (unless
spill mode is active):
// lookup_hash_entries — src/backend/executor/nodeAgg.c (condensed)for (setno = 0; setno < aggstate->num_hashes; setno++){ /* if hash table already spilled, don't create new entries */ p_isnew = aggstate->hash_spill_mode ? NULL : &isnew; select_current_set(aggstate, setno, true); prepare_hash_slot(perhash, outerslot, hashslot); entry = LookupTupleHashEntry(hashtable, hashslot, p_isnew, &hash); if (entry != NULL) { if (isnew) initialize_hash_entry(aggstate, hashtable, entry); pergroup[setno] = TupleHashEntryGetAdditional(hashtable, entry); } else { /* no room for a new group: spill this tuple to disk */ hashagg_spill_tuple(aggstate, &aggstate->hash_spills[setno], slot, hash); pergroup[setno] = NULL; }}Spilling. After inserting a group, hash_agg_check_limits sums the
metadata, entry, and transition-value memory; if it exceeds
hash_mem_limit or the group count exceeds hash_ngroups_limit, the
node enters spill mode. Spilled tuples are partitioned by hash bits onto
LogicalTapes; once the in-memory groups are drained,
agg_refill_hash_table pops one spilled HashAggBatch, resets the tables,
and re-aggregates it — recursing (re-partitioning on more hash bits) if a
batch is itself too large.
flowchart TB
START["ExecAgg first call<br/>AGG_HASHED"]
FILL["agg_fill_hash_table:<br/>drain outer plan"]
LOOK["lookup_hash_entries<br/>probe each hashtable"]
HIT{"group<br/>found?"}
ADV["advance_aggregates<br/>(update state in place)"]
CHK["hash_agg_check_limits"]
OVER{"over<br/>hash_mem?"}
SPILLMODE["hash_agg_enter_spill_mode<br/>(stop new groups)"]
SPILL["hashagg_spill_tuple<br/>partition to LogicalTape"]
DRAIN["agg_retrieve_hash_table_in_memory"]
REFILL{"spilled<br/>batches left?"}
POP["agg_refill_hash_table:<br/>pop HashAggBatch, reset,<br/>re-aggregate (recursive)"]
DONE["agg_done = true"]
START --> FILL --> LOOK --> HIT
HIT -- yes --> ADV --> CHK --> OVER
HIT -- "no, spill mode" --> SPILL
OVER -- yes --> SPILLMODE
OVER -- no --> LOOK
SPILLMODE --> LOOK
SPILL --> LOOK
FILL --> DRAIN --> REFILL
REFILL -- yes --> POP --> DRAIN
REFILL -- no --> DONE
Grouping sets. A ROLLUP/CUBE/GROUPING SETS query is one “real”
Agg node with a chain of additional Agg nodes. The executor collapses
these into phases: phase 0 holds all hashed grouping sets; phases
1..n each hold one sorted rollup with its own sort order. Within a single
ordered pass, agg_retrieve_direct tracks one transition state per
grouping set (in aggcontexts[0..k]) and resets only the finer-grained
sets at each boundary. AGG_MIXED populates the hash tables during the
first sorted phase, then switches to phase 0 to read them out —
initialize_phase is the state transition.
Sort: a pure blocking operator over tuplesort
Section titled “Sort: a pure blocking operator over tuplesort”Sort is the textbook blocker. The first ExecSort call drains the child
entirely into tuplesort.c, calls tuplesort_performsort, and flips
sort_Done; every later call just pulls the next sorted tuple back. The
node picks a datum sort for a single output column (cheaper, especially
for pass-by-value types) and a heap/tuple sort otherwise, and forwards
the planner’s bounded/randomAccess hints to the utility:
// ExecSort — src/backend/executor/nodeSort.c (condensed)if (!node->sort_Done){ if (node->datumSort) tuplesortstate = tuplesort_begin_datum(/* type, op, coll, … */); else tuplesortstate = tuplesort_begin_heap(tupDesc, plannode->numCols, plannode->sortColIdx, /* … */); if (node->bounded) tuplesort_set_bound(tuplesortstate, node->bound);
for (;;) /* blocking: read ALL input */ { slot = ExecProcNode(outerNode); if (TupIsNull(slot)) break; tuplesort_puttupleslot(tuplesortstate, slot); } tuplesort_performsort(tuplesortstate); node->sort_Done = true;}/* subsequent calls: stream sorted tuples out */(void) tuplesort_gettupleslot(tuplesortstate, ScanDirectionIsForward(dir), false, slot, NULL);The node deliberately leaves its ExprContext uninitialized — Sort
never evaluates a qual or projection; it only reorders its child’s slots.
IncrementalSort: a four-state machine over a presorted prefix
Section titled “IncrementalSort: a four-state machine over a presorted prefix”When the input is already ordered on a prefix of the sort keys,
IncrementalSort avoids one giant sort by sorting one prefix-group at a
time. It runs a four-state machine (INCSORT_LOADFULLSORT,
INCSORT_READFULLSORT, INCSORT_LOADPREFIXSORT, INCSORT_READPREFIXSORT)
with a hybrid heuristic: it first accumulates a minimum group of
DEFAULT_MIN_GROUP_SIZE (32) tuples sorting on all keys (cheap for tiny
groups), and only when a group proves large does it switch to sorting just
the unsorted suffix keys. Prefix-group membership is tested by
isCurrentGroup, comparing presorted columns from the last key backward
(the tail key is likeliest to change first):
// isCurrentGroup — src/backend/executor/nodeIncrementalSort.c (condensed)for (int i = nPresortedCols - 1; i >= 0; i--) /* last key first */{ datumA = slot_getattr(pivot, attno, &isnullA); datumB = slot_getattr(tuple, attno, &isnullB); /* NULL-vs-NULL handled specially, else call cached equality fn */ key->fcinfo->args[0].value = datumA; key->fcinfo->args[1].value = datumB; result = FunctionCallInvoke(key->fcinfo); if (!DatumGetBool(result)) return false; /* prefix changed: new group */}return true;The payoff is twofold: memory is bounded to one prefix group (less
spilling), and the node returns its first rows before reading the whole
input — a large win under LIMIT, the case the planner most often picks it
for.
flowchart TB
LF["INCSORT_LOADFULLSORT:<br/>accumulate up to 32 tuples,<br/>sort on ALL keys"]
BIG{"group still<br/>growing past<br/>min size?"}
SW["switchToPresortedPrefixMode:<br/>re-bucket accumulated tuples"]
LP["INCSORT_LOADPREFIXSORT:<br/>load one prefix group,<br/>sort on SUFFIX keys only"]
RF["INCSORT_READFULLSORT:<br/>emit sorted full batch"]
RP["INCSORT_READPREFIXSORT:<br/>emit sorted prefix group"]
MORE{"more input?"}
END["return empty slot"]
LF --> BIG
BIG -- "no (small/end)" --> RF
BIG -- "yes" --> SW --> LP --> RP
RF --> MORE
RP --> MORE
MORE -- yes --> LF
MORE -- no --> END
WindowAgg: a buffered partition with a sliding frame
Section titled “WindowAgg: a buffered partition with a sliding frame”WindowAgg evaluates window functions for a single window spec; the
planner stacks several with intervening Sort nodes for multiple specs.
Input must arrive sorted by PARTITION BY then ORDER BY. The node
buffers each partition in a Tuplestorestate (begin_partition /
spool_tuples) so window functions can reach any row of the partition
through the WindowObject API, which keeps per-function read/mark cursors
into the store.
The performance core is eval_windowaggregates. For a frame whose head is
UNBOUNDED PRECEDING with no exclusion, rows only enter the frame, so it
runs the forward transition function incrementally. For a frame whose head
advances, it removes exited rows with the aggregate’s inverse
transition function rather than recomputing from scratch; if no inverse
function exists (or it declines), it restarts the aggregation over the new
frame. Peer-group sharing lets one computed value serve all rows with an
identical frame.
// ExecWindowAgg — src/backend/executor/nodeWindowAgg.c (condensed)if (winstate->status == WINDOWAGG_DONE) return NULL;for (;;){ if (winstate->next_partition) begin_partition(winstate); /* buffer a fresh partition */ else winstate->currentpos++; /* advance current row; frame moves */
spool_tuples(winstate, winstate->currentpos); if (winstate->partition_spooled && winstate->currentpos >= winstate->spooled_rows) { release_partition(winstate); if (winstate->more_partitions) begin_partition(winstate); else { winstate->status = WINDOWAGG_DONE; return NULL; } } /* eval frame-head/tail, window funcs, project; loop if qual rejects */}The decision tree at the heart of eval_windowaggregates is the
restart-vs-slide test. For each aggregate, the node restarts the
running transition value from scratch only when it has no choice: the very
first row, an advancing frame head with no inverse function, any
EXCLUDE clause, or a new frame that does not overlap the old one. In
every other case it slides — removing the rows that fell off the top with
the inverse transition function and adding the rows that entered at the
bottom with the forward one:
// eval_windowaggregates — src/backend/executor/nodeWindowAgg.c (condensed)numaggs_restart = 0;for (i = 0; i < numaggs; i++){ peraggstate = &winstate->peragg[i]; if (winstate->currentpos == 0 || /* first row */ (winstate->aggregatedbase != winstate->frameheadpos && /* head moved */ !OidIsValid(peraggstate->invtransfn_oid)) || /* no inverse fn */ (winstate->frameOptions & FRAMEOPTION_EXCLUSION) || /* EXCLUDE clause */ winstate->aggregatedupto <= winstate->frameheadpos) /* no overlap */ { peraggstate->restart = true; numaggs_restart++; } else peraggstate->restart = false;}
/* slide: remove rows that fell off the frame head via inverse transition */while (numaggs_restart < numaggs && winstate->aggregatedbase < winstate->frameheadpos){ window_gettupleslot(agg_winobj, winstate->aggregatedbase, temp_slot); /* ... advance_windowaggregate_base() per agg, or mark restart on failure ... */}This is the difference between O(partition) and O(partition × frame)
work for a moving-average-style query: sum/count/avg have inverse
transition functions and slide; max/min over a shrinking frame do not,
so they restart.
Source Walkthrough
Section titled “Source Walkthrough”Four files, one per node, all under src/backend/executor/. The symbols
below are the stable anchors; line numbers live only in the position-hint
table at the end of this section.
Agg — nodeAgg.c
Section titled “Agg — nodeAgg.c”The dispatch and retrieval spine:
ExecAgg— theExecProcNodeentry; switches onnode->phase->aggstrategyand routes to the direct or hash retriever.agg_retrieve_direct— theAGG_PLAIN/AGG_SORTEDengine. Walks groups off ordered input, detecting boundaries with the phase’s grouping equality expressions (phase->eqfunctions[…]viaExecQualAndReset), finalizing and projecting one group per call. Drives the grouping-sets phase machine: when a phase’s input is exhausted it callsinitialize_phaseto advance to the next sort order, and anAGG_MIXEDnode switches to phase 0 to drain hash tables.agg_fill_hash_table— blocking pre-pass forAGG_HASHED: drains the outer plan, callinglookup_hash_entries+advance_aggregatesper tuple, thenhashagg_finish_initial_spills.lookup_hash_entries— probes each grouping set’sTupleHashTable; hit ⇒ advance in place; miss ⇒ create a group, or (in spill mode)hashagg_spill_tuple.agg_retrieve_hash_table/agg_retrieve_hash_table_in_memory— iterate the filled tables, finalizing one group per call; when in-memory groups are exhausted, callagg_refill_hash_table.agg_refill_hash_table— pops one spilledHashAggBatch, resets the tables, recompiles the transition expression forMinimalTupleinput, and re-aggregates the batch (recursively spilling on overflow).
The memory/limit machinery:
build_hash_tables/build_hash_table— create oneTupleHashTableper grouping set with a bucket count fromhash_choose_num_buckets.hash_agg_set_limits— derivesmem_limitandngroups_limitfromget_hash_memory_limit(), reserving tape-buffer memory if a spill is expected.hash_agg_check_limits— after each new group, sums meta + entry + transition-value memory and trips spilling if over the memory or group limit.hash_agg_enter_spill_mode— flipshash_spill_mode, recompiles expressions, and lazily creates theLogicalTapeSetand per-setHashAggSpillpartitions.ExecInitAgg— builds theAggState, the phase array, the per-agg / per-trans tables, and (for hashed/mixed) the hash tables and contexts; readsuse_hashingoff the strategy tag.ExecReScanAgg— resets for re-execution: a hashed agg can simply re-iterate its existing table if it has no params and never spilled.
Key structs/enums: AggStrategy (nodes.h), AggStatePerHashData,
AggStatePerGroupData, HashAggSpill, HashAggBatch, AggStatePerPhase.
Sort — nodeSort.c
Section titled “Sort — nodeSort.c”ExecSort— blocking on first call: builds the tuplesort (tuplesort_begin_datumfor a single column whendatumSort, elsetuplesort_begin_heap), feeds every child tuple (tuplesort_puttupleslot/tuplesort_putdatum),tuplesort_performsort, setssort_Done; later calls stream out viatuplesort_gettupleslot/tuplesort_getdatum.ExecInitSort— setsrandomAccessfromEXEC_FLAG_REWIND | BACKWARD | MARK, decidesdatumSortfrom output column count, and shields the child from rewind/backward/mark requirements. Sort nodes do not initialize anExprContext(no qual, no projection).ExecReScanSort— reuses the sorted output if random access and no param change; otherwise tears the tuplesort down for a fresh sort.
IncrementalSort — nodeIncrementalSort.c
Section titled “IncrementalSort — nodeIncrementalSort.c”ExecIncrementalSort— the four-state machine: load a min-size group full-sorting on all keys (INCSORT_LOADFULLSORT), emit it (INCSORT_READFULLSORT), and on large groups transition to suffix-only sorting (INCSORT_LOADPREFIXSORT/READPREFIXSORT).preparePresortedCols— caches the equalityFmgrInfo/FunctionCallInfofor each presorted key intoPresortedKeyData.isCurrentGroup— tests prefix-group membership against the group pivot, comparing the last presorted key first.switchToPresortedPrefixMode— re-buckets already-accumulated tuples into the prefix-sort state when a large group is detected, carrying over any straggler prefix group vian_fullsort_remaining.ExecInitIncrementalSort— builds the state;DEFAULT_MIN_GROUP_SIZE(32) is the hybrid threshold.
WindowAgg — nodeWindowAgg.c
Section titled “WindowAgg — nodeWindowAgg.c”ExecWindowAgg— partition loop; per row spools input, advances the current position, updates frame validity, evaluates window functions, and projects.statusis one ofWINDOWAGG_RUN/WINDOWAGG_PASSTHROUGH/WINDOWAGG_PASSTHROUGH_STRICT/WINDOWAGG_DONE.begin_partition— resets frame/group counters, allocates the tuplestore (prepare_tuplestore), resets per-function mark/seek cursors, and stores the first row.spool_tuples— reads outer rows into the tuplestore up to a target position (or the whole partition), detecting the partition boundary.eval_windowaggregates— the moving-aggregate engine: incremental forward transition for fixed frame heads, inverse transition (advance_windowaggregate_base) for advancing heads, full restart when no inverse exists, and peer-group result caching.advance_windowaggregate— forward transition for one aggregate over one row (FILTER, strictness, expanded-object handling).update_frameheadpos/update_frametailpos/row_is_in_frame— resolveROWS/RANGE/GROUPSframe boundaries and exclusion against the buffered partition.ExecInitWindowAgg— buildsWindowAggState, the per-function and per-agg tables, theWindowObjects, and frame-offset expressions.
Frame semantics flags (FRAMEOPTION_RANGE / _ROWS / _GROUPS,
FRAMEOPTION_EXCLUDE_*) live in parsenodes.h.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
AggStrategy (enum) | src/include/nodes/nodes.h | 358 |
initialize_phase | src/backend/executor/nodeAgg.c | 479 |
build_hash_tables | src/backend/executor/nodeAgg.c | 1466 |
hash_agg_set_limits | src/backend/executor/nodeAgg.c | 1809 |
hash_agg_check_limits | src/backend/executor/nodeAgg.c | 1867 |
hash_agg_enter_spill_mode | src/backend/executor/nodeAgg.c | 1911 |
lookup_hash_entries | src/backend/executor/nodeAgg.c | 2181 |
ExecAgg | src/backend/executor/nodeAgg.c | 2244 |
agg_retrieve_direct | src/backend/executor/nodeAgg.c | 2280 |
agg_fill_hash_table | src/backend/executor/nodeAgg.c | 2626 |
agg_refill_hash_table | src/backend/executor/nodeAgg.c | 2680 |
agg_retrieve_hash_table | src/backend/executor/nodeAgg.c | 2835 |
ExecInitAgg | src/backend/executor/nodeAgg.c | 3279 |
ExecReScanAgg | src/backend/executor/nodeAgg.c | 4466 |
ExecSort | src/backend/executor/nodeSort.c | 50 |
ExecInitSort | src/backend/executor/nodeSort.c | 221 |
ExecReScanSort | src/backend/executor/nodeSort.c | 362 |
preparePresortedCols | src/backend/executor/nodeIncrementalSort.c | 164 |
isCurrentGroup | src/backend/executor/nodeIncrementalSort.c | 212 |
switchToPresortedPrefixMode | src/backend/executor/nodeIncrementalSort.c | 286 |
DEFAULT_MIN_GROUP_SIZE (macro) | src/backend/executor/nodeIncrementalSort.c | 467 |
ExecIncrementalSort | src/backend/executor/nodeIncrementalSort.c | 495 |
ExecInitIncrementalSort | src/backend/executor/nodeIncrementalSort.c | 976 |
advance_windowaggregate | src/backend/executor/nodeWindowAgg.c | 243 |
advance_windowaggregate_base | src/backend/executor/nodeWindowAgg.c | 420 |
eval_windowaggregates | src/backend/executor/nodeWindowAgg.c | 664 |
begin_partition | src/backend/executor/nodeWindowAgg.c | 1194 |
spool_tuples | src/backend/executor/nodeWindowAgg.c | 1283 |
row_is_in_frame | src/backend/executor/nodeWindowAgg.c | 1427 |
update_frameheadpos | src/backend/executor/nodeWindowAgg.c | 1536 |
ExecWindowAgg | src/backend/executor/nodeWindowAgg.c | 2211 |
ExecInitWindowAgg | src/backend/executor/nodeWindowAgg.c | 2479 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Each entry leads with a fact about the current source at commit
273fe94(REL_18_STABLE), readable without other materials. The trailing sentence shows how it was checked.
Verified facts
Section titled “Verified facts”-
ExecAggdispatches on exactly four strategies. Theswitchonnode->phase->aggstrategyhas casesAGG_HASHED(falling through toAGG_MIXED),AGG_MIXED, andAGG_PLAIN/AGG_SORTED; the enum innodes.hdefines preciselyAGG_PLAIN,AGG_SORTED,AGG_HASHED,AGG_MIXED. Verified by readingExecAggand theAggStrategyenum on 2026-06-05. -
Hash aggregation enters spill mode on either a memory or a group-count overrun.
hash_agg_check_limitscomputestotal_mem = meta_mem + entry_mem + tval_memand tripshash_agg_enter_spill_modewhen, with at least one group present,total_mem > hash_mem_limit || ngroups > hash_ngroups_limit. Thengroups_limitexists specifically for transition states (e.g.array_agg) that grow far past their initial size. Verified by readinghash_agg_check_limitsandhash_agg_set_limits. -
Spilled aggregation is recursive over partitioned logical tapes.
hash_agg_enter_spill_modecreates aLogicalTapeSetand oneHashAggSpillper grouping set;agg_refill_hash_tablepops oneHashAggBatch(a stack viallast/list_delete_last), carriesused_bitsandinput_card, and may itself spill again. Verified by readinghash_agg_enter_spill_modeandagg_refill_hash_table. -
Spilled tuples are read back as
MinimalTuple, forcing an expression recompile.agg_refill_hash_tablecallshashagg_recompile_expressions(aggstate, true, true)and stores tuples withExecStoreMinimalTuplebefore probing. Verified by readingagg_refill_hash_table. -
AGG_MIXEDpopulates hash tables during sorted phase 1 and reads them out in phase 0. Inagg_retrieve_direct, during phase 1 of a mixed agg (current_phase == 1) each tuple also callslookup_hash_entries; when input ends the nodeinitialize_phase(…, 0)and switches toagg_retrieve_hash_table.agg_refill_hash_tablemirrors this, togglingcurrent_phaseto 1 to process a batch then back to 0. Verified by reading both functions. -
Sortis fully blocking and never evaluates quals or projection.ExecSortloads the entire outer plan into the tuplesort before returning the first tuple (sort_Doneguards this), andExecInitSort’s comment states “Sort nodes don’t initialize their ExprContexts because they never call ExecQual or ExecProject.” Verified by readingExecSortandExecInitSort. -
Sortchooses a datum sort only for a single output column.ExecInitSortsetsdatumSort = truewhen the result tuple descriptor has one attribute, andExecSortthen usestuplesort_begin_datum/tuplesort_getdatuminstead of the heap variants. Verified by reading both. -
Incremental sort’s hybrid threshold is 32 tuples.
#define DEFAULT_MIN_GROUP_SIZE 32;ExecIncrementalSortaccumulates up tominGroupSizetuples in full-sort mode before considering a switch, and clampsminGroupSizeto the bound when bounded. Verified by reading the macro and theINCSORT_LOADFULLSORTblock. -
Incremental sort compares presorted keys from the last key backward.
isCurrentGrouploopsfor (i = nPresortedCols - 1; i >= 0; i--), with a comment that the tail key is likeliest to change first, minimizing comparator calls. Verified by readingisCurrentGroup. -
WindowAgg buffers each partition in a tuplestore and slides moving-frame aggregates with an inverse transition function.
begin_partitionallocateswinstate->bufferandeval_windowaggregatesdocuments and implements the three regimes — incremental forward transition for fixedUNBOUNDED PRECEDINGheads, inverse transition for advancing heads, full restart when no inverse function exists or it returns NULL. Verified by readingbegin_partition,eval_windowaggregates, andadvance_windowaggregate_base. -
A WindowAgg handles exactly one window spec; multiple specs are stacked by the planner. The file header states each WindowAgg “works for just a single window specification” and the planner “generates a stack of WindowAggs with intervening Sort nodes.” Verified by reading the
nodeWindowAgg.cheader comment.
Open questions
Section titled “Open questions”-
What exactly determines the partition count and re-partition bit allotment for hash-agg spills?
hash_choose_num_partitionsand theused_bitsaccounting inHashAggBatchwere read at call sites but the bit-budget math (HASHAGG_MAX_PARTITIONS, the cost balance between partition fan-out and recursion depth) was not traced. Investigation path: readhash_choose_num_partitionsandhashagg_spill_initinnodeAgg.c. -
How does
WINDOWAGG_PASSTHROUGH_STRICTinteract with a non-top-level WindowAgg that must still feed rows upward?spool_tupleskeeps rows for an upper WindowAgg even in pass-through; the precise condition that selectsPASSTHROUGHvs.PASSTHROUGH_STRICT(tied torunConditionpushdown) was only partially traced. Investigation path: read therunConditionhandling aroundExecWindowAgg’s status assignment. -
Where is the boundary between the
Sortnode andtuplesort.c’s bounded / parallel modes? This doc treats run generation, tape merge, and the top-N heap as a black box. The parallel-workerSortpath (shared_info,tuplesort_get_stats) was seen but not analyzed. Investigation path:postgres-tuplesort.mdplusnodeSort.c’sam_workerbranch.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”Pointers, not analysis. Each bullet is a handle for a follow-up doc; depth here is intentionally shallow.
-
Spilling hash aggregation is recent in PostgreSQL. Before PG 13, hash aggregation could exceed
work_memwithout bound (the planner simply avoided it when it expected too many groups, sometimes mis-estimating). The partitioned-spill machinery (hash_agg_enter_spill_mode,HashAggBatch) landed in PG 13 (Jeff Davis), bringing PostgreSQL in line with the “grace hash” / partition-and-recurse pattern that DSC §15.5 and the hash-join literature (hashjoininpostgres-scan-nodes.mdterritory) had long assumed. A side-by-side with hash join spilling would show how much machinery the two share. -
CUBRID’s aggregation and sort paths. CUBRID evaluates aggregation inside its
XASLtree with aGROUPBY/aggregate_listmechanism and a separate external-sort utility, but its hash-aggregation and grouping-sets support, and whether it spills comparably, differ from PostgreSQL’s phase chain. A comparison of how each engine handlesROLLUP/CUBEin a single ordered pass would sharpen what PostgreSQL’sAGG_MIXEDphase machine buys. -
Vectorized aggregation and sorting. The tuple-at-a-time transition call is exactly the per-tuple overhead the vectorized engines attack. MonetDB/X100 (Boncz, Zukowski & Nes, CIDR 2005) and the column-store lineage (
dbms-papers/cstore.md) aggregate over vectors of grouping keys and use vectorized partitioning sorts. PostgreSQL stays tuple-at-a-time;ExecBuildAggTransfusing argument evaluation and the transition call into one JIT-able expression (postgres-expression-eval.md,postgres-jit.md) is its answer to the interpreter overhead instead. -
Adaptive aggregation: hash vs. sort decided at run time. PostgreSQL fixes the strategy at plan time. Systems such as Hellerstein’s eddies line of work, and more recently run-time-adaptive operators in HyPer/Umbra, switch between hashing and sorting based on observed cardinality. The research frontier is choosing — and re-choosing — the aggregation physical operator after seeing real group counts, which would make PostgreSQL’s plan-time
AggStrategytag a run-time variable. -
Incremental sort as a near-sorted-input optimization. Incremental sort (PG 13, Tomáš Vondra / James Coleman) is PostgreSQL’s embodiment of the broad “exploit pre-sortedness” idea also seen in adaptive sorts (Timsort’s run detection, natural merge sort). The frontier question is whether the planner picks it often enough: it shines under
LIMITwith a partially-ordered child, but the cost crossover vs. a full sort is estimate-sensitive. A study ofIncrementalSortpath selection inpostgres-path-generation.mdwould close that loop. -
Window-function frame algorithms. The inverse-transition trick (Leis et al., “Efficient Processing of Window Functions in Analytical SQL Queries”, VLDB 2015, from the HyPer group) formalizes segment-tree and aggregation-tree approaches that beat PostgreSQL’s linear forward/inverse slide for
RANGE/GROUPSframes with non-invertible aggregates (e.g.maxover a shrinking frame). PostgreSQL implements the forward+inverse+restart trichotomy but not the tree-based O(log n) frame aggregation; that gap is a concrete optimization frontier.
Sources
Section titled “Sources”-
Code (REL_18_STABLE, commit
273fe94, PG 18.x) — source of truth:src/backend/executor/nodeAgg.c—Aggstrategies, hash spill, grouping-set phases.src/backend/executor/nodeSort.c— blockingSortovertuplesort.c.src/backend/executor/nodeIncrementalSort.c— four-state incremental sort.src/backend/executor/nodeWindowAgg.c— partition buffering, frame sliding, inverse transition.src/include/nodes/nodes.h—AggStrategyenum.src/include/nodes/execnodes.h—AggState,SortState,IncrementalSortState(INCSORT_*),WindowAggState(WINDOWAGG_*).src/include/nodes/parsenodes.h—FRAMEOPTION_*window-frame flags.
-
Theory —
knowledge/research/dbms-general/database-system-concepts.md(Silberschatz 7e): §15.4 Sorting (external sort-merge), §15.5 Aggregation (sorted vs. hashed), §15.7.2 pipelined vs. blocking operators, §5.5.1 Windowing. -
Sibling code-analysis docs (cross-references, not duplicated here):
postgres-executor.md— thePlanStatetree,ExecProcNodepull protocol,TupleTableSlot, memory contexts.postgres-tuplesort.md— thetuplesort.cengine (run generation, tape merge, top-N heap, datum vs. heap sort, parallel sort).postgres-expression-eval.md—ExprState/ExecBuildAggTrans(the fused transition expression), qual and projection evaluation.postgres-planner-overview.md,postgres-path-generation.md— why a strategy/path is chosen.
-
Comparative pointers —
dbms-papers/cstore.md(column store / vectorization contrast); MonetDB/X100 (Boncz et al., CIDR 2005); Leis et al., “Efficient Processing of Window Functions in Analytical SQL Queries” (VLDB 2015).