PostgreSQL Executor — The Demand-Pull Plan-Node Tree and Tuple Flow
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A relational query, once optimized, is a tree of physical operators: scans at the leaves, joins and aggregates and sorts in the interior, a projection or modification at the root. The execution engine’s job is to turn that operator tree into a stream of result tuples. Database System Concepts (Silberschatz, 7e, ch. 15 “Query Processing”) frames the choice of how tuples move between operators as the central design axis, and names two families:
- Materialized evaluation. Each operator runs to completion, writes its entire output to a temporary relation, and the next operator reads that relation. Simple, but every intermediate result is paid for in full — I/O and latency to first row both scale with the largest intermediate.
- Pipelined evaluation. Adjacent operators are combined into a pipeline so that one operator’s output tuples flow directly into the next without an intermediate relation (§15.7.2). Two benefits follow: intermediate results are never written to disk, and “the root operator of a query-evaluation plan … can start generating query results quickly” (§15.7.2) — the engine produces the first row before the last input row has even been read.
A pipeline can run in one of two directions, and this is the knob every engine turns (§15.7.2.1):
- Demand-driven (pull / lazy). “The system makes repeated requests for tuples from the operation at the top of the pipeline.” Each operator, when asked, computes one output tuple — recursively requesting input tuples from its children as needed — and returns it. Tuples are “pulled up an operation tree from the top,” generated lazily, on demand.
- Producer-driven (push / eager). Each operator runs as its own process or thread, eagerly generating output into a buffer until the buffer fills; tuples are “pushed up an operation tree from below.”
The textbook’s verdict is unambiguous: “Demand-driven pipelining is used more commonly than producer-driven pipelining because it is easier to implement” (§15.7.2.1). Its canonical realization is the iterator model, also called the Volcano model after Graefe’s Volcano system, whose exchange operator (DSC §22.5, ch. 22 parallel processing) later generalized it to parallelism. Each operator exposes three functions:
“Each operation in a demand-driven pipeline can be implemented as an iterator that provides the following functions:
open(),next(), andclose(). After a call toopen(), each call tonext()returns the next output tuple of the operation. The implementation of the operation in turn callsopen()andnext()on its inputs, to get its input tuples when required. … The iterator maintains the state of its execution in between calls so that successivenext()requests receive successive result tuples.” (DSC §15.7.2.1)
Three properties of the iterator model shape everything downstream and are worth naming before reading any engine’s source:
- Uniform interface. Every operator answers the same
next()call and returns the same kind of thing (one tuple, or “no more”). A join does not know whether its child is a sequential scan or another join. The tree composes because the interface is uniform. - State lives in the iterator, not the caller. Between
next()calls, each operator remembers where it was — the file offset of a scan, the build/probe phase of a hash join. This per-operator execution state is the bulk of what an executor allocates. - Control flow is recursion down, tuples flow up. A single
next()on the root drives a depth-first cascade ofnext()calls to the leaves; the leaves return tuples that bubble back up, each interior operator transforming the stream as it passes.
The Architecture of a Database System survey (Hellerstein, Stonebraker
& Hamilton 2007, §1.1, §4) restates the same picture for production
engines: the relational query processor is “a suite of operators … for
executing any query,” and SQL is served in a “pull model” where the
client repeatedly pulls rows. PostgreSQL is a textbook-faithful
implementation of the demand-driven iterator model — with the wrinkle
that its next() is ExecProcNode, its open()/close() are
ExecInitNode/ExecEndNode, and the “tuple” it passes is not a bare row
but a TupleTableSlot abstraction. The rest of this document traces those
pieces in the REL_18 source.
Common DBMS Design
Section titled “Common DBMS Design”The textbook gives the model — demand-driven iterators over an operator
tree. This section names the engineering conventions that almost every
production iterator engine adopts to make that model fast and safe, the
patterns the textbook leaves implicit. PostgreSQL’s specific choices in
## PostgreSQL's Approach are best read as one set of dials within this
shared space.
Split the plan from the execution state
Section titled “Split the plan from the execution state”The optimizer’s output — the operator tree with its cost estimates, join conditions, and target lists — is logically read-only at run time: nothing about “where the scan cursor currently sits” belongs in it. Engines therefore keep two parallel trees: an immutable plan tree (the recipe) and a mutable state tree (the running instance), with each state node pointing back at its plan node. The payoff is plan caching and reuse: one cached plan can be executed many times, even concurrently in different sessions or parallel workers, because each execution gets its own fresh state tree and the plan is never mutated.
A uniform tuple handle, not a uniform tuple format
Section titled “A uniform tuple handle, not a uniform tuple format”Operators must compose, but the tuples flowing between them come from
wildly different sources: a heap page (with MVCC visibility bits and a
buffer pin), an index, a sort’s spill file, an in-memory VALUES list, a
join’s freshly-projected combination of two inputs. Forcing all of these
into one physical layout would mean copying every tuple into that layout
at every boundary. The convention instead is a uniform tuple handle
that can wrap any of these backings — exposing a common “give me column
i” interface while deferring the physical decode until a column is
actually touched. This is the single most important abstraction in a
pipelined engine: it lets an operator consume its child’s output without
knowing or caring how that output is stored.
Lazy column decode
Section titled “Lazy column decode”A heap tuple on disk is a packed byte string; turning it into an array of
typed Datum values (“deforming”) costs CPU proportional to the number of
columns. But a query that touches column 2 of a 40-column table should not
pay to decode columns 3–40. The convention is lazy, left-prefix
deforming: decode columns 1..n only when something asks for column
n, cache how far you have decoded, and never redo it. A “virtual” tuple
— one that is already an array of Datums, e.g. a join’s projected
output — skips deforming entirely.
Bounded per-tuple memory via reset-able contexts
Section titled “Bounded per-tuple memory via reset-able contexts”A pipeline can process billions of tuples in one query. If each tuple’s
transient allocations (the scratch space for evaluating a WHERE clause,
a function call’s working storage) accumulated for the life of the query,
memory would blow up. The convention is a per-tuple memory arena that
is reset (bulk-freed) once per tuple, separate from the per-query
arena that holds the state tree and survives until the query ends. Two
allocation lifetimes, one cheap bulk-free at each boundary, no retail
free() of individual tuples.
One snapshot, registered for the query’s life
Section titled “One snapshot, registered for the query’s life”A pipelined read must see a consistent set of rows from start to finish.
The engine acquires an MVCC snapshot before execution and keeps it
registered (pinned, so vacuum cannot reclaim versions it can still
see) for the entire ExecutorStart→ExecutorEnd span, releasing it only
at teardown.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”By the time you reach a named PostgreSQL symbol in the next section, you should already know what kind of thing it is:
| Theory / convention | PostgreSQL name |
|---|---|
Operator open() | ExecInitNode (dispatch) → per-node ExecInit* |
Operator next() | ExecProcNode (inline) → node->ExecProcNode → per-node Exec* |
Operator close() | ExecEndNode (dispatch) → per-node ExecEnd* |
| Read-only plan tree | Plan and subtypes (from the planner) |
| Mutable execution-state tree | PlanState and subtypes; lefttree / righttree links |
| Per-execution global state | EState (one per executor invocation) |
| Uniform tuple handle | TupleTableSlot + TupleTableSlotOps vtable |
| Tuple-handle backings | TTSOpsVirtual / TTSOpsHeapTuple / TTSOpsMinimalTuple / TTSOpsBufferHeapTuple |
| Lazy column decode | slot_getsomeattrs + tts_nvalid watermark |
| Per-query memory arena | EState.es_query_cxt |
| Per-tuple memory arena | ExprContext.ecxt_per_tuple_memory, reset by ResetExprContext |
| Registered query snapshot | EState.es_snapshot via RegisterSnapshot |
| Scan boilerplate (qual + project loop) | ExecScan / ExecScanExtended in execScan.[ch] |
| ”No more tuples” sentinel | empty TupleTableSlot (TupIsNull), not a C NULL everywhere |
The planner side of this boundary — how the Plan tree is built and why
it is read-only — is owned by postgres-planner-overview.md and
postgres-node-trees.md. Expression evaluation (the ExprState /
ExprEvalStep machinery that quals and projections compile to) is owned
by postgres-expression-eval.md. The individual leaf scan and join node
implementations are owned by postgres-scan-nodes.md. This document covers
the framework: the four lifecycle entry points, the dispatch layer,
the slot abstraction, and the scan boilerplate every leaf shares.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL’s executor is the demand-driven iterator model rendered almost exactly as the textbook describes it, with four design decisions that give it its shape:
- Four lifecycle entry points, not one.
ExecutorStart/ExecutorRun/ExecutorFinish/ExecutorEndseparate “build the machine,” “pull tuples,” “drain side-effects and fire AFTER triggers,” and “tear down.” The split exists so thatEXPLAIN ANALYZEcan time the trigger-firing phase, and so a portal can callExecutorRunrepeatedly for cursorFETCH. - A
PlanStatetree mirroring thePlantree one node for one node, built top-down byExecInitNoderecursing the plan. The plan stays read-only; all run-time mutation lands in the state tree. ExecProcNodeis thenext()call, dispatched through a function pointer stored on eachPlanState(node->ExecProcNode), not a giantswitchper tuple. The dispatch cost is paid once at init, not per tuple.TupleTableSlotis the unit of dataflow, with aTupleTableSlotOpsvtable selecting the physical backing. EveryExec*function returns a slot; an empty slot means end-of-data.
The four lifecycle entry points
Section titled “The four lifecycle entry points”Each entry point is a thin hook-dispatcher wrapping a standard_*
implementation, so extensions (pg_stat_statements, auditing plugins) can
intercept the whole executor by setting ExecutorStart_hook and friends —
this is the executor’s slice of the engine’s hook-based extensibility
surface.
// ExecutorStart / ExecutorRun — src/backend/executor/execMain.cvoidExecutorStart(QueryDesc *queryDesc, int eflags){ pgstat_report_query_id(queryDesc->plannedstmt->queryId, false); if (ExecutorStart_hook) (*ExecutorStart_hook) (queryDesc, eflags); else standard_ExecutorStart(queryDesc, eflags);}
voidExecutorRun(QueryDesc *queryDesc, ScanDirection direction, uint64 count){ if (ExecutorRun_hook) (*ExecutorRun_hook) (queryDesc, direction, count); else standard_ExecutorRun(queryDesc, direction, count);}The README’s control-flow sketch is the canonical map of how the four phases nest:
flowchart TB
subgraph START["ExecutorStart"]
CES["CreateExecutorState<br/>(per-query context)"]
SW1["switch into es_query_cxt"]
INIT["InitPlan -> ExecInitNode<br/>(recursively builds PlanState tree)"]
CES --> SW1 --> INIT
end
subgraph RUN["ExecutorRun"]
EP["ExecutePlan loop:<br/>ExecProcNode(root) until empty slot"]
DEST["dest->receiveSlot per tuple"]
EP --> DEST
end
subgraph FIN["ExecutorFinish"]
PPP["ExecPostprocessPlan<br/>(run unfinished ModifyTable nodes)"]
AT["AfterTriggerEndQuery<br/>(fire AFTER triggers)"]
PPP --> AT
end
subgraph END["ExecutorEnd"]
EEP["ExecEndNode<br/>(recursively release resources)"]
UNR["UnregisterSnapshot"]
FREE["FreeExecutorState<br/>(destroys per-query context)"]
EEP --> UNR --> FREE
end
START --> RUN --> FIN --> END
Figure 1 — The four executor lifecycle phases and what each does.
ExecutorStart builds the state tree in the per-query context;
ExecutorRun pulls tuples in a loop; ExecutorFinish drains side-effects
and fires AFTER triggers (separated so EXPLAIN ANALYZE can time it);
ExecutorEnd releases resources and frees the whole per-query context in
one operation. (Source: executor/README “Query Processing Control
Flow”.)
standard_ExecutorStart is where the per-query EState is born, the
query snapshot is registered, and the state tree is built. The key middle
section:
// standard_ExecutorStart — src/backend/executor/execMain.c (condensed)estate = CreateExecutorState();queryDesc->estate = estate;
oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
estate->es_param_list_info = queryDesc->params;/* ... allocate es_param_exec_vals for internal params ... */estate->es_sourceText = queryDesc->sourceText;estate->es_queryEnv = queryDesc->queryEnv;
/* set es_output_cid for non-read-only ops (or SELECT FOR UPDATE) */
estate->es_snapshot = RegisterSnapshot(queryDesc->snapshot);estate->es_crosscheck_snapshot = RegisterSnapshot(queryDesc->crosscheck_snapshot);estate->es_top_eflags = eflags;estate->es_instrument = queryDesc->instrument_options;estate->es_jit_flags = queryDesc->plannedstmt->jitFlags;
if (!(eflags & (EXEC_FLAG_SKIP_TRIGGERS | EXEC_FLAG_EXPLAIN_ONLY))) AfterTriggerBeginQuery();
InitPlan(queryDesc, eflags); /* builds the PlanState tree */
MemoryContextSwitchTo(oldcontext);Two things to notice. First, the snapshot passed by the caller (the portal
/ pquery.c layer, owned by postgres-portals-prepared.md) is
registered here and unregistered only in standard_ExecutorEnd — the
read sees one consistent set of versions for the whole run. Second,
everything from CreateExecutorState onward happens inside
es_query_cxt, so the entire state tree and all the slots and expression
states it allocates land in that one context.
InitPlan does the permissions check, sets up the range table and row
marks, runs initial partition pruning, then calls ExecInitNode on the
plan root to build the state tree (covered in the next subsection) and
finally derives the result tuple descriptor and junk filter:
// InitPlan — src/backend/executor/execMain.c (condensed)ExecCheckPermissions(rangeTable, plannedstmt->permInfos, true);ExecInitRangeTable(estate, rangeTable, plannedstmt->permInfos, ...);estate->es_plannedstmt = plannedstmt;ExecDoInitialPruning(estate);/* ... build es_rowmarks from plannedstmt->rowMarks ... */estate->es_tupleTable = NIL;estate->es_epq_active = NULL;/* init each SubPlan's state first ... */planstate = ExecInitNode(plan, estate, eflags);tupType = ExecGetResultType(planstate);/* ... build junk filter if the top tlist has junk attrs ... */standard_ExecutorRun sets up the destination receiver, then calls
ExecutePlan (the pull loop) unless the fetch direction is “no movement”:
// standard_ExecutorRun — src/backend/executor/execMain.c (condensed)oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);operation = queryDesc->operation;dest = queryDesc->dest;estate->es_processed = 0;sendTuples = (operation == CMD_SELECT || queryDesc->plannedstmt->hasReturning);if (sendTuples) dest->rStartup(dest, operation, queryDesc->tupDesc);if (!ScanDirectionIsNoMovement(direction)) ExecutePlan(queryDesc, operation, sendTuples, count, direction, dest);estate->es_total_processed += estate->es_processed;if (sendTuples) dest->rShutdown(dest);MemoryContextSwitchTo(oldcontext);ExecutorFinish runs any ModifyTable nodes that have not yet run to
completion (ExecPostprocessPlan) and fires queued AFTER triggers; it is
deliberately separate from ExecutorEnd so that EXPLAIN ANALYZE can
include trigger time in the reported total. ExecutorEnd calls
ExecEndNode to recurse the state tree releasing relations and buffer
pins, unregisters the two snapshots, and calls FreeExecutorState — which
destroys es_query_cxt and with it everything the executor allocated.
The PlanState tree mirrors the Plan tree
Section titled “The PlanState tree mirrors the Plan tree”The README states the core invariant directly: “During executor startup we build a parallel tree of identical structure containing executor state nodes — generally, every plan node type has a corresponding executor state node type. Each node in the state tree has a pointer to its corresponding node in the plan tree … This arrangement allows the plan tree to be completely read-only so far as the executor is concerned.”
PlanState is the abstract base every state node embeds as its first
field. Its run-time-relevant members:
// PlanState (abridged) — src/include/nodes/execnodes.htypedef struct PlanState{ pg_node_attr(abstract) NodeTag type; Plan *plan; /* the read-only plan node */ EState *state; /* shared per-query state */ ExecProcNodeMtd ExecProcNode; /* the next() function (maybe wrapped) */ ExecProcNodeMtd ExecProcNodeReal; /* the real next() if above is a wrapper */ Instrumentation *instrument; /* optional EXPLAIN ANALYZE stats */ ExprState *qual; /* compiled WHERE / filter, or NULL */ struct PlanState *lefttree; /* input subplan(s) — the "children" */ struct PlanState *righttree; List *initPlan; /* uncorrelated SubPlanState nodes */ List *subPlan; /* correlated SubPlanState nodes */ Bitmapset *chgParam; /* set of changed Param IDs -> triggers rescan */ TupleDesc ps_ResultTupleDesc; /* this node's output row type */ TupleTableSlot *ps_ResultTupleSlot; /* slot this node returns */ ExprContext *ps_ExprContext; /* per-tuple memory + expr scratch */ ProjectionInfo *ps_ProjInfo; /* compiled target list, or NULL */ /* ... slot-type hint fields (scanops/outerops/innerops/resultops) ... */} PlanState;lefttree and righttree are the iterator-tree edges: a join’s two
inputs are its left and right subtrees; a single-input node (sort,
aggregate, scan-with-subquery) uses lefttree as its one child. The
convenience macros outerPlanState(node) / innerPlanState(node) read
these. qual and ps_ProjInfo are the compiled WHERE filter and target
list — pre-compiled at init by the expression machinery so the per-tuple
path is a tight interpreter loop rather than a tree walk.
flowchart TB
subgraph PLAN["Plan tree (read-only, from planner)"]
P0["NestLoop"]
P1["SeqScan dept"]
P2["SeqScan emp"]
P0 --> P1
P0 --> P2
end
subgraph STATE["PlanState tree (mutable, built by ExecInitNode)"]
S0["NestLoopState<br/>ExecProcNode=ExecNestLoop"]
S1["SeqScanState<br/>ExecProcNode=ExecSeqScan<br/>ss_currentScanDesc (cursor)"]
S2["SeqScanState<br/>cursor + ss_ScanTupleSlot"]
S0 -. "lefttree" .-> S1
S0 -. "righttree" .-> S2
end
S0 -. "->plan" .-> P0
S1 -. "->plan" .-> P1
S2 -. "->plan" .-> P2
Figure 2 — The two parallel trees. The Plan tree is the planner’s
read-only recipe; ExecInitNode builds an isomorphic PlanState tree
whose nodes carry the run-time cursor (ss_currentScanDesc), the per-node
ExecProcNode function pointer, and a back-pointer to the plan node.
Read-only plans are what make plan caching and parallel re-execution safe.
The README notes one wrinkle: a state node may be omitted when run-time
partition pruning determines a subplan can produce no rows — currently
only under Append / MergeAppend — so the state subnode array can fall
out of lockstep with the plan’s subplan list. The one-for-one mapping is
the rule; pruning is the documented exception.
ExecInitNode — the open() dispatch
Section titled “ExecInitNode — the open() dispatch”ExecInitNode is the recursive tree builder. It is a switch on the plan
node’s tag that calls the right ExecInit* constructor, which in turn
recurses into its own children. The structure:
// ExecInitNode — src/backend/executor/execProcnode.c (condensed)PlanState *ExecInitNode(Plan *node, EState *estate, int eflags){ PlanState *result;
if (node == NULL) /* leaf of recursion */ return NULL; check_stack_depth();
switch (nodeTag(node)) { case T_SeqScan: result = (PlanState *) ExecInitSeqScan((SeqScan *) node, estate, eflags); break; case T_NestLoop: result = (PlanState *) ExecInitNestLoop((NestLoop *) node, estate, eflags); break; case T_Agg: result = (PlanState *) ExecInitAgg((Agg *) node, estate, eflags); break; /* ... ~40 more node types: control, scan, join, materialization ... */ default: elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node)); result = NULL; break; }
ExecSetExecProcNode(result, result->ExecProcNode); /* install wrapper */ /* init this node's initPlans; set up instrumentation if requested */ return result;}Each ExecInit* constructor calls ExecInitNode on its own children, so
one call on the root depth-first-builds the whole tree. The nodeTag
switch is grouped by family in the source — control nodes (Result,
ProjectSet, ModifyTable, Append, …), scan nodes (SeqScan,
IndexScan, BitmapHeapScan, ForeignScan, …), join nodes (NestLoop,
MergeJoin, HashJoin), and materialization nodes (Material, Sort,
Agg, Gather, …) — the same grouping mirrored in ExecEndNode.
ExecProcNode — the next() call and its wrappers
Section titled “ExecProcNode — the next() call and its wrappers”ExecProcNode is the demand-pull next(). It is a tiny inline function in
executor.h, not a switch: each PlanState already carries a function
pointer to its own per-node routine, set at init.
// ExecProcNode (inline) — src/include/executor/executor.hstatic inline TupleTableSlot *ExecProcNode(PlanState *node){ if (node->chgParam != NULL) /* a Param changed? */ ExecReScan(node); /* reset this subtree before pulling */
return node->ExecProcNode(node);}The indirection through node->ExecProcNode is what makes the tree
polymorphic: a join calls ExecProcNode(outerPlanState(node)) without
knowing or branching on what the child is. But the pointer is not set
directly to the per-node function — ExecSetExecProcNode installs a
wrapper, ExecProcNodeFirst, that runs one-time checks on the first
call and then rewires the pointer to either the bare function or an
instrumentation wrapper:
// ExecSetExecProcNode + ExecProcNodeFirst — src/backend/executor/execProcnode.cvoidExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function){ node->ExecProcNodeReal = function; node->ExecProcNode = ExecProcNodeFirst; /* wrapper for first call */}
static TupleTableSlot *ExecProcNodeFirst(PlanState *node){ check_stack_depth(); /* once, on first execution */
if (node->instrument) node->ExecProcNode = ExecProcNodeInstr; /* EXPLAIN ANALYZE path */ else node->ExecProcNode = node->ExecProcNodeReal; /* fast path henceforth */
return node->ExecProcNode(node);}This is a deliberate optimization: the expensive stack-depth check and the
instrumentation branch are paid once per node per query, not once per
tuple. After the first call, a non-instrumented node’s ExecProcNode
points straight at ExecSeqScan (or whatever), so the hot path is a
single indirect call with no wrapper overhead.
sequenceDiagram
participant EP as ExecutePlan
participant NL as NestLoopState
participant OUT as outer SeqScanState
participant IN as inner SeqScanState
EP->>NL: ExecProcNode(root)
NL->>OUT: ExecProcNode(outer)
OUT-->>NL: slot (dept row) or empty
NL->>IN: ExecProcNode(inner)
IN-->>NL: slot (emp row) or empty
note over NL: join match? project combined row
NL-->>EP: result slot (one joined row)
note over EP: send to dest, loop again
EP->>NL: ExecProcNode(root)
note over NL,IN: inner exhausted -> advance outer, rescan inner
EP->>NL: ExecProcNode(root)
NL-->>EP: empty slot (TupIsNull) -> done
Figure 3 — Demand-pull recursion for a two-table nested-loop join. One
ExecProcNode on the root cascades down to the leaf scans; tuples bubble
back up; the join transforms the stream. ExecutePlan keeps calling the
root until it returns an empty slot. This is the iterator model’s “control
flows down, tuples flow up” property in PostgreSQL form.
ExecutePlan — the pull loop
Section titled “ExecutePlan — the pull loop”ExecutePlan is the literal realization of “repeatedly call next() on
the top of the pipeline.” It loops on ExecProcNode(planstate), stops when
the slot is empty, optionally strips junk columns, ships the tuple to the
destination, and honors a tuple-count limit (for cursor FETCH n):
// ExecutePlan — src/backend/executor/execMain.c (condensed)for (;;){ ResetPerTupleExprContext(estate); /* free last tuple's scratch memory */
slot = ExecProcNode(planstate); /* the demand-pull next() */
if (TupIsNull(slot)) /* empty slot == end of data */ break;
if (estate->es_junkFilter != NULL) slot = ExecFilterJunk(estate->es_junkFilter, slot);
if (sendTuples) { if (!dest->receiveSlot(slot, dest)) /* client closed? */ break; }
if (operation == CMD_SELECT) (estate->es_processed)++;
current_tuple_count++; if (numberTuples && numberTuples == current_tuple_count) break; /* honored FETCH count limit */}
if (!(estate->es_top_eflags & EXEC_FLAG_BACKWARD)) ExecShutdownNode(planstate); /* early resource release */Three details earn their place. (1) ResetPerTupleExprContext at the top
of every iteration is the per-tuple memory-bounding discipline from
§“Common DBMS Design” — last tuple’s transient allocations are bulk-freed
before this tuple starts. (2) The empty-slot test TupIsNull(slot) is the
“no more tuples” sentinel: PostgreSQL signals end-of-data with an empty
slot, not always a C NULL. (3) The loop counts tuples for SELECT only;
for INSERT/UPDATE/DELETE the ModifyTable node counts its own
modified rows, which is why numberTuples (the cursor limit) is documented
to apply only to retrieved tuples.
TupleTableSlot — the unit of dataflow
Section titled “TupleTableSlot — the unit of dataflow”Every Exec* returns a TupleTableSlot *. The slot is a handle, not a
tuple: it holds a pointer to a TupleTableSlotOps vtable plus a cache of
deformed column values, and the vtable decides what physical tuple (if
any) sits underneath.
// TupleTableSlot — src/include/executor/tuptable.htypedef struct TupleTableSlot{ NodeTag type; uint16 tts_flags; /* TTS_FLAG_EMPTY, _SHOULDFREE, _FIXED, _SLOW */ AttrNumber tts_nvalid; /* # of leading columns already deformed */ const TupleTableSlotOps *const tts_ops; /* the vtable (== the slot "type") */ TupleDesc tts_tupleDescriptor; /* row shape */ Datum *tts_values; /* deformed column values (cache) */ bool *tts_isnull; /* per-column null flags */ MemoryContext tts_mcxt; ItemPointerData tts_tid; /* TID of the stored tuple, if any */ Oid tts_tableOid;} TupleTableSlot;The four built-in slot types are four const TupleTableSlotOps vtables,
each filling in init / clear / getsomeattrs / materialize /
copyslot / get_heap_tuple / etc. for one backing:
// the four slot vtables — src/include/executor/tuptable.hextern PGDLLIMPORT const TupleTableSlotOps TTSOpsVirtual;extern PGDLLIMPORT const TupleTableSlotOps TTSOpsHeapTuple;extern PGDLLIMPORT const TupleTableSlotOps TTSOpsMinimalTuple;extern PGDLLIMPORT const TupleTableSlotOps TTSOpsBufferHeapTuple;| Slot type | Backing | Typical producer |
|---|---|---|
TTSOpsVirtual | none — tts_values/tts_isnull are the tuple | projections, VALUES, computed join output |
TTSOpsHeapTuple | a palloc’d HeapTuple the slot owns | tuples built in memory, FunctionScan |
TTSOpsMinimalTuple | a MinimalTuple (no header/visibility) the slot owns | sort output, hash tables, tuplestores |
TTSOpsBufferHeapTuple | a HeapTuple pointing into a pinned shared buffer | SeqScan / IndexScan reading heap pages |
The BufferHeapTuple slot is the one that ties the executor to the storage
layer: it does not copy the row out of the buffer pool, it pins the buffer
and points into it, deferring both the copy and the pin-release. The
MinimalTuple slot is the one that crosses the parallel-query shm_mq
boundary, because a minimal tuple has no transaction header to carry. The
Virtual slot is the cheapest and the one a projection produces — it never
“deforms” because its columns are its tts_values array.
flowchart TB
CONS["any consumer node<br/>(join, agg, sort, dest)"]
CONS -->|"slot_getattr / slot_getsomeattrs"| SLOT
subgraph SLOT["TupleTableSlot (uniform handle)"]
VT["tts_ops -> vtable<br/>(getsomeattrs / materialize / copyslot)"]
CACHE["tts_values[] + tts_isnull[]<br/>tts_nvalid = deformed watermark"]
end
VT --> V["TTSOpsVirtual<br/>(values ARE the tuple)"]
VT --> H["TTSOpsHeapTuple<br/>(owned HeapTuple)"]
VT --> M["TTSOpsMinimalTuple<br/>(owned MinimalTuple)"]
VT --> B["TTSOpsBufferHeapTuple<br/>(HeapTuple in pinned buffer)"]
B -.-> BUF[("shared buffer pool")]
Figure 4 — The slot abstraction. A consumer asks the slot for column
values through one interface; the tts_ops vtable dispatches to the
backing’s deform/materialize routines. The BufferHeapTuple backing points
into a pinned shared buffer rather than copying — the seam between the
executor and the buffer manager.
A slot is created by MakeTupleTableSlot, which co-allocates the
tts_values/tts_isnull arrays with the slot when the descriptor is known
and calls the vtable’s init:
// MakeTupleTableSlot — src/backend/executor/execTuples.c (condensed)basesz = tts_ops->base_slot_size;if (tupleDesc) allocsz = MAXALIGN(basesz) + MAXALIGN(tupleDesc->natts * sizeof(Datum)) + MAXALIGN(tupleDesc->natts * sizeof(bool));else allocsz = basesz;slot = palloc0(allocsz);*((const TupleTableSlotOps **) &slot->tts_ops) = tts_ops; /* set the vtable */slot->type = T_TupleTableSlot;slot->tts_flags |= TTS_FLAG_EMPTY; /* born empty */slot->tts_tupleDescriptor = tupleDesc;/* ... point tts_values/tts_isnull into the co-allocation ... */slot->tts_ops->init(slot); /* backing-specific init */A freshly made slot is empty (TTS_FLAG_EMPTY), and TupIsNull
treats both a C NULL pointer and an empty slot as “no tuple”:
// TupIsNull — src/include/executor/tuptable.h#define TupIsNull(slot) \ ((slot) == NULL || TTS_EMPTY(slot))To put a tuple into a slot, the executor uses one of the ExecStore*
routines matched to the slot type — ExecStoreHeapTuple,
ExecStoreMinimalTuple, ExecStoreBufferHeapTuple, or, for a virtual
slot whose tts_values were filled directly, ExecStoreVirtualTuple,
which simply clears the empty flag and declares all columns valid:
// ExecStoreVirtualTuple — src/backend/executor/execTuples.cTupleTableSlot *ExecStoreVirtualTuple(TupleTableSlot *slot){ Assert(TTS_EMPTY(slot)); slot->tts_flags &= ~TTS_FLAG_EMPTY; slot->tts_nvalid = slot->tts_tupleDescriptor->natts; /* all valid */ return slot;}Lazy column decode
Section titled “Lazy column decode”tts_nvalid is the deform watermark: it records how many leading columns
of the current tuple have been decoded into tts_values. When something
asks for column n, slot_getsomeattrs deforms only up to n (if not
already done) by calling the vtable’s getsomeattrs:
// slot_getsomeattrs (inline) — src/include/executor/tuptable.h, calls execTuples.cstatic inline voidslot_getsomeattrs(TupleTableSlot *slot, int attnum){ if (slot->tts_nvalid < attnum) slot_getsomeattrs_int(slot, attnum); /* deform the gap, advance nvalid */}A query selecting dept (column 1) from a 12-column row never decodes
columns 2–12. The expression machinery exploits this: the README explains
that an ExprState’s step array “begins with EEOP_*_FETCHSOME steps that
ensure that the relevant tuples have been deconstructed to make the
required columns directly available (cf. slot_getsomeattrs()),” so each
Var-fetch is “little more than an array lookup.” The mechanics of those
steps live in postgres-expression-eval.md.
EState and the two memory contexts
Section titled “EState and the two memory contexts”EState is the per-execution global — one per executor invocation, shared
by every node in the tree (each PlanState.state points at it). Its
run-time-relevant fields:
// EState (abridged) — src/include/nodes/execnodes.htypedef struct EState{ NodeTag type; ScanDirection es_direction; /* forward / backward / no-movement */ Snapshot es_snapshot; /* the registered query snapshot */ Snapshot es_crosscheck_snapshot; List *es_range_table; /* RangeTblEntry list */ PlannedStmt *es_plannedstmt; JunkFilter *es_junkFilter; CommandId es_output_cid; /* CID to stamp inserted/deleted tuples */ ResultRelInfo **es_result_relations; /* DML target tables */ ParamListInfo es_param_list_info; /* external params */ ParamExecData *es_param_exec_vals; /* internal params (subplans) */ MemoryContext es_query_cxt; /* PER-QUERY arena: holds everything */ List *es_tupleTable; /* all the TupleTableSlots */ uint64 es_processed; /* tuples processed this ExecutorRun */ int es_top_eflags; List *es_subplanstates; ExprContext *es_per_tuple_exprcontext; /* PER-OUTPUT-TUPLE arena */ struct EPQState *es_epq_active; /* non-NULL inside an EvalPlanQual recheck */ bool es_use_parallel_mode; /* ... parallel-worker counters, DSA area, etc. ... */} EState;The two memory lifetimes the README spells out:
- Per-query context (
es_query_cxt): created byCreateExecutorState, it holds thePlanStatetree, theExprStatetrees, the slots, everything. Teardown is a singleMemoryContextDeleteinsideFreeExecutorState— no retailpfree, “rather than messing with retail pfree’s and probable storage leaks, we just destroy the memory context.” - Per-tuple context (each
ExprContext.ecxt_per_tuple_memory, plus the top-leveles_per_tuple_exprcontext): scratch space for evaluating one tuple’s quals and projections, reset once per tuple.ResetExprContext/ResetPerTupleExprContextare the bulk-free.
flowchart TB
subgraph QCTX["es_query_cxt — per-query context (lives whole query)"]
PST["PlanState tree"]
EXS["ExprState trees<br/>(compiled quals + tlists)"]
SLOTS["TupleTableSlots<br/>(es_tupleTable)"]
end
subgraph TCTX["per-tuple ExprContext memory (reset every tuple)"]
SCR["qual / projection scratch<br/>function-call working storage"]
end
RESET["ResetExprContext / ResetPerTupleExprContext<br/>(bulk-free, once per tuple)"]
RESET -.->|empties| TCTX
FREE["FreeExecutorState -> MemoryContextDelete"]
FREE -.->|destroys| QCTX
Figure 5 — Two memory lifetimes. The per-query context holds the entire
state machine and is freed in one operation at ExecutorEnd. The per-tuple
context holds only the current tuple’s transient scratch and is reset once
per tuple, bounding intra-query memory regardless of how many tuples flow.
(Source: executor/README “Memory Management”.)
The es_epq_active field is the executor’s hook for EvalPlanQual — the
READ COMMITTED update-recheck machinery where, on a concurrent update,
the query is re-run for the single modified row to see whether it still
passes the quals. The executor framework only carries the flag and the
substitute-tuple plumbing (see the scan entry point next); the recheck
policy belongs to the txn/MVCC layer and postgres-mvcc-snapshots.md.
The scan entry point — ExecScan and ExecScanExtended
Section titled “The scan entry point — ExecScan and ExecScanExtended”Every leaf scan (SeqScan, IndexScan, SampleScan, SubqueryScan, …)
shares one piece of boilerplate: fetch a candidate tuple from the access
method, check it against the node’s qual, project the surviving columns,
loop until a tuple passes or the source is exhausted. That boilerplate is
ExecScan / ExecScanExtended in execScan.[ch]. ExecScan just pulls
the node’s qual / projection / EPQ state and forwards to the always-inlined
core:
// ExecScan — src/backend/executor/execScan.cTupleTableSlot *ExecScan(ScanState *node, ExecScanAccessMtd accessMtd, /* per-node "get next raw tuple" */ ExecScanRecheckMtd recheckMtd) /* per-node EPQ recheck */{ EPQState *epqstate = node->ps.state->es_epq_active; ExprState *qual = node->ps.qual; ProjectionInfo *projInfo = node->ps.ps_ProjInfo;
return ExecScanExtended(node, accessMtd, recheckMtd, epqstate, qual, projInfo);}ExecScanExtended is marked pg_attribute_always_inline so the compiler
can specialize it per call site, eliminating the qual == NULL and
projInfo == NULL branches entirely when a scan has neither. Its loop:
// ExecScanExtended — src/include/executor/execScan.h (condensed)ExprContext *econtext = node->ps.ps_ExprContext;
if (!qual && !projInfo) /* no filter, no projection */{ ResetExprContext(econtext); return ExecScanFetch(node, epqstate, accessMtd, recheckMtd); /* raw tuple */}
ResetExprContext(econtext); /* free last tuple's scratch */for (;;){ TupleTableSlot *slot = ExecScanFetch(node, epqstate, accessMtd, recheckMtd);
if (TupIsNull(slot)) /* source exhausted */ { if (projInfo) return ExecClearTuple(projInfo->pi_state.resultslot); else return slot; }
econtext->ecxt_scantuple = slot; /* expose tuple to Var refs in qual/tlist */
if (qual == NULL || ExecQual(qual, econtext)) { if (projInfo) return ExecProject(projInfo); /* project surviving columns */ else return slot; /* return raw scan tuple */ } else InstrCountFiltered1(node, 1); /* count a qual-rejected row */
ResetExprContext(econtext); /* tuple failed qual -> free + retry */}This is where the “fast path” the README hints at lives: a SELECT * with
no WHERE (no qual, and a tlist that exactly matches the scan descriptor,
so ps_ProjInfo == NULL) takes the if (!qual && !projInfo) branch and
returns the access method’s slot untouched — zero projection, zero qual
call. ExecAssignScanProjectionInfo is what decides projection can be
skipped, leaving ps_ProjInfo NULL when “the requested tlist exactly
matches the underlying tuple type.”
ExecScanFetch (also always-inlined) is the indirection to the per-node
access method, with the EvalPlanQual substitution layered in front. When
not inside an EPQ recheck (epqstate == NULL, the common case), the entire
EPQ block is compiled away and ExecScanFetch is just
return (*accessMtd)(node):
// ExecScanFetch — src/include/executor/execScan.h (common-case skeleton)static pg_attribute_always_inline TupleTableSlot *ExecScanFetch(ScanState *node, EPQState *epqstate, ExecScanAccessMtd accessMtd, ExecScanRecheckMtd recheckMtd){ CHECK_FOR_INTERRUPTS();
if (epqstate != NULL) { /* inside an EvalPlanQual recheck: return the substitute test tuple, * or an empty slot if this rel's EPQ tuple was already consumed */ /* ... relsubs_slot / relsubs_rowmark handling ... */ }
return (*accessMtd) (node); /* the node's real next-tuple routine */}accessMtd is the per-node function that actually talks to storage — e.g.
SeqNext for a sequential scan, which calls the table AM’s
table_scan_getnextslot. Those per-node access methods, and the table/index
AM layer they call, are the subject of postgres-scan-nodes.md and the
storage-engine docs; this document stops at the ExecScan boilerplate they
all plug into.
ModifyTable — the write entry point
Section titled “ModifyTable — the write entry point”For INSERT / UPDATE / DELETE / MERGE, the actual table mutation is
not in ExecutePlan; it happens inside a top-level ModifyTable plan
node (ExecInitModifyTable / ExecModifyTable). ExecutePlan still pulls
tuples from the root, but for these commands the root is the
ModifyTable node: the plan tree below it produces the new column values
plus a “junk” row-identity column (a CTID for a heap table), and
ModifyTable performs the insert/update/delete via the table AM. If there
is a RETURNING clause, ModifyTable emits the computed rows as its output
(which is why hasReturning makes sendTuples true); otherwise it returns
nothing and the row counting is done by ModifyTable itself, not the
CMD_SELECT counter in ExecutePlan. The per-node mechanics of
ModifyTable (tuple routing for partitions, trigger firing, ON CONFLICT,
MERGE action dispatch) are deep enough for their own treatment and are
deferred to a dedicated node doc; the framework fact to carry is that
writes are a node, pulled like any other.
ExecEndNode — symmetric teardown
Section titled “ExecEndNode — symmetric teardown”ExecEndNode mirrors ExecInitNode: a nodeTag switch to the per-node
ExecEnd* routine, each of which recurses into its children and releases
that node’s resources (closing relations, dropping buffer pins, freeing
tuplestores). The README is explicit that this is not about freeing
memory — “it’s not really critical for ExecEndNode to free any memory;
it’ll all go away in FreeExecutorState anyway” — but about releasing
non-memory resources (relation locks, buffer pins) that the memory-context
delete would otherwise leak. ExecShutdownNode is a lighter, earlier pass
(called from ExecutePlan when no backward scan is possible) that stops
asynchronous resource consumption — notably shutting down parallel workers
and propagating their buffer-usage stats into the Gather node.
Source Walkthrough
Section titled “Source Walkthrough”Anchor on symbol names, not line numbers. The PostgreSQL source moves between releases; a function or struct name is the stable handle. Use
git grep -n '<symbol>' src/backend/executor/to locate the current position. The line numbers in the position-hint table were observed at commit273fe94(REL_18_STABLE) and are quick hints only.
Lifecycle entry points (execMain.c)
Section titled “Lifecycle entry points (execMain.c)”ExecutorStart/standard_ExecutorStart— createEState, register the snapshot, setes_output_cid, begin AFTER-trigger scope, callInitPlan.ExecutorRun/standard_ExecutorRun— start the dest receiver, callExecutePlanunless the direction is no-movement, accumulatees_total_processed.ExecutorFinish/standard_ExecutorFinish—ExecPostprocessPlan(run unfinishedModifyTablenodes) thenAfterTriggerEndQuery.ExecutorEnd/standard_ExecutorEnd—ExecEndNode, unregister both snapshots,FreeExecutorState.ExecutorRewind—ExecReScanthe root for cursor rewind (SELECT only).InitPlan— permissions, range table, row marks, initial pruning, subplan init, thenExecInitNode(root); derives resultTupleDescand junk filter.ExecutePlan— the demand-pull loop:ResetPerTupleExprContext→ExecProcNode→ junk filter →dest->receiveSlot→ count → limit check.ExecCheckPermissions/ExecCheckOneRelPerms— per-relation and per-column ACL checks driven fromInitPlan.
Dispatch layer (execProcnode.c)
Section titled “Dispatch layer (execProcnode.c)”ExecInitNode— theopen()dispatch;nodeTagswitch toExecInit*, recurses children, installs theExecProcNodewrapper.ExecSetExecProcNode— installExecProcNodeFirstas the first-call wrapper, stash the real function inExecProcNodeReal.ExecProcNodeFirst— one-time stack check, then rewireExecProcNodeto the real function orExecProcNodeInstr.ExecProcNodeInstr— instrumentation wrapper used underEXPLAIN ANALYZE(InstrStartNode/InstrStopNodearound the real call).MultiExecProcNode— the bulk variant for nodes that return a whole structure (a hash table, a bitmap) rather than one tuple at a time:Hash,BitmapIndexScan,BitmapAnd,BitmapOr.ExecEndNode— theclose()dispatch; symmetricnodeTagswitch toExecEnd*.ExecShutdownNode/ExecShutdownNode_walker— early async-resource release; shuts downGather/GatherMergeworkers,Hash, foreign and custom scans.ExecSetTupleBound— push aFETCH nlimit down through bound-aware nodes (Sort,IncrementalSort,Append,Gather, …) for bounded sort.
The next() call and rescan (executor.h)
Section titled “The next() call and rescan (executor.h)”ExecProcNode(inline) — the publicnext(): rescan-if-chgParam, thennode->ExecProcNode(node).ExecProcNodeMtd(typedef,execnodes.h) — the per-node function-pointer type returning aTupleTableSlot *.ExecReScan— reset a subtree to re-emit its output sequence (driven by changedParams); the README’s “rescan command.”
State-tree structures (execnodes.h)
Section titled “State-tree structures (execnodes.h)”struct PlanState— abstract base; theExecProcNodepointer,lefttree/righttreeedges,qual,ps_ProjInfo,ps_ResultTupleSlot,ps_ExprContext.struct EState— per-execution global;es_query_cxt,es_snapshot,es_direction,es_tupleTable,es_per_tuple_exprcontext,es_epq_active.struct ExprContext— per-tuple memory (ecxt_per_tuple_memory) plus theecxt_scantuple/ecxt_innertuple/ecxt_outertupleslots thatVarreferences resolve against.struct ScanState— the scan-node base:ss_currentRelation,ss_currentScanDesc(the cursor),ss_ScanTupleSlot.
Slot abstraction (tuptable.h, execTuples.c)
Section titled “Slot abstraction (tuptable.h, execTuples.c)”struct TupleTableSlot— the uniform handle;tts_ops,tts_values/tts_isnull,tts_nvalid,tts_flags.struct TupleTableSlotOps— the vtable (init/clear/getsomeattrs/materialize/copyslot/get_heap_tuple/ …).TTSOpsVirtual/TTSOpsHeapTuple/TTSOpsMinimalTuple/TTSOpsBufferHeapTuple— the four backing implementations.TupIsNull(macro) — NULL-or-empty test, the end-of-data sentinel.MakeTupleTableSlot/ExecAllocTableSlot/ExecInitScanTupleSlot— slot construction.ExecStoreHeapTuple/ExecStoreMinimalTuple/ExecStoreVirtualTuple— put a tuple into a slot of the matching type.ExecClearTuple(inline) — return a slot to the empty state.slot_getsomeattrs/slot_getsomeattrs_int— lazy left-prefix deform up toattnum, advancingtts_nvalid.
Scan boilerplate (execScan.c, execScan.h)
Section titled “Scan boilerplate (execScan.c, execScan.h)”ExecScan— the public scan driver; reads qual / projInfo / EPQ and forwards toExecScanExtended.ExecScanExtended(always-inline) — fetch / qual / project loop with the no-qual-no-project fast path.ExecScanFetch(always-inline) —CHECK_FOR_INTERRUPTS+ EPQ substitution +(*accessMtd)(node).ExecAssignScanProjectionInfo— decide whether projection can be skipped (leavesps_ProjInfoNULL when the tlist matches the scan descriptor).ExecScanReScan— clear the scan slot and reset EPQ done-flags on rescan.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
ExecutorStart | execMain.c | 122 |
standard_ExecutorStart | execMain.c | 141 |
standard_ExecutorRun | execMain.c | 307 |
standard_ExecutorFinish | execMain.c | 415 |
standard_ExecutorEnd | execMain.c | 475 |
ExecCheckPermissions | execMain.c | 582 |
InitPlan | execMain.c | 836 |
ExecutePlan | execMain.c | 1660 |
ExecInitNode | execProcnode.c | 142 |
ExecSetExecProcNode | execProcnode.c | 430 |
ExecProcNodeFirst | execProcnode.c | 448 |
ExecProcNodeInstr | execProcnode.c | 479 |
MultiExecProcNode | execProcnode.c | 507 |
ExecEndNode | execProcnode.c | 562 |
ExecShutdownNode | execProcnode.c | 772 |
ExecSetTupleBound | execProcnode.c | 848 |
ExecProcNode (inline) | executor.h | 310 |
ExecProcNodeMtd (typedef) | execnodes.h | 1140 |
struct PlanState | execnodes.h | 1149 |
struct EState | execnodes.h | 649 |
struct ExprContext | execnodes.h | 262 |
struct ScanState | execnodes.h | 1609 |
struct TupleTableSlot | tuptable.h | 114 |
struct TupleTableSlotOps | tuptable.h | 134 |
TupIsNull (macro) | tuptable.h | 310 |
MakeTupleTableSlot | execTuples.c | 1301 |
ExecStoreVirtualTuple | execTuples.c | 1741 |
ExecScan | execScan.c | 47 |
ExecScanReScan | execScan.c | 108 |
ExecScanFetch (inline) | execScan.h | 32 |
ExecScanExtended (inline) | execScan.h | 160 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Each entry leads with a fact about the current source at commit
273fe94(REL_18_STABLE), readable without any other materials. The trailing sentence shows how it was checked. Open questions follow as recorded gaps.
Verified facts
Section titled “Verified facts”-
All four executor entry points are hook-dispatchers over a
standard_*body.ExecutorStart/Run/Finish/Endeach test a*_hookglobal and otherwise callstandard_*. The hook globals (ExecutorStart_hook…) are defined at the top ofexecMain.cand initialized toNULL. Verified by reading the four functions on 2026-06-05. -
ExecProcNodeis an inline function, not aswitch; per-node dispatch is a function pointer set at init. The inline lives inexecutor.hand callsnode->ExecProcNode(node). The pointer is installed byExecSetExecProcNode(inexecProcnode.c) as the first-call wrapperExecProcNodeFirst, which then rewires itself to the bare per-node function orExecProcNodeInstr. Verified by reading all three on 2026-06-05. The per-tuple cost is one indirect call after the first invocation. -
End-of-data is signaled by an empty
TupleTableSlot, tested withTupIsNull, not by a universal CNULL.ExecutePlanbreaks onTupIsNull(slot);TupIsNull(intuptable.h) is((slot) == NULL || TTS_EMPTY(slot)). Verified by reading both. Individual nodes may return aNULLpointer or a cleared slot; both satisfyTupIsNull. -
There are exactly four built-in slot backings.
TTSOpsVirtual,TTSOpsHeapTuple,TTSOpsMinimalTuple,TTSOpsBufferHeapTupleare declaredextern PGDLLIMPORT constintuptable.hand defined inexecTuples.c. Verified by reading both. Thetts_opspointer doubles as the slot’s type tag (theTTS_IS_*macros compare against the vtable address). -
The state tree and the plan tree are isomorphic one-for-one, with the documented Append/MergeAppend pruning exception.
ExecInitNode’snodeTagswitch has oneExecInit*per plan node type, and the README states the exception explicitly (“the executor state’s subnode array will become out of sequence to the plan’s subplan list” under run-time pruning). Verified by readingExecInitNodeand the README on 2026-06-05. -
The whole executor invocation lives in one per-query memory context freed in a single operation.
CreateExecutorStatecreateses_query_cxt;standard_ExecutorStart/Run/Finishswitch into it;FreeExecutorState(called bystandard_ExecutorEnd) destroys it. The README’s Memory Management section states the intent (“we just destroy the memory context”). Verified by reading the entry points; theMemoryContextDeleteitself is insideFreeExecutorStateinexecUtils.c, which was not separately opened — taken on the README’s word plus the switch/free pairing. -
The query snapshot is registered for the run and unregistered at end.
standard_ExecutorStartdoesestate->es_snapshot = RegisterSnapshot(queryDesc->snapshot)and assertsGetActiveSnapshot() == queryDesc->snapshot;standard_ExecutorEndcallsUnregisterSnapshot(estate->es_snapshot). Verified by reading both. -
ExecScanExtendedhas a branch-eliminated fast path for no-qual-no-projection scans. Theif (!qual && !projInfo)early return inexecScan.h, combined with thepg_attribute_always_inlineattribute, lets the compiler drop the qual/projection branches at each call site.ExecAssignScanProjectionInfois what leavesps_ProjInfoNULL when the tlist matches the scan tuple type. Verified by readingexecScan.handExecAssignScanProjectionInfoinexecScan.c. -
ExecScanFetchcompiles the EvalPlanQual path away when not in an EPQ recheck. The whole EPQ block is underif (epqstate != NULL), andepqstateisnode->ps.state->es_epq_active, which is NULL outside an EPQ recheck; withpg_attribute_always_inlinethe dead branch is eliminated. Verified by readingExecScanandExecScanFetch.
Open questions
Section titled “Open questions”-
Where exactly is
es_per_tuple_exprcontextreset relative to per-nodeExprContexts?ExecutePlancallsResetPerTupleExprContext(estate)(the top-level context), whileExecScanExtendedresets the node’s ownps_ExprContext. The division of labor between the EState-level per-tuple context and each node’s per-tuple context is not fully traced here. Investigation path: readCreateExprContext/GetPerTupleExprContextinexecUtils.cand the comment block aboveExprContextinexecnodes.h. -
What is the exact set of nodes that support
MultiExecProcNode, and is it closed? The REL_18 switch listsHash,BitmapIndexScan,BitmapAnd,BitmapOr. Whether custom-scan or foreign-scan providers can extend this set, or it is hard-closed to those four, was not verified. Investigation path: grep forMultiExec*definitions and check the custom-scan API innodeCustom.c. -
How does asynchronous execution (
ExecAsyncRequest/ExecAppendAsyncEventWait) interleave with the synchronous pull loop? The README documents an async path forAppendover async-capableForeignScanchildren, but this document covers only the synchronous demand-pull spine. Investigation path: readnodeAppend.c’s async event loop andexecAsync.c; likely its own follow-up note.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”Pointers, not analysis. Each bullet is a starting handle for a follow-up doc; depth here is intentionally shallow.
-
CUBRID’s XASL tree vs. PostgreSQL’s PlanState tree. CUBRID executes an
XASL(eXtended Access Specification Language) tree, also a pull-style operator tree, but the plan and execution state are less cleanly separated than PostgreSQL’s read-only-Plan/ mutable-PlanStatesplit. A side-by-side of how each engine achieves plan reuse and parallel re-execution would sharpen what PostgreSQL buys with the strict immutability rule. -
Vectorized / batch-at-a-time execution. The classic critique of the tuple-at-a-time Volcano model is per-tuple interpreter overhead — one
next()call and one slot per row. MonetDB/X100 (Boncz, Zukowski & Nes, CIDR 2005, “MonetDB/X100: Hyper-Pipelining Query Execution”) and the column stores process vectors of values per call instead. The C-Store / Vertica lineage (dbms-papers/cstore.md,vertica-7-years.md) is the production embodiment. PostgreSQL stays tuple-at-a-time in core; JIT compilation (postgres-jit.md) is its answer to the interpreter overhead instead of vectorization. -
Push-based / data-centric compiled execution. Neumann’s “Efficiently Compiling Efficient Query Plans for Modern Hardware” (VLDB 2011, the HyPer engine) inverts the iterator: instead of pulling tuples up, it compiles the plan into tight push-based loops that keep a tuple in CPU registers across operator boundaries. This is the producer-driven direction DSC §15.7.2.1 names. A comparison would quantify what PostgreSQL’s pull model costs in L1/branch-prediction terms versus its simplicity and composability.
-
The Volcano exchange operator and PostgreSQL parallelism. Graefe’s “Volcano — An Extensible and Parallel Query Evaluation System” (IEEE TKDE 1994) introduced the exchange operator that turns a serial iterator tree into a parallel one without changing the operators. PostgreSQL’s
Gather/GatherMergenodes plusshm_mqare a variant of the same idea — a parallel-aware node forks the sub-plan into background workers and re-merges their streams. The cross-reference ispostgres-parallel-query.md; mappingGatheronto the exchange-operator abstraction would tie the implementation back to the theory. -
Morsel-driven parallelism. Leis et al., “Morsel-Driven Parallelism” (SIGMOD 2014) replaces the exchange operator with a work-stealing scheduler over small tuple “morsels,” better at load balancing on many-core machines than PostgreSQL’s per-worker sub-plan split. Relevant if PostgreSQL parallel query is ever measured against modern NUMA hardware.
Sources
Section titled “Sources”In-tree README
Section titled “In-tree README”src/backend/executor/README— the authoritative design doc: the demand-pull pipeline model, the Plan/State two-tree split, expression vs. state trees, memory management (per-query and per-tuple contexts), the Query Processing Control Flow sketch,EvalPlanQual, and asynchronous execution.
Textbook chapters (under knowledge/research/dbms-general/)
Section titled “Textbook chapters (under knowledge/research/dbms-general/)”- Database System Concepts (Silberschatz, Korth & Sudarshan, 7e),
Ch. 15 “Query Processing”, §15.7 “Evaluation of Expressions” — §15.7.2
“Pipelining” (materialized vs. pipelined; benefits) and §15.7.2.1
“Implementation of Pipelining” (demand-driven vs. producer-driven; the
open()/next()/close()iterator). Ch. 22 §22.5 for the Volcano exchange-operator model in parallel processing.
Papers (under knowledge/research/dbms-papers/)
Section titled “Papers (under knowledge/research/dbms-papers/)”- Architecture of a Database System (Hellerstein, Stonebraker &
Hamilton, FnT 2007) —
fntdb07-architecture.md, §1.1 / §4: the relational query processor as a suite of operators, SQL served in a “pull model.” - (Comparative, not yet captured) Graefe 1994 “Volcano”; Boncz et al. 2005 “MonetDB/X100”; Neumann 2011 “Compiling Efficient Query Plans”; Leis et al. 2014 “Morsel-Driven Parallelism” — see §“Beyond PostgreSQL”.
PostgreSQL source (under /data/hgryoo/references/postgres/, REL_18 273fe94)
Section titled “PostgreSQL source (under /data/hgryoo/references/postgres/, REL_18 273fe94)”src/backend/executor/execMain.c— lifecycle entry points,InitPlan,ExecutePlan, permissions.src/backend/executor/execProcnode.c—ExecInitNode/ExecProcNodewrappers /ExecEndNodedispatch,MultiExecProcNode,ExecShutdownNode,ExecSetTupleBound.src/backend/executor/execTuples.c— slot construction andExecStore*/slot_getsomeattrs; the fourTTSOps*vtable definitions.src/backend/executor/execScan.c+src/include/executor/execScan.h— the scan fetch/qual/project boilerplate.src/include/executor/executor.h— the inlineExecProcNode, expression-eval prototypes.src/include/executor/tuptable.h—TupleTableSlot,TupleTableSlotOps,TupIsNull, theTTSOps*declarations.src/include/nodes/execnodes.h—PlanState,EState,ExprContext,ScanState,ExecProcNodeMtd.
Cross-references (sibling docs that own adjacent mechanism)
Section titled “Cross-references (sibling docs that own adjacent mechanism)”postgres-planner-overview.md— how the read-onlyPlantree is built.postgres-node-trees.md— theNode/Plan/PlanStatetype system.postgres-expression-eval.md—ExprState/ExprEvalStepquals and projections,slot_getsomeattrsfetch steps.postgres-scan-nodes.md— per-node access methods (SeqNext, etc.) that plug intoExecScan.postgres-portals-prepared.md— the portal layer that callsExecutorStart/Run/Endand supplies the snapshot.