Skip to content

PostgreSQL Executor — The Demand-Pull Plan-Node Tree and Tuple Flow

Contents:

A relational query, once optimized, is a tree of physical operators: scans at the leaves, joins and aggregates and sorts in the interior, a projection or modification at the root. The execution engine’s job is to turn that operator tree into a stream of result tuples. Database System Concepts (Silberschatz, 7e, ch. 15 “Query Processing”) frames the choice of how tuples move between operators as the central design axis, and names two families:

  • Materialized evaluation. Each operator runs to completion, writes its entire output to a temporary relation, and the next operator reads that relation. Simple, but every intermediate result is paid for in full — I/O and latency to first row both scale with the largest intermediate.
  • Pipelined evaluation. Adjacent operators are combined into a pipeline so that one operator’s output tuples flow directly into the next without an intermediate relation (§15.7.2). Two benefits follow: intermediate results are never written to disk, and “the root operator of a query-evaluation plan … can start generating query results quickly” (§15.7.2) — the engine produces the first row before the last input row has even been read.

A pipeline can run in one of two directions, and this is the knob every engine turns (§15.7.2.1):

  1. Demand-driven (pull / lazy). “The system makes repeated requests for tuples from the operation at the top of the pipeline.” Each operator, when asked, computes one output tuple — recursively requesting input tuples from its children as needed — and returns it. Tuples are “pulled up an operation tree from the top,” generated lazily, on demand.
  2. Producer-driven (push / eager). Each operator runs as its own process or thread, eagerly generating output into a buffer until the buffer fills; tuples are “pushed up an operation tree from below.”

The textbook’s verdict is unambiguous: “Demand-driven pipelining is used more commonly than producer-driven pipelining because it is easier to implement” (§15.7.2.1). Its canonical realization is the iterator model, also called the Volcano model after Graefe’s Volcano system, whose exchange operator (DSC §22.5, ch. 22 parallel processing) later generalized it to parallelism. Each operator exposes three functions:

“Each operation in a demand-driven pipeline can be implemented as an iterator that provides the following functions: open(), next(), and close(). After a call to open(), each call to next() returns the next output tuple of the operation. The implementation of the operation in turn calls open() and next() on its inputs, to get its input tuples when required. … The iterator maintains the state of its execution in between calls so that successive next() requests receive successive result tuples.” (DSC §15.7.2.1)

Three properties of the iterator model shape everything downstream and are worth naming before reading any engine’s source:

  1. Uniform interface. Every operator answers the same next() call and returns the same kind of thing (one tuple, or “no more”). A join does not know whether its child is a sequential scan or another join. The tree composes because the interface is uniform.
  2. State lives in the iterator, not the caller. Between next() calls, each operator remembers where it was — the file offset of a scan, the build/probe phase of a hash join. This per-operator execution state is the bulk of what an executor allocates.
  3. Control flow is recursion down, tuples flow up. A single next() on the root drives a depth-first cascade of next() calls to the leaves; the leaves return tuples that bubble back up, each interior operator transforming the stream as it passes.

The Architecture of a Database System survey (Hellerstein, Stonebraker & Hamilton 2007, §1.1, §4) restates the same picture for production engines: the relational query processor is “a suite of operators … for executing any query,” and SQL is served in a “pull model” where the client repeatedly pulls rows. PostgreSQL is a textbook-faithful implementation of the demand-driven iterator model — with the wrinkle that its next() is ExecProcNode, its open()/close() are ExecInitNode/ExecEndNode, and the “tuple” it passes is not a bare row but a TupleTableSlot abstraction. The rest of this document traces those pieces in the REL_18 source.

The textbook gives the model — demand-driven iterators over an operator tree. This section names the engineering conventions that almost every production iterator engine adopts to make that model fast and safe, the patterns the textbook leaves implicit. PostgreSQL’s specific choices in ## PostgreSQL's Approach are best read as one set of dials within this shared space.

The optimizer’s output — the operator tree with its cost estimates, join conditions, and target lists — is logically read-only at run time: nothing about “where the scan cursor currently sits” belongs in it. Engines therefore keep two parallel trees: an immutable plan tree (the recipe) and a mutable state tree (the running instance), with each state node pointing back at its plan node. The payoff is plan caching and reuse: one cached plan can be executed many times, even concurrently in different sessions or parallel workers, because each execution gets its own fresh state tree and the plan is never mutated.

A uniform tuple handle, not a uniform tuple format

Section titled “A uniform tuple handle, not a uniform tuple format”

Operators must compose, but the tuples flowing between them come from wildly different sources: a heap page (with MVCC visibility bits and a buffer pin), an index, a sort’s spill file, an in-memory VALUES list, a join’s freshly-projected combination of two inputs. Forcing all of these into one physical layout would mean copying every tuple into that layout at every boundary. The convention instead is a uniform tuple handle that can wrap any of these backings — exposing a common “give me column i” interface while deferring the physical decode until a column is actually touched. This is the single most important abstraction in a pipelined engine: it lets an operator consume its child’s output without knowing or caring how that output is stored.

A heap tuple on disk is a packed byte string; turning it into an array of typed Datum values (“deforming”) costs CPU proportional to the number of columns. But a query that touches column 2 of a 40-column table should not pay to decode columns 3–40. The convention is lazy, left-prefix deforming: decode columns 1..n only when something asks for column n, cache how far you have decoded, and never redo it. A “virtual” tuple — one that is already an array of Datums, e.g. a join’s projected output — skips deforming entirely.

Bounded per-tuple memory via reset-able contexts

Section titled “Bounded per-tuple memory via reset-able contexts”

A pipeline can process billions of tuples in one query. If each tuple’s transient allocations (the scratch space for evaluating a WHERE clause, a function call’s working storage) accumulated for the life of the query, memory would blow up. The convention is a per-tuple memory arena that is reset (bulk-freed) once per tuple, separate from the per-query arena that holds the state tree and survives until the query ends. Two allocation lifetimes, one cheap bulk-free at each boundary, no retail free() of individual tuples.

One snapshot, registered for the query’s life

Section titled “One snapshot, registered for the query’s life”

A pipelined read must see a consistent set of rows from start to finish. The engine acquires an MVCC snapshot before execution and keeps it registered (pinned, so vacuum cannot reclaim versions it can still see) for the entire ExecutorStartExecutorEnd span, releasing it only at teardown.

By the time you reach a named PostgreSQL symbol in the next section, you should already know what kind of thing it is:

Theory / conventionPostgreSQL name
Operator open()ExecInitNode (dispatch) → per-node ExecInit*
Operator next()ExecProcNode (inline) → node->ExecProcNode → per-node Exec*
Operator close()ExecEndNode (dispatch) → per-node ExecEnd*
Read-only plan treePlan and subtypes (from the planner)
Mutable execution-state treePlanState and subtypes; lefttree / righttree links
Per-execution global stateEState (one per executor invocation)
Uniform tuple handleTupleTableSlot + TupleTableSlotOps vtable
Tuple-handle backingsTTSOpsVirtual / TTSOpsHeapTuple / TTSOpsMinimalTuple / TTSOpsBufferHeapTuple
Lazy column decodeslot_getsomeattrs + tts_nvalid watermark
Per-query memory arenaEState.es_query_cxt
Per-tuple memory arenaExprContext.ecxt_per_tuple_memory, reset by ResetExprContext
Registered query snapshotEState.es_snapshot via RegisterSnapshot
Scan boilerplate (qual + project loop)ExecScan / ExecScanExtended in execScan.[ch]
”No more tuples” sentinelempty TupleTableSlot (TupIsNull), not a C NULL everywhere

The planner side of this boundary — how the Plan tree is built and why it is read-only — is owned by postgres-planner-overview.md and postgres-node-trees.md. Expression evaluation (the ExprState / ExprEvalStep machinery that quals and projections compile to) is owned by postgres-expression-eval.md. The individual leaf scan and join node implementations are owned by postgres-scan-nodes.md. This document covers the framework: the four lifecycle entry points, the dispatch layer, the slot abstraction, and the scan boilerplate every leaf shares.

PostgreSQL’s executor is the demand-driven iterator model rendered almost exactly as the textbook describes it, with four design decisions that give it its shape:

  1. Four lifecycle entry points, not one. ExecutorStart / ExecutorRun / ExecutorFinish / ExecutorEnd separate “build the machine,” “pull tuples,” “drain side-effects and fire AFTER triggers,” and “tear down.” The split exists so that EXPLAIN ANALYZE can time the trigger-firing phase, and so a portal can call ExecutorRun repeatedly for cursor FETCH.
  2. A PlanState tree mirroring the Plan tree one node for one node, built top-down by ExecInitNode recursing the plan. The plan stays read-only; all run-time mutation lands in the state tree.
  3. ExecProcNode is the next() call, dispatched through a function pointer stored on each PlanState (node->ExecProcNode), not a giant switch per tuple. The dispatch cost is paid once at init, not per tuple.
  4. TupleTableSlot is the unit of dataflow, with a TupleTableSlotOps vtable selecting the physical backing. Every Exec* function returns a slot; an empty slot means end-of-data.

Each entry point is a thin hook-dispatcher wrapping a standard_* implementation, so extensions (pg_stat_statements, auditing plugins) can intercept the whole executor by setting ExecutorStart_hook and friends — this is the executor’s slice of the engine’s hook-based extensibility surface.

// ExecutorStart / ExecutorRun — src/backend/executor/execMain.c
void
ExecutorStart(QueryDesc *queryDesc, int eflags)
{
pgstat_report_query_id(queryDesc->plannedstmt->queryId, false);
if (ExecutorStart_hook)
(*ExecutorStart_hook) (queryDesc, eflags);
else
standard_ExecutorStart(queryDesc, eflags);
}
void
ExecutorRun(QueryDesc *queryDesc, ScanDirection direction, uint64 count)
{
if (ExecutorRun_hook)
(*ExecutorRun_hook) (queryDesc, direction, count);
else
standard_ExecutorRun(queryDesc, direction, count);
}

The README’s control-flow sketch is the canonical map of how the four phases nest:

flowchart TB
  subgraph START["ExecutorStart"]
    CES["CreateExecutorState<br/>(per-query context)"]
    SW1["switch into es_query_cxt"]
    INIT["InitPlan -> ExecInitNode<br/>(recursively builds PlanState tree)"]
    CES --> SW1 --> INIT
  end

  subgraph RUN["ExecutorRun"]
    EP["ExecutePlan loop:<br/>ExecProcNode(root) until empty slot"]
    DEST["dest->receiveSlot per tuple"]
    EP --> DEST
  end

  subgraph FIN["ExecutorFinish"]
    PPP["ExecPostprocessPlan<br/>(run unfinished ModifyTable nodes)"]
    AT["AfterTriggerEndQuery<br/>(fire AFTER triggers)"]
    PPP --> AT
  end

  subgraph END["ExecutorEnd"]
    EEP["ExecEndNode<br/>(recursively release resources)"]
    UNR["UnregisterSnapshot"]
    FREE["FreeExecutorState<br/>(destroys per-query context)"]
    EEP --> UNR --> FREE
  end

  START --> RUN --> FIN --> END

Figure 1 — The four executor lifecycle phases and what each does. ExecutorStart builds the state tree in the per-query context; ExecutorRun pulls tuples in a loop; ExecutorFinish drains side-effects and fires AFTER triggers (separated so EXPLAIN ANALYZE can time it); ExecutorEnd releases resources and frees the whole per-query context in one operation. (Source: executor/README “Query Processing Control Flow”.)

standard_ExecutorStart is where the per-query EState is born, the query snapshot is registered, and the state tree is built. The key middle section:

// standard_ExecutorStart — src/backend/executor/execMain.c (condensed)
estate = CreateExecutorState();
queryDesc->estate = estate;
oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
estate->es_param_list_info = queryDesc->params;
/* ... allocate es_param_exec_vals for internal params ... */
estate->es_sourceText = queryDesc->sourceText;
estate->es_queryEnv = queryDesc->queryEnv;
/* set es_output_cid for non-read-only ops (or SELECT FOR UPDATE) */
estate->es_snapshot = RegisterSnapshot(queryDesc->snapshot);
estate->es_crosscheck_snapshot = RegisterSnapshot(queryDesc->crosscheck_snapshot);
estate->es_top_eflags = eflags;
estate->es_instrument = queryDesc->instrument_options;
estate->es_jit_flags = queryDesc->plannedstmt->jitFlags;
if (!(eflags & (EXEC_FLAG_SKIP_TRIGGERS | EXEC_FLAG_EXPLAIN_ONLY)))
AfterTriggerBeginQuery();
InitPlan(queryDesc, eflags); /* builds the PlanState tree */
MemoryContextSwitchTo(oldcontext);

Two things to notice. First, the snapshot passed by the caller (the portal / pquery.c layer, owned by postgres-portals-prepared.md) is registered here and unregistered only in standard_ExecutorEnd — the read sees one consistent set of versions for the whole run. Second, everything from CreateExecutorState onward happens inside es_query_cxt, so the entire state tree and all the slots and expression states it allocates land in that one context.

InitPlan does the permissions check, sets up the range table and row marks, runs initial partition pruning, then calls ExecInitNode on the plan root to build the state tree (covered in the next subsection) and finally derives the result tuple descriptor and junk filter:

// InitPlan — src/backend/executor/execMain.c (condensed)
ExecCheckPermissions(rangeTable, plannedstmt->permInfos, true);
ExecInitRangeTable(estate, rangeTable, plannedstmt->permInfos, ...);
estate->es_plannedstmt = plannedstmt;
ExecDoInitialPruning(estate);
/* ... build es_rowmarks from plannedstmt->rowMarks ... */
estate->es_tupleTable = NIL;
estate->es_epq_active = NULL;
/* init each SubPlan's state first ... */
planstate = ExecInitNode(plan, estate, eflags);
tupType = ExecGetResultType(planstate);
/* ... build junk filter if the top tlist has junk attrs ... */

standard_ExecutorRun sets up the destination receiver, then calls ExecutePlan (the pull loop) unless the fetch direction is “no movement”:

// standard_ExecutorRun — src/backend/executor/execMain.c (condensed)
oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
operation = queryDesc->operation;
dest = queryDesc->dest;
estate->es_processed = 0;
sendTuples = (operation == CMD_SELECT || queryDesc->plannedstmt->hasReturning);
if (sendTuples)
dest->rStartup(dest, operation, queryDesc->tupDesc);
if (!ScanDirectionIsNoMovement(direction))
ExecutePlan(queryDesc, operation, sendTuples, count, direction, dest);
estate->es_total_processed += estate->es_processed;
if (sendTuples)
dest->rShutdown(dest);
MemoryContextSwitchTo(oldcontext);

ExecutorFinish runs any ModifyTable nodes that have not yet run to completion (ExecPostprocessPlan) and fires queued AFTER triggers; it is deliberately separate from ExecutorEnd so that EXPLAIN ANALYZE can include trigger time in the reported total. ExecutorEnd calls ExecEndNode to recurse the state tree releasing relations and buffer pins, unregisters the two snapshots, and calls FreeExecutorState — which destroys es_query_cxt and with it everything the executor allocated.

The README states the core invariant directly: “During executor startup we build a parallel tree of identical structure containing executor state nodes — generally, every plan node type has a corresponding executor state node type. Each node in the state tree has a pointer to its corresponding node in the plan tree … This arrangement allows the plan tree to be completely read-only so far as the executor is concerned.”

PlanState is the abstract base every state node embeds as its first field. Its run-time-relevant members:

// PlanState (abridged) — src/include/nodes/execnodes.h
typedef struct PlanState
{
pg_node_attr(abstract)
NodeTag type;
Plan *plan; /* the read-only plan node */
EState *state; /* shared per-query state */
ExecProcNodeMtd ExecProcNode; /* the next() function (maybe wrapped) */
ExecProcNodeMtd ExecProcNodeReal; /* the real next() if above is a wrapper */
Instrumentation *instrument; /* optional EXPLAIN ANALYZE stats */
ExprState *qual; /* compiled WHERE / filter, or NULL */
struct PlanState *lefttree; /* input subplan(s) — the "children" */
struct PlanState *righttree;
List *initPlan; /* uncorrelated SubPlanState nodes */
List *subPlan; /* correlated SubPlanState nodes */
Bitmapset *chgParam; /* set of changed Param IDs -> triggers rescan */
TupleDesc ps_ResultTupleDesc; /* this node's output row type */
TupleTableSlot *ps_ResultTupleSlot; /* slot this node returns */
ExprContext *ps_ExprContext; /* per-tuple memory + expr scratch */
ProjectionInfo *ps_ProjInfo; /* compiled target list, or NULL */
/* ... slot-type hint fields (scanops/outerops/innerops/resultops) ... */
} PlanState;

lefttree and righttree are the iterator-tree edges: a join’s two inputs are its left and right subtrees; a single-input node (sort, aggregate, scan-with-subquery) uses lefttree as its one child. The convenience macros outerPlanState(node) / innerPlanState(node) read these. qual and ps_ProjInfo are the compiled WHERE filter and target list — pre-compiled at init by the expression machinery so the per-tuple path is a tight interpreter loop rather than a tree walk.

flowchart TB
  subgraph PLAN["Plan tree (read-only, from planner)"]
    P0["NestLoop"]
    P1["SeqScan dept"]
    P2["SeqScan emp"]
    P0 --> P1
    P0 --> P2
  end

  subgraph STATE["PlanState tree (mutable, built by ExecInitNode)"]
    S0["NestLoopState<br/>ExecProcNode=ExecNestLoop"]
    S1["SeqScanState<br/>ExecProcNode=ExecSeqScan<br/>ss_currentScanDesc (cursor)"]
    S2["SeqScanState<br/>cursor + ss_ScanTupleSlot"]
    S0 -. "lefttree" .-> S1
    S0 -. "righttree" .-> S2
  end

  S0 -. "->plan" .-> P0
  S1 -. "->plan" .-> P1
  S2 -. "->plan" .-> P2

Figure 2 — The two parallel trees. The Plan tree is the planner’s read-only recipe; ExecInitNode builds an isomorphic PlanState tree whose nodes carry the run-time cursor (ss_currentScanDesc), the per-node ExecProcNode function pointer, and a back-pointer to the plan node. Read-only plans are what make plan caching and parallel re-execution safe.

The README notes one wrinkle: a state node may be omitted when run-time partition pruning determines a subplan can produce no rows — currently only under Append / MergeAppend — so the state subnode array can fall out of lockstep with the plan’s subplan list. The one-for-one mapping is the rule; pruning is the documented exception.

ExecInitNode is the recursive tree builder. It is a switch on the plan node’s tag that calls the right ExecInit* constructor, which in turn recurses into its own children. The structure:

// ExecInitNode — src/backend/executor/execProcnode.c (condensed)
PlanState *
ExecInitNode(Plan *node, EState *estate, int eflags)
{
PlanState *result;
if (node == NULL) /* leaf of recursion */
return NULL;
check_stack_depth();
switch (nodeTag(node))
{
case T_SeqScan:
result = (PlanState *) ExecInitSeqScan((SeqScan *) node, estate, eflags);
break;
case T_NestLoop:
result = (PlanState *) ExecInitNestLoop((NestLoop *) node, estate, eflags);
break;
case T_Agg:
result = (PlanState *) ExecInitAgg((Agg *) node, estate, eflags);
break;
/* ... ~40 more node types: control, scan, join, materialization ... */
default:
elog(ERROR, "unrecognized node type: %d", (int) nodeTag(node));
result = NULL;
break;
}
ExecSetExecProcNode(result, result->ExecProcNode); /* install wrapper */
/* init this node's initPlans; set up instrumentation if requested */
return result;
}

Each ExecInit* constructor calls ExecInitNode on its own children, so one call on the root depth-first-builds the whole tree. The nodeTag switch is grouped by family in the source — control nodes (Result, ProjectSet, ModifyTable, Append, …), scan nodes (SeqScan, IndexScan, BitmapHeapScan, ForeignScan, …), join nodes (NestLoop, MergeJoin, HashJoin), and materialization nodes (Material, Sort, Agg, Gather, …) — the same grouping mirrored in ExecEndNode.

ExecProcNode — the next() call and its wrappers

Section titled “ExecProcNode — the next() call and its wrappers”

ExecProcNode is the demand-pull next(). It is a tiny inline function in executor.h, not a switch: each PlanState already carries a function pointer to its own per-node routine, set at init.

// ExecProcNode (inline) — src/include/executor/executor.h
static inline TupleTableSlot *
ExecProcNode(PlanState *node)
{
if (node->chgParam != NULL) /* a Param changed? */
ExecReScan(node); /* reset this subtree before pulling */
return node->ExecProcNode(node);
}

The indirection through node->ExecProcNode is what makes the tree polymorphic: a join calls ExecProcNode(outerPlanState(node)) without knowing or branching on what the child is. But the pointer is not set directly to the per-node function — ExecSetExecProcNode installs a wrapper, ExecProcNodeFirst, that runs one-time checks on the first call and then rewires the pointer to either the bare function or an instrumentation wrapper:

// ExecSetExecProcNode + ExecProcNodeFirst — src/backend/executor/execProcnode.c
void
ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function)
{
node->ExecProcNodeReal = function;
node->ExecProcNode = ExecProcNodeFirst; /* wrapper for first call */
}
static TupleTableSlot *
ExecProcNodeFirst(PlanState *node)
{
check_stack_depth(); /* once, on first execution */
if (node->instrument)
node->ExecProcNode = ExecProcNodeInstr; /* EXPLAIN ANALYZE path */
else
node->ExecProcNode = node->ExecProcNodeReal; /* fast path henceforth */
return node->ExecProcNode(node);
}

This is a deliberate optimization: the expensive stack-depth check and the instrumentation branch are paid once per node per query, not once per tuple. After the first call, a non-instrumented node’s ExecProcNode points straight at ExecSeqScan (or whatever), so the hot path is a single indirect call with no wrapper overhead.

sequenceDiagram
    participant EP as ExecutePlan
    participant NL as NestLoopState
    participant OUT as outer SeqScanState
    participant IN as inner SeqScanState

    EP->>NL: ExecProcNode(root)
    NL->>OUT: ExecProcNode(outer)
    OUT-->>NL: slot (dept row) or empty
    NL->>IN: ExecProcNode(inner)
    IN-->>NL: slot (emp row) or empty
    note over NL: join match? project combined row
    NL-->>EP: result slot (one joined row)
    note over EP: send to dest, loop again
    EP->>NL: ExecProcNode(root)
    note over NL,IN: inner exhausted -> advance outer, rescan inner
    EP->>NL: ExecProcNode(root)
    NL-->>EP: empty slot (TupIsNull) -> done

Figure 3 — Demand-pull recursion for a two-table nested-loop join. One ExecProcNode on the root cascades down to the leaf scans; tuples bubble back up; the join transforms the stream. ExecutePlan keeps calling the root until it returns an empty slot. This is the iterator model’s “control flows down, tuples flow up” property in PostgreSQL form.

ExecutePlan is the literal realization of “repeatedly call next() on the top of the pipeline.” It loops on ExecProcNode(planstate), stops when the slot is empty, optionally strips junk columns, ships the tuple to the destination, and honors a tuple-count limit (for cursor FETCH n):

// ExecutePlan — src/backend/executor/execMain.c (condensed)
for (;;)
{
ResetPerTupleExprContext(estate); /* free last tuple's scratch memory */
slot = ExecProcNode(planstate); /* the demand-pull next() */
if (TupIsNull(slot)) /* empty slot == end of data */
break;
if (estate->es_junkFilter != NULL)
slot = ExecFilterJunk(estate->es_junkFilter, slot);
if (sendTuples)
{
if (!dest->receiveSlot(slot, dest)) /* client closed? */
break;
}
if (operation == CMD_SELECT)
(estate->es_processed)++;
current_tuple_count++;
if (numberTuples && numberTuples == current_tuple_count)
break; /* honored FETCH count limit */
}
if (!(estate->es_top_eflags & EXEC_FLAG_BACKWARD))
ExecShutdownNode(planstate); /* early resource release */

Three details earn their place. (1) ResetPerTupleExprContext at the top of every iteration is the per-tuple memory-bounding discipline from §“Common DBMS Design” — last tuple’s transient allocations are bulk-freed before this tuple starts. (2) The empty-slot test TupIsNull(slot) is the “no more tuples” sentinel: PostgreSQL signals end-of-data with an empty slot, not always a C NULL. (3) The loop counts tuples for SELECT only; for INSERT/UPDATE/DELETE the ModifyTable node counts its own modified rows, which is why numberTuples (the cursor limit) is documented to apply only to retrieved tuples.

Every Exec* returns a TupleTableSlot *. The slot is a handle, not a tuple: it holds a pointer to a TupleTableSlotOps vtable plus a cache of deformed column values, and the vtable decides what physical tuple (if any) sits underneath.

// TupleTableSlot — src/include/executor/tuptable.h
typedef struct TupleTableSlot
{
NodeTag type;
uint16 tts_flags; /* TTS_FLAG_EMPTY, _SHOULDFREE, _FIXED, _SLOW */
AttrNumber tts_nvalid; /* # of leading columns already deformed */
const TupleTableSlotOps *const tts_ops; /* the vtable (== the slot "type") */
TupleDesc tts_tupleDescriptor; /* row shape */
Datum *tts_values; /* deformed column values (cache) */
bool *tts_isnull; /* per-column null flags */
MemoryContext tts_mcxt;
ItemPointerData tts_tid; /* TID of the stored tuple, if any */
Oid tts_tableOid;
} TupleTableSlot;

The four built-in slot types are four const TupleTableSlotOps vtables, each filling in init / clear / getsomeattrs / materialize / copyslot / get_heap_tuple / etc. for one backing:

// the four slot vtables — src/include/executor/tuptable.h
extern PGDLLIMPORT const TupleTableSlotOps TTSOpsVirtual;
extern PGDLLIMPORT const TupleTableSlotOps TTSOpsHeapTuple;
extern PGDLLIMPORT const TupleTableSlotOps TTSOpsMinimalTuple;
extern PGDLLIMPORT const TupleTableSlotOps TTSOpsBufferHeapTuple;
Slot typeBackingTypical producer
TTSOpsVirtualnone — tts_values/tts_isnull are the tupleprojections, VALUES, computed join output
TTSOpsHeapTuplea palloc’d HeapTuple the slot ownstuples built in memory, FunctionScan
TTSOpsMinimalTuplea MinimalTuple (no header/visibility) the slot ownssort output, hash tables, tuplestores
TTSOpsBufferHeapTuplea HeapTuple pointing into a pinned shared bufferSeqScan / IndexScan reading heap pages

The BufferHeapTuple slot is the one that ties the executor to the storage layer: it does not copy the row out of the buffer pool, it pins the buffer and points into it, deferring both the copy and the pin-release. The MinimalTuple slot is the one that crosses the parallel-query shm_mq boundary, because a minimal tuple has no transaction header to carry. The Virtual slot is the cheapest and the one a projection produces — it never “deforms” because its columns are its tts_values array.

flowchart TB
  CONS["any consumer node<br/>(join, agg, sort, dest)"]
  CONS -->|"slot_getattr / slot_getsomeattrs"| SLOT
  subgraph SLOT["TupleTableSlot (uniform handle)"]
    VT["tts_ops -> vtable<br/>(getsomeattrs / materialize / copyslot)"]
    CACHE["tts_values[] + tts_isnull[]<br/>tts_nvalid = deformed watermark"]
  end
  VT --> V["TTSOpsVirtual<br/>(values ARE the tuple)"]
  VT --> H["TTSOpsHeapTuple<br/>(owned HeapTuple)"]
  VT --> M["TTSOpsMinimalTuple<br/>(owned MinimalTuple)"]
  VT --> B["TTSOpsBufferHeapTuple<br/>(HeapTuple in pinned buffer)"]
  B -.-> BUF[("shared buffer pool")]

Figure 4 — The slot abstraction. A consumer asks the slot for column values through one interface; the tts_ops vtable dispatches to the backing’s deform/materialize routines. The BufferHeapTuple backing points into a pinned shared buffer rather than copying — the seam between the executor and the buffer manager.

A slot is created by MakeTupleTableSlot, which co-allocates the tts_values/tts_isnull arrays with the slot when the descriptor is known and calls the vtable’s init:

// MakeTupleTableSlot — src/backend/executor/execTuples.c (condensed)
basesz = tts_ops->base_slot_size;
if (tupleDesc)
allocsz = MAXALIGN(basesz)
+ MAXALIGN(tupleDesc->natts * sizeof(Datum))
+ MAXALIGN(tupleDesc->natts * sizeof(bool));
else
allocsz = basesz;
slot = palloc0(allocsz);
*((const TupleTableSlotOps **) &slot->tts_ops) = tts_ops; /* set the vtable */
slot->type = T_TupleTableSlot;
slot->tts_flags |= TTS_FLAG_EMPTY; /* born empty */
slot->tts_tupleDescriptor = tupleDesc;
/* ... point tts_values/tts_isnull into the co-allocation ... */
slot->tts_ops->init(slot); /* backing-specific init */

A freshly made slot is empty (TTS_FLAG_EMPTY), and TupIsNull treats both a C NULL pointer and an empty slot as “no tuple”:

// TupIsNull — src/include/executor/tuptable.h
#define TupIsNull(slot) \
((slot) == NULL || TTS_EMPTY(slot))

To put a tuple into a slot, the executor uses one of the ExecStore* routines matched to the slot type — ExecStoreHeapTuple, ExecStoreMinimalTuple, ExecStoreBufferHeapTuple, or, for a virtual slot whose tts_values were filled directly, ExecStoreVirtualTuple, which simply clears the empty flag and declares all columns valid:

// ExecStoreVirtualTuple — src/backend/executor/execTuples.c
TupleTableSlot *
ExecStoreVirtualTuple(TupleTableSlot *slot)
{
Assert(TTS_EMPTY(slot));
slot->tts_flags &= ~TTS_FLAG_EMPTY;
slot->tts_nvalid = slot->tts_tupleDescriptor->natts; /* all valid */
return slot;
}

tts_nvalid is the deform watermark: it records how many leading columns of the current tuple have been decoded into tts_values. When something asks for column n, slot_getsomeattrs deforms only up to n (if not already done) by calling the vtable’s getsomeattrs:

// slot_getsomeattrs (inline) — src/include/executor/tuptable.h, calls execTuples.c
static inline void
slot_getsomeattrs(TupleTableSlot *slot, int attnum)
{
if (slot->tts_nvalid < attnum)
slot_getsomeattrs_int(slot, attnum); /* deform the gap, advance nvalid */
}

A query selecting dept (column 1) from a 12-column row never decodes columns 2–12. The expression machinery exploits this: the README explains that an ExprState’s step array “begins with EEOP_*_FETCHSOME steps that ensure that the relevant tuples have been deconstructed to make the required columns directly available (cf. slot_getsomeattrs()),” so each Var-fetch is “little more than an array lookup.” The mechanics of those steps live in postgres-expression-eval.md.

EState is the per-execution global — one per executor invocation, shared by every node in the tree (each PlanState.state points at it). Its run-time-relevant fields:

// EState (abridged) — src/include/nodes/execnodes.h
typedef struct EState
{
NodeTag type;
ScanDirection es_direction; /* forward / backward / no-movement */
Snapshot es_snapshot; /* the registered query snapshot */
Snapshot es_crosscheck_snapshot;
List *es_range_table; /* RangeTblEntry list */
PlannedStmt *es_plannedstmt;
JunkFilter *es_junkFilter;
CommandId es_output_cid; /* CID to stamp inserted/deleted tuples */
ResultRelInfo **es_result_relations; /* DML target tables */
ParamListInfo es_param_list_info; /* external params */
ParamExecData *es_param_exec_vals; /* internal params (subplans) */
MemoryContext es_query_cxt; /* PER-QUERY arena: holds everything */
List *es_tupleTable; /* all the TupleTableSlots */
uint64 es_processed; /* tuples processed this ExecutorRun */
int es_top_eflags;
List *es_subplanstates;
ExprContext *es_per_tuple_exprcontext; /* PER-OUTPUT-TUPLE arena */
struct EPQState *es_epq_active; /* non-NULL inside an EvalPlanQual recheck */
bool es_use_parallel_mode;
/* ... parallel-worker counters, DSA area, etc. ... */
} EState;

The two memory lifetimes the README spells out:

  • Per-query context (es_query_cxt): created by CreateExecutorState, it holds the PlanState tree, the ExprState trees, the slots, everything. Teardown is a single MemoryContextDelete inside FreeExecutorState — no retail pfree, “rather than messing with retail pfree’s and probable storage leaks, we just destroy the memory context.”
  • Per-tuple context (each ExprContext.ecxt_per_tuple_memory, plus the top-level es_per_tuple_exprcontext): scratch space for evaluating one tuple’s quals and projections, reset once per tuple. ResetExprContext / ResetPerTupleExprContext are the bulk-free.
flowchart TB
  subgraph QCTX["es_query_cxt — per-query context (lives whole query)"]
    PST["PlanState tree"]
    EXS["ExprState trees<br/>(compiled quals + tlists)"]
    SLOTS["TupleTableSlots<br/>(es_tupleTable)"]
  end
  subgraph TCTX["per-tuple ExprContext memory (reset every tuple)"]
    SCR["qual / projection scratch<br/>function-call working storage"]
  end
  RESET["ResetExprContext / ResetPerTupleExprContext<br/>(bulk-free, once per tuple)"]
  RESET -.->|empties| TCTX
  FREE["FreeExecutorState -> MemoryContextDelete"]
  FREE -.->|destroys| QCTX

Figure 5 — Two memory lifetimes. The per-query context holds the entire state machine and is freed in one operation at ExecutorEnd. The per-tuple context holds only the current tuple’s transient scratch and is reset once per tuple, bounding intra-query memory regardless of how many tuples flow. (Source: executor/README “Memory Management”.)

The es_epq_active field is the executor’s hook for EvalPlanQual — the READ COMMITTED update-recheck machinery where, on a concurrent update, the query is re-run for the single modified row to see whether it still passes the quals. The executor framework only carries the flag and the substitute-tuple plumbing (see the scan entry point next); the recheck policy belongs to the txn/MVCC layer and postgres-mvcc-snapshots.md.

The scan entry point — ExecScan and ExecScanExtended

Section titled “The scan entry point — ExecScan and ExecScanExtended”

Every leaf scan (SeqScan, IndexScan, SampleScan, SubqueryScan, …) shares one piece of boilerplate: fetch a candidate tuple from the access method, check it against the node’s qual, project the surviving columns, loop until a tuple passes or the source is exhausted. That boilerplate is ExecScan / ExecScanExtended in execScan.[ch]. ExecScan just pulls the node’s qual / projection / EPQ state and forwards to the always-inlined core:

// ExecScan — src/backend/executor/execScan.c
TupleTableSlot *
ExecScan(ScanState *node,
ExecScanAccessMtd accessMtd, /* per-node "get next raw tuple" */
ExecScanRecheckMtd recheckMtd) /* per-node EPQ recheck */
{
EPQState *epqstate = node->ps.state->es_epq_active;
ExprState *qual = node->ps.qual;
ProjectionInfo *projInfo = node->ps.ps_ProjInfo;
return ExecScanExtended(node, accessMtd, recheckMtd,
epqstate, qual, projInfo);
}

ExecScanExtended is marked pg_attribute_always_inline so the compiler can specialize it per call site, eliminating the qual == NULL and projInfo == NULL branches entirely when a scan has neither. Its loop:

// ExecScanExtended — src/include/executor/execScan.h (condensed)
ExprContext *econtext = node->ps.ps_ExprContext;
if (!qual && !projInfo) /* no filter, no projection */
{
ResetExprContext(econtext);
return ExecScanFetch(node, epqstate, accessMtd, recheckMtd); /* raw tuple */
}
ResetExprContext(econtext); /* free last tuple's scratch */
for (;;)
{
TupleTableSlot *slot = ExecScanFetch(node, epqstate, accessMtd, recheckMtd);
if (TupIsNull(slot)) /* source exhausted */
{
if (projInfo)
return ExecClearTuple(projInfo->pi_state.resultslot);
else
return slot;
}
econtext->ecxt_scantuple = slot; /* expose tuple to Var refs in qual/tlist */
if (qual == NULL || ExecQual(qual, econtext))
{
if (projInfo)
return ExecProject(projInfo); /* project surviving columns */
else
return slot; /* return raw scan tuple */
}
else
InstrCountFiltered1(node, 1); /* count a qual-rejected row */
ResetExprContext(econtext); /* tuple failed qual -> free + retry */
}

This is where the “fast path” the README hints at lives: a SELECT * with no WHERE (no qual, and a tlist that exactly matches the scan descriptor, so ps_ProjInfo == NULL) takes the if (!qual && !projInfo) branch and returns the access method’s slot untouched — zero projection, zero qual call. ExecAssignScanProjectionInfo is what decides projection can be skipped, leaving ps_ProjInfo NULL when “the requested tlist exactly matches the underlying tuple type.”

ExecScanFetch (also always-inlined) is the indirection to the per-node access method, with the EvalPlanQual substitution layered in front. When not inside an EPQ recheck (epqstate == NULL, the common case), the entire EPQ block is compiled away and ExecScanFetch is just return (*accessMtd)(node):

// ExecScanFetch — src/include/executor/execScan.h (common-case skeleton)
static pg_attribute_always_inline TupleTableSlot *
ExecScanFetch(ScanState *node, EPQState *epqstate,
ExecScanAccessMtd accessMtd, ExecScanRecheckMtd recheckMtd)
{
CHECK_FOR_INTERRUPTS();
if (epqstate != NULL)
{
/* inside an EvalPlanQual recheck: return the substitute test tuple,
* or an empty slot if this rel's EPQ tuple was already consumed */
/* ... relsubs_slot / relsubs_rowmark handling ... */
}
return (*accessMtd) (node); /* the node's real next-tuple routine */
}

accessMtd is the per-node function that actually talks to storage — e.g. SeqNext for a sequential scan, which calls the table AM’s table_scan_getnextslot. Those per-node access methods, and the table/index AM layer they call, are the subject of postgres-scan-nodes.md and the storage-engine docs; this document stops at the ExecScan boilerplate they all plug into.

For INSERT / UPDATE / DELETE / MERGE, the actual table mutation is not in ExecutePlan; it happens inside a top-level ModifyTable plan node (ExecInitModifyTable / ExecModifyTable). ExecutePlan still pulls tuples from the root, but for these commands the root is the ModifyTable node: the plan tree below it produces the new column values plus a “junk” row-identity column (a CTID for a heap table), and ModifyTable performs the insert/update/delete via the table AM. If there is a RETURNING clause, ModifyTable emits the computed rows as its output (which is why hasReturning makes sendTuples true); otherwise it returns nothing and the row counting is done by ModifyTable itself, not the CMD_SELECT counter in ExecutePlan. The per-node mechanics of ModifyTable (tuple routing for partitions, trigger firing, ON CONFLICT, MERGE action dispatch) are deep enough for their own treatment and are deferred to a dedicated node doc; the framework fact to carry is that writes are a node, pulled like any other.

ExecEndNode mirrors ExecInitNode: a nodeTag switch to the per-node ExecEnd* routine, each of which recurses into its children and releases that node’s resources (closing relations, dropping buffer pins, freeing tuplestores). The README is explicit that this is not about freeing memory — “it’s not really critical for ExecEndNode to free any memory; it’ll all go away in FreeExecutorState anyway” — but about releasing non-memory resources (relation locks, buffer pins) that the memory-context delete would otherwise leak. ExecShutdownNode is a lighter, earlier pass (called from ExecutePlan when no backward scan is possible) that stops asynchronous resource consumption — notably shutting down parallel workers and propagating their buffer-usage stats into the Gather node.

Anchor on symbol names, not line numbers. The PostgreSQL source moves between releases; a function or struct name is the stable handle. Use git grep -n '<symbol>' src/backend/executor/ to locate the current position. The line numbers in the position-hint table were observed at commit 273fe94 (REL_18_STABLE) and are quick hints only.

  • ExecutorStart / standard_ExecutorStart — create EState, register the snapshot, set es_output_cid, begin AFTER-trigger scope, call InitPlan.
  • ExecutorRun / standard_ExecutorRun — start the dest receiver, call ExecutePlan unless the direction is no-movement, accumulate es_total_processed.
  • ExecutorFinish / standard_ExecutorFinishExecPostprocessPlan (run unfinished ModifyTable nodes) then AfterTriggerEndQuery.
  • ExecutorEnd / standard_ExecutorEndExecEndNode, unregister both snapshots, FreeExecutorState.
  • ExecutorRewindExecReScan the root for cursor rewind (SELECT only).
  • InitPlan — permissions, range table, row marks, initial pruning, subplan init, then ExecInitNode(root); derives result TupleDesc and junk filter.
  • ExecutePlan — the demand-pull loop: ResetPerTupleExprContextExecProcNode → junk filter → dest->receiveSlot → count → limit check.
  • ExecCheckPermissions / ExecCheckOneRelPerms — per-relation and per-column ACL checks driven from InitPlan.
  • ExecInitNode — the open() dispatch; nodeTag switch to ExecInit*, recurses children, installs the ExecProcNode wrapper.
  • ExecSetExecProcNode — install ExecProcNodeFirst as the first-call wrapper, stash the real function in ExecProcNodeReal.
  • ExecProcNodeFirst — one-time stack check, then rewire ExecProcNode to the real function or ExecProcNodeInstr.
  • ExecProcNodeInstr — instrumentation wrapper used under EXPLAIN ANALYZE (InstrStartNode / InstrStopNode around the real call).
  • MultiExecProcNode — the bulk variant for nodes that return a whole structure (a hash table, a bitmap) rather than one tuple at a time: Hash, BitmapIndexScan, BitmapAnd, BitmapOr.
  • ExecEndNode — the close() dispatch; symmetric nodeTag switch to ExecEnd*.
  • ExecShutdownNode / ExecShutdownNode_walker — early async-resource release; shuts down Gather / GatherMerge workers, Hash, foreign and custom scans.
  • ExecSetTupleBound — push a FETCH n limit down through bound-aware nodes (Sort, IncrementalSort, Append, Gather, …) for bounded sort.
  • ExecProcNode (inline) — the public next(): rescan-if-chgParam, then node->ExecProcNode(node).
  • ExecProcNodeMtd (typedef, execnodes.h) — the per-node function-pointer type returning a TupleTableSlot *.
  • ExecReScan — reset a subtree to re-emit its output sequence (driven by changed Params); the README’s “rescan command.”
  • struct PlanState — abstract base; the ExecProcNode pointer, lefttree/righttree edges, qual, ps_ProjInfo, ps_ResultTupleSlot, ps_ExprContext.
  • struct EState — per-execution global; es_query_cxt, es_snapshot, es_direction, es_tupleTable, es_per_tuple_exprcontext, es_epq_active.
  • struct ExprContext — per-tuple memory (ecxt_per_tuple_memory) plus the ecxt_scantuple / ecxt_innertuple / ecxt_outertuple slots that Var references resolve against.
  • struct ScanState — the scan-node base: ss_currentRelation, ss_currentScanDesc (the cursor), ss_ScanTupleSlot.

Slot abstraction (tuptable.h, execTuples.c)

Section titled “Slot abstraction (tuptable.h, execTuples.c)”
  • struct TupleTableSlot — the uniform handle; tts_ops, tts_values / tts_isnull, tts_nvalid, tts_flags.
  • struct TupleTableSlotOps — the vtable (init / clear / getsomeattrs / materialize / copyslot / get_heap_tuple / …).
  • TTSOpsVirtual / TTSOpsHeapTuple / TTSOpsMinimalTuple / TTSOpsBufferHeapTuple — the four backing implementations.
  • TupIsNull (macro) — NULL-or-empty test, the end-of-data sentinel.
  • MakeTupleTableSlot / ExecAllocTableSlot / ExecInitScanTupleSlot — slot construction.
  • ExecStoreHeapTuple / ExecStoreMinimalTuple / ExecStoreVirtualTuple — put a tuple into a slot of the matching type.
  • ExecClearTuple (inline) — return a slot to the empty state.
  • slot_getsomeattrs / slot_getsomeattrs_int — lazy left-prefix deform up to attnum, advancing tts_nvalid.
  • ExecScan — the public scan driver; reads qual / projInfo / EPQ and forwards to ExecScanExtended.
  • ExecScanExtended (always-inline) — fetch / qual / project loop with the no-qual-no-project fast path.
  • ExecScanFetch (always-inline) — CHECK_FOR_INTERRUPTS + EPQ substitution + (*accessMtd)(node).
  • ExecAssignScanProjectionInfo — decide whether projection can be skipped (leaves ps_ProjInfo NULL when the tlist matches the scan descriptor).
  • ExecScanReScan — clear the scan slot and reset EPQ done-flags on rescan.

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
ExecutorStartexecMain.c122
standard_ExecutorStartexecMain.c141
standard_ExecutorRunexecMain.c307
standard_ExecutorFinishexecMain.c415
standard_ExecutorEndexecMain.c475
ExecCheckPermissionsexecMain.c582
InitPlanexecMain.c836
ExecutePlanexecMain.c1660
ExecInitNodeexecProcnode.c142
ExecSetExecProcNodeexecProcnode.c430
ExecProcNodeFirstexecProcnode.c448
ExecProcNodeInstrexecProcnode.c479
MultiExecProcNodeexecProcnode.c507
ExecEndNodeexecProcnode.c562
ExecShutdownNodeexecProcnode.c772
ExecSetTupleBoundexecProcnode.c848
ExecProcNode (inline)executor.h310
ExecProcNodeMtd (typedef)execnodes.h1140
struct PlanStateexecnodes.h1149
struct EStateexecnodes.h649
struct ExprContextexecnodes.h262
struct ScanStateexecnodes.h1609
struct TupleTableSlottuptable.h114
struct TupleTableSlotOpstuptable.h134
TupIsNull (macro)tuptable.h310
MakeTupleTableSlotexecTuples.c1301
ExecStoreVirtualTupleexecTuples.c1741
ExecScanexecScan.c47
ExecScanReScanexecScan.c108
ExecScanFetch (inline)execScan.h32
ExecScanExtended (inline)execScan.h160

Each entry leads with a fact about the current source at commit 273fe94 (REL_18_STABLE), readable without any other materials. The trailing sentence shows how it was checked. Open questions follow as recorded gaps.

  • All four executor entry points are hook-dispatchers over a standard_* body. ExecutorStart/Run/Finish/End each test a *_hook global and otherwise call standard_*. The hook globals (ExecutorStart_hook …) are defined at the top of execMain.c and initialized to NULL. Verified by reading the four functions on 2026-06-05.

  • ExecProcNode is an inline function, not a switch; per-node dispatch is a function pointer set at init. The inline lives in executor.h and calls node->ExecProcNode(node). The pointer is installed by ExecSetExecProcNode (in execProcnode.c) as the first-call wrapper ExecProcNodeFirst, which then rewires itself to the bare per-node function or ExecProcNodeInstr. Verified by reading all three on 2026-06-05. The per-tuple cost is one indirect call after the first invocation.

  • End-of-data is signaled by an empty TupleTableSlot, tested with TupIsNull, not by a universal C NULL. ExecutePlan breaks on TupIsNull(slot); TupIsNull (in tuptable.h) is ((slot) == NULL || TTS_EMPTY(slot)). Verified by reading both. Individual nodes may return a NULL pointer or a cleared slot; both satisfy TupIsNull.

  • There are exactly four built-in slot backings. TTSOpsVirtual, TTSOpsHeapTuple, TTSOpsMinimalTuple, TTSOpsBufferHeapTuple are declared extern PGDLLIMPORT const in tuptable.h and defined in execTuples.c. Verified by reading both. The tts_ops pointer doubles as the slot’s type tag (the TTS_IS_* macros compare against the vtable address).

  • The state tree and the plan tree are isomorphic one-for-one, with the documented Append/MergeAppend pruning exception. ExecInitNode’s nodeTag switch has one ExecInit* per plan node type, and the README states the exception explicitly (“the executor state’s subnode array will become out of sequence to the plan’s subplan list” under run-time pruning). Verified by reading ExecInitNode and the README on 2026-06-05.

  • The whole executor invocation lives in one per-query memory context freed in a single operation. CreateExecutorState creates es_query_cxt; standard_ExecutorStart/Run/Finish switch into it; FreeExecutorState (called by standard_ExecutorEnd) destroys it. The README’s Memory Management section states the intent (“we just destroy the memory context”). Verified by reading the entry points; the MemoryContextDelete itself is inside FreeExecutorState in execUtils.c, which was not separately opened — taken on the README’s word plus the switch/free pairing.

  • The query snapshot is registered for the run and unregistered at end. standard_ExecutorStart does estate->es_snapshot = RegisterSnapshot(queryDesc->snapshot) and asserts GetActiveSnapshot() == queryDesc->snapshot; standard_ExecutorEnd calls UnregisterSnapshot(estate->es_snapshot). Verified by reading both.

  • ExecScanExtended has a branch-eliminated fast path for no-qual-no-projection scans. The if (!qual && !projInfo) early return in execScan.h, combined with the pg_attribute_always_inline attribute, lets the compiler drop the qual/projection branches at each call site. ExecAssignScanProjectionInfo is what leaves ps_ProjInfo NULL when the tlist matches the scan tuple type. Verified by reading execScan.h and ExecAssignScanProjectionInfo in execScan.c.

  • ExecScanFetch compiles the EvalPlanQual path away when not in an EPQ recheck. The whole EPQ block is under if (epqstate != NULL), and epqstate is node->ps.state->es_epq_active, which is NULL outside an EPQ recheck; with pg_attribute_always_inline the dead branch is eliminated. Verified by reading ExecScan and ExecScanFetch.

  1. Where exactly is es_per_tuple_exprcontext reset relative to per-node ExprContexts? ExecutePlan calls ResetPerTupleExprContext(estate) (the top-level context), while ExecScanExtended resets the node’s own ps_ExprContext. The division of labor between the EState-level per-tuple context and each node’s per-tuple context is not fully traced here. Investigation path: read CreateExprContext / GetPerTupleExprContext in execUtils.c and the comment block above ExprContext in execnodes.h.

  2. What is the exact set of nodes that support MultiExecProcNode, and is it closed? The REL_18 switch lists Hash, BitmapIndexScan, BitmapAnd, BitmapOr. Whether custom-scan or foreign-scan providers can extend this set, or it is hard-closed to those four, was not verified. Investigation path: grep for MultiExec* definitions and check the custom-scan API in nodeCustom.c.

  3. How does asynchronous execution (ExecAsyncRequest / ExecAppendAsyncEventWait) interleave with the synchronous pull loop? The README documents an async path for Append over async-capable ForeignScan children, but this document covers only the synchronous demand-pull spine. Investigation path: read nodeAppend.c’s async event loop and execAsync.c; likely its own follow-up note.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

Pointers, not analysis. Each bullet is a starting handle for a follow-up doc; depth here is intentionally shallow.

  • CUBRID’s XASL tree vs. PostgreSQL’s PlanState tree. CUBRID executes an XASL (eXtended Access Specification Language) tree, also a pull-style operator tree, but the plan and execution state are less cleanly separated than PostgreSQL’s read-only-Plan / mutable-PlanState split. A side-by-side of how each engine achieves plan reuse and parallel re-execution would sharpen what PostgreSQL buys with the strict immutability rule.

  • Vectorized / batch-at-a-time execution. The classic critique of the tuple-at-a-time Volcano model is per-tuple interpreter overhead — one next() call and one slot per row. MonetDB/X100 (Boncz, Zukowski & Nes, CIDR 2005, “MonetDB/X100: Hyper-Pipelining Query Execution”) and the column stores process vectors of values per call instead. The C-Store / Vertica lineage (dbms-papers/cstore.md, vertica-7-years.md) is the production embodiment. PostgreSQL stays tuple-at-a-time in core; JIT compilation (postgres-jit.md) is its answer to the interpreter overhead instead of vectorization.

  • Push-based / data-centric compiled execution. Neumann’s “Efficiently Compiling Efficient Query Plans for Modern Hardware” (VLDB 2011, the HyPer engine) inverts the iterator: instead of pulling tuples up, it compiles the plan into tight push-based loops that keep a tuple in CPU registers across operator boundaries. This is the producer-driven direction DSC §15.7.2.1 names. A comparison would quantify what PostgreSQL’s pull model costs in L1/branch-prediction terms versus its simplicity and composability.

  • The Volcano exchange operator and PostgreSQL parallelism. Graefe’s “Volcano — An Extensible and Parallel Query Evaluation System” (IEEE TKDE 1994) introduced the exchange operator that turns a serial iterator tree into a parallel one without changing the operators. PostgreSQL’s Gather / GatherMerge nodes plus shm_mq are a variant of the same idea — a parallel-aware node forks the sub-plan into background workers and re-merges their streams. The cross-reference is postgres-parallel-query.md; mapping Gather onto the exchange-operator abstraction would tie the implementation back to the theory.

  • Morsel-driven parallelism. Leis et al., “Morsel-Driven Parallelism” (SIGMOD 2014) replaces the exchange operator with a work-stealing scheduler over small tuple “morsels,” better at load balancing on many-core machines than PostgreSQL’s per-worker sub-plan split. Relevant if PostgreSQL parallel query is ever measured against modern NUMA hardware.

  • src/backend/executor/README — the authoritative design doc: the demand-pull pipeline model, the Plan/State two-tree split, expression vs. state trees, memory management (per-query and per-tuple contexts), the Query Processing Control Flow sketch, EvalPlanQual, and asynchronous execution.

Textbook chapters (under knowledge/research/dbms-general/)

Section titled “Textbook chapters (under knowledge/research/dbms-general/)”
  • Database System Concepts (Silberschatz, Korth & Sudarshan, 7e), Ch. 15 “Query Processing”, §15.7 “Evaluation of Expressions” — §15.7.2 “Pipelining” (materialized vs. pipelined; benefits) and §15.7.2.1 “Implementation of Pipelining” (demand-driven vs. producer-driven; the open()/next()/close() iterator). Ch. 22 §22.5 for the Volcano exchange-operator model in parallel processing.

Papers (under knowledge/research/dbms-papers/)

Section titled “Papers (under knowledge/research/dbms-papers/)”
  • Architecture of a Database System (Hellerstein, Stonebraker & Hamilton, FnT 2007) — fntdb07-architecture.md, §1.1 / §4: the relational query processor as a suite of operators, SQL served in a “pull model.”
  • (Comparative, not yet captured) Graefe 1994 “Volcano”; Boncz et al. 2005 “MonetDB/X100”; Neumann 2011 “Compiling Efficient Query Plans”; Leis et al. 2014 “Morsel-Driven Parallelism” — see §“Beyond PostgreSQL”.

PostgreSQL source (under /data/hgryoo/references/postgres/, REL_18 273fe94)

Section titled “PostgreSQL source (under /data/hgryoo/references/postgres/, REL_18 273fe94)”
  • src/backend/executor/execMain.c — lifecycle entry points, InitPlan, ExecutePlan, permissions.
  • src/backend/executor/execProcnode.cExecInitNode / ExecProcNode wrappers / ExecEndNode dispatch, MultiExecProcNode, ExecShutdownNode, ExecSetTupleBound.
  • src/backend/executor/execTuples.c — slot construction and ExecStore* / slot_getsomeattrs; the four TTSOps* vtable definitions.
  • src/backend/executor/execScan.c + src/include/executor/execScan.h — the scan fetch/qual/project boilerplate.
  • src/include/executor/executor.h — the inline ExecProcNode, expression-eval prototypes.
  • src/include/executor/tuptable.hTupleTableSlot, TupleTableSlotOps, TupIsNull, the TTSOps* declarations.
  • src/include/nodes/execnodes.hPlanState, EState, ExprContext, ScanState, ExecProcNodeMtd.

Cross-references (sibling docs that own adjacent mechanism)

Section titled “Cross-references (sibling docs that own adjacent mechanism)”
  • postgres-planner-overview.md — how the read-only Plan tree is built.
  • postgres-node-trees.md — the Node / Plan / PlanState type system.
  • postgres-expression-eval.mdExprState / ExprEvalStep quals and projections, slot_getsomeattrs fetch steps.
  • postgres-scan-nodes.md — per-node access methods (SeqNext, etc.) that plug into ExecScan.
  • postgres-portals-prepared.md — the portal layer that calls ExecutorStart/Run/End and supplies the snapshot.