Skip to content

PostgreSQL Scan Nodes — Sequential, Index, Bitmap, and TID Scans

Contents:

Leaf scan nodes are where the Volcano iterator model meets physical storage. Every query plan tree has scans at its leaves — the nodes that actually read tuples from a relation — and every other node (join, sort, aggregate) sits above them, pulling tuples upward on demand. Database System Concepts (Silberschatz, 7e, ch. 15 “Query Processing”) distinguishes scan strategies by the path they take through the data:

  • Full table scan. The engine reads every page of the relation regardless of which tuples are needed, then filters rows with the selection predicate. The cost is proportional to the number of blocks in the relation and is independent of the selectivity of the predicate. For low-selectivity predicates over large relations, this is often the cheapest path because storage I/O is sequential and predictable (§12.3).
  • Index scan. The engine first traverses an index ordered by the search key, retrieving only the record identifiers (RIDs) for tuples that satisfy the index predicate, then fetches each heap tuple by its RID. The cost has two components: the index traversal cost and the heap fetch cost. When the index is highly selective and the relation is large, the heap fetch cost dominates because each RID may point to a different disk block (random I/O). When the index is weakly selective, the total random I/O cost can exceed a full scan.
  • Index-only scan. If the query projects only attributes that are stored in the index (a covering index), the engine can satisfy the query entirely from the index without touching the heap. The textbook calls this a covering-index scan or key-only scan. The cost is the index traversal cost alone (§15.3.1).

The textbook also motivates the bitmap-based scan, sometimes called multiple-index scan (§15.3.3): when several indexes each cover part of the predicate, the engine evaluates them independently, forms the intersection (AND) or union (OR) of the result bitmaps, and then fetches only the heap pages that appear in the final bitmap. The key insight is that sorting the heap accesses by block number converts a large number of random I/Os into a mostly sequential pass. Modern commercial engines use this technique under names like “index merge” (MySQL) or “bitmap AND/OR” (Oracle, PostgreSQL).

Two design axes govern every real scan implementation:

  1. Predicate evaluation placement. The engine can push predicates down into the scan (evaluated per-tuple before projecting), or evaluate them above the scan at the join/filter layer. PostgreSQL pushes qual evaluation into ExecScan so that unqualified tuples never leave the scan node — only qualifying tuples travel up the plan tree.

  2. Visibility gating. A tuple stored in the heap may be invisible to the current snapshot even if the index points to it (see postgres-mvcc-snapshots.md). The scan must apply the MVCC visibility check as part of tuple retrieval. This check can be deferred (heap fetch followed by visibility check) or bypassed (visibility map, discussed below).

Almost every production engine converges on the same structural choices for scan nodes. Understanding this shared engineering context makes PostgreSQL’s specific dials easier to read.

Scans are expressed as a pair of AM callbacks: an access function that returns the next raw tuple, and a recheck function that re-evaluates index quals on the heap tuple when the index is lossy (i.e., it stored a page-level predicate rather than a precise per-tuple predicate). PostgreSQL names these ExecScanAccessMtd and ExecScanRecheckMtd. The outer loop — qual evaluation, projection, EPQ handling — is factored into a shared ExecScan/ExecScanExtended function shared by all six node types. Only the access function and the recheck function differ per node type.

All engines defer opening a scan descriptor until the first next() call. This matters because plans are built before execution, and some plan nodes (e.g., an inner side of a nested loop) are re-opened on every outer row. Opening the scan descriptor on Init would waste work for never-executed branches. PostgreSQL checks ss_currentScanDesc == NULL on the first call to the access function and opens the descriptor there.

Parallel table scans (both sequential and index) divide work via a shared scan descriptor allocated in dynamic shared memory (DSM). The convention across all PostgreSQL scan node types is a four-function protocol: Estimate (size the DSM region), InitializeDSM (leader sets up the shared descriptor), ReInitializeDSM (reset for a rescan), InitializeWorker (worker attaches to the shared descriptor). The actual work-division logic lives in the AM (table_parallelscan_* and index_parallelscan_* families).

Qual specialization by presence of conditions

Section titled “Qual specialization by presence of conditions”

A scan with no qual and no projection is trivially faster than one with both — the tight inner loop does nothing but fetch the next tuple and return it. PostgreSQL’s ExecInitSeqScan sets ExecProcNode to one of five specialized variants: ExecSeqScan, ExecSeqScanWithQual, ExecSeqScanWithProject, ExecSeqScanWithQualProject, or ExecSeqScanEPQ. Each variant passes const-NULL arguments for the absent components, which the compiler uses to eliminate dead branches in the common fast path.

Bitmap scans split into two distinct plan nodes: a BitmapIndexScan (or a BitmapAnd/BitmapOr tree of them) runs as the outer subplan, returning a TIDBitmap via MultiExecProcNode rather than individual tuples. The BitmapHeapScan node then iterates over the bitmap, grouping accesses by heap block number to amortize random I/O. The bitmap itself is in-memory when small (exact mode, one bit per TID) and degrades gracefully to page-level lossy mode when it overflows work_mem.

PostgreSQL provides six leaf scan node types, all wired into the shared ExecScan framework. The nodes split cleanly into three families:

flowchart TD
  ES["ExecScan / ExecScanExtended<br/>qual eval + projection + EPQ"]
  SS["SeqScan<br/>table_beginscan + table_scan_getnextslot"]
  IS["IndexScan<br/>index_beginscan + index_getnext_slot<br/>+ heap fetch"]
  IOS["IndexOnlyScan<br/>index_beginscan xs_want_itup=true<br/>VM check → skip heap if all-visible"]
  BHS["BitmapHeapScan<br/>MultiExecProcNode → TIDBitmap<br/>table_scan_bitmap_next_tuple"]
  BIS["BitmapIndexScan<br/>MultiExecProcNode returns TIDBitmap"]
  TS["TidScan<br/>table_tuple_fetch_row_version<br/>on explicit ctid list"]
  TRS["TidRangeScan<br/>table_beginscan_tidrange<br/>on ctid range bounds"]
  ES --> SS
  ES --> IS
  ES --> IOS
  ES --> BHS
  BHS --> BIS
  ES --> TS
  ES --> TRS

Figure 1 — Six scan node types share the ExecScan outer loop. BitmapHeapScan’s inner subplan is a BitmapIndexScan (or an And/Or tree of them) that returns a TIDBitmap via MultiExecProcNode.

Shared scan loop: ExecScan and ExecScanExtended

Section titled “Shared scan loop: ExecScan and ExecScanExtended”

ExecScan is a thin wrapper that reads the epqstate, qual, and projInfo from the node and delegates to ExecScanExtended. The extended form accepts them as explicit parameters so that the specialized SeqScan variants can pass compile-time NULL constants.

// ExecScan — src/backend/executor/execScan.c
TupleTableSlot *
ExecScan(ScanState *node,
ExecScanAccessMtd accessMtd,
ExecScanRecheckMtd recheckMtd)
{
EPQState *epqstate = node->ps.state->es_epq_active;
ExprState *qual = node->ps.qual;
ProjectionInfo *projInfo = node->ps.ps_ProjInfo;
return ExecScanExtended(node, accessMtd, recheckMtd,
epqstate, qual, projInfo);
}

The shared loop inside ExecScanExtended calls accessMtd to get the next raw tuple, applies the qual (if non-NULL), projects (if non-NULL), and handles EPQ re-evaluation — none of this logic is duplicated per scan node.

SeqScan — full table scan via the table AM

Section titled “SeqScan — full table scan via the table AM”

SeqScan is the simplest node. SeqNext opens a TableScanDesc on the first call via table_beginscan, then calls table_scan_getnextslot in a direction determined by estate->es_direction.

// SeqNext — src/backend/executor/nodeSeqscan.c
static TupleTableSlot *
SeqNext(SeqScanState *node)
{
TableScanDesc scandesc = node->ss.ss_currentScanDesc;
EState *estate = node->ss.ps.state;
ScanDirection direction = estate->es_direction;
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
if (scandesc == NULL)
{
scandesc = table_beginscan(node->ss.ss_currentRelation,
estate->es_snapshot, 0, NULL);
node->ss.ss_currentScanDesc = scandesc;
}
if (table_scan_getnextslot(scandesc, direction, slot))
return slot;
return NULL;
}

The recheck function SeqRecheck always returns true — sequential scans never use index keys, so there is nothing to recheck.

ExecInitSeqScan selects one of five specialized ExecProcNode variants based on the presence of qual, projection, and EPQ:

// ExecInitSeqScan (excerpt) — src/backend/executor/nodeSeqscan.c
if (scanstate->ss.ps.state->es_epq_active != NULL)
scanstate->ss.ps.ExecProcNode = ExecSeqScanEPQ;
else if (scanstate->ss.ps.qual == NULL)
{
if (scanstate->ss.ps.ps_ProjInfo == NULL)
scanstate->ss.ps.ExecProcNode = ExecSeqScan;
else
scanstate->ss.ps.ExecProcNode = ExecSeqScanWithProject;
}
else
{
if (scanstate->ss.ps.ps_ProjInfo == NULL)
scanstate->ss.ps.ExecProcNode = ExecSeqScanWithQual;
else
scanstate->ss.ps.ExecProcNode = ExecSeqScanWithQualProject;
}

Each variant passes NULL for absent components: ExecSeqScan passes NULL, NULL, NULL for epqstate, qual, projInfo — the compiler eliminates the corresponding branches in ExecScanExtended.

Parallel support is the standard four-function protocol: ExecSeqScanEstimate sizes the DSM region via table_parallelscan_estimate; ExecSeqScanInitializeDSM allocates the shared descriptor and calls table_parallelscan_initialize; ExecSeqScanInitializeWorker attaches via table_beginscan_parallel.

IndexNext opens an IndexScanDesc via index_beginscan (lazy, first call), applies any runtime scan keys, then loops on index_getnext_slot. Each call to index_getnext_slot returns the index tuple and sets xs_recheck if the index is lossy:

// IndexNext (excerpt) — src/backend/executor/nodeIndexscan.c
while (index_getnext_slot(scandesc, direction, slot))
{
CHECK_FOR_INTERRUPTS();
if (scandesc->xs_recheck)
{
econtext->ecxt_scantuple = slot;
if (!ExecQualAndReset(node->indexqualorig, econtext))
{
InstrCountFiltered2(node, 1);
continue;
}
}
return slot;
}
node->iss_ReachedEnd = true;
return ExecClearTuple(slot);

index_getnext_slot fetches the heap tuple at the TID returned by the index AM. The visibility check is performed inside the heap AM using the query snapshot (estate->es_snapshot). The xs_recheck flag is set by lossy index AMs (GIN, GiST when the index predicate does not store full per-tuple precision) to trigger re-evaluation of the original index qual against the fetched heap tuple.

IndexNextWithReorder handles the case where the query includes ORDER BY expressions on index operators: it fetches tuples into a pairing-heap reorder queue (iss_ReorderQueue) and emits them in order, re-checking distance values when the index returns approximate ordering.

ExecIndexBuildScanKeys (line 1158) translates the planner’s IndexScan.indexqual list into the ScanKey array that the index AM expects. Runtime keys — quals whose right-hand side must be evaluated at execution time (e.g., WHERE col = $1) — are separated into iss_RuntimeKeys and evaluated via ExecIndexEvalRuntimeKeys before the first index_rescan.

IndexOnlyScan — heap bypass via visibility map

Section titled “IndexOnlyScan — heap bypass via visibility map”

IndexOnlyScan is IndexScan with one structural addition: ioss_ScanDesc->xs_want_itup = true tells the index AM to return the index tuple rather than fetching from the heap. Before returning the tuple, IndexOnlyNext consults the visibility map to decide whether a heap fetch is still necessary:

// IndexOnlyNext (excerpt) — src/backend/executor/nodeIndexonlyscan.c
while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
{
bool tuple_from_heap = false;
CHECK_FOR_INTERRUPTS();
if (!VM_ALL_VISIBLE(scandesc->heapRelation,
ItemPointerGetBlockNumber(tid),
&node->ioss_VMBuffer))
{
/* must visit heap to check visibility */
InstrCountTuples2(node, 1);
if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
continue; /* no visible tuple */
ExecClearTuple(node->ioss_TableSlot);
tuple_from_heap = true;
}
/* fill slot from index tuple (xs_hitup preferred over xs_itup) */
if (scandesc->xs_hitup)
ExecForceStoreHeapTuple(scandesc->xs_hitup, slot, false);
else if (scandesc->xs_itup)
StoreIndexTuple(node, slot, scandesc->xs_itup, scandesc->xs_itupdesc);
else
elog(ERROR, "no data returned for index-only scan");
...
}

If VM_ALL_VISIBLE is set for the heap page containing the TID, the scan skips the heap fetch entirely. If not, it calls index_fetch_heap to obtain the heap tuple (to verify visibility) but still returns the index tuple as the data source. This means the heap fetch is used only for its visibility side-effect, not to carry column data. The ioss_VMBuffer pin persists across calls to amortize repeated visibility-map lookups for the same page.

StoreIndexTuple unpacks the index tuple into the scan slot’s attribute array. For indexes on name-typed columns, the tuple must be padded to NAMEDATALEN; ioss_NameCStringAttNums tracks those column numbers.

BitmapHeapScan — two-phase bitmap-driven heap scan

Section titled “BitmapHeapScan — two-phase bitmap-driven heap scan”

BitmapHeapScan executes as two plan nodes cooperating through MultiExecProcNode. The outer subplan is a BitmapIndexScan (or a tree of BitmapAnd/BitmapOr nodes) that returns a TIDBitmap without going through the standard tuple interface. BitmapHeapScan calls MultiExecProcNode exactly once (deferred to the first BitmapHeapNext call via BitmapTableScanSetup) to build the bitmap, then hands it to the heap AM for page-ordered iteration.

// BitmapTableScanSetup (excerpt) — src/backend/executor/nodeBitmapHeapscan.c
static void
BitmapTableScanSetup(BitmapHeapScanState *node)
{
/* serial path: build TIDBitmap from subplan */
node->tbm = (TIDBitmap *) MultiExecProcNode(outerPlanState(node));
if (!node->tbm || !IsA(node->tbm, TIDBitmap))
elog(ERROR, "unrecognized result from subplan");
tbmiterator = tbm_begin_iterate(node->tbm, dsa, InvalidDsaPointer);
/* open heap scan descriptor for bitmap-driven access */
if (!node->ss.ss_currentScanDesc)
node->ss.ss_currentScanDesc =
table_beginscan_bm(node->ss.ss_currentRelation,
node->ss.ps.state->es_snapshot, 0, NULL);
node->ss.ss_currentScanDesc->st.rs_tbmiterator = tbmiterator;
node->initialized = true;
}

BitmapHeapNext then calls table_scan_bitmap_next_tuple in a loop. The heap AM iterates over the bitmap in block order. When the bitmap contains a lossy page entry (the TIDBitmap overflowed work_mem and fell back to page-level granularity), node->recheck is set to true and BitmapHeapNext re-evaluates bitmapqualorig against each tuple on that page:

// BitmapHeapNext (excerpt) — src/backend/executor/nodeBitmapHeapscan.c
while (table_scan_bitmap_next_tuple(node->ss.ss_currentScanDesc,
slot, &node->recheck, ...))
{
CHECK_FOR_INTERRUPTS();
if (node->recheck)
{
econtext->ecxt_scantuple = slot;
if (!ExecQualAndReset(node->bitmapqualorig, econtext))
{
InstrCountFiltered2(node, 1);
ExecClearTuple(slot);
continue;
}
}
return slot;
}
return ExecClearTuple(slot);

For parallel bitmap scans, the leader calls tbm_prepare_shared_iterate to publish the iterator DSA pointer in pstate->tbmiterator; workers attach via tbm_begin_iterate with that pointer. The BitmapShouldInitializeSharedState / BitmapDoneInitializingSharedState pair uses a spinlock + condition variable to ensure only the first worker builds the bitmap.

The two-phase structure deserves a closer look because it is the single most important architectural difference between BitmapHeapScan and the index-driven IndexScan. In an IndexScan, the index traversal and the heap fetch are interleaved tuple-by-tuple: fetch one TID, fetch its heap tuple, return it, repeat. Each heap fetch can land on an arbitrary block, so for a weakly-selective predicate the heap access pattern is effectively random. BitmapHeapScan breaks that interleaving by inserting a materialization barrier — the TIDBitmap — between the index pass and the heap pass. The first phase drains the entire BitmapIndexScan subplan via a single MultiExecProcNode call, accumulating every matching TID into the bitmap. Only after the bitmap is complete does the second phase begin, walking the heap in strictly ascending block-number order. This is the I/O-coalescing win: the same set of heap blocks that an IndexScan would visit in random order is now visited sequentially, and each block is read at most once even if many matching tuples live on it.

flowchart TD
  subgraph EXECSCAN["ExecScan / ExecScanExtended — shared outer loop"]
    ESLOOP["accessMtd → qual → project → EPQ"]
  end

  subgraph FAMILY["Scan-node family — one accessMtd each"]
    SS["SeqNext<br/>table_scan_getnextslot"]
    IS["IndexNext<br/>index_getnext_slot + heap fetch"]
    IOS["IndexOnlyNext<br/>index_getnext_tid + VM_ALL_VISIBLE gate"]
    TS["TidNext<br/>table_tuple_fetch_row_version"]
    TRS["TidRangeNext<br/>table_scan_getnextslot_tidrange"]
    BHN["BitmapHeapNext<br/>table_scan_bitmap_next_tuple"]
  end

  ESLOOP --> SS
  ESLOOP --> IS
  ESLOOP --> IOS
  ESLOOP --> TS
  ESLOOP --> TRS
  ESLOOP --> BHN

  subgraph PHASE1["Bitmap phase 1 — build (deferred, runs once)"]
    INIT{"node->initialized?"}
    SETUP["BitmapTableScanSetup"]
    MEPN["MultiExecProcNode<br/>(BitmapIndexScan / BitmapAnd / BitmapOr)"]
    TBM["TIDBitmap<br/>exact: 1 bit/TID — lossy: 1 bit/page"]
    SHARED["pstate? → tbm_prepare_shared_iterate<br/>+ BitmapDoneInitializingSharedState"]
    ITER["tbm_begin_iterate<br/>table_beginscan_bm → rs_tbmiterator"]
  end

  subgraph PHASE2["Bitmap phase 2 — heap fetch (per tuple, block-ordered)"]
    NEXT["table_scan_bitmap_next_tuple"]
    RECHK{"node->recheck?<br/>(lossy page)"}
    QUAL["ExecQualAndReset<br/>bitmapqualorig"]
    EMIT["return slot"]
    DROP["InstrCountFiltered2<br/>ExecClearTuple → loop"]
  end

  BHN --> INIT
  INIT -- no --> SETUP
  INIT -- yes --> NEXT
  SETUP --> MEPN --> TBM --> SHARED --> ITER --> NEXT
  NEXT --> RECHK
  RECHK -- no --> EMIT
  RECHK -- yes --> QUAL
  QUAL -- pass --> EMIT
  QUAL -- fail --> DROP --> NEXT

Figure 2 — Every leaf scan node plugs a single accessMtd into the shared ExecScan/ExecScanExtended loop. BitmapHeapNext is unique in that its accessMtd is itself two-phased: a deferred one-time build (BitmapTableScanSetupMultiExecProcNodeTIDBitmap) followed by a block-ordered heap walk (table_scan_bitmap_next_tuple) with per-tuple recheck of bitmapqualorig on lossy pages.

The deferral is what makes the node->initialized guard load-bearing: BitmapTableScanSetup runs lazily on the first BitmapHeapNext call, not at ExecInitBitmapHeapScan time. This mirrors the lazy scan-descriptor pattern used by every other node — an inner bitmap-heap scan re-driven by a nested loop pays the bitmap-build cost only when it is actually pulled from. The recheck branch (node->recheck) is set per-page by table_scan_bitmap_next_tuple: the heap AM raises it whenever the current page came from a lossy bitmap entry, because a lossy entry only records “some tuple on this page matched” and cannot say which. Exact entries carry per-TID precision and need no recheck, so the recheck cost is paid only on the pages that overflowed work_mem into lossy mode. The two instrumentation counters threaded through the call — node->stats.lossy_pages and node->stats.exact_pages — are exactly what EXPLAIN (ANALYZE) surfaces as “Heap Blocks: exact=N lossy=M”, making the exact/lossy split observable at the SQL level.

TidScan handles queries of the form WHERE ctid = '(0,1)' or WHERE ctid = ANY(array_expression) or WHERE CURRENT OF cursor. TidExprListCreate parses the TidScan.tidquals expression list at ExecInitTidScan time, compiling ordinary expressions into ExprState objects and recording CURRENT OF cursor references. TidListEval evaluates those expressions at runtime to produce the ItemPointer list stored in tss_TidList.

// TidNext (excerpt) — src/backend/executor/nodeTidscan.c
static TupleTableSlot *
TidNext(TidScanState *node)
{
/* evaluate tid expressions on first call */
if (node->tss_TidList == NULL)
TidListEval(node);
while (node->tss_TidPtr < node->tss_NumTids)
{
ItemPointerData tid = node->tss_TidList[node->tss_TidPtr++];
/* ... CURRENT OF cursor resolution ... */
if (!table_tuple_fetch_row_version(heapRelation, &tid,
estate->es_snapshot, slot))
continue;
return slot;
}
return ExecClearTuple(slot);
}

TidListEval deduplicates and sorts the TID list so that table_tuple_fetch_row_version calls proceed in physical order, reducing random I/O when the list is large. CURRENT OF resolution calls GetCurrentCID / table_tuple_tid_valid to resolve the cursor’s current position.

TidRangeScan — contiguous ctid range scan

Section titled “TidRangeScan — contiguous ctid range scan”

TidRangeScan is a PG14-era addition for queries of the form WHERE ctid >= '(10,0)' AND ctid < '(20,0)'. The planner recognizes range comparisons on ctid and emits a TidRangeScan plan node. MakeTidOpExpr (called from ExecInitTidRangeScan) parses the comparison operator and direction:

// MakeTidOpExpr (excerpt) — src/backend/executor/nodeTidrangescan.c
switch (expr->opno)
{
case TIDLessEqOperator:
tidopexpr->exprtype = TIDEXPR_UPPER_BOUND;
tidopexpr->inclusive = true; break;
case TIDLessOperator:
tidopexpr->exprtype = TIDEXPR_UPPER_BOUND;
tidopexpr->inclusive = false; break;
/* ... TIDGreater*, invert if needed ... */
}

TidRangeNext calls table_beginscan_tidrange (deferred to first call) and then table_scan_getnextslot_tidrange to retrieve tuples within the bounds. The heap AM uses the TID range to restrict the block range it reads, so this is more efficient than a full sequential scan filtered by a ctid predicate.

SymbolRole
ExecScanThin wrapper; reads epqstate, qual, projInfo from node, delegates to ExecScanExtended.
ExecScanExtendedShared outer loop: calls accessMtd, applies qual, projects, handles EPQ. Defined in execScan.h (inline).
ExecScanReScanClears the scan tuple slot; handles EPQ relsubs_done reset for ForeignScan and CustomScan.
ExecAssignScanProjectionInfoSets ps_ProjInfo to NULL if the target list matches the scan tuple descriptor exactly (no-op projection).
SymbolRole
SeqNextAccess function: lazy table_beginscan, then table_scan_getnextslot.
SeqRecheckAlways true — no index keys to recheck.
ExecSeqScanFast path: no qual, no projection, no EPQ.
ExecSeqScanWithQualFast path: qual present, no projection.
ExecSeqScanWithProjectFast path: projection present, no qual.
ExecSeqScanWithQualProjectFull path: qual + projection.
ExecSeqScanEPQEPQ active path (uses ExecScan, not ExecScanExtended).
ExecInitSeqScanAllocates SeqScanState, opens relation, initializes slot, selects specialized ExecProcNode.
ExecEndSeqScanCalls table_endscan if descriptor open.
ExecReScanSeqScanCalls table_rescan; then ExecScanReScan.
ExecSeqScanEstimatetable_parallelscan_estimateshm_toc_estimate_chunk.
ExecSeqScanInitializeDSMtable_parallelscan_initialize; table_beginscan_parallel.
ExecSeqScanInitializeWorkershm_toc_lookuptable_beginscan_parallel.

nodeIndexscan.c — index-driven heap fetch

Section titled “nodeIndexscan.c — index-driven heap fetch”
SymbolRole
IndexNextAccess function: lazy index_beginscan; index_getnext_slot loop with xs_recheck recheck.
IndexNextWithReorderVariant: fetches into iss_ReorderQueue (pairing heap) for approximate ORDER BY.
ExecIndexScanExecProcNode wrapper calling ExecScan with IndexNext/IndexRecheck.
ExecInitIndexScanAllocates IndexScanState; opens heap + index relations; calls ExecIndexBuildScanKeys.
ExecIndexBuildScanKeysTranslates planner indexqual list to ScanKey[]; separates constant vs. runtime keys.
ExecReScanIndexScanRe-evaluates runtime keys if params changed; calls index_rescan.
ExecEndIndexScanindex_endscan; closes index and (if needed) heap relation.

nodeIndexonlyscan.c — index-only (no heap for all-visible pages)

Section titled “nodeIndexonlyscan.c — index-only (no heap for all-visible pages)”
SymbolRole
IndexOnlyNextindex_getnext_tidVM_ALL_VISIBLE check → optional index_fetch_heapStoreIndexTuple.
StoreIndexTupleUnpacks IndexTuple into TupleTableSlot; handles name-typed column padding.
ExecIndexOnlyScanExecProcNode wrapper; checks recheckqual if xs_recheck.
ExecInitIndexOnlyScanAllocates IndexOnlyScanState; sets xs_want_itup = true; initializes ioss_VMBuffer = InvalidBuffer.
ExecEndIndexOnlyScanindex_endscan; releases ioss_VMBuffer if pinned.

nodeBitmapHeapscan.c — two-phase bitmap scan

Section titled “nodeBitmapHeapscan.c — two-phase bitmap scan”
SymbolRole
BitmapTableScanSetupCalls MultiExecProcNode to build TIDBitmap; opens table_beginscan_bm; initializes parallel shared state.
BitmapHeapNexttable_scan_bitmap_next_tuple loop; rechecks bitmapqualorig on lossy pages.
ExecBitmapHeapScanExecProcNode calling ExecScan with BitmapHeapNext/BitmapHeapRecheck.
ExecInitBitmapHeapScanAllocates BitmapHeapScanState; initializes outer (bitmap index) subplan.
BitmapDoneInitializingSharedStateSpinlock + ConditionVariableBroadcast to release waiting workers after leader builds bitmap.

nodeTidscan.c / nodeTidrangescan.c — ctid-targeted scans

Section titled “nodeTidscan.c / nodeTidrangescan.c — ctid-targeted scans”
SymbolRole
TidExprListCreateParses TidScan.tidquals; compiles to ExprState[]; flags CURRENT OF.
TidListEvalEvaluates TID expressions; deduplicates + sorts; resolves CURRENT OF cursor.
TidNextIterates tss_TidList; calls table_tuple_fetch_row_version per TID.
ExecInitTidScanAllocates TidScanState; calls TidExprListCreate.
MakeTidOpExprParses ctid range operator into TidOpExpr (upper/lower bound, inclusive).
TidRangeNextLazy table_beginscan_tidrange; table_scan_getnextslot_tidrange loop.
ExecInitTidRangeScanAllocates TidRangeScanState; builds TidOpExpr list from plan quals.

Position hints (as of 2026-06-05, commit 273fe94)

Section titled “Position hints (as of 2026-06-05, commit 273fe94)”
SymbolFileLine
ExecScansrc/backend/executor/execScan.c47
ExecScanReScansrc/backend/executor/execScan.c108
ExecAssignScanProjectionInfosrc/backend/executor/execScan.c81
SeqNextsrc/backend/executor/nodeSeqscan.c51
ExecSeqScansrc/backend/executor/nodeSeqscan.c110
ExecInitSeqScansrc/backend/executor/nodeSeqscan.c207
ExecEndSeqScansrc/backend/executor/nodeSeqscan.c289
ExecReScanSeqScansrc/backend/executor/nodeSeqscan.c317
ExecSeqScanEstimatesrc/backend/executor/nodeSeqscan.c343
ExecSeqScanInitializeDSMsrc/backend/executor/nodeSeqscan.c361
ExecSeqScanInitializeWorkersrc/backend/executor/nodeSeqscan.c399
IndexNextsrc/backend/executor/nodeIndexscan.c80
IndexNextWithReordersrc/backend/executor/nodeIndexscan.c169
ExecInitIndexScansrc/backend/executor/nodeIndexscan.c909
ExecIndexBuildScanKeyssrc/backend/executor/nodeIndexscan.c1158
ExecIndexScansrc/backend/executor/nodeIndexscan.c521
IndexOnlyNextsrc/backend/executor/nodeIndexonlyscan.c61
ExecInitIndexOnlyScansrc/backend/executor/nodeIndexonlyscan.c528
BitmapTableScanSetupsrc/backend/executor/nodeBitmapHeapscan.c63
BitmapHeapNextsrc/backend/executor/nodeBitmapHeapscan.c126
BitmapDoneInitializingSharedStatesrc/backend/executor/nodeBitmapHeapscan.c181
BitmapHeapRechecksrc/backend/executor/nodeBitmapHeapscan.c193
ExecBitmapHeapScansrc/backend/executor/nodeBitmapHeapscan.c212
ExecInitBitmapHeapScansrc/backend/executor/nodeBitmapHeapscan.c333
BitmapShouldInitializeSharedStatesrc/backend/executor/nodeBitmapHeapscan.c420
StoreIndexTuplesrc/backend/executor/nodeIndexonlyscan.c269
ExecIndexOnlyScansrc/backend/executor/nodeIndexonlyscan.c337
ExecEndIndexOnlyScansrc/backend/executor/nodeIndexonlyscan.c399
table_scan_bitmap_next_tuplesrc/include/access/tableam.h1932
TidExprListCreatesrc/backend/executor/nodeTidscan.c70
TidListEvalsrc/backend/executor/nodeTidscan.c134
TidNextsrc/backend/executor/nodeTidscan.c312
ExecInitTidScansrc/backend/executor/nodeTidscan.c499
MakeTidOpExprsrc/backend/executor/nodeTidrangescan.c56
TidRangeNextsrc/backend/executor/nodeTidrangescan.c222
ExecInitTidRangeScansrc/backend/executor/nodeTidrangescan.c359
ScanStatesrc/include/nodes/execnodes.h1609
SeqScanStatesrc/include/nodes/execnodes.h1621
IndexScanStatesrc/include/nodes/execnodes.h1698
IndexOnlyScanStatesrc/include/nodes/execnodes.h1750

Verified against REL_18_STABLE, commit 273fe94.

Confirmed:

  • ExecSeqScan specialization into five variants (ExecSeqScan, ExecSeqScanWithQual, ExecSeqScanWithProject, ExecSeqScanWithQualProject, ExecSeqScanEPQ) is present in nodeSeqscan.c exactly as described. This optimization was introduced in PG16/PG17 era; it is present in REL_18.
  • ExecScanExtended is declared as an static inline function in src/include/executor/execScan.h (not in execScan.c). The call in ExecScan at line 59 confirms this.
  • VM_ALL_VISIBLE macro usage in IndexOnlyNext (line 161) and the ioss_VMBuffer field in IndexOnlyScanState are present.
  • BitmapTableScanSetup (renamed from the earlier per-tuple setup approach) and the rs_tbmiterator field assignment at line 115 of nodeBitmapHeapscan.c are confirmed present in REL_18.
  • TidRangeScan node type and table_beginscan_tidrange / table_scan_getnextslot_tidrange AM entry points exist in REL_18.

Unresolved / future drift risk:

  • ExecScanExtended is an inline in the header; its internals (the qual-eval + projection loop) are not shown in the code excerpts above but are referenced by the five SeqScan variant descriptions. The line number for ExecScanExtended is in the header file (execScan.h, around line 138–165) and is excluded from the position-hint table above since the table covers .c files only.
  • iss_NameCStringAttNums / iss_NameCStringCount fields in IndexOnlyScanState are REL_18 additions (name-typed column padding for index-only scans). They exist in execnodes.h but the corresponding StoreIndexTuple logic was not read in full.
  • The parallel bitmap scan shared-state lifecycle (BitmapShouldInitializeSharedState, pstate->tbmiterator DSA pointer) is present but the full parallel path through dsa_area allocation was not traced end-to-end.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

Oracle and DB2: access-path separation from execution

Section titled “Oracle and DB2: access-path separation from execution”

Oracle’s execution engine separates access path from filter more aggressively than PostgreSQL. A single TABLE ACCESS FULL node in Oracle’s plan carries both the storage-layer read and the filter predicate; there is no separate Filter node above it in many plans. This matches what PostgreSQL does (qual pushed into ExecScan), but Oracle additionally performs storage-index (SI) filtering at the block level via the buffer cache, avoiding even loading a block’s worth of tuples when the block-level predicate is false. PostgreSQL has no equivalent block-level filter; the visibility map covers only all-visible, not predicate filtering.

MySQL/InnoDB: clustered index changes the heap-fetch cost model

Section titled “MySQL/InnoDB: clustered index changes the heap-fetch cost model”

InnoDB stores data in a clustered index (the primary key B-tree holds the full tuple). A secondary-index scan is therefore always a two-step operation: secondary-index lookup → clustered-index fetch (the equivalent of PostgreSQL’s IndexScan heap fetch). But because the clustered index is ordered by primary key, sequential primary-key range scans are always full-sequential reads. PostgreSQL has no clustered index; CLUSTER TABLE ... USING index is a one-time physical sort, not a maintained structure. This makes PostgreSQL’s index-only scan via the visibility map particularly important for covering-index workloads.

Covering indexes and index-only scans across engines

Section titled “Covering indexes and index-only scans across engines”

The visibility-map gate in PostgreSQL’s IndexOnlyScan is a design-specific tradeoff: unlike Oracle (which uses SCN-based row directory in the index) or SQL Server (which stores row versions in the index page itself via the row-versioning store), PostgreSQL stores visibility state per-page in the visibility map. The upside is that VACUUM-driven visibility-map maintenance amortizes the visibility cost across all subsequent index-only scans on those pages; the downside is that freshly-inserted or recently-updated pages require a heap fetch until VACUUM runs and sets the all-visible bit.

Research frontier: late materialization and columnar projection

Section titled “Research frontier: late materialization and columnar projection”

The textbook’s “projection pushdown” principle extends to late materialization in columnar stores (Abadi et al., VLDB 2007 “Column-Stores vs. Row-Stores”): don’t fetch all columns of a tuple at the scan layer; pass only the column identifiers upward and reconstruct the full tuple only when a filter predicate has narrowed the row set. PostgreSQL’s heap AM is a row store, so ExecScan must return a full TupleTableSlot at each step. The TupleTableSlot virtual-tuple mechanism (tts_values[] + tts_isnull[]) partially decouples column materialization from heap fetching, but full late materialization requires a columnar AM. The table_am API (postgres-table-am.md) is the extensibility hook for columnar engines that want to provide their own scan node variants via the custom-scan interface.

The two-phase bitmap scan closely matches what Database Internals (Petrov, ch. 7 “Log-Structured Storage” — discussed in the context of merge reads) calls I/O coalescing: convert a set of random lookups into a sorted sequential pass. PostgreSQL’s TIDBitmap implements exact mode (one bit per TID, cheap recheck) and lossy mode (one bit per page, mandatory recheck), with automatic degradation when work_mem is exhausted. Research systems such as Blink-tree indexes and ARIES-IM extend this to buffer-managed bitmaps shared across concurrent queries; PostgreSQL’s bitmap is per-query and lives in operator memory.

In-tree source files (commit 273fe94, REL_18_STABLE):

  • src/backend/executor/execScan.c — shared outer scan loop
  • src/backend/executor/nodeSeqscan.c — sequential scan
  • src/backend/executor/nodeIndexscan.c — index scan + scan-key build
  • src/backend/executor/nodeIndexonlyscan.c — index-only scan
  • src/backend/executor/nodeBitmapHeapscan.c — bitmap heap scan
  • src/backend/executor/nodeBitmapIndexscan.c — bitmap index scan (subplan)
  • src/backend/executor/nodeTidscan.c — ctid point scan
  • src/backend/executor/nodeTidrangescan.c — ctid range scan
  • src/include/nodes/execnodes.hScanState, SeqScanState, IndexScanState, IndexOnlyScanState, BitmapHeapScanState, TidScanState, TidRangeScanState
  • src/include/executor/execScan.hExecScanExtended inline, ExecScanAccessMtd, ExecScanRecheckMtd

Cross-references (knowledge base):

  • postgres-executor.md — Volcano iterator model, TupleTableSlot, ExecProcNode dispatch
  • postgres-table-am.mdtable_beginscan, table_scan_getnextslot, table_tuple_fetch_row_version, table_beginscan_tidrange
  • postgres-nbtree.mdindex_beginscan, index_getnext_slot, scan-key semantics
  • postgres-heap-am.mdtable_beginscan_bm, table_scan_bitmap_next_tuple
  • postgres-mvcc-snapshots.md — visibility checks, snapshot semantics
  • postgres-buffer-manager.md — buffer pin and visibility map buffer management
  • postgres-path-generation.md — planner path types that produce these plan nodes

Textbook references:

  • Database System Concepts, Silberschatz, 7e, §12.3 (file scan), §15.3 (index scan), §15.3.3 (bitmap scan)
  • Database Internals, Petrov, ch. 7 (I/O coalescing, read amplification)