PostgreSQL Scan Nodes — Sequential, Index, Bitmap, and TID Scans
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Leaf scan nodes are where the Volcano iterator model meets physical storage. Every query plan tree has scans at its leaves — the nodes that actually read tuples from a relation — and every other node (join, sort, aggregate) sits above them, pulling tuples upward on demand. Database System Concepts (Silberschatz, 7e, ch. 15 “Query Processing”) distinguishes scan strategies by the path they take through the data:
- Full table scan. The engine reads every page of the relation regardless of which tuples are needed, then filters rows with the selection predicate. The cost is proportional to the number of blocks in the relation and is independent of the selectivity of the predicate. For low-selectivity predicates over large relations, this is often the cheapest path because storage I/O is sequential and predictable (§12.3).
- Index scan. The engine first traverses an index ordered by the search key, retrieving only the record identifiers (RIDs) for tuples that satisfy the index predicate, then fetches each heap tuple by its RID. The cost has two components: the index traversal cost and the heap fetch cost. When the index is highly selective and the relation is large, the heap fetch cost dominates because each RID may point to a different disk block (random I/O). When the index is weakly selective, the total random I/O cost can exceed a full scan.
- Index-only scan. If the query projects only attributes that are stored in the index (a covering index), the engine can satisfy the query entirely from the index without touching the heap. The textbook calls this a covering-index scan or key-only scan. The cost is the index traversal cost alone (§15.3.1).
The textbook also motivates the bitmap-based scan, sometimes called multiple-index scan (§15.3.3): when several indexes each cover part of the predicate, the engine evaluates them independently, forms the intersection (AND) or union (OR) of the result bitmaps, and then fetches only the heap pages that appear in the final bitmap. The key insight is that sorting the heap accesses by block number converts a large number of random I/Os into a mostly sequential pass. Modern commercial engines use this technique under names like “index merge” (MySQL) or “bitmap AND/OR” (Oracle, PostgreSQL).
Two design axes govern every real scan implementation:
-
Predicate evaluation placement. The engine can push predicates down into the scan (evaluated per-tuple before projecting), or evaluate them above the scan at the join/filter layer. PostgreSQL pushes qual evaluation into
ExecScanso that unqualified tuples never leave the scan node — only qualifying tuples travel up the plan tree. -
Visibility gating. A tuple stored in the heap may be invisible to the current snapshot even if the index points to it (see
postgres-mvcc-snapshots.md). The scan must apply the MVCC visibility check as part of tuple retrieval. This check can be deferred (heap fetch followed by visibility check) or bypassed (visibility map, discussed below).
Common DBMS Design
Section titled “Common DBMS Design”Almost every production engine converges on the same structural choices for scan nodes. Understanding this shared engineering context makes PostgreSQL’s specific dials easier to read.
A two-function access-method interface
Section titled “A two-function access-method interface”Scans are expressed as a pair of AM callbacks: an access function
that returns the next raw tuple, and a recheck function that
re-evaluates index quals on the heap tuple when the index is
lossy (i.e., it stored a page-level predicate rather than a
precise per-tuple predicate). PostgreSQL names these
ExecScanAccessMtd and ExecScanRecheckMtd. The outer loop —
qual evaluation, projection, EPQ handling — is factored into a
shared ExecScan/ExecScanExtended function shared by all six node
types. Only the access function and the recheck function differ per
node type.
Lazy scan-descriptor initialization
Section titled “Lazy scan-descriptor initialization”All engines defer opening a scan descriptor until the first next()
call. This matters because plans are built before execution, and some
plan nodes (e.g., an inner side of a nested loop) are re-opened on
every outer row. Opening the scan descriptor on Init would waste
work for never-executed branches. PostgreSQL checks
ss_currentScanDesc == NULL on the first call to the access function
and opens the descriptor there.
Parallel scan as a DSM extension
Section titled “Parallel scan as a DSM extension”Parallel table scans (both sequential and index) divide work via a
shared scan descriptor allocated in dynamic shared memory (DSM).
The convention across all PostgreSQL scan node types is a four-function
protocol: Estimate (size the DSM region), InitializeDSM (leader
sets up the shared descriptor), ReInitializeDSM (reset for a
rescan), InitializeWorker (worker attaches to the shared descriptor).
The actual work-division logic lives in the AM (table_parallelscan_*
and index_parallelscan_* families).
Qual specialization by presence of conditions
Section titled “Qual specialization by presence of conditions”A scan with no qual and no projection is trivially faster than one
with both — the tight inner loop does nothing but fetch the next
tuple and return it. PostgreSQL’s ExecInitSeqScan sets
ExecProcNode to one of five specialized variants:
ExecSeqScan, ExecSeqScanWithQual, ExecSeqScanWithProject,
ExecSeqScanWithQualProject, or ExecSeqScanEPQ. Each variant
passes const-NULL arguments for the absent components, which the
compiler uses to eliminate dead branches in the common fast path.
Bitmap scan as a two-phase operation
Section titled “Bitmap scan as a two-phase operation”Bitmap scans split into two distinct plan nodes: a
BitmapIndexScan (or a BitmapAnd/BitmapOr tree of them) runs
as the outer subplan, returning a TIDBitmap via MultiExecProcNode
rather than individual tuples. The BitmapHeapScan node then
iterates over the bitmap, grouping accesses by heap block number to
amortize random I/O. The bitmap itself is in-memory when small
(exact mode, one bit per TID) and degrades gracefully to page-level
lossy mode when it overflows work_mem.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL provides six leaf scan node types, all wired into the
shared ExecScan framework. The nodes split cleanly into three
families:
flowchart TD ES["ExecScan / ExecScanExtended<br/>qual eval + projection + EPQ"] SS["SeqScan<br/>table_beginscan + table_scan_getnextslot"] IS["IndexScan<br/>index_beginscan + index_getnext_slot<br/>+ heap fetch"] IOS["IndexOnlyScan<br/>index_beginscan xs_want_itup=true<br/>VM check → skip heap if all-visible"] BHS["BitmapHeapScan<br/>MultiExecProcNode → TIDBitmap<br/>table_scan_bitmap_next_tuple"] BIS["BitmapIndexScan<br/>MultiExecProcNode returns TIDBitmap"] TS["TidScan<br/>table_tuple_fetch_row_version<br/>on explicit ctid list"] TRS["TidRangeScan<br/>table_beginscan_tidrange<br/>on ctid range bounds"] ES --> SS ES --> IS ES --> IOS ES --> BHS BHS --> BIS ES --> TS ES --> TRS
Figure 1 — Six scan node types share the ExecScan outer loop. BitmapHeapScan’s inner subplan is a BitmapIndexScan (or an And/Or tree of them) that returns a TIDBitmap via MultiExecProcNode.
Shared scan loop: ExecScan and ExecScanExtended
Section titled “Shared scan loop: ExecScan and ExecScanExtended”ExecScan is a thin wrapper that reads the epqstate, qual, and
projInfo from the node and delegates to ExecScanExtended. The
extended form accepts them as explicit parameters so that the
specialized SeqScan variants can pass compile-time NULL constants.
// ExecScan — src/backend/executor/execScan.cTupleTableSlot *ExecScan(ScanState *node, ExecScanAccessMtd accessMtd, ExecScanRecheckMtd recheckMtd){ EPQState *epqstate = node->ps.state->es_epq_active; ExprState *qual = node->ps.qual; ProjectionInfo *projInfo = node->ps.ps_ProjInfo;
return ExecScanExtended(node, accessMtd, recheckMtd, epqstate, qual, projInfo);}The shared loop inside ExecScanExtended calls accessMtd to get
the next raw tuple, applies the qual (if non-NULL), projects (if
non-NULL), and handles EPQ re-evaluation — none of this logic is
duplicated per scan node.
SeqScan — full table scan via the table AM
Section titled “SeqScan — full table scan via the table AM”SeqScan is the simplest node. SeqNext opens a TableScanDesc
on the first call via table_beginscan, then calls
table_scan_getnextslot in a direction determined by
estate->es_direction.
// SeqNext — src/backend/executor/nodeSeqscan.cstatic TupleTableSlot *SeqNext(SeqScanState *node){ TableScanDesc scandesc = node->ss.ss_currentScanDesc; EState *estate = node->ss.ps.state; ScanDirection direction = estate->es_direction; TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
if (scandesc == NULL) { scandesc = table_beginscan(node->ss.ss_currentRelation, estate->es_snapshot, 0, NULL); node->ss.ss_currentScanDesc = scandesc; } if (table_scan_getnextslot(scandesc, direction, slot)) return slot; return NULL;}The recheck function SeqRecheck always returns true — sequential
scans never use index keys, so there is nothing to recheck.
ExecInitSeqScan selects one of five specialized ExecProcNode
variants based on the presence of qual, projection, and EPQ:
// ExecInitSeqScan (excerpt) — src/backend/executor/nodeSeqscan.cif (scanstate->ss.ps.state->es_epq_active != NULL) scanstate->ss.ps.ExecProcNode = ExecSeqScanEPQ;else if (scanstate->ss.ps.qual == NULL){ if (scanstate->ss.ps.ps_ProjInfo == NULL) scanstate->ss.ps.ExecProcNode = ExecSeqScan; else scanstate->ss.ps.ExecProcNode = ExecSeqScanWithProject;}else{ if (scanstate->ss.ps.ps_ProjInfo == NULL) scanstate->ss.ps.ExecProcNode = ExecSeqScanWithQual; else scanstate->ss.ps.ExecProcNode = ExecSeqScanWithQualProject;}Each variant passes NULL for absent components:
ExecSeqScan passes NULL, NULL, NULL for epqstate, qual,
projInfo — the compiler eliminates the corresponding branches in
ExecScanExtended.
Parallel support is the standard four-function protocol:
ExecSeqScanEstimate sizes the DSM region via
table_parallelscan_estimate; ExecSeqScanInitializeDSM allocates
the shared descriptor and calls table_parallelscan_initialize;
ExecSeqScanInitializeWorker attaches via table_beginscan_parallel.
IndexScan — index-driven heap fetch
Section titled “IndexScan — index-driven heap fetch”IndexNext opens an IndexScanDesc via index_beginscan (lazy,
first call), applies any runtime scan keys, then loops on
index_getnext_slot. Each call to index_getnext_slot returns the
index tuple and sets xs_recheck if the index is lossy:
// IndexNext (excerpt) — src/backend/executor/nodeIndexscan.cwhile (index_getnext_slot(scandesc, direction, slot)){ CHECK_FOR_INTERRUPTS(); if (scandesc->xs_recheck) { econtext->ecxt_scantuple = slot; if (!ExecQualAndReset(node->indexqualorig, econtext)) { InstrCountFiltered2(node, 1); continue; } } return slot;}node->iss_ReachedEnd = true;return ExecClearTuple(slot);index_getnext_slot fetches the heap tuple at the TID returned by
the index AM. The visibility check is performed inside the heap AM
using the query snapshot (estate->es_snapshot). The xs_recheck
flag is set by lossy index AMs (GIN, GiST when the index predicate
does not store full per-tuple precision) to trigger re-evaluation of
the original index qual against the fetched heap tuple.
IndexNextWithReorder handles the case where the query includes
ORDER BY expressions on index operators: it fetches tuples into a
pairing-heap reorder queue (iss_ReorderQueue) and emits them in
order, re-checking distance values when the index returns approximate
ordering.
ExecIndexBuildScanKeys (line 1158) translates the planner’s
IndexScan.indexqual list into the ScanKey array that the index
AM expects. Runtime keys — quals whose right-hand side must be
evaluated at execution time (e.g., WHERE col = $1) — are separated
into iss_RuntimeKeys and evaluated via ExecIndexEvalRuntimeKeys
before the first index_rescan.
IndexOnlyScan — heap bypass via visibility map
Section titled “IndexOnlyScan — heap bypass via visibility map”IndexOnlyScan is IndexScan with one structural addition:
ioss_ScanDesc->xs_want_itup = true tells the index AM to return
the index tuple rather than fetching from the heap. Before returning
the tuple, IndexOnlyNext consults the visibility map to decide
whether a heap fetch is still necessary:
// IndexOnlyNext (excerpt) — src/backend/executor/nodeIndexonlyscan.cwhile ((tid = index_getnext_tid(scandesc, direction)) != NULL){ bool tuple_from_heap = false; CHECK_FOR_INTERRUPTS();
if (!VM_ALL_VISIBLE(scandesc->heapRelation, ItemPointerGetBlockNumber(tid), &node->ioss_VMBuffer)) { /* must visit heap to check visibility */ InstrCountTuples2(node, 1); if (!index_fetch_heap(scandesc, node->ioss_TableSlot)) continue; /* no visible tuple */ ExecClearTuple(node->ioss_TableSlot); tuple_from_heap = true; }
/* fill slot from index tuple (xs_hitup preferred over xs_itup) */ if (scandesc->xs_hitup) ExecForceStoreHeapTuple(scandesc->xs_hitup, slot, false); else if (scandesc->xs_itup) StoreIndexTuple(node, slot, scandesc->xs_itup, scandesc->xs_itupdesc); else elog(ERROR, "no data returned for index-only scan"); ...}If VM_ALL_VISIBLE is set for the heap page containing the TID,
the scan skips the heap fetch entirely. If not, it calls
index_fetch_heap to obtain the heap tuple (to verify visibility)
but still returns the index tuple as the data source. This means
the heap fetch is used only for its visibility side-effect, not to
carry column data. The ioss_VMBuffer pin persists across calls to
amortize repeated visibility-map lookups for the same page.
StoreIndexTuple unpacks the index tuple into the scan slot’s
attribute array. For indexes on name-typed columns, the tuple
must be padded to NAMEDATALEN; ioss_NameCStringAttNums tracks
those column numbers.
BitmapHeapScan — two-phase bitmap-driven heap scan
Section titled “BitmapHeapScan — two-phase bitmap-driven heap scan”BitmapHeapScan executes as two plan nodes cooperating through
MultiExecProcNode. The outer subplan is a BitmapIndexScan (or a
tree of BitmapAnd/BitmapOr nodes) that returns a TIDBitmap
without going through the standard tuple interface. BitmapHeapScan
calls MultiExecProcNode exactly once (deferred to the first
BitmapHeapNext call via BitmapTableScanSetup) to build the
bitmap, then hands it to the heap AM for page-ordered iteration.
// BitmapTableScanSetup (excerpt) — src/backend/executor/nodeBitmapHeapscan.cstatic voidBitmapTableScanSetup(BitmapHeapScanState *node){ /* serial path: build TIDBitmap from subplan */ node->tbm = (TIDBitmap *) MultiExecProcNode(outerPlanState(node)); if (!node->tbm || !IsA(node->tbm, TIDBitmap)) elog(ERROR, "unrecognized result from subplan");
tbmiterator = tbm_begin_iterate(node->tbm, dsa, InvalidDsaPointer);
/* open heap scan descriptor for bitmap-driven access */ if (!node->ss.ss_currentScanDesc) node->ss.ss_currentScanDesc = table_beginscan_bm(node->ss.ss_currentRelation, node->ss.ps.state->es_snapshot, 0, NULL);
node->ss.ss_currentScanDesc->st.rs_tbmiterator = tbmiterator; node->initialized = true;}BitmapHeapNext then calls table_scan_bitmap_next_tuple in a
loop. The heap AM iterates over the bitmap in block order. When the
bitmap contains a lossy page entry (the TIDBitmap overflowed
work_mem and fell back to page-level granularity), node->recheck
is set to true and BitmapHeapNext re-evaluates
bitmapqualorig against each tuple on that page:
// BitmapHeapNext (excerpt) — src/backend/executor/nodeBitmapHeapscan.cwhile (table_scan_bitmap_next_tuple(node->ss.ss_currentScanDesc, slot, &node->recheck, ...)){ CHECK_FOR_INTERRUPTS(); if (node->recheck) { econtext->ecxt_scantuple = slot; if (!ExecQualAndReset(node->bitmapqualorig, econtext)) { InstrCountFiltered2(node, 1); ExecClearTuple(slot); continue; } } return slot;}return ExecClearTuple(slot);For parallel bitmap scans, the leader calls
tbm_prepare_shared_iterate to publish the iterator DSA pointer in
pstate->tbmiterator; workers attach via tbm_begin_iterate with
that pointer. The BitmapShouldInitializeSharedState / BitmapDoneInitializingSharedState pair
uses a spinlock + condition variable to ensure only the first
worker builds the bitmap.
The two-phase structure deserves a closer look because it is the
single most important architectural difference between
BitmapHeapScan and the index-driven IndexScan. In an
IndexScan, the index traversal and the heap fetch are interleaved
tuple-by-tuple: fetch one TID, fetch its heap tuple, return it,
repeat. Each heap fetch can land on an arbitrary block, so for a
weakly-selective predicate the heap access pattern is effectively
random. BitmapHeapScan breaks that interleaving by inserting a
materialization barrier — the TIDBitmap — between the index pass
and the heap pass. The first phase drains the entire
BitmapIndexScan subplan via a single MultiExecProcNode call,
accumulating every matching TID into the bitmap. Only after the
bitmap is complete does the second phase begin, walking the heap in
strictly ascending block-number order. This is the I/O-coalescing
win: the same set of heap blocks that an IndexScan would visit in
random order is now visited sequentially, and each block is read at
most once even if many matching tuples live on it.
flowchart TD
subgraph EXECSCAN["ExecScan / ExecScanExtended — shared outer loop"]
ESLOOP["accessMtd → qual → project → EPQ"]
end
subgraph FAMILY["Scan-node family — one accessMtd each"]
SS["SeqNext<br/>table_scan_getnextslot"]
IS["IndexNext<br/>index_getnext_slot + heap fetch"]
IOS["IndexOnlyNext<br/>index_getnext_tid + VM_ALL_VISIBLE gate"]
TS["TidNext<br/>table_tuple_fetch_row_version"]
TRS["TidRangeNext<br/>table_scan_getnextslot_tidrange"]
BHN["BitmapHeapNext<br/>table_scan_bitmap_next_tuple"]
end
ESLOOP --> SS
ESLOOP --> IS
ESLOOP --> IOS
ESLOOP --> TS
ESLOOP --> TRS
ESLOOP --> BHN
subgraph PHASE1["Bitmap phase 1 — build (deferred, runs once)"]
INIT{"node->initialized?"}
SETUP["BitmapTableScanSetup"]
MEPN["MultiExecProcNode<br/>(BitmapIndexScan / BitmapAnd / BitmapOr)"]
TBM["TIDBitmap<br/>exact: 1 bit/TID — lossy: 1 bit/page"]
SHARED["pstate? → tbm_prepare_shared_iterate<br/>+ BitmapDoneInitializingSharedState"]
ITER["tbm_begin_iterate<br/>table_beginscan_bm → rs_tbmiterator"]
end
subgraph PHASE2["Bitmap phase 2 — heap fetch (per tuple, block-ordered)"]
NEXT["table_scan_bitmap_next_tuple"]
RECHK{"node->recheck?<br/>(lossy page)"}
QUAL["ExecQualAndReset<br/>bitmapqualorig"]
EMIT["return slot"]
DROP["InstrCountFiltered2<br/>ExecClearTuple → loop"]
end
BHN --> INIT
INIT -- no --> SETUP
INIT -- yes --> NEXT
SETUP --> MEPN --> TBM --> SHARED --> ITER --> NEXT
NEXT --> RECHK
RECHK -- no --> EMIT
RECHK -- yes --> QUAL
QUAL -- pass --> EMIT
QUAL -- fail --> DROP --> NEXT
Figure 2 — Every leaf scan node plugs a single accessMtd into the
shared ExecScan/ExecScanExtended loop. BitmapHeapNext is unique
in that its accessMtd is itself two-phased: a deferred one-time
build (BitmapTableScanSetup → MultiExecProcNode → TIDBitmap)
followed by a block-ordered heap walk (table_scan_bitmap_next_tuple)
with per-tuple recheck of bitmapqualorig on lossy pages.
The deferral is what makes the node->initialized guard load-bearing:
BitmapTableScanSetup runs lazily on the first BitmapHeapNext call,
not at ExecInitBitmapHeapScan time. This mirrors the lazy
scan-descriptor pattern used by every other node — an inner
bitmap-heap scan re-driven by a nested loop pays the bitmap-build cost
only when it is actually pulled from. The recheck branch
(node->recheck) is set per-page by table_scan_bitmap_next_tuple:
the heap AM raises it whenever the current page came from a lossy
bitmap entry, because a lossy entry only records “some tuple on this
page matched” and cannot say which. Exact entries carry per-TID
precision and need no recheck, so the recheck cost is paid only on the
pages that overflowed work_mem into lossy mode. The two
instrumentation counters threaded through the call —
node->stats.lossy_pages and node->stats.exact_pages — are exactly
what EXPLAIN (ANALYZE) surfaces as “Heap Blocks: exact=N lossy=M”,
making the exact/lossy split observable at the SQL level.
TidScan — point lookup by ctid
Section titled “TidScan — point lookup by ctid”TidScan handles queries of the form WHERE ctid = '(0,1)' or
WHERE ctid = ANY(array_expression) or WHERE CURRENT OF cursor.
TidExprListCreate parses the TidScan.tidquals expression list
at ExecInitTidScan time, compiling ordinary expressions into
ExprState objects and recording CURRENT OF cursor references.
TidListEval evaluates those expressions at runtime to produce
the ItemPointer list stored in tss_TidList.
// TidNext (excerpt) — src/backend/executor/nodeTidscan.cstatic TupleTableSlot *TidNext(TidScanState *node){ /* evaluate tid expressions on first call */ if (node->tss_TidList == NULL) TidListEval(node);
while (node->tss_TidPtr < node->tss_NumTids) { ItemPointerData tid = node->tss_TidList[node->tss_TidPtr++]; /* ... CURRENT OF cursor resolution ... */ if (!table_tuple_fetch_row_version(heapRelation, &tid, estate->es_snapshot, slot)) continue; return slot; } return ExecClearTuple(slot);}TidListEval deduplicates and sorts the TID list so that
table_tuple_fetch_row_version calls proceed in physical order,
reducing random I/O when the list is large. CURRENT OF resolution
calls GetCurrentCID / table_tuple_tid_valid to resolve the
cursor’s current position.
TidRangeScan — contiguous ctid range scan
Section titled “TidRangeScan — contiguous ctid range scan”TidRangeScan is a PG14-era addition for queries of the form
WHERE ctid >= '(10,0)' AND ctid < '(20,0)'. The planner
recognizes range comparisons on ctid and emits a TidRangeScan
plan node. MakeTidOpExpr (called from ExecInitTidRangeScan)
parses the comparison operator and direction:
// MakeTidOpExpr (excerpt) — src/backend/executor/nodeTidrangescan.cswitch (expr->opno){ case TIDLessEqOperator: tidopexpr->exprtype = TIDEXPR_UPPER_BOUND; tidopexpr->inclusive = true; break; case TIDLessOperator: tidopexpr->exprtype = TIDEXPR_UPPER_BOUND; tidopexpr->inclusive = false; break; /* ... TIDGreater*, invert if needed ... */}TidRangeNext calls table_beginscan_tidrange (deferred to first
call) and then table_scan_getnextslot_tidrange to retrieve tuples
within the bounds. The heap AM uses the TID range to restrict the
block range it reads, so this is more efficient than a full sequential
scan filtered by a ctid predicate.
Source Walkthrough
Section titled “Source Walkthrough”execScan.c — shared outer loop
Section titled “execScan.c — shared outer loop”| Symbol | Role |
|---|---|
ExecScan | Thin wrapper; reads epqstate, qual, projInfo from node, delegates to ExecScanExtended. |
ExecScanExtended | Shared outer loop: calls accessMtd, applies qual, projects, handles EPQ. Defined in execScan.h (inline). |
ExecScanReScan | Clears the scan tuple slot; handles EPQ relsubs_done reset for ForeignScan and CustomScan. |
ExecAssignScanProjectionInfo | Sets ps_ProjInfo to NULL if the target list matches the scan tuple descriptor exactly (no-op projection). |
nodeSeqscan.c — sequential scan
Section titled “nodeSeqscan.c — sequential scan”| Symbol | Role |
|---|---|
SeqNext | Access function: lazy table_beginscan, then table_scan_getnextslot. |
SeqRecheck | Always true — no index keys to recheck. |
ExecSeqScan | Fast path: no qual, no projection, no EPQ. |
ExecSeqScanWithQual | Fast path: qual present, no projection. |
ExecSeqScanWithProject | Fast path: projection present, no qual. |
ExecSeqScanWithQualProject | Full path: qual + projection. |
ExecSeqScanEPQ | EPQ active path (uses ExecScan, not ExecScanExtended). |
ExecInitSeqScan | Allocates SeqScanState, opens relation, initializes slot, selects specialized ExecProcNode. |
ExecEndSeqScan | Calls table_endscan if descriptor open. |
ExecReScanSeqScan | Calls table_rescan; then ExecScanReScan. |
ExecSeqScanEstimate | table_parallelscan_estimate → shm_toc_estimate_chunk. |
ExecSeqScanInitializeDSM | table_parallelscan_initialize; table_beginscan_parallel. |
ExecSeqScanInitializeWorker | shm_toc_lookup → table_beginscan_parallel. |
nodeIndexscan.c — index-driven heap fetch
Section titled “nodeIndexscan.c — index-driven heap fetch”| Symbol | Role |
|---|---|
IndexNext | Access function: lazy index_beginscan; index_getnext_slot loop with xs_recheck recheck. |
IndexNextWithReorder | Variant: fetches into iss_ReorderQueue (pairing heap) for approximate ORDER BY. |
ExecIndexScan | ExecProcNode wrapper calling ExecScan with IndexNext/IndexRecheck. |
ExecInitIndexScan | Allocates IndexScanState; opens heap + index relations; calls ExecIndexBuildScanKeys. |
ExecIndexBuildScanKeys | Translates planner indexqual list to ScanKey[]; separates constant vs. runtime keys. |
ExecReScanIndexScan | Re-evaluates runtime keys if params changed; calls index_rescan. |
ExecEndIndexScan | index_endscan; closes index and (if needed) heap relation. |
nodeIndexonlyscan.c — index-only (no heap for all-visible pages)
Section titled “nodeIndexonlyscan.c — index-only (no heap for all-visible pages)”| Symbol | Role |
|---|---|
IndexOnlyNext | index_getnext_tid → VM_ALL_VISIBLE check → optional index_fetch_heap → StoreIndexTuple. |
StoreIndexTuple | Unpacks IndexTuple into TupleTableSlot; handles name-typed column padding. |
ExecIndexOnlyScan | ExecProcNode wrapper; checks recheckqual if xs_recheck. |
ExecInitIndexOnlyScan | Allocates IndexOnlyScanState; sets xs_want_itup = true; initializes ioss_VMBuffer = InvalidBuffer. |
ExecEndIndexOnlyScan | index_endscan; releases ioss_VMBuffer if pinned. |
nodeBitmapHeapscan.c — two-phase bitmap scan
Section titled “nodeBitmapHeapscan.c — two-phase bitmap scan”| Symbol | Role |
|---|---|
BitmapTableScanSetup | Calls MultiExecProcNode to build TIDBitmap; opens table_beginscan_bm; initializes parallel shared state. |
BitmapHeapNext | table_scan_bitmap_next_tuple loop; rechecks bitmapqualorig on lossy pages. |
ExecBitmapHeapScan | ExecProcNode calling ExecScan with BitmapHeapNext/BitmapHeapRecheck. |
ExecInitBitmapHeapScan | Allocates BitmapHeapScanState; initializes outer (bitmap index) subplan. |
BitmapDoneInitializingSharedState | Spinlock + ConditionVariableBroadcast to release waiting workers after leader builds bitmap. |
nodeTidscan.c / nodeTidrangescan.c — ctid-targeted scans
Section titled “nodeTidscan.c / nodeTidrangescan.c — ctid-targeted scans”| Symbol | Role |
|---|---|
TidExprListCreate | Parses TidScan.tidquals; compiles to ExprState[]; flags CURRENT OF. |
TidListEval | Evaluates TID expressions; deduplicates + sorts; resolves CURRENT OF cursor. |
TidNext | Iterates tss_TidList; calls table_tuple_fetch_row_version per TID. |
ExecInitTidScan | Allocates TidScanState; calls TidExprListCreate. |
MakeTidOpExpr | Parses ctid range operator into TidOpExpr (upper/lower bound, inclusive). |
TidRangeNext | Lazy table_beginscan_tidrange; table_scan_getnextslot_tidrange loop. |
ExecInitTidRangeScan | Allocates TidRangeScanState; builds TidOpExpr list from plan quals. |
Position hints (as of 2026-06-05, commit 273fe94)
Section titled “Position hints (as of 2026-06-05, commit 273fe94)”| Symbol | File | Line |
|---|---|---|
ExecScan | src/backend/executor/execScan.c | 47 |
ExecScanReScan | src/backend/executor/execScan.c | 108 |
ExecAssignScanProjectionInfo | src/backend/executor/execScan.c | 81 |
SeqNext | src/backend/executor/nodeSeqscan.c | 51 |
ExecSeqScan | src/backend/executor/nodeSeqscan.c | 110 |
ExecInitSeqScan | src/backend/executor/nodeSeqscan.c | 207 |
ExecEndSeqScan | src/backend/executor/nodeSeqscan.c | 289 |
ExecReScanSeqScan | src/backend/executor/nodeSeqscan.c | 317 |
ExecSeqScanEstimate | src/backend/executor/nodeSeqscan.c | 343 |
ExecSeqScanInitializeDSM | src/backend/executor/nodeSeqscan.c | 361 |
ExecSeqScanInitializeWorker | src/backend/executor/nodeSeqscan.c | 399 |
IndexNext | src/backend/executor/nodeIndexscan.c | 80 |
IndexNextWithReorder | src/backend/executor/nodeIndexscan.c | 169 |
ExecInitIndexScan | src/backend/executor/nodeIndexscan.c | 909 |
ExecIndexBuildScanKeys | src/backend/executor/nodeIndexscan.c | 1158 |
ExecIndexScan | src/backend/executor/nodeIndexscan.c | 521 |
IndexOnlyNext | src/backend/executor/nodeIndexonlyscan.c | 61 |
ExecInitIndexOnlyScan | src/backend/executor/nodeIndexonlyscan.c | 528 |
BitmapTableScanSetup | src/backend/executor/nodeBitmapHeapscan.c | 63 |
BitmapHeapNext | src/backend/executor/nodeBitmapHeapscan.c | 126 |
BitmapDoneInitializingSharedState | src/backend/executor/nodeBitmapHeapscan.c | 181 |
BitmapHeapRecheck | src/backend/executor/nodeBitmapHeapscan.c | 193 |
ExecBitmapHeapScan | src/backend/executor/nodeBitmapHeapscan.c | 212 |
ExecInitBitmapHeapScan | src/backend/executor/nodeBitmapHeapscan.c | 333 |
BitmapShouldInitializeSharedState | src/backend/executor/nodeBitmapHeapscan.c | 420 |
StoreIndexTuple | src/backend/executor/nodeIndexonlyscan.c | 269 |
ExecIndexOnlyScan | src/backend/executor/nodeIndexonlyscan.c | 337 |
ExecEndIndexOnlyScan | src/backend/executor/nodeIndexonlyscan.c | 399 |
table_scan_bitmap_next_tuple | src/include/access/tableam.h | 1932 |
TidExprListCreate | src/backend/executor/nodeTidscan.c | 70 |
TidListEval | src/backend/executor/nodeTidscan.c | 134 |
TidNext | src/backend/executor/nodeTidscan.c | 312 |
ExecInitTidScan | src/backend/executor/nodeTidscan.c | 499 |
MakeTidOpExpr | src/backend/executor/nodeTidrangescan.c | 56 |
TidRangeNext | src/backend/executor/nodeTidrangescan.c | 222 |
ExecInitTidRangeScan | src/backend/executor/nodeTidrangescan.c | 359 |
ScanState | src/include/nodes/execnodes.h | 1609 |
SeqScanState | src/include/nodes/execnodes.h | 1621 |
IndexScanState | src/include/nodes/execnodes.h | 1698 |
IndexOnlyScanState | src/include/nodes/execnodes.h | 1750 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified against REL_18_STABLE, commit 273fe94.
Confirmed:
ExecSeqScanspecialization into five variants (ExecSeqScan,ExecSeqScanWithQual,ExecSeqScanWithProject,ExecSeqScanWithQualProject,ExecSeqScanEPQ) is present innodeSeqscan.cexactly as described. This optimization was introduced in PG16/PG17 era; it is present in REL_18.ExecScanExtendedis declared as anstatic inlinefunction insrc/include/executor/execScan.h(not inexecScan.c). The call inExecScanat line 59 confirms this.VM_ALL_VISIBLEmacro usage inIndexOnlyNext(line 161) and theioss_VMBufferfield inIndexOnlyScanStateare present.BitmapTableScanSetup(renamed from the earlier per-tuple setup approach) and thers_tbmiteratorfield assignment at line 115 ofnodeBitmapHeapscan.care confirmed present in REL_18.TidRangeScannode type andtable_beginscan_tidrange/table_scan_getnextslot_tidrangeAM entry points exist in REL_18.
Unresolved / future drift risk:
ExecScanExtendedis an inline in the header; its internals (the qual-eval + projection loop) are not shown in the code excerpts above but are referenced by the five SeqScan variant descriptions. The line number forExecScanExtendedis in the header file (execScan.h, around line 138–165) and is excluded from the position-hint table above since the table covers.cfiles only.iss_NameCStringAttNums/iss_NameCStringCountfields inIndexOnlyScanStateare REL_18 additions (name-typed column padding for index-only scans). They exist inexecnodes.hbut the correspondingStoreIndexTuplelogic was not read in full.- The parallel bitmap scan shared-state lifecycle
(
BitmapShouldInitializeSharedState,pstate->tbmiteratorDSA pointer) is present but the full parallel path throughdsa_areaallocation was not traced end-to-end.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”Oracle and DB2: access-path separation from execution
Section titled “Oracle and DB2: access-path separation from execution”Oracle’s execution engine separates access path from filter
more aggressively than PostgreSQL. A single TABLE ACCESS FULL node
in Oracle’s plan carries both the storage-layer read and the filter
predicate; there is no separate Filter node above it in many plans.
This matches what PostgreSQL does (qual pushed into ExecScan), but
Oracle additionally performs storage-index (SI) filtering at the
block level via the buffer cache, avoiding even loading a block’s
worth of tuples when the block-level predicate is false. PostgreSQL
has no equivalent block-level filter; the visibility map covers only
all-visible, not predicate filtering.
MySQL/InnoDB: clustered index changes the heap-fetch cost model
Section titled “MySQL/InnoDB: clustered index changes the heap-fetch cost model”InnoDB stores data in a clustered index (the primary key B-tree
holds the full tuple). A secondary-index scan is therefore always a
two-step operation: secondary-index lookup → clustered-index fetch
(the equivalent of PostgreSQL’s IndexScan heap fetch). But because
the clustered index is ordered by primary key, sequential primary-key
range scans are always full-sequential reads. PostgreSQL has no
clustered index; CLUSTER TABLE ... USING index is a one-time
physical sort, not a maintained structure. This makes PostgreSQL’s
index-only scan via the visibility map particularly important for
covering-index workloads.
Covering indexes and index-only scans across engines
Section titled “Covering indexes and index-only scans across engines”The visibility-map gate in PostgreSQL’s IndexOnlyScan is a
design-specific tradeoff: unlike Oracle (which uses SCN-based row
directory in the index) or SQL Server (which stores row versions in
the index page itself via the row-versioning store), PostgreSQL stores
visibility state per-page in the visibility map. The upside is that
VACUUM-driven visibility-map maintenance amortizes the visibility
cost across all subsequent index-only scans on those pages; the
downside is that freshly-inserted or recently-updated pages require
a heap fetch until VACUUM runs and sets the all-visible bit.
Research frontier: late materialization and columnar projection
Section titled “Research frontier: late materialization and columnar projection”The textbook’s “projection pushdown” principle extends to
late materialization in columnar stores (Abadi et al., VLDB
2007 “Column-Stores vs. Row-Stores”): don’t fetch all columns of a
tuple at the scan layer; pass only the column identifiers upward
and reconstruct the full tuple only when a filter predicate has
narrowed the row set. PostgreSQL’s heap AM is a row store, so
ExecScan must return a full TupleTableSlot at each step.
The TupleTableSlot virtual-tuple mechanism (tts_values[] +
tts_isnull[]) partially decouples column materialization from
heap fetching, but full late materialization requires a columnar
AM. The table_am API (postgres-table-am.md) is the
extensibility hook for columnar engines that want to provide
their own scan node variants via the custom-scan interface.
Bitmap scans and I/O coalescing
Section titled “Bitmap scans and I/O coalescing”The two-phase bitmap scan closely matches what Database Internals
(Petrov, ch. 7 “Log-Structured Storage” — discussed in the context
of merge reads) calls I/O coalescing: convert a set of random
lookups into a sorted sequential pass. PostgreSQL’s TIDBitmap
implements exact mode (one bit per TID, cheap recheck) and lossy
mode (one bit per page, mandatory recheck), with automatic degradation
when work_mem is exhausted. Research systems such as Blink-tree
indexes and ARIES-IM extend this to buffer-managed bitmaps shared
across concurrent queries; PostgreSQL’s bitmap is per-query and
lives in operator memory.
Sources
Section titled “Sources”In-tree source files (commit 273fe94, REL_18_STABLE):
src/backend/executor/execScan.c— shared outer scan loopsrc/backend/executor/nodeSeqscan.c— sequential scansrc/backend/executor/nodeIndexscan.c— index scan + scan-key buildsrc/backend/executor/nodeIndexonlyscan.c— index-only scansrc/backend/executor/nodeBitmapHeapscan.c— bitmap heap scansrc/backend/executor/nodeBitmapIndexscan.c— bitmap index scan (subplan)src/backend/executor/nodeTidscan.c— ctid point scansrc/backend/executor/nodeTidrangescan.c— ctid range scansrc/include/nodes/execnodes.h—ScanState,SeqScanState,IndexScanState,IndexOnlyScanState,BitmapHeapScanState,TidScanState,TidRangeScanStatesrc/include/executor/execScan.h—ExecScanExtendedinline,ExecScanAccessMtd,ExecScanRecheckMtd
Cross-references (knowledge base):
postgres-executor.md— Volcano iterator model,TupleTableSlot,ExecProcNodedispatchpostgres-table-am.md—table_beginscan,table_scan_getnextslot,table_tuple_fetch_row_version,table_beginscan_tidrangepostgres-nbtree.md—index_beginscan,index_getnext_slot, scan-key semanticspostgres-heap-am.md—table_beginscan_bm,table_scan_bitmap_next_tuplepostgres-mvcc-snapshots.md— visibility checks, snapshot semanticspostgres-buffer-manager.md— buffer pin and visibility map buffer managementpostgres-path-generation.md— planner path types that produce these plan nodes
Textbook references:
- Database System Concepts, Silberschatz, 7e, §12.3 (file scan), §15.3 (index scan), §15.3.3 (bitmap scan)
- Database Internals, Petrov, ch. 7 (I/O coalescing, read amplification)