PostgreSQL Join Executors — Nested Loop, Merge Join, and Hash Join
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A join combines tuples from two relations whenever they satisfy a join
predicate. It is the operation the relational model is built around and the
one query processing spends the most effort optimizing, because the number
of candidate tuple pairs is the product of the two input sizes — a naive
“compare every pair” is O(|R| × |S|). Database System Concepts
(Silberschatz, Korth & Sudarshan, 7e, ch. 15 “Query Processing”, §15.5
“Join Operation”) frames the physical-operator question as: given that the
optimizer has chosen a join order and which input is outer vs. inner,
which algorithm evaluates one binary join cheaply? The textbook names
three families, and every mainstream engine — PostgreSQL included —
implements exactly these three.
-
Nested-loop join (§15.5.1). The brute-force algorithm: for each tuple
t_rin the outer relationr, scan the entire inner relationsand emit every pair(t_r, t_s)that satisfies the predicateθ. Cost isn_r × b_s + b_rblock transfers in the worst case (the inner re-read once per outer tuple). It needs no preconditions on the inputs — no sort order, no index, no equality predicate — which is why it is the universal fallback and the only choice for non-equijoins (r.a < s.b) and cross products. The crucial refinement is the indexed nested-loop join (§15.5.2): if there is an index on the inner relation’s join attribute, the inner “scan” becomes an index lookup keyed on the current outer tuple’s value, collapsing the inner cost fromb_stoc(the index probe cost) per outer tuple. This turns the worst algorithm into often the best one for a small outer and a selective inner index. -
Merge join (§15.5.4, “sort-merge join”). If both inputs are sorted on the join attribute(s), a single coordinated pass over both suffices: advance whichever cursor points at the smaller value until the two meet, emit all matching pairs, then advance past the equal group. Cost is linear in the input sizes once sorted —
b_r + b_s— but the inputs must be sorted on the join key, which either the planner pays for with an explicit sort, or which comes free from an ordered index scan. The subtlety the textbook stresses is duplicate handling: when the outer side has several tuples with the same key, the inner cursor must be rewound to the first matching inner tuple for each such outer tuple. The algorithm therefore needs a way to mark an inner position and restore to it. -
Hash join (§15.5.5). For an equijoin, partition both inputs by a hash function
hon the join attribute. Tuples that join must land in the same partition, so the join reduces to, for each partition, building an in-memory hash table on the (smaller) build input and probing it with the (larger) probe input. Cost is linear —3(b_r + b_s)with one partitioning pass — provided the build partition fits in memory. When it does not, the GRACE / hybrid hash join (§15.5.5.1–15.5.5.2) recursively partitions to disk: the build relation is split intonpartitions written to temporary files, and each partition’s build/probe is processed independently so that only one partition’s hash table need be memory-resident at a time. Hybrid hash join keeps the first partition in memory during the partitioning pass to avoid writing it out at all. Hash join works only for equijoins (the hash function presumes equality semantics) but needs no sort order and no index.
The textbook’s decision summary (§15.5.6 “Complex Joins” and the cost discussion) is the mental model the optimizer encodes: nested loop wins for tiny inputs or a selective inner index; merge join wins when inputs are already sorted (or a sort is cheap and the output order is useful); hash join wins for large unsorted equijoin inputs that exceed what a sort would comfortably cost. The execution engine’s job is to implement all three faithfully and let the planner pick; this document traces those three implementations in the REL_18 source.
Two properties carry over from the executor framework
(postgres-executor.md) and shape all three nodes:
- Demand-pull iterator interface. Each join is a
PlanStatewhoseExecProcNodereturns one joinedTupleTableSlotper call, pulling input tuples from itslefttree(outer) andrighttree(inner) viaExecProcNodeon the children. State (which outer tuple we are on, the merge phase, the hash batch) persists between calls in the node struct. - State lives in the node, control recurses down. A single
next()on the join cascades intonext()calls on the children; the join transforms the two input streams into one output stream.
Common DBMS Design
Section titled “Common DBMS Design”The textbook gives the algorithms. This section names the engineering
conventions a production iterator engine adopts to render those
algorithms as composable pull nodes — the patterns PostgreSQL shares with
other systems, so that the named symbols in ## PostgreSQL's Approach read
as one point in a shared design space.
Outer drives, inner is the re-readable side
Section titled “Outer drives, inner is the re-readable side”In a pull engine the asymmetry between the two inputs is concrete: the outer input is scanned once, front to back; the inner input must be re-readable — scanned afresh for each outer tuple (nested loop), rewound to a marked position (merge join), or fully materialized into a hash table (hash join). Engines encode this by making the inner child support a rescan / mark-restore protocol, and by choosing which physical operator can sit on the inner side of each join type. The cost asymmetry is why join order and inner/outer assignment are planner decisions, not executor ones.
Parameterized rescan for indexed nested loop
Section titled “Parameterized rescan for indexed nested loop”The indexed nested-loop join is only fast if the inner index scan is keyed on the current outer tuple’s value. The convention is a parameter-passing channel: the join, on each new outer tuple, writes the relevant outer column values into a set of run-time parameter slots, flags those parameters as changed, and triggers a rescan of the inner subtree. The inner index scan reads its scan keys from those parameter slots, so each “rescan” is actually a fresh, cheap index probe. Without this channel the inner side would re-scan its whole relation per outer tuple and the index would be useless.
Mark and restore for merge-join duplicate groups
Section titled “Mark and restore for merge-join duplicate groups”Merge join over duplicate keys needs to re-join a group of equal inner tuples against each of several equal outer tuples. The convention is a mark / restore pair on the inner stream: mark the position of the first inner tuple in an equal group, advance through the group joining, and when the next outer tuple has the same key, restore the inner cursor back to the mark. Engines implement this either in the access method (an index scan can remember a TID) or by interposing a materialize buffer that caches recently read inner tuples so they can be replayed.
Build the smaller side; spill when memory is exceeded
Section titled “Build the smaller side; spill when memory is exceeded”Hash join’s linear cost holds only while the build-side hash table fits in the working-memory budget. Every engine therefore (1) builds on the input the planner estimated smaller, and (2) has a graceful-degradation strategy when the estimate is wrong at run time: partition both inputs into batches by extra bits of the hash value, keep one batch in memory, and spill the rest to temporary files, processing batches sequentially. The number of batches is kept a power of two so that doubling it merely inspects one more hash bit, and tuples already on disk need not all be re-read to re-partition. This is the hybrid-hash / GRACE machinery, and its correctness hinges on the invariant that the same hash bits decide both the in-memory bucket and the on-disk batch.
One join base type: predicate split and outer-join fill
Section titled “One join base type: predicate split and outer-join fill”All three algorithms share outer-join semantics and a two-way predicate split that production engines factor into a common base:
joinqualvs.otherqual. The join predicate decides whether two tuples “match” (and so whether an outer tuple has been satisfied, which matters for outer joins); the other predicate (WHEREresidue) is an additional filter that must also pass before a row is emitted but does not count toward match status. Splitting them is required to getLEFT JOIN ... ONvs.WHEREsemantics right.- Null-fill slots. A
LEFT/RIGHT/FULL/ANTIjoin must emit a row with NULLs on the unmatched side. Engines pre-build a constant all-NULLs slot for the side that may be filled and substitute it into the projection when an outer (or inner) tuple finds no match. single_matchshort-circuit. For semijoins and joins the planner has proven unique on the inner side, the engine stops after the first inner match for a given outer tuple — there is no point scanning further.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Theory / convention | PostgreSQL name |
|---|---|
Nested-loop next() | ExecNestLoop (in nodeNestloop.c) |
| Indexed-NL parameter channel | NestLoop.nestParams → ParamExecData slots, ExecReScan(innerPlan) |
Merge-join next() state machine | ExecMergeJoin (11 EXEC_MJ_* states) |
| Mark / restore on inner | ExecMarkPos / ExecRestrPos, mj_MarkedTupleSlot |
| Per-key comparison | MJCompare over MergeJoinClause[] + SortSupport |
Hash-join next() state machine | ExecHashJoinImpl (6 HJ_* states) |
| Build-side hash table | HashJoinTable built by MultiExecHash / ExecHashTableInsert |
| Bucket-and-batch assignment | ExecHashGetBucketAndBatch |
| Probe a bucket | ExecScanHashBucket |
| Batch growth / spill | ExecHashIncreaseNumBatches, ExecHashJoinSaveTuple, BufFile |
| Memory-budget sizing | ExecChooseHashTableSize, get_hash_memory_limit |
| Shared join base | JoinState (js.joinqual, js.ps.qual, js.single_match) |
| Null-fill slot | ExecInitNullTupleSlot → *_NullInnerTupleSlot / *_NullOuterTupleSlot |
| Result projection | ExecProject(node->js.ps.ps_ProjInfo) |
The executor framework (lifecycle, ExecProcNode dispatch, the
TupleTableSlot abstraction, ExecReScan) is owned by
postgres-executor.md. The leaf inputs these joins pull from — sequential
and index scans, including the parameterized index scan that makes
indexed nested loop fast, and the Material / Sort nodes that feed merge
join — are owned by postgres-scan-nodes.md. The compiled joinqual /
otherqual / hash-clause ExprStates and ExecQual / ExecProject are
owned by postgres-expression-eval.md. This document covers the three
join node bodies and the hash-table build/probe/spill machinery they
drive.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL implements the three textbook algorithms as three executor node
files, all built on a shared JoinState base, all returning one projected
TupleTableSlot per ExecProcNode call.
flowchart TB
subgraph JS["JoinState (shared base, execnodes.h)"]
JQ["jointype<br/>joinqual (match predicate)<br/>ps.qual = otherqual (WHERE residue)<br/>single_match"]
end
NL["NestLoopState<br/>ExecNestLoop<br/>nl_NeedNewOuter / nl_MatchedOuter"]
MJ["MergeJoinState<br/>ExecMergeJoin<br/>mj_JoinState (11 states)<br/>mj_MarkedTupleSlot"]
HJ["HashJoinState<br/>ExecHashJoinImpl<br/>hj_JoinState (6 states)<br/>hj_HashTable"]
JS --> NL
JS --> MJ
JS --> HJ
HJ -. "inner child" .-> HASH["HashState<br/>MultiExecHash<br/>HashJoinTable (buckets + batches)"]
Figure 1 — The three join executors and their shared JoinState base.
Every join carries jointype, the match predicate joinqual, the residual
filter otherqual (stored in the generic ps.qual), and the
single_match short-circuit flag. Hash join is special in that its inner
input is always a dedicated Hash node whose MultiExecHash builds the
HashJoinTable the join then probes.
Four design decisions give the join layer its shape:
- Outer is
lefttree, inner isrighttree.outerPlanState(node)/innerPlanState(node)read the two children. The outer is pulled once front-to-back; the inner is the re-readable side (rescanned, mark/restored, or hashed). - The match/filter split is in the base.
js.joinqualis the predicate that decides “matched” (and thus drives outer-join and antijoin logic);js.ps.qual(the generic node qual) is theotherqualthat must also pass before a row is emitted. Both are compiledExprStates. - Outer joins use a pre-built NULL slot.
ExecInitNullTupleSlotcreates an all-NULLs virtual slot for the fillable side, substituted intoecontext->ecxt_innertuple/ecxt_outertuplebefore projecting an unmatched row. - Hash join’s build side is a separate node. The
Hashnode (nodeHash.c) is not a normal pull node — itsExecProcNodeerrors out; it is driven once viaMultiExecProcNodeto build the whole table, then theHashJoinnode probes that table directly.
Nested loop — ExecNestLoop
Section titled “Nested loop — ExecNestLoop”ExecNestLoop is the most literal rendering of the textbook algorithm. Its
loop carries two flags: nl_NeedNewOuter (we have exhausted the inner for
the current outer and must advance) and nl_MatchedOuter (did the current
outer find any inner match — needed for LEFT/ANTI fill). When a new
outer tuple is needed it pulls one, and — the key step for indexed nested
loop — pushes outer column values into PARAM_EXEC slots and rescans the
inner:
// ExecNestLoop — src/backend/executor/nodeNestloop.c (condensed)if (node->nl_NeedNewOuter){ outerTupleSlot = ExecProcNode(outerPlan); if (TupIsNull(outerTupleSlot)) return NULL; /* outer exhausted -> join done */
econtext->ecxt_outertuple = outerTupleSlot; node->nl_NeedNewOuter = false; node->nl_MatchedOuter = false;
/* push outer Vars into the inner scan's PARAM_EXEC slots */ foreach(lc, nl->nestParams) { NestLoopParam *nlp = (NestLoopParam *) lfirst(lc); ParamExecData *prm = &(econtext->ecxt_param_exec_vals[nlp->paramno]);
prm->value = slot_getattr(outerTupleSlot, nlp->paramval->varattno, &(prm->isnull)); /* Flag parameter value as changed */ innerPlan->chgParam = bms_add_member(innerPlan->chgParam, nlp->paramno); } ExecReScan(innerPlan); /* now an index probe, not a re-read */}Setting innerPlan->chgParam is what makes the rescan cheap: recall from
postgres-executor.md that ExecProcNode checks chgParam and calls
ExecReScan before pulling — and a parameterized index scan reads its scan
keys from these very ParamExecData slots, so the “rescan” issues a new
index lookup keyed on the current outer value. For a plain (non-indexed)
nested loop, nestParams is empty and ExecReScan(innerPlan) simply
restarts the inner scan from the beginning.
The body then pulls the next inner tuple and tests the predicates:
// ExecNestLoop — inner advance and qual test (condensed)innerTupleSlot = ExecProcNode(innerPlan);econtext->ecxt_innertuple = innerTupleSlot;
if (TupIsNull(innerTupleSlot)){ node->nl_NeedNewOuter = true; if (!node->nl_MatchedOuter && (node->js.jointype == JOIN_LEFT || node->js.jointype == JOIN_ANTI)) { /* outer had no match: emit NULL-extended row if otherqual passes */ econtext->ecxt_innertuple = node->nl_NullInnerTupleSlot; if (otherqual == NULL || ExecQual(otherqual, econtext)) return ExecProject(node->js.ps.ps_ProjInfo); } continue; /* go get a new outer tuple */}
if (ExecQual(joinqual, econtext)) /* match predicate */{ node->nl_MatchedOuter = true; if (node->js.jointype == JOIN_ANTI) { /* antijoin: a match means skip */ node->nl_NeedNewOuter = true; continue; } if (node->js.single_match) /* semijoin / unique inner */ node->nl_NeedNewOuter = true; if (otherqual == NULL || ExecQual(otherqual, econtext)) return ExecProject(node->js.ps.ps_ProjInfo); /* emit joined row */}The structure is the same shared shape all three nodes use: joinqual
decides MatchedOuter; for JOIN_ANTI a match suppresses output and a
non-match (at inner-exhaustion) emits the NULL-filled row; for
single_match (semijoin or planner-proven unique inner) one match advances
to the next outer; otherqual is the final filter before ExecProject.
flowchart TB
START["ExecNestLoop call"] --> NEED{"nl_NeedNewOuter?"}
NEED -->|yes| OUT["ExecProcNode(outer)"]
OUT --> OUTNULL{"outer NULL?"}
OUTNULL -->|yes| DONE["return NULL (join done)"]
OUTNULL -->|no| PARAM["push outer Vars to PARAM_EXEC<br/>set chgParam, ExecReScan(inner)"]
PARAM --> INNER
NEED -->|no| INNER["ExecProcNode(inner)"]
INNER --> INNULL{"inner NULL?"}
INNULL -->|yes| FILL["LEFT/ANTI and not matched?<br/>emit NULL-filled row"]
FILL --> NEED
INNULL -->|no| JQ{"joinqual?"}
JQ -->|no| INNER
JQ -->|yes| OQ{"otherqual?"}
OQ -->|no| INNER
OQ -->|yes| EMIT["ExecProject -> return joined row"]
Figure 2 — ExecNestLoop control flow. The outer-tuple branch (left) is
where the parameterized-inner-rescan happens; the inner-tuple branch (right)
is the per-pair qual test. Inner exhaustion both triggers a new outer and is
the point at which an unmatched LEFT/ANTI outer tuple is NULL-filled.
ExecReScanNestLoop deliberately does not rescan the inner plan: the
inner is rescanned per outer tuple inside the main loop, and rescanning it
here would corrupt the run-time-keyed inner index scan. It only rescans the
outer (unless the outer itself has chgParam, in which case ExecProcNode
will do it) and resets the two flags.
Merge join — ExecMergeJoin
Section titled “Merge join — ExecMergeJoin”Merge join assumes both inputs arrive sorted on the merge keys (the
planner guarantees this with index scans or explicit Sort nodes). The
node is coded as an explicit state machine with eleven states; the
file-header comment gives the algorithm skeleton, and the states map onto it
directly. The per-call entry reads the node, resets the per-tuple context,
and dispatches on mj_JoinState:
// ExecMergeJoin — src/backend/executor/nodeMergejoin.c (state dispatch, condensed)for (;;){ switch (node->mj_JoinState) { case EXEC_MJ_INITIALIZE_OUTER: /* first call: fetch first outer */ outerTupleSlot = ExecProcNode(outerPlan); node->mj_OuterTupleSlot = outerTupleSlot; switch (MJEvalOuterValues(node)) { ... } /* -> INITIALIZE_INNER */ break; case EXEC_MJ_SKIP_TEST: /* compare; advance the smaller side */ compareResult = MJCompare(node); if (compareResult == 0) { if (!node->mj_SkipMarkRestore) ExecMarkPos(innerPlan); /* mark start of group */ MarkInnerTuple(node->mj_InnerTupleSlot, node); node->mj_JoinState = EXEC_MJ_JOINTUPLES; } else if (compareResult < 0) node->mj_JoinState = EXEC_MJ_SKIPOUTER_ADVANCE; else node->mj_JoinState = EXEC_MJ_SKIPINNER_ADVANCE; break; /* ... JOINTUPLES, NEXTINNER, NEXTOUTER, TESTOUTER, ENDOUTER, ENDINNER ... */ }}The comparison itself is not done by running the merge-equality
operator; the planner hands the executor a list of leftexpr = rightexpr
clauses plus the btree opfamily / collation / sort direction for each, and
MJExamineQuals turns each into a MergeJoinClauseData holding a
SortSupport. MJEvalOuterValues / MJEvalInnerValues evaluate the key
expressions for the current tuples; MJCompare then walks the clauses
calling ApplySortComparator, returning 0 / <0 / >0:
// MJCompare — src/backend/executor/nodeMergejoin.c (condensed)for (i = 0; i < mergestate->mj_NumClauses; i++){ MergeJoinClause clause = &mergestate->mj_Clauses[i];
if (clause->lisnull && clause->risnull) { nulleqnull = true; /* NULL "=" NULL -> not a real match */ continue; } result = ApplySortComparator(clause->ldatum, clause->lisnull, clause->rdatum, clause->risnull, &clause->ssup); if (result != 0) break;}/* NULL=NULL or a constant-false joinqual must not report equality */if (result == 0 && (nulleqnull || mergestate->mj_ConstFalseJoin)) result = 1; /* treat as outer > inner: advance inner */The duplicate-group handling is the heart of the algorithm. When
SKIP_TEST finds an equal pair it marks the inner position
(ExecMarkPos on the inner plan, plus a copy into mj_MarkedTupleSlot),
then JOINTUPLES emits the pair and NEXTINNER advances the inner through
the equal group. When the inner group ends, NEXTOUTER fetches the next
outer tuple and TESTOUTER compares it to the marked inner tuple: if they
are still equal, the outer side also had duplicates, so ExecRestrPos
rewinds the inner back to the mark to re-join the whole inner group against
this new outer tuple:
// ExecMergeJoin — EXEC_MJ_TESTOUTER (condensed)innerTupleSlot = node->mj_MarkedTupleSlot;(void) MJEvalInnerValues(node, innerTupleSlot);compareResult = MJCompare(node);
if (compareResult == 0) /* new outer still equals marked inner */{ if (!node->mj_SkipMarkRestore) { ExecRestrPos(innerPlan); /* rewind inner to the mark */ node->mj_InnerTupleSlot = innerTupleSlot; } node->mj_JoinState = EXEC_MJ_JOINTUPLES; /* re-join the group */}else if (compareResult > 0) /* new outer past the marked group */{ /* reload current inner and resume skipping */ ... node->mj_JoinState = EXEC_MJ_SKIP_TEST;}When the planner can prove mark/restore is unnecessary (e.g. the inner is
unique on the key so groups have size one), it sets skip_mark_restore and
the node skips the ExecMarkPos / ExecRestrPos calls entirely
(mj_SkipMarkRestore). Outer joins are handled by the doFillOuter /
doFillInner flags and the MJFillOuter / MJFillInner helpers, which
emit a NULL-extended row (subject to otherqual) for any tuple that exits a
SKIP/ADVANCE/END state without having matched.
flowchart TB
IO["INITIALIZE_OUTER<br/>fetch first outer"] --> II["INITIALIZE_INNER<br/>fetch first inner"]
II --> ST{"SKIP_TEST<br/>MJCompare"}
ST -->|"outer < inner"| SOA["SKIPOUTER_ADVANCE<br/>(fill outer if LEFT)"]
ST -->|"outer > inner"| SIA["SKIPINNER_ADVANCE<br/>(fill inner if RIGHT)"]
ST -->|"equal"| MARK["ExecMarkPos inner<br/>-> JOINTUPLES"]
SOA --> ST
SIA --> ST
MARK --> JT["JOINTUPLES<br/>emit pair, joinqual+otherqual"]
JT --> NI["NEXTINNER<br/>advance inner in group"]
NI -->|"still equal"| JT
NI -->|"group ended"| NO["NEXTOUTER<br/>advance outer"]
NO --> TO{"TESTOUTER<br/>new outer == mark?"}
TO -->|"equal"| RESTR["ExecRestrPos inner<br/>-> JOINTUPLES"]
TO -->|"greater"| ST
RESTR --> JT
Figure 3 — ExecMergeJoin as a state machine. The left SKIP_* states
synchronize the two cursors; the right JOINTUPLES/NEXTINNER/NEXTOUTER/
TESTOUTER cycle emits an equal group and — via the mark set in SKIP_TEST
and the ExecRestrPos in TESTOUTER — re-joins it against each duplicate
outer key. The ENDOUTER/ENDINNER states (not shown) drain outer-join
fill rows after one input is exhausted.
Hash join — ExecHashJoinImpl
Section titled “Hash join — ExecHashJoinImpl”Hash join is the hybrid-hash algorithm, also a state machine — six HJ_*
states — but with an asymmetry the other two lack: its inner input is a
dedicated Hash node that builds the whole hash table up front, before
any probing. The core driver is ExecHashJoinImpl, marked
pg_attribute_always_inline and instantiated twice, as the
parallel-oblivious ExecHashJoin and the parallel-aware
ExecParallelHashJoin, so the compiler removes the dead parallel branches
from each. The build state runs once:
// ExecHashJoinImpl — HJ_BUILD_HASHTABLE (condensed, serial path)case HJ_BUILD_HASHTABLE: /* (optional) empty-outer short-circuit for inner joins */ if (!HJ_FILL_INNER(node) && !parallel && ...) { node->hj_FirstOuterTupleSlot = ExecProcNode(outerNode); if (TupIsNull(node->hj_FirstOuterTupleSlot)) return NULL; /* empty outer -> no inner join */ }
hashtable = ExecHashTableCreate(hashNode); node->hj_HashTable = hashtable;
hashNode->hashtable = hashtable; (void) MultiExecProcNode((PlanState *) hashNode); /* BUILD the table */
if (hashtable->totalTuples == 0 && !HJ_FILL_OUTER(node)) return NULL; /* empty inner -> no rows */
node->hj_JoinState = HJ_NEED_NEW_OUTER; /* FALL THRU */MultiExecProcNode on the Hash child runs MultiExecHash →
MultiExecPrivateHash, which pulls every inner tuple, computes its hash
value, and inserts it into the table (or spills it to a batch file). The
build loop is the literal partitioning pass:
// MultiExecPrivateHash — src/backend/executor/nodeHash.c (condensed)for (;;){ slot = ExecProcNode(outerNode); /* "outer" of Hash = inner of join */ if (TupIsNull(slot)) break; econtext->ecxt_outertuple = slot; hashdatum = ExecEvalExprSwitchContext(node->hash_expr, econtext, &isnull); if (!isnull) { uint32 hashvalue = DatumGetUInt32(hashdatum); int bucketNumber = ExecHashGetSkewBucket(hashtable, hashvalue); if (bucketNumber != INVALID_SKEW_BUCKET_NO) ExecHashSkewTableInsert(hashtable, slot, hashvalue, bucketNumber); else ExecHashTableInsert(hashtable, slot, hashvalue); /* normal insert */ hashtable->totalTuples += 1; }}Probing then runs the HJ_NEED_NEW_OUTER → HJ_SCAN_BUCKET cycle: fetch an
outer tuple, compute its hash value, find its bucket and batch, and walk the
bucket’s chain testing the hash clauses. ExecHashGetBucketAndBatch is the
function whose correctness the whole spill scheme rests on — it splits one
hash value into the in-memory bucket index (low bits) and the batch index
(rotated high bits):
// ExecHashGetBucketAndBatch — src/backend/executor/nodeHash.cif (nbatch > 1){ *bucketno = hashvalue & (nbuckets - 1); *batchno = pg_rotate_right32(hashvalue, hashtable->log2_nbuckets) & (nbatch - 1);}else{ *bucketno = hashvalue & (nbuckets - 1); *batchno = 0;}// ExecScanHashBucket — src/backend/executor/nodeHash.c (condensed)if (hashTuple != NULL) hashTuple = hashTuple->next.unshared; /* continue current chain */else hashTuple = hashtable->buckets.unshared[hjstate->hj_CurBucketNo];
while (hashTuple != NULL){ if (hashTuple->hashvalue == hashvalue) /* cheap pre-filter */ { inntuple = ExecStoreMinimalTuple(HJTUPLE_MINTUPLE(hashTuple), hjstate->hj_HashTupleSlot, false); econtext->ecxt_innertuple = inntuple; if (ExecQualAndReset(hjclauses, econtext)) /* real equality test */ { hjstate->hj_CurTuple = hashTuple; /* remember position */ return true; } } hashTuple = hashTuple->next.unshared;}return false; /* bucket exhausted */Note the two-level test: a chained hash table stores the full hashvalue in
each HashJoinTuple header, so ExecScanHashBucket skips the expensive
ExecQual for tuples whose hash differs, only running the hash-clause
ExprState on hash-collision candidates.
Build/probe and batching (spill to disk)
Section titled “Build/probe and batching (spill to disk)”The build-side table is sized up front by ExecChooseHashTableSize from the
planner’s row estimate and the hash_mem budget (get_hash_memory_limit),
choosing an initial nbuckets and nbatch (both powers of two). But the
estimate can be wrong, so ExecHashTableInsert measures real space used and
doubles the batch count when the table outgrows spaceAllowed:
// ExecHashTableInsert — src/backend/executor/nodeHash.c (condensed)ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno, &batchno);if (batchno == hashtable->curbatch){ /* in-memory: prepend to bucket chain */ hashTuple = (HashJoinTuple) dense_alloc(hashtable, HJTUPLE_OVERHEAD + tuple->t_len); hashTuple->hashvalue = hashvalue; memcpy(HJTUPLE_MINTUPLE(hashTuple), tuple, tuple->t_len); hashTuple->next.unshared = hashtable->buckets.unshared[bucketno]; hashtable->buckets.unshared[bucketno] = hashTuple;
hashtable->spaceUsed += HJTUPLE_OVERHEAD + tuple->t_len; if (hashtable->spaceUsed + hashtable->nbuckets_optimal * sizeof(HashJoinTuple) > hashtable->spaceAllowed) ExecHashIncreaseNumBatches(hashtable); /* over budget -> split */}else{ /* belongs to a later batch: spill to a temp file */ Assert(batchno > hashtable->curbatch); ExecHashJoinSaveTuple(tuple, hashvalue, &hashtable->innerBatchFile[batchno], hashtable);}ExecHashIncreaseNumBatches doubles nbatch, enlarges the per-batch
BufFile arrays, then walks every in-memory tuple and re-tests its batch
number — tuples whose batchno is no longer curbatch are flushed to their
new batch file and freed from memory. Because nbatch is a power of two, a
doubling exposes exactly one more hash bit, so roughly half the in-memory
tuples migrate and the rest stay. Spilling itself is ExecHashJoinSaveTuple,
which lazily creates a temp BufFile and appends hashvalue + the
MinimalTuple:
// ExecHashJoinSaveTuple — src/backend/executor/nodeHashjoin.c (condensed)if (file == NULL){ MemoryContext oldctx = MemoryContextSwitchTo(hashtable->spillCxt); file = BufFileCreateTemp(false); *fileptr = file; MemoryContextSwitchTo(oldctx);}BufFileWrite(file, &hashvalue, sizeof(uint32)); /* hash, then the tuple */BufFileWrite(file, tuple, tuple->t_len);When the current batch’s outer side is exhausted, HJ_NEED_NEW_BATCH calls
ExecHashJoinNewBatch, which closes the finished outer file, loads the next
inner batch file back into the (now empty) hash table, and reopens that
batch’s outer file for probing. Batch 0 is special: its inner tuples are
already in memory from the build pass, and its outer tuples are read
directly from the outer child (those that hash to a later batch are spilled
to outer batch files on the fly by HJ_NEED_NEW_OUTER). This is exactly the
hybrid-hash optimization — the first partition never leaves memory.
flowchart TB BUILD["HJ_BUILD_HASHTABLE<br/>MultiExecHash builds table<br/>(may spill batches 1..n)"] --> NNO["HJ_NEED_NEW_OUTER<br/>fetch outer, hash it"] NNO -->|"outer NULL"| NNB NNO -->|"belongs to later batch"| SPILL["save to outer batch file<br/>stay in NEED_NEW_OUTER"] SPILL --> NNO NNO -->|"this batch"| SB["HJ_SCAN_BUCKET<br/>walk chain, ExecScanHashBucket"] SB -->|"match + quals"| EMIT["ExecProject -> return row"] SB -->|"no more matches"| FOT["HJ_FILL_OUTER_TUPLE<br/>(LEFT: emit NULL-filled)"] FOT --> NNO NNB["HJ_NEED_NEW_BATCH<br/>ExecHashJoinNewBatch:<br/>load next inner batch file"] -->|"more batches"| NNO NNB -->|"no more batches"| DONE["return NULL (join done)"]
Figure 4 — ExecHashJoinImpl hybrid-hash state machine (serial path). The
build runs once; the NEED_NEW_OUTER/SCAN_BUCKET cycle probes the current
batch; NEED_NEW_BATCH loads the next spilled batch from disk. Outer tuples
that hash to a later batch are spilled rather than probed. HJ_FILL_INNER_TUPLES
(right/full joins, not shown) scans the table for unmatched inner tuples
before advancing the batch.
Source Walkthrough
Section titled “Source Walkthrough”Anchor on symbol names, not line numbers. A function or struct name is the stable handle across releases. Use
git grep -n '<symbol>' src/backend/executor/to locate the current position. The line numbers in the position-hint table were observed at commit273fe94(REL_18_STABLE) and are quick hints only.
Nested loop (nodeNestloop.c)
Section titled “Nested loop (nodeNestloop.c)”ExecNestLoop— thenext()body:nl_NeedNewOuterdrives a new outer fetch +nestParamspush +ExecReScan(innerPlan); the inner advance testsjoinqualthenotherqual; inner exhaustion handlesLEFT/ANTINULL-fill vianl_NullInnerTupleSlot.ExecInitNestLoop— buildsNestLoopState; inits outer child, then setsEXEC_FLAG_REWINDon the inner only ifnestParams == NIL(a parameterized inner is rescanned with fresh params, so REWIND is pointless);single_match = inner_unique || JOIN_SEMI; builds the null-inner slot forLEFT/ANTI.ExecReScanNestLoop— rescans the outer (only if itschgParamis null) and resetsnl_NeedNewOuter/nl_MatchedOuter; deliberately does not rescan the inner (per-outer rescan happens in the main loop).ExecEndNestLoop— recurseExecEndNodeinto both children.
Merge join (nodeMergejoin.c)
Section titled “Merge join (nodeMergejoin.c)”ExecMergeJoin— the eleven-state machine (EXEC_MJ_INITIALIZE_OUTER/INNER,SKIP_TEST,SKIPOUTER_ADVANCE/SKIPINNER_ADVANCE,JOINTUPLES,NEXTINNER,NEXTOUTER,TESTOUTER,ENDOUTER/ENDINNER).MJExamineQuals— turns the planner’sleftexpr = rightexprmergeclause list + opfamily/collation/direction metadata into aMergeJoinClauseData[]array, each with aSortSupportcomparator (BTSORTSUPPORT_PROC, or a shim overBTORDER_PROC).MJEvalOuterValues/MJEvalInnerValues— evaluate the merge-key expressions intoclause->ldatum/rdatum; returnMJEVAL_MATCHABLE/MJEVAL_NONMATCHABLE(NULL key) /MJEVAL_ENDOFJOIN(input exhausted or first-column NULL that ends the join).MJCompare— walk the clauses withApplySortComparator; map NULL=NULL andmj_ConstFalseJointo a non-equal result so the inner advances.MarkInnerTuple(macro) /ExecMarkPos/ExecRestrPos— the mark/restore on the inner plan that re-joins duplicate outer keys against an equal inner group;mj_MarkedTupleSlotcaches the marked tuple.MJFillOuter/MJFillInner— emit a NULL-extended row (subject tootherqual) for an unmatched outer/inner tuple in an outer join.ExecInitMergeJoin— buildsMergeJoinState; creates two extraExprContexts (mj_OuterEContext/mj_InnerEContext) so key evaluation survives the per-tuple reset; passesEXEC_FLAG_MARKto the inner child unlessskip_mark_restore; setsmj_ExtraMarksfor aMaterialinner.ExecReScanMergeJoin/ExecEndMergeJoin— reset toEXEC_MJ_INITIALIZE_OUTER; teardown.
Hash join (nodeHashjoin.c)
Section titled “Hash join (nodeHashjoin.c)”ExecHashJoinImpl— the always-inlined six-state machine (HJ_BUILD_HASHTABLE,HJ_NEED_NEW_OUTER,HJ_SCAN_BUCKET,HJ_FILL_OUTER_TUPLE,HJ_FILL_INNER_TUPLES,HJ_NEED_NEW_BATCH).ExecHashJoin/ExecParallelHashJoin— the two thin wrappers that instantiateExecHashJoinImpl(pstate, false/true).ExecHashJoinOuterGetTuple/ExecParallelHashJoinOuterGetTuple— fetch the next outer tuple: from the child plan on batch 0 (computing its hash value viahj_OuterHash), or replayed from an outer batchBufFilefor later batches.ExecHashJoinNewBatch/ExecParallelHashJoinNewBatch— advance to the next batch, loading its inner file into the table and reopening its outer file; return false when no batches remain.ExecHashJoinSaveTuple— appendhashvalue+MinimalTupleto a (lazily created) per-batch tempBufFileinspillCxt.ExecInitHashJoin— buildsHashJoinState; the inner child is always aHashnode; compileshashclausesand per-side hash expressions.
Hash table build/probe/spill (nodeHash.c)
Section titled “Hash table build/probe/spill (nodeHash.c)”ExecHash— errors out: theHashnode is driven byMultiExecProcNode, not the per-tupleExecProcNodeconvention.MultiExecHash→MultiExecPrivateHash/MultiExecParallelHash— pull every inner tuple, hash it, andExecHashTableInsert(or skew-insert).ExecHashTableCreate/ExecChooseHashTableSize— allocate theHashJoinTable; sizenbuckets/nbatchfrom the row estimate andget_hash_memory_limit;NTUP_PER_BUCKET == 1is the load-factor target.ExecHashTableInsert— insert in-memory (prepend to bucket chain viadense_alloc) or spill toinnerBatchFile[batchno]; triggerExecHashIncreaseNumBuckets/ExecHashIncreaseNumBatcheswhen thresholds are exceeded.ExecHashGetBucketAndBatch— split a hash value into bucket (low bits) and batch (rotated high bits); the invariant the spill scheme depends on.ExecHashIncreaseNumBatches— doublenbatch, re-test every in-memory tuple, flush those now belonging to a later batch.ExecScanHashBucket/ExecParallelScanHashBucket— walk a bucket chain; hashvalue pre-filter thenExecQualAndReset(hashclauses).ExecPrepHashTableForUnmatched/ExecScanHashTableForUnmatched— the right/full-join pass over unmatched inner tuples (HJ_FILL_INNER_TUPLES).HashJoinTableData(hashjoin.h) —nbuckets/nbatch/spaceUsed/spaceAllowed/curbatch/buckets/innerBatchFile/outerBatchFile.
Shared join base (execnodes.h)
Section titled “Shared join base (execnodes.h)”struct JoinState— base embedded by all three:jointype,joinqual,single_match; the genericps.qualholdsotherqual.struct NestLoopState/MergeJoinState/HashJoinState— the three concrete state nodes.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
ExecNestLoop | nodeNestloop.c | 60 |
ExecInitNestLoop | nodeNestloop.c | 262 |
ExecEndNestLoop | nodeNestloop.c | 361 |
ExecReScanNestLoop | nodeNestloop.c | 381 |
ExecMergeJoin | nodeMergejoin.c | 594 |
MJExamineQuals | nodeMergejoin.c | 175 |
MJEvalOuterValues | nodeMergejoin.c | 289 |
MJEvalInnerValues | nodeMergejoin.c | 336 |
MJCompare | nodeMergejoin.c | 386 |
MJFillOuter | nodeMergejoin.c | 447 |
MJFillInner | nodeMergejoin.c | 478 |
ExecInitMergeJoin | nodeMergejoin.c | 1439 |
ExecEndMergeJoin | nodeMergejoin.c | 1636 |
ExecReScanMergeJoin | nodeMergejoin.c | 1652 |
ExecHashJoinImpl | nodeHashjoin.c | 221 |
ExecHashJoin | nodeHashjoin.c | 684 |
ExecParallelHashJoin | nodeHashjoin.c | 700 |
ExecInitHashJoin | nodeHashjoin.c | 716 |
ExecHashJoinOuterGetTuple | nodeHashjoin.c | 979 |
ExecHashJoinNewBatch | nodeHashjoin.c | 1130 |
ExecHashJoinSaveTuple | nodeHashjoin.c | 1414 |
MultiExecHash | nodeHash.c | 105 |
MultiExecPrivateHash | nodeHash.c | 138 |
ExecHashTableCreate | nodeHash.c | 446 |
ExecChooseHashTableSize | nodeHash.c | 658 |
ExecHashIncreaseNumBatches | nodeHash.c | 1030 |
ExecHashIncreaseNumBuckets | nodeHash.c | 1587 |
ExecHashTableInsert | nodeHash.c | 1749 |
ExecHashGetBucketAndBatch | nodeHash.c | 1960 |
ExecScanHashBucket | nodeHash.c | 1992 |
NTUP_PER_BUCKET (macro) | nodeHash.c | 655 |
HashJoinTableData | hashjoin.h | 298 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Each entry leads with a fact about the current source at commit
273fe94(REL_18_STABLE), readable without other materials. The trailing sentence shows how it was checked.
Verified facts
Section titled “Verified facts”-
Nested loop pushes outer Vars into PARAM_EXEC slots and rescans the inner only via
chgParam.ExecNestLoop’snl_NeedNewOuterbranch iteratesnl->nestParams, doesslot_getattrintoecontext->ecxt_param_exec_vals[paramno], setsinnerPlan->chgParam = bms_add_member(...), then callsExecReScan(innerPlan). Verified by readingExecNestLoopon 2026-06-05. -
ExecInitNestLoopsuppressesEXEC_FLAG_REWINDfor a parameterized inner. The init setseflags |= EXEC_FLAG_REWINDwhennode->nestParams == NILand clears it otherwise, beforeExecInitNode(innerPlan(node), ...). Verified by readingExecInitNestLoop. -
Merge join uses a
SortSupportcomparator, not the merge-equality operator, to compare keys.MJExamineQualsfetchesBTSORTSUPPORT_PROC(or shimsBTORDER_PROC) into each clause’sssup;MJComparecallsApplySortComparator. The mergeclause operator is asserted to be the opfamily equality operator (COMPARE_EQ). Verified by readingMJExamineQualsandMJCompare. -
Merge join marks the first inner tuple of an equal group and restores to it for duplicate outer keys.
EXEC_MJ_SKIP_TESTon equality callsExecMarkPos(innerPlan)(unlessmj_SkipMarkRestore) andMarkInnerTuple;EXEC_MJ_TESTOUTERon a still-equal new outer callsExecRestrPos(innerPlan). Verified by reading both states. The inner child is initialized withEXEC_FLAG_MARKunlessskip_mark_restore(read inExecInitMergeJoin). -
The
Hashnode is built byMultiExecProcNode, and itsExecProcNodedeliberately errors.ExecHashcallselog(ERROR, "Hash node does not support ExecProcNode call convention");HJ_BUILD_HASHTABLEcalls(void) MultiExecProcNode((PlanState *) hashNode)afterExecHashTableCreate. Verified by readingExecHash,MultiExecHash, andExecHashJoinImpl. -
A hash value is split into bucket (low bits) and batch (rotated high bits), keeping
nbatcha power of two.ExecHashGetBucketAndBatchdoes*bucketno = hashvalue & (nbuckets - 1)and*batchno = pg_rotate_right32(hashvalue, log2_nbuckets) & (nbatch - 1). Verified by reading the function.ExecHashIncreaseNumBatchesdoesnbatch = oldnbatch * 2. Verified by reading it. -
Build-side spill is driven by measured
spaceUsedexceedingspaceAllowed, not the planner estimate alone.ExecHashTableInsertaccumulatesspaceUsedand callsExecHashIncreaseNumBatcheswhenspaceUsed + nbuckets_optimal * sizeof(HashJoinTuple) > spaceAllowed.ExecChooseHashTableSizesets the initialnbatch/nbucketsfromget_hash_memory_limit. Verified by reading both. -
ExecScanHashBucketpre-filters on the stored hash value before running the hash-clauseExecQual. EachHashJoinTuplestoreshashvalue; the bucket walk runsExecQualAndReset(hjclauses, econtext)only whenhashTuple->hashvalue == hashvalue. Verified by reading the function. -
The same
joinqual/otherqual/single_matchshape appears in all three nodes. Each testsExecQual(joinqual, ...)to set matched status, thenotherqualbeforeExecProject, and short-circuits onjs.single_match(= inner_unique || JOIN_SEMI). Verified by reading the emit paths inExecNestLoop,EXEC_MJ_JOINTUPLES, andHJ_SCAN_BUCKET.
Open questions
Section titled “Open questions”-
Parallel hash join’s barrier choreography is only sketched here. The
PHJ_BUILD_*/PHJ_BATCH_*/PHJ_GROW_*phase machine (theBarrier-coordinated shared state inExecParallelHashJoinandMultiExecParallelHash) is summarized but not traced state-by-state. Investigation path: read thenodeHashjoin.cheader comment block andExecParallelHashJoinNewBatch; likely its own follow-up underpostgres-parallel-query.md. -
The skew optimization (MCV-based skew buckets) is named but not detailed.
ExecHashBuildSkewHash/ExecHashGetSkewBucket/ExecHashSkewTableInsertkeep the most-common inner values in a separate skew hash table so skewed outer tuples never spill. The cost model and thenum_skew_mcvschoice were not traced. Investigation path: readExecHashBuildSkewHashand theskewTableplanner field. -
How
Materialinteracts with merge join’s mark/restore vs. an ordered index scan.mj_ExtraMarksis set only for aMaterialinner without REWIND; index scans must not get extra marks. The exact cost/correctness tradeoff (when the planner inserts aMaterialbelow the inner of a merge join) belongs topostgres-scan-nodes.md/postgres-path-generation.md.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”Pointers, not analysis. Each bullet is a starting handle for a follow-up doc; depth here is intentionally shallow.
-
Block / batch nested loop and the death of the inner-rescan tax. PostgreSQL’s
ExecNestLoopre-reads the inner once per outer tuple. The textbook block-nested-loop refinement (DSC §15.5.1) amortizes that over an outer block, and modern engines push further: SQL Server’s “batch mode” and Oracle’s vector joins process groups of outer rows per inner pass. PostgreSQL leans on the parameterized-index-scan path instead — keeping nested loop attractive only when the inner probe is itself cheap — and onMemoize(a caching node above a parameterized inner) to skip repeated identical inner rescans. A comparison would quantify whatMemoizebuys versus a true block join. -
One-pass vs. recursive (GRACE) partitioning. PostgreSQL’s hybrid hash join does a single partitioning level: if a batch still overflows after doubling
nbatch, it disables further growth and lets the batch exceed the budget (the “best effort” the file header describes). GRACE hash join (Kitsuregawa et al., 1983) and the recursive-partitioning variants in column stores partition recursively until each partition fits. The tradeoff is partitioning passes vs. the risk of a memory blowout on a badly-skewed key; mapping PostgreSQL’s growth-disable heuristic onto the GRACE recursion would sharpen when each wins. -
Radix / cache-conscious hash joins. The classic result of Manegold, Boncz & Kersten (“Optimizing Main-Memory Join on Modern Hardware”, IEEE TKDE 2002) and the later Balkesen et al. SIGMOD 2013 studies show that a TLB- and cache-aware radix partition pass beats a single large chained hash table on modern CPUs even when everything fits in RAM. PostgreSQL’s
ExecScanHashBucketis a pointer-chasing chained table — simple and pointer-stable across batch growth, but not cache-optimized. Relevant if PostgreSQL hash join is ever measured against in-memory analytic engines. -
Sort-merge vs. hash for the analytic mainstream. The long-running “sort-merge vs. hash join” debate (Kim et al., “Sort vs. Hash Revisited”, VLDB 2009; Albutiu et al., “Massively Parallel Sort-Merge Joins”, VLDB 2012) turns on SIMD-accelerated sorting and NUMA bandwidth. PostgreSQL keeps both algorithms and lets the cost model choose; it has neither SIMD sort nor NUMA-aware merge. The cross-reference for how the planner chooses among the three is
postgres-path-generation.md. -
Worst-case-optimal and multi-way joins. All three PostgreSQL joins are binary — a multi-relation join is a left-deep (or bushy) tree of binary joins, which can blow up on cyclic queries. Worst-case-optimal join algorithms (Ngo et al.’s “Leapfrog Triejoin”, Veldhuizen 2014; the AGM bound) join all relations at once and are provably better on such queries. No mainstream row store ships them in core; a note on why (the planner, storage, and API assumptions they violate) would be a useful frontier pointer.
-
CUBRID’s join executors for contrast. CUBRID executes joins inside its XASL tree with nested-loop, sort-merge, and hash-list scan variants; its hash join historically materialized the inner into a “hash list scan” rather than a spillable multi-batch table. A side-by-side of CUBRID’s hash-list-scan spill behavior against PostgreSQL’s hybrid-hash batching would show two different answers to the same memory-budget problem (cross-ref the CUBRID code-analysis tree).
Sources
Section titled “Sources”Textbook chapters (under knowledge/research/dbms-general/)
Section titled “Textbook chapters (under knowledge/research/dbms-general/)”- Database System Concepts (Silberschatz, Korth & Sudarshan, 7e), Ch. 15 “Query Processing”, §15.5 “Join Operation” — §15.5.1 nested-loop and block nested-loop, §15.5.2 indexed nested-loop, §15.5.4 merge join (sort-merge, with the mark/restore duplicate-handling discussion), §15.5.5 hash join including §15.5.5.1–15.5.5.2 GRACE / hybrid (recursive) partitioning and the build/probe cost model.
In-tree design comment
Section titled “In-tree design comment”src/backend/executor/nodeHashjoin.cheader — the authoritative description of the hybrid-hash algorithm, lazy vs. eager batch sizing (serial vs. parallel), the power-of-two batch-doubling rule, the best-effort growth-disable heuristic, and the full parallel barrier phase machine (PHJ_BUILD_*,PHJ_BATCH_*,PHJ_GROW_*).src/backend/executor/nodeMergejoin.cheader — the merge-join algorithm skeleton (theJoin { ... }pseudocode) and the worked outer/inner duplicate examples annotated with theEXEC_MJ_*states.
PostgreSQL source (under /data/hgryoo/references/postgres/, REL_18 273fe94)
Section titled “PostgreSQL source (under /data/hgryoo/references/postgres/, REL_18 273fe94)”src/backend/executor/nodeNestloop.c—ExecNestLoop,ExecInitNestLoop(parameterized-inner REWIND logic),ExecReScanNestLoop.src/backend/executor/nodeMergejoin.c—ExecMergeJoinstate machine,MJExamineQuals,MJEvalOuterValues/MJEvalInnerValues,MJCompare,MJFillOuter/MJFillInner,ExecInitMergeJoin.src/backend/executor/nodeHashjoin.c—ExecHashJoinImplstate machine,ExecHashJoinOuterGetTuple,ExecHashJoinNewBatch,ExecHashJoinSaveTuple,ExecInitHashJoin.src/backend/executor/nodeHash.c—MultiExecHash/MultiExecPrivateHash,ExecHashTableCreate,ExecChooseHashTableSize,ExecHashTableInsert,ExecHashGetBucketAndBatch,ExecHashIncreaseNumBatches,ExecScanHashBucket.src/include/executor/hashjoin.h—HashJoinTableData, the bucket/batch/spill fields,HJTUPLE_*macros.src/include/nodes/execnodes.h—JoinState,NestLoopState,MergeJoinState,HashJoinState.
Cross-references (sibling docs that own adjacent mechanism)
Section titled “Cross-references (sibling docs that own adjacent mechanism)”postgres-executor.md— the demand-pull framework,ExecProcNodedispatch,ExecReScan, theTupleTableSlotabstraction, per-tuple memory.postgres-scan-nodes.md— the leaf scans the joins pull from, including the parameterized index scan (run-time scan keys fromPARAM_EXEC) and theMaterial/Sortnodes feeding merge join.postgres-expression-eval.md— compiledjoinqual/otherqual/ hash-clauseExprStates,ExecQual,ExecProject, the per-side hash and merge-key expressions.postgres-path-generation.md/postgres-planner-overview.md— how the planner chooses among the three join algorithms and assigns inner/outer.