Skip to content

PostgreSQL Join Executors — Nested Loop, Merge Join, and Hash Join

Contents:

A join combines tuples from two relations whenever they satisfy a join predicate. It is the operation the relational model is built around and the one query processing spends the most effort optimizing, because the number of candidate tuple pairs is the product of the two input sizes — a naive “compare every pair” is O(|R| × |S|). Database System Concepts (Silberschatz, Korth & Sudarshan, 7e, ch. 15 “Query Processing”, §15.5 “Join Operation”) frames the physical-operator question as: given that the optimizer has chosen a join order and which input is outer vs. inner, which algorithm evaluates one binary join cheaply? The textbook names three families, and every mainstream engine — PostgreSQL included — implements exactly these three.

  1. Nested-loop join (§15.5.1). The brute-force algorithm: for each tuple t_r in the outer relation r, scan the entire inner relation s and emit every pair (t_r, t_s) that satisfies the predicate θ. Cost is n_r × b_s + b_r block transfers in the worst case (the inner re-read once per outer tuple). It needs no preconditions on the inputs — no sort order, no index, no equality predicate — which is why it is the universal fallback and the only choice for non-equijoins (r.a < s.b) and cross products. The crucial refinement is the indexed nested-loop join (§15.5.2): if there is an index on the inner relation’s join attribute, the inner “scan” becomes an index lookup keyed on the current outer tuple’s value, collapsing the inner cost from b_s to c (the index probe cost) per outer tuple. This turns the worst algorithm into often the best one for a small outer and a selective inner index.

  2. Merge join (§15.5.4, “sort-merge join”). If both inputs are sorted on the join attribute(s), a single coordinated pass over both suffices: advance whichever cursor points at the smaller value until the two meet, emit all matching pairs, then advance past the equal group. Cost is linear in the input sizes once sorted — b_r + b_s — but the inputs must be sorted on the join key, which either the planner pays for with an explicit sort, or which comes free from an ordered index scan. The subtlety the textbook stresses is duplicate handling: when the outer side has several tuples with the same key, the inner cursor must be rewound to the first matching inner tuple for each such outer tuple. The algorithm therefore needs a way to mark an inner position and restore to it.

  3. Hash join (§15.5.5). For an equijoin, partition both inputs by a hash function h on the join attribute. Tuples that join must land in the same partition, so the join reduces to, for each partition, building an in-memory hash table on the (smaller) build input and probing it with the (larger) probe input. Cost is linear — 3(b_r + b_s) with one partitioning pass — provided the build partition fits in memory. When it does not, the GRACE / hybrid hash join (§15.5.5.1–15.5.5.2) recursively partitions to disk: the build relation is split into n partitions written to temporary files, and each partition’s build/probe is processed independently so that only one partition’s hash table need be memory-resident at a time. Hybrid hash join keeps the first partition in memory during the partitioning pass to avoid writing it out at all. Hash join works only for equijoins (the hash function presumes equality semantics) but needs no sort order and no index.

The textbook’s decision summary (§15.5.6 “Complex Joins” and the cost discussion) is the mental model the optimizer encodes: nested loop wins for tiny inputs or a selective inner index; merge join wins when inputs are already sorted (or a sort is cheap and the output order is useful); hash join wins for large unsorted equijoin inputs that exceed what a sort would comfortably cost. The execution engine’s job is to implement all three faithfully and let the planner pick; this document traces those three implementations in the REL_18 source.

Two properties carry over from the executor framework (postgres-executor.md) and shape all three nodes:

  • Demand-pull iterator interface. Each join is a PlanState whose ExecProcNode returns one joined TupleTableSlot per call, pulling input tuples from its lefttree (outer) and righttree (inner) via ExecProcNode on the children. State (which outer tuple we are on, the merge phase, the hash batch) persists between calls in the node struct.
  • State lives in the node, control recurses down. A single next() on the join cascades into next() calls on the children; the join transforms the two input streams into one output stream.

The textbook gives the algorithms. This section names the engineering conventions a production iterator engine adopts to render those algorithms as composable pull nodes — the patterns PostgreSQL shares with other systems, so that the named symbols in ## PostgreSQL's Approach read as one point in a shared design space.

Outer drives, inner is the re-readable side

Section titled “Outer drives, inner is the re-readable side”

In a pull engine the asymmetry between the two inputs is concrete: the outer input is scanned once, front to back; the inner input must be re-readable — scanned afresh for each outer tuple (nested loop), rewound to a marked position (merge join), or fully materialized into a hash table (hash join). Engines encode this by making the inner child support a rescan / mark-restore protocol, and by choosing which physical operator can sit on the inner side of each join type. The cost asymmetry is why join order and inner/outer assignment are planner decisions, not executor ones.

Parameterized rescan for indexed nested loop

Section titled “Parameterized rescan for indexed nested loop”

The indexed nested-loop join is only fast if the inner index scan is keyed on the current outer tuple’s value. The convention is a parameter-passing channel: the join, on each new outer tuple, writes the relevant outer column values into a set of run-time parameter slots, flags those parameters as changed, and triggers a rescan of the inner subtree. The inner index scan reads its scan keys from those parameter slots, so each “rescan” is actually a fresh, cheap index probe. Without this channel the inner side would re-scan its whole relation per outer tuple and the index would be useless.

Mark and restore for merge-join duplicate groups

Section titled “Mark and restore for merge-join duplicate groups”

Merge join over duplicate keys needs to re-join a group of equal inner tuples against each of several equal outer tuples. The convention is a mark / restore pair on the inner stream: mark the position of the first inner tuple in an equal group, advance through the group joining, and when the next outer tuple has the same key, restore the inner cursor back to the mark. Engines implement this either in the access method (an index scan can remember a TID) or by interposing a materialize buffer that caches recently read inner tuples so they can be replayed.

Build the smaller side; spill when memory is exceeded

Section titled “Build the smaller side; spill when memory is exceeded”

Hash join’s linear cost holds only while the build-side hash table fits in the working-memory budget. Every engine therefore (1) builds on the input the planner estimated smaller, and (2) has a graceful-degradation strategy when the estimate is wrong at run time: partition both inputs into batches by extra bits of the hash value, keep one batch in memory, and spill the rest to temporary files, processing batches sequentially. The number of batches is kept a power of two so that doubling it merely inspects one more hash bit, and tuples already on disk need not all be re-read to re-partition. This is the hybrid-hash / GRACE machinery, and its correctness hinges on the invariant that the same hash bits decide both the in-memory bucket and the on-disk batch.

One join base type: predicate split and outer-join fill

Section titled “One join base type: predicate split and outer-join fill”

All three algorithms share outer-join semantics and a two-way predicate split that production engines factor into a common base:

  • joinqual vs. otherqual. The join predicate decides whether two tuples “match” (and so whether an outer tuple has been satisfied, which matters for outer joins); the other predicate (WHERE residue) is an additional filter that must also pass before a row is emitted but does not count toward match status. Splitting them is required to get LEFT JOIN ... ON vs. WHERE semantics right.
  • Null-fill slots. A LEFT/RIGHT/FULL/ANTI join must emit a row with NULLs on the unmatched side. Engines pre-build a constant all-NULLs slot for the side that may be filled and substitute it into the projection when an outer (or inner) tuple finds no match.
  • single_match short-circuit. For semijoins and joins the planner has proven unique on the inner side, the engine stops after the first inner match for a given outer tuple — there is no point scanning further.
Theory / conventionPostgreSQL name
Nested-loop next()ExecNestLoop (in nodeNestloop.c)
Indexed-NL parameter channelNestLoop.nestParamsParamExecData slots, ExecReScan(innerPlan)
Merge-join next() state machineExecMergeJoin (11 EXEC_MJ_* states)
Mark / restore on innerExecMarkPos / ExecRestrPos, mj_MarkedTupleSlot
Per-key comparisonMJCompare over MergeJoinClause[] + SortSupport
Hash-join next() state machineExecHashJoinImpl (6 HJ_* states)
Build-side hash tableHashJoinTable built by MultiExecHash / ExecHashTableInsert
Bucket-and-batch assignmentExecHashGetBucketAndBatch
Probe a bucketExecScanHashBucket
Batch growth / spillExecHashIncreaseNumBatches, ExecHashJoinSaveTuple, BufFile
Memory-budget sizingExecChooseHashTableSize, get_hash_memory_limit
Shared join baseJoinState (js.joinqual, js.ps.qual, js.single_match)
Null-fill slotExecInitNullTupleSlot*_NullInnerTupleSlot / *_NullOuterTupleSlot
Result projectionExecProject(node->js.ps.ps_ProjInfo)

The executor framework (lifecycle, ExecProcNode dispatch, the TupleTableSlot abstraction, ExecReScan) is owned by postgres-executor.md. The leaf inputs these joins pull from — sequential and index scans, including the parameterized index scan that makes indexed nested loop fast, and the Material / Sort nodes that feed merge join — are owned by postgres-scan-nodes.md. The compiled joinqual / otherqual / hash-clause ExprStates and ExecQual / ExecProject are owned by postgres-expression-eval.md. This document covers the three join node bodies and the hash-table build/probe/spill machinery they drive.

PostgreSQL implements the three textbook algorithms as three executor node files, all built on a shared JoinState base, all returning one projected TupleTableSlot per ExecProcNode call.

flowchart TB
  subgraph JS["JoinState (shared base, execnodes.h)"]
    JQ["jointype<br/>joinqual (match predicate)<br/>ps.qual = otherqual (WHERE residue)<br/>single_match"]
  end
  NL["NestLoopState<br/>ExecNestLoop<br/>nl_NeedNewOuter / nl_MatchedOuter"]
  MJ["MergeJoinState<br/>ExecMergeJoin<br/>mj_JoinState (11 states)<br/>mj_MarkedTupleSlot"]
  HJ["HashJoinState<br/>ExecHashJoinImpl<br/>hj_JoinState (6 states)<br/>hj_HashTable"]
  JS --> NL
  JS --> MJ
  JS --> HJ
  HJ -. "inner child" .-> HASH["HashState<br/>MultiExecHash<br/>HashJoinTable (buckets + batches)"]

Figure 1 — The three join executors and their shared JoinState base. Every join carries jointype, the match predicate joinqual, the residual filter otherqual (stored in the generic ps.qual), and the single_match short-circuit flag. Hash join is special in that its inner input is always a dedicated Hash node whose MultiExecHash builds the HashJoinTable the join then probes.

Four design decisions give the join layer its shape:

  1. Outer is lefttree, inner is righttree. outerPlanState(node) / innerPlanState(node) read the two children. The outer is pulled once front-to-back; the inner is the re-readable side (rescanned, mark/restored, or hashed).
  2. The match/filter split is in the base. js.joinqual is the predicate that decides “matched” (and thus drives outer-join and antijoin logic); js.ps.qual (the generic node qual) is the otherqual that must also pass before a row is emitted. Both are compiled ExprStates.
  3. Outer joins use a pre-built NULL slot. ExecInitNullTupleSlot creates an all-NULLs virtual slot for the fillable side, substituted into econtext->ecxt_innertuple / ecxt_outertuple before projecting an unmatched row.
  4. Hash join’s build side is a separate node. The Hash node (nodeHash.c) is not a normal pull node — its ExecProcNode errors out; it is driven once via MultiExecProcNode to build the whole table, then the HashJoin node probes that table directly.

ExecNestLoop is the most literal rendering of the textbook algorithm. Its loop carries two flags: nl_NeedNewOuter (we have exhausted the inner for the current outer and must advance) and nl_MatchedOuter (did the current outer find any inner match — needed for LEFT/ANTI fill). When a new outer tuple is needed it pulls one, and — the key step for indexed nested loop — pushes outer column values into PARAM_EXEC slots and rescans the inner:

// ExecNestLoop — src/backend/executor/nodeNestloop.c (condensed)
if (node->nl_NeedNewOuter)
{
outerTupleSlot = ExecProcNode(outerPlan);
if (TupIsNull(outerTupleSlot))
return NULL; /* outer exhausted -> join done */
econtext->ecxt_outertuple = outerTupleSlot;
node->nl_NeedNewOuter = false;
node->nl_MatchedOuter = false;
/* push outer Vars into the inner scan's PARAM_EXEC slots */
foreach(lc, nl->nestParams)
{
NestLoopParam *nlp = (NestLoopParam *) lfirst(lc);
ParamExecData *prm = &(econtext->ecxt_param_exec_vals[nlp->paramno]);
prm->value = slot_getattr(outerTupleSlot,
nlp->paramval->varattno, &(prm->isnull));
/* Flag parameter value as changed */
innerPlan->chgParam = bms_add_member(innerPlan->chgParam,
nlp->paramno);
}
ExecReScan(innerPlan); /* now an index probe, not a re-read */
}

Setting innerPlan->chgParam is what makes the rescan cheap: recall from postgres-executor.md that ExecProcNode checks chgParam and calls ExecReScan before pulling — and a parameterized index scan reads its scan keys from these very ParamExecData slots, so the “rescan” issues a new index lookup keyed on the current outer value. For a plain (non-indexed) nested loop, nestParams is empty and ExecReScan(innerPlan) simply restarts the inner scan from the beginning.

The body then pulls the next inner tuple and tests the predicates:

// ExecNestLoop — inner advance and qual test (condensed)
innerTupleSlot = ExecProcNode(innerPlan);
econtext->ecxt_innertuple = innerTupleSlot;
if (TupIsNull(innerTupleSlot))
{
node->nl_NeedNewOuter = true;
if (!node->nl_MatchedOuter &&
(node->js.jointype == JOIN_LEFT || node->js.jointype == JOIN_ANTI))
{
/* outer had no match: emit NULL-extended row if otherqual passes */
econtext->ecxt_innertuple = node->nl_NullInnerTupleSlot;
if (otherqual == NULL || ExecQual(otherqual, econtext))
return ExecProject(node->js.ps.ps_ProjInfo);
}
continue; /* go get a new outer tuple */
}
if (ExecQual(joinqual, econtext)) /* match predicate */
{
node->nl_MatchedOuter = true;
if (node->js.jointype == JOIN_ANTI) { /* antijoin: a match means skip */
node->nl_NeedNewOuter = true;
continue;
}
if (node->js.single_match) /* semijoin / unique inner */
node->nl_NeedNewOuter = true;
if (otherqual == NULL || ExecQual(otherqual, econtext))
return ExecProject(node->js.ps.ps_ProjInfo); /* emit joined row */
}

The structure is the same shared shape all three nodes use: joinqual decides MatchedOuter; for JOIN_ANTI a match suppresses output and a non-match (at inner-exhaustion) emits the NULL-filled row; for single_match (semijoin or planner-proven unique inner) one match advances to the next outer; otherqual is the final filter before ExecProject.

flowchart TB
  START["ExecNestLoop call"] --> NEED{"nl_NeedNewOuter?"}
  NEED -->|yes| OUT["ExecProcNode(outer)"]
  OUT --> OUTNULL{"outer NULL?"}
  OUTNULL -->|yes| DONE["return NULL (join done)"]
  OUTNULL -->|no| PARAM["push outer Vars to PARAM_EXEC<br/>set chgParam, ExecReScan(inner)"]
  PARAM --> INNER
  NEED -->|no| INNER["ExecProcNode(inner)"]
  INNER --> INNULL{"inner NULL?"}
  INNULL -->|yes| FILL["LEFT/ANTI and not matched?<br/>emit NULL-filled row"]
  FILL --> NEED
  INNULL -->|no| JQ{"joinqual?"}
  JQ -->|no| INNER
  JQ -->|yes| OQ{"otherqual?"}
  OQ -->|no| INNER
  OQ -->|yes| EMIT["ExecProject -> return joined row"]

Figure 2 — ExecNestLoop control flow. The outer-tuple branch (left) is where the parameterized-inner-rescan happens; the inner-tuple branch (right) is the per-pair qual test. Inner exhaustion both triggers a new outer and is the point at which an unmatched LEFT/ANTI outer tuple is NULL-filled.

ExecReScanNestLoop deliberately does not rescan the inner plan: the inner is rescanned per outer tuple inside the main loop, and rescanning it here would corrupt the run-time-keyed inner index scan. It only rescans the outer (unless the outer itself has chgParam, in which case ExecProcNode will do it) and resets the two flags.

Merge join assumes both inputs arrive sorted on the merge keys (the planner guarantees this with index scans or explicit Sort nodes). The node is coded as an explicit state machine with eleven states; the file-header comment gives the algorithm skeleton, and the states map onto it directly. The per-call entry reads the node, resets the per-tuple context, and dispatches on mj_JoinState:

// ExecMergeJoin — src/backend/executor/nodeMergejoin.c (state dispatch, condensed)
for (;;)
{
switch (node->mj_JoinState)
{
case EXEC_MJ_INITIALIZE_OUTER: /* first call: fetch first outer */
outerTupleSlot = ExecProcNode(outerPlan);
node->mj_OuterTupleSlot = outerTupleSlot;
switch (MJEvalOuterValues(node)) { ... } /* -> INITIALIZE_INNER */
break;
case EXEC_MJ_SKIP_TEST: /* compare; advance the smaller side */
compareResult = MJCompare(node);
if (compareResult == 0) {
if (!node->mj_SkipMarkRestore)
ExecMarkPos(innerPlan); /* mark start of group */
MarkInnerTuple(node->mj_InnerTupleSlot, node);
node->mj_JoinState = EXEC_MJ_JOINTUPLES;
} else if (compareResult < 0)
node->mj_JoinState = EXEC_MJ_SKIPOUTER_ADVANCE;
else
node->mj_JoinState = EXEC_MJ_SKIPINNER_ADVANCE;
break;
/* ... JOINTUPLES, NEXTINNER, NEXTOUTER, TESTOUTER, ENDOUTER, ENDINNER ... */
}
}

The comparison itself is not done by running the merge-equality operator; the planner hands the executor a list of leftexpr = rightexpr clauses plus the btree opfamily / collation / sort direction for each, and MJExamineQuals turns each into a MergeJoinClauseData holding a SortSupport. MJEvalOuterValues / MJEvalInnerValues evaluate the key expressions for the current tuples; MJCompare then walks the clauses calling ApplySortComparator, returning 0 / <0 / >0:

// MJCompare — src/backend/executor/nodeMergejoin.c (condensed)
for (i = 0; i < mergestate->mj_NumClauses; i++)
{
MergeJoinClause clause = &mergestate->mj_Clauses[i];
if (clause->lisnull && clause->risnull) {
nulleqnull = true; /* NULL "=" NULL -> not a real match */
continue;
}
result = ApplySortComparator(clause->ldatum, clause->lisnull,
clause->rdatum, clause->risnull,
&clause->ssup);
if (result != 0)
break;
}
/* NULL=NULL or a constant-false joinqual must not report equality */
if (result == 0 && (nulleqnull || mergestate->mj_ConstFalseJoin))
result = 1; /* treat as outer > inner: advance inner */

The duplicate-group handling is the heart of the algorithm. When SKIP_TEST finds an equal pair it marks the inner position (ExecMarkPos on the inner plan, plus a copy into mj_MarkedTupleSlot), then JOINTUPLES emits the pair and NEXTINNER advances the inner through the equal group. When the inner group ends, NEXTOUTER fetches the next outer tuple and TESTOUTER compares it to the marked inner tuple: if they are still equal, the outer side also had duplicates, so ExecRestrPos rewinds the inner back to the mark to re-join the whole inner group against this new outer tuple:

// ExecMergeJoin — EXEC_MJ_TESTOUTER (condensed)
innerTupleSlot = node->mj_MarkedTupleSlot;
(void) MJEvalInnerValues(node, innerTupleSlot);
compareResult = MJCompare(node);
if (compareResult == 0) /* new outer still equals marked inner */
{
if (!node->mj_SkipMarkRestore)
{
ExecRestrPos(innerPlan); /* rewind inner to the mark */
node->mj_InnerTupleSlot = innerTupleSlot;
}
node->mj_JoinState = EXEC_MJ_JOINTUPLES; /* re-join the group */
}
else if (compareResult > 0) /* new outer past the marked group */
{
/* reload current inner and resume skipping */
...
node->mj_JoinState = EXEC_MJ_SKIP_TEST;
}

When the planner can prove mark/restore is unnecessary (e.g. the inner is unique on the key so groups have size one), it sets skip_mark_restore and the node skips the ExecMarkPos / ExecRestrPos calls entirely (mj_SkipMarkRestore). Outer joins are handled by the doFillOuter / doFillInner flags and the MJFillOuter / MJFillInner helpers, which emit a NULL-extended row (subject to otherqual) for any tuple that exits a SKIP/ADVANCE/END state without having matched.

flowchart TB
  IO["INITIALIZE_OUTER<br/>fetch first outer"] --> II["INITIALIZE_INNER<br/>fetch first inner"]
  II --> ST{"SKIP_TEST<br/>MJCompare"}
  ST -->|"outer < inner"| SOA["SKIPOUTER_ADVANCE<br/>(fill outer if LEFT)"]
  ST -->|"outer > inner"| SIA["SKIPINNER_ADVANCE<br/>(fill inner if RIGHT)"]
  ST -->|"equal"| MARK["ExecMarkPos inner<br/>-> JOINTUPLES"]
  SOA --> ST
  SIA --> ST
  MARK --> JT["JOINTUPLES<br/>emit pair, joinqual+otherqual"]
  JT --> NI["NEXTINNER<br/>advance inner in group"]
  NI -->|"still equal"| JT
  NI -->|"group ended"| NO["NEXTOUTER<br/>advance outer"]
  NO --> TO{"TESTOUTER<br/>new outer == mark?"}
  TO -->|"equal"| RESTR["ExecRestrPos inner<br/>-> JOINTUPLES"]
  TO -->|"greater"| ST
  RESTR --> JT

Figure 3 — ExecMergeJoin as a state machine. The left SKIP_* states synchronize the two cursors; the right JOINTUPLES/NEXTINNER/NEXTOUTER/ TESTOUTER cycle emits an equal group and — via the mark set in SKIP_TEST and the ExecRestrPos in TESTOUTER — re-joins it against each duplicate outer key. The ENDOUTER/ENDINNER states (not shown) drain outer-join fill rows after one input is exhausted.

Hash join is the hybrid-hash algorithm, also a state machine — six HJ_* states — but with an asymmetry the other two lack: its inner input is a dedicated Hash node that builds the whole hash table up front, before any probing. The core driver is ExecHashJoinImpl, marked pg_attribute_always_inline and instantiated twice, as the parallel-oblivious ExecHashJoin and the parallel-aware ExecParallelHashJoin, so the compiler removes the dead parallel branches from each. The build state runs once:

// ExecHashJoinImpl — HJ_BUILD_HASHTABLE (condensed, serial path)
case HJ_BUILD_HASHTABLE:
/* (optional) empty-outer short-circuit for inner joins */
if (!HJ_FILL_INNER(node) && !parallel && ...)
{
node->hj_FirstOuterTupleSlot = ExecProcNode(outerNode);
if (TupIsNull(node->hj_FirstOuterTupleSlot))
return NULL; /* empty outer -> no inner join */
}
hashtable = ExecHashTableCreate(hashNode);
node->hj_HashTable = hashtable;
hashNode->hashtable = hashtable;
(void) MultiExecProcNode((PlanState *) hashNode); /* BUILD the table */
if (hashtable->totalTuples == 0 && !HJ_FILL_OUTER(node))
return NULL; /* empty inner -> no rows */
node->hj_JoinState = HJ_NEED_NEW_OUTER;
/* FALL THRU */

MultiExecProcNode on the Hash child runs MultiExecHashMultiExecPrivateHash, which pulls every inner tuple, computes its hash value, and inserts it into the table (or spills it to a batch file). The build loop is the literal partitioning pass:

// MultiExecPrivateHash — src/backend/executor/nodeHash.c (condensed)
for (;;)
{
slot = ExecProcNode(outerNode); /* "outer" of Hash = inner of join */
if (TupIsNull(slot))
break;
econtext->ecxt_outertuple = slot;
hashdatum = ExecEvalExprSwitchContext(node->hash_expr, econtext, &isnull);
if (!isnull)
{
uint32 hashvalue = DatumGetUInt32(hashdatum);
int bucketNumber = ExecHashGetSkewBucket(hashtable, hashvalue);
if (bucketNumber != INVALID_SKEW_BUCKET_NO)
ExecHashSkewTableInsert(hashtable, slot, hashvalue, bucketNumber);
else
ExecHashTableInsert(hashtable, slot, hashvalue); /* normal insert */
hashtable->totalTuples += 1;
}
}

Probing then runs the HJ_NEED_NEW_OUTERHJ_SCAN_BUCKET cycle: fetch an outer tuple, compute its hash value, find its bucket and batch, and walk the bucket’s chain testing the hash clauses. ExecHashGetBucketAndBatch is the function whose correctness the whole spill scheme rests on — it splits one hash value into the in-memory bucket index (low bits) and the batch index (rotated high bits):

// ExecHashGetBucketAndBatch — src/backend/executor/nodeHash.c
if (nbatch > 1)
{
*bucketno = hashvalue & (nbuckets - 1);
*batchno = pg_rotate_right32(hashvalue,
hashtable->log2_nbuckets) & (nbatch - 1);
}
else
{
*bucketno = hashvalue & (nbuckets - 1);
*batchno = 0;
}
// ExecScanHashBucket — src/backend/executor/nodeHash.c (condensed)
if (hashTuple != NULL)
hashTuple = hashTuple->next.unshared; /* continue current chain */
else
hashTuple = hashtable->buckets.unshared[hjstate->hj_CurBucketNo];
while (hashTuple != NULL)
{
if (hashTuple->hashvalue == hashvalue) /* cheap pre-filter */
{
inntuple = ExecStoreMinimalTuple(HJTUPLE_MINTUPLE(hashTuple),
hjstate->hj_HashTupleSlot, false);
econtext->ecxt_innertuple = inntuple;
if (ExecQualAndReset(hjclauses, econtext)) /* real equality test */
{
hjstate->hj_CurTuple = hashTuple; /* remember position */
return true;
}
}
hashTuple = hashTuple->next.unshared;
}
return false; /* bucket exhausted */

Note the two-level test: a chained hash table stores the full hashvalue in each HashJoinTuple header, so ExecScanHashBucket skips the expensive ExecQual for tuples whose hash differs, only running the hash-clause ExprState on hash-collision candidates.

The build-side table is sized up front by ExecChooseHashTableSize from the planner’s row estimate and the hash_mem budget (get_hash_memory_limit), choosing an initial nbuckets and nbatch (both powers of two). But the estimate can be wrong, so ExecHashTableInsert measures real space used and doubles the batch count when the table outgrows spaceAllowed:

// ExecHashTableInsert — src/backend/executor/nodeHash.c (condensed)
ExecHashGetBucketAndBatch(hashtable, hashvalue, &bucketno, &batchno);
if (batchno == hashtable->curbatch)
{
/* in-memory: prepend to bucket chain */
hashTuple = (HashJoinTuple) dense_alloc(hashtable, HJTUPLE_OVERHEAD + tuple->t_len);
hashTuple->hashvalue = hashvalue;
memcpy(HJTUPLE_MINTUPLE(hashTuple), tuple, tuple->t_len);
hashTuple->next.unshared = hashtable->buckets.unshared[bucketno];
hashtable->buckets.unshared[bucketno] = hashTuple;
hashtable->spaceUsed += HJTUPLE_OVERHEAD + tuple->t_len;
if (hashtable->spaceUsed + hashtable->nbuckets_optimal * sizeof(HashJoinTuple)
> hashtable->spaceAllowed)
ExecHashIncreaseNumBatches(hashtable); /* over budget -> split */
}
else
{
/* belongs to a later batch: spill to a temp file */
Assert(batchno > hashtable->curbatch);
ExecHashJoinSaveTuple(tuple, hashvalue,
&hashtable->innerBatchFile[batchno], hashtable);
}

ExecHashIncreaseNumBatches doubles nbatch, enlarges the per-batch BufFile arrays, then walks every in-memory tuple and re-tests its batch number — tuples whose batchno is no longer curbatch are flushed to their new batch file and freed from memory. Because nbatch is a power of two, a doubling exposes exactly one more hash bit, so roughly half the in-memory tuples migrate and the rest stay. Spilling itself is ExecHashJoinSaveTuple, which lazily creates a temp BufFile and appends hashvalue + the MinimalTuple:

// ExecHashJoinSaveTuple — src/backend/executor/nodeHashjoin.c (condensed)
if (file == NULL)
{
MemoryContext oldctx = MemoryContextSwitchTo(hashtable->spillCxt);
file = BufFileCreateTemp(false);
*fileptr = file;
MemoryContextSwitchTo(oldctx);
}
BufFileWrite(file, &hashvalue, sizeof(uint32)); /* hash, then the tuple */
BufFileWrite(file, tuple, tuple->t_len);

When the current batch’s outer side is exhausted, HJ_NEED_NEW_BATCH calls ExecHashJoinNewBatch, which closes the finished outer file, loads the next inner batch file back into the (now empty) hash table, and reopens that batch’s outer file for probing. Batch 0 is special: its inner tuples are already in memory from the build pass, and its outer tuples are read directly from the outer child (those that hash to a later batch are spilled to outer batch files on the fly by HJ_NEED_NEW_OUTER). This is exactly the hybrid-hash optimization — the first partition never leaves memory.

flowchart TB
  BUILD["HJ_BUILD_HASHTABLE<br/>MultiExecHash builds table<br/>(may spill batches 1..n)"] --> NNO["HJ_NEED_NEW_OUTER<br/>fetch outer, hash it"]
  NNO -->|"outer NULL"| NNB
  NNO -->|"belongs to later batch"| SPILL["save to outer batch file<br/>stay in NEED_NEW_OUTER"]
  SPILL --> NNO
  NNO -->|"this batch"| SB["HJ_SCAN_BUCKET<br/>walk chain, ExecScanHashBucket"]
  SB -->|"match + quals"| EMIT["ExecProject -> return row"]
  SB -->|"no more matches"| FOT["HJ_FILL_OUTER_TUPLE<br/>(LEFT: emit NULL-filled)"]
  FOT --> NNO
  NNB["HJ_NEED_NEW_BATCH<br/>ExecHashJoinNewBatch:<br/>load next inner batch file"] -->|"more batches"| NNO
  NNB -->|"no more batches"| DONE["return NULL (join done)"]

Figure 4 — ExecHashJoinImpl hybrid-hash state machine (serial path). The build runs once; the NEED_NEW_OUTER/SCAN_BUCKET cycle probes the current batch; NEED_NEW_BATCH loads the next spilled batch from disk. Outer tuples that hash to a later batch are spilled rather than probed. HJ_FILL_INNER_TUPLES (right/full joins, not shown) scans the table for unmatched inner tuples before advancing the batch.

Anchor on symbol names, not line numbers. A function or struct name is the stable handle across releases. Use git grep -n '<symbol>' src/backend/executor/ to locate the current position. The line numbers in the position-hint table were observed at commit 273fe94 (REL_18_STABLE) and are quick hints only.

  • ExecNestLoop — the next() body: nl_NeedNewOuter drives a new outer fetch + nestParams push + ExecReScan(innerPlan); the inner advance tests joinqual then otherqual; inner exhaustion handles LEFT/ANTI NULL-fill via nl_NullInnerTupleSlot.
  • ExecInitNestLoop — builds NestLoopState; inits outer child, then sets EXEC_FLAG_REWIND on the inner only if nestParams == NIL (a parameterized inner is rescanned with fresh params, so REWIND is pointless); single_match = inner_unique || JOIN_SEMI; builds the null-inner slot for LEFT/ANTI.
  • ExecReScanNestLoop — rescans the outer (only if its chgParam is null) and resets nl_NeedNewOuter / nl_MatchedOuter; deliberately does not rescan the inner (per-outer rescan happens in the main loop).
  • ExecEndNestLoop — recurse ExecEndNode into both children.
  • ExecMergeJoin — the eleven-state machine (EXEC_MJ_INITIALIZE_OUTER/INNER, SKIP_TEST, SKIPOUTER_ADVANCE/SKIPINNER_ADVANCE, JOINTUPLES, NEXTINNER, NEXTOUTER, TESTOUTER, ENDOUTER/ENDINNER).
  • MJExamineQuals — turns the planner’s leftexpr = rightexpr mergeclause list + opfamily/collation/direction metadata into a MergeJoinClauseData[] array, each with a SortSupport comparator (BTSORTSUPPORT_PROC, or a shim over BTORDER_PROC).
  • MJEvalOuterValues / MJEvalInnerValues — evaluate the merge-key expressions into clause->ldatum / rdatum; return MJEVAL_MATCHABLE / MJEVAL_NONMATCHABLE (NULL key) / MJEVAL_ENDOFJOIN (input exhausted or first-column NULL that ends the join).
  • MJCompare — walk the clauses with ApplySortComparator; map NULL=NULL and mj_ConstFalseJoin to a non-equal result so the inner advances.
  • MarkInnerTuple (macro) / ExecMarkPos / ExecRestrPos — the mark/restore on the inner plan that re-joins duplicate outer keys against an equal inner group; mj_MarkedTupleSlot caches the marked tuple.
  • MJFillOuter / MJFillInner — emit a NULL-extended row (subject to otherqual) for an unmatched outer/inner tuple in an outer join.
  • ExecInitMergeJoin — builds MergeJoinState; creates two extra ExprContexts (mj_OuterEContext / mj_InnerEContext) so key evaluation survives the per-tuple reset; passes EXEC_FLAG_MARK to the inner child unless skip_mark_restore; sets mj_ExtraMarks for a Material inner.
  • ExecReScanMergeJoin / ExecEndMergeJoin — reset to EXEC_MJ_INITIALIZE_OUTER; teardown.
  • ExecHashJoinImpl — the always-inlined six-state machine (HJ_BUILD_HASHTABLE, HJ_NEED_NEW_OUTER, HJ_SCAN_BUCKET, HJ_FILL_OUTER_TUPLE, HJ_FILL_INNER_TUPLES, HJ_NEED_NEW_BATCH).
  • ExecHashJoin / ExecParallelHashJoin — the two thin wrappers that instantiate ExecHashJoinImpl(pstate, false/true).
  • ExecHashJoinOuterGetTuple / ExecParallelHashJoinOuterGetTuple — fetch the next outer tuple: from the child plan on batch 0 (computing its hash value via hj_OuterHash), or replayed from an outer batch BufFile for later batches.
  • ExecHashJoinNewBatch / ExecParallelHashJoinNewBatch — advance to the next batch, loading its inner file into the table and reopening its outer file; return false when no batches remain.
  • ExecHashJoinSaveTuple — append hashvalue + MinimalTuple to a (lazily created) per-batch temp BufFile in spillCxt.
  • ExecInitHashJoin — builds HashJoinState; the inner child is always a Hash node; compiles hashclauses and per-side hash expressions.
  • ExecHash — errors out: the Hash node is driven by MultiExecProcNode, not the per-tuple ExecProcNode convention.
  • MultiExecHashMultiExecPrivateHash / MultiExecParallelHash — pull every inner tuple, hash it, and ExecHashTableInsert (or skew-insert).
  • ExecHashTableCreate / ExecChooseHashTableSize — allocate the HashJoinTable; size nbuckets / nbatch from the row estimate and get_hash_memory_limit; NTUP_PER_BUCKET == 1 is the load-factor target.
  • ExecHashTableInsert — insert in-memory (prepend to bucket chain via dense_alloc) or spill to innerBatchFile[batchno]; trigger ExecHashIncreaseNumBuckets / ExecHashIncreaseNumBatches when thresholds are exceeded.
  • ExecHashGetBucketAndBatch — split a hash value into bucket (low bits) and batch (rotated high bits); the invariant the spill scheme depends on.
  • ExecHashIncreaseNumBatches — double nbatch, re-test every in-memory tuple, flush those now belonging to a later batch.
  • ExecScanHashBucket / ExecParallelScanHashBucket — walk a bucket chain; hashvalue pre-filter then ExecQualAndReset(hashclauses).
  • ExecPrepHashTableForUnmatched / ExecScanHashTableForUnmatched — the right/full-join pass over unmatched inner tuples (HJ_FILL_INNER_TUPLES).
  • HashJoinTableData (hashjoin.h) — nbuckets / nbatch / spaceUsed / spaceAllowed / curbatch / buckets / innerBatchFile / outerBatchFile.
  • struct JoinState — base embedded by all three: jointype, joinqual, single_match; the generic ps.qual holds otherqual.
  • struct NestLoopState / MergeJoinState / HashJoinState — the three concrete state nodes.

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
ExecNestLoopnodeNestloop.c60
ExecInitNestLoopnodeNestloop.c262
ExecEndNestLoopnodeNestloop.c361
ExecReScanNestLoopnodeNestloop.c381
ExecMergeJoinnodeMergejoin.c594
MJExamineQualsnodeMergejoin.c175
MJEvalOuterValuesnodeMergejoin.c289
MJEvalInnerValuesnodeMergejoin.c336
MJComparenodeMergejoin.c386
MJFillOuternodeMergejoin.c447
MJFillInnernodeMergejoin.c478
ExecInitMergeJoinnodeMergejoin.c1439
ExecEndMergeJoinnodeMergejoin.c1636
ExecReScanMergeJoinnodeMergejoin.c1652
ExecHashJoinImplnodeHashjoin.c221
ExecHashJoinnodeHashjoin.c684
ExecParallelHashJoinnodeHashjoin.c700
ExecInitHashJoinnodeHashjoin.c716
ExecHashJoinOuterGetTuplenodeHashjoin.c979
ExecHashJoinNewBatchnodeHashjoin.c1130
ExecHashJoinSaveTuplenodeHashjoin.c1414
MultiExecHashnodeHash.c105
MultiExecPrivateHashnodeHash.c138
ExecHashTableCreatenodeHash.c446
ExecChooseHashTableSizenodeHash.c658
ExecHashIncreaseNumBatchesnodeHash.c1030
ExecHashIncreaseNumBucketsnodeHash.c1587
ExecHashTableInsertnodeHash.c1749
ExecHashGetBucketAndBatchnodeHash.c1960
ExecScanHashBucketnodeHash.c1992
NTUP_PER_BUCKET (macro)nodeHash.c655
HashJoinTableDatahashjoin.h298

Each entry leads with a fact about the current source at commit 273fe94 (REL_18_STABLE), readable without other materials. The trailing sentence shows how it was checked.

  • Nested loop pushes outer Vars into PARAM_EXEC slots and rescans the inner only via chgParam. ExecNestLoop’s nl_NeedNewOuter branch iterates nl->nestParams, does slot_getattr into econtext->ecxt_param_exec_vals[paramno], sets innerPlan->chgParam = bms_add_member(...), then calls ExecReScan(innerPlan). Verified by reading ExecNestLoop on 2026-06-05.

  • ExecInitNestLoop suppresses EXEC_FLAG_REWIND for a parameterized inner. The init sets eflags |= EXEC_FLAG_REWIND when node->nestParams == NIL and clears it otherwise, before ExecInitNode(innerPlan(node), ...). Verified by reading ExecInitNestLoop.

  • Merge join uses a SortSupport comparator, not the merge-equality operator, to compare keys. MJExamineQuals fetches BTSORTSUPPORT_PROC (or shims BTORDER_PROC) into each clause’s ssup; MJCompare calls ApplySortComparator. The mergeclause operator is asserted to be the opfamily equality operator (COMPARE_EQ). Verified by reading MJExamineQuals and MJCompare.

  • Merge join marks the first inner tuple of an equal group and restores to it for duplicate outer keys. EXEC_MJ_SKIP_TEST on equality calls ExecMarkPos(innerPlan) (unless mj_SkipMarkRestore) and MarkInnerTuple; EXEC_MJ_TESTOUTER on a still-equal new outer calls ExecRestrPos(innerPlan). Verified by reading both states. The inner child is initialized with EXEC_FLAG_MARK unless skip_mark_restore (read in ExecInitMergeJoin).

  • The Hash node is built by MultiExecProcNode, and its ExecProcNode deliberately errors. ExecHash calls elog(ERROR, "Hash node does not support ExecProcNode call convention"); HJ_BUILD_HASHTABLE calls (void) MultiExecProcNode((PlanState *) hashNode) after ExecHashTableCreate. Verified by reading ExecHash, MultiExecHash, and ExecHashJoinImpl.

  • A hash value is split into bucket (low bits) and batch (rotated high bits), keeping nbatch a power of two. ExecHashGetBucketAndBatch does *bucketno = hashvalue & (nbuckets - 1) and *batchno = pg_rotate_right32(hashvalue, log2_nbuckets) & (nbatch - 1). Verified by reading the function. ExecHashIncreaseNumBatches does nbatch = oldnbatch * 2. Verified by reading it.

  • Build-side spill is driven by measured spaceUsed exceeding spaceAllowed, not the planner estimate alone. ExecHashTableInsert accumulates spaceUsed and calls ExecHashIncreaseNumBatches when spaceUsed + nbuckets_optimal * sizeof(HashJoinTuple) > spaceAllowed. ExecChooseHashTableSize sets the initial nbatch/nbuckets from get_hash_memory_limit. Verified by reading both.

  • ExecScanHashBucket pre-filters on the stored hash value before running the hash-clause ExecQual. Each HashJoinTuple stores hashvalue; the bucket walk runs ExecQualAndReset(hjclauses, econtext) only when hashTuple->hashvalue == hashvalue. Verified by reading the function.

  • The same joinqual/otherqual/single_match shape appears in all three nodes. Each tests ExecQual(joinqual, ...) to set matched status, then otherqual before ExecProject, and short-circuits on js.single_match (= inner_unique || JOIN_SEMI). Verified by reading the emit paths in ExecNestLoop, EXEC_MJ_JOINTUPLES, and HJ_SCAN_BUCKET.

  1. Parallel hash join’s barrier choreography is only sketched here. The PHJ_BUILD_* / PHJ_BATCH_* / PHJ_GROW_* phase machine (the Barrier-coordinated shared state in ExecParallelHashJoin and MultiExecParallelHash) is summarized but not traced state-by-state. Investigation path: read the nodeHashjoin.c header comment block and ExecParallelHashJoinNewBatch; likely its own follow-up under postgres-parallel-query.md.

  2. The skew optimization (MCV-based skew buckets) is named but not detailed. ExecHashBuildSkewHash / ExecHashGetSkewBucket / ExecHashSkewTableInsert keep the most-common inner values in a separate skew hash table so skewed outer tuples never spill. The cost model and the num_skew_mcvs choice were not traced. Investigation path: read ExecHashBuildSkewHash and the skewTable planner field.

  3. How Material interacts with merge join’s mark/restore vs. an ordered index scan. mj_ExtraMarks is set only for a Material inner without REWIND; index scans must not get extra marks. The exact cost/correctness tradeoff (when the planner inserts a Material below the inner of a merge join) belongs to postgres-scan-nodes.md / postgres-path-generation.md.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

Pointers, not analysis. Each bullet is a starting handle for a follow-up doc; depth here is intentionally shallow.

  • Block / batch nested loop and the death of the inner-rescan tax. PostgreSQL’s ExecNestLoop re-reads the inner once per outer tuple. The textbook block-nested-loop refinement (DSC §15.5.1) amortizes that over an outer block, and modern engines push further: SQL Server’s “batch mode” and Oracle’s vector joins process groups of outer rows per inner pass. PostgreSQL leans on the parameterized-index-scan path instead — keeping nested loop attractive only when the inner probe is itself cheap — and on Memoize (a caching node above a parameterized inner) to skip repeated identical inner rescans. A comparison would quantify what Memoize buys versus a true block join.

  • One-pass vs. recursive (GRACE) partitioning. PostgreSQL’s hybrid hash join does a single partitioning level: if a batch still overflows after doubling nbatch, it disables further growth and lets the batch exceed the budget (the “best effort” the file header describes). GRACE hash join (Kitsuregawa et al., 1983) and the recursive-partitioning variants in column stores partition recursively until each partition fits. The tradeoff is partitioning passes vs. the risk of a memory blowout on a badly-skewed key; mapping PostgreSQL’s growth-disable heuristic onto the GRACE recursion would sharpen when each wins.

  • Radix / cache-conscious hash joins. The classic result of Manegold, Boncz & Kersten (“Optimizing Main-Memory Join on Modern Hardware”, IEEE TKDE 2002) and the later Balkesen et al. SIGMOD 2013 studies show that a TLB- and cache-aware radix partition pass beats a single large chained hash table on modern CPUs even when everything fits in RAM. PostgreSQL’s ExecScanHashBucket is a pointer-chasing chained table — simple and pointer-stable across batch growth, but not cache-optimized. Relevant if PostgreSQL hash join is ever measured against in-memory analytic engines.

  • Sort-merge vs. hash for the analytic mainstream. The long-running “sort-merge vs. hash join” debate (Kim et al., “Sort vs. Hash Revisited”, VLDB 2009; Albutiu et al., “Massively Parallel Sort-Merge Joins”, VLDB 2012) turns on SIMD-accelerated sorting and NUMA bandwidth. PostgreSQL keeps both algorithms and lets the cost model choose; it has neither SIMD sort nor NUMA-aware merge. The cross-reference for how the planner chooses among the three is postgres-path-generation.md.

  • Worst-case-optimal and multi-way joins. All three PostgreSQL joins are binary — a multi-relation join is a left-deep (or bushy) tree of binary joins, which can blow up on cyclic queries. Worst-case-optimal join algorithms (Ngo et al.’s “Leapfrog Triejoin”, Veldhuizen 2014; the AGM bound) join all relations at once and are provably better on such queries. No mainstream row store ships them in core; a note on why (the planner, storage, and API assumptions they violate) would be a useful frontier pointer.

  • CUBRID’s join executors for contrast. CUBRID executes joins inside its XASL tree with nested-loop, sort-merge, and hash-list scan variants; its hash join historically materialized the inner into a “hash list scan” rather than a spillable multi-batch table. A side-by-side of CUBRID’s hash-list-scan spill behavior against PostgreSQL’s hybrid-hash batching would show two different answers to the same memory-budget problem (cross-ref the CUBRID code-analysis tree).

Textbook chapters (under knowledge/research/dbms-general/)

Section titled “Textbook chapters (under knowledge/research/dbms-general/)”
  • Database System Concepts (Silberschatz, Korth & Sudarshan, 7e), Ch. 15 “Query Processing”, §15.5 “Join Operation” — §15.5.1 nested-loop and block nested-loop, §15.5.2 indexed nested-loop, §15.5.4 merge join (sort-merge, with the mark/restore duplicate-handling discussion), §15.5.5 hash join including §15.5.5.1–15.5.5.2 GRACE / hybrid (recursive) partitioning and the build/probe cost model.
  • src/backend/executor/nodeHashjoin.c header — the authoritative description of the hybrid-hash algorithm, lazy vs. eager batch sizing (serial vs. parallel), the power-of-two batch-doubling rule, the best-effort growth-disable heuristic, and the full parallel barrier phase machine (PHJ_BUILD_*, PHJ_BATCH_*, PHJ_GROW_*).
  • src/backend/executor/nodeMergejoin.c header — the merge-join algorithm skeleton (the Join { ... } pseudocode) and the worked outer/inner duplicate examples annotated with the EXEC_MJ_* states.

PostgreSQL source (under /data/hgryoo/references/postgres/, REL_18 273fe94)

Section titled “PostgreSQL source (under /data/hgryoo/references/postgres/, REL_18 273fe94)”
  • src/backend/executor/nodeNestloop.cExecNestLoop, ExecInitNestLoop (parameterized-inner REWIND logic), ExecReScanNestLoop.
  • src/backend/executor/nodeMergejoin.cExecMergeJoin state machine, MJExamineQuals, MJEvalOuterValues/MJEvalInnerValues, MJCompare, MJFillOuter/MJFillInner, ExecInitMergeJoin.
  • src/backend/executor/nodeHashjoin.cExecHashJoinImpl state machine, ExecHashJoinOuterGetTuple, ExecHashJoinNewBatch, ExecHashJoinSaveTuple, ExecInitHashJoin.
  • src/backend/executor/nodeHash.cMultiExecHash/MultiExecPrivateHash, ExecHashTableCreate, ExecChooseHashTableSize, ExecHashTableInsert, ExecHashGetBucketAndBatch, ExecHashIncreaseNumBatches, ExecScanHashBucket.
  • src/include/executor/hashjoin.hHashJoinTableData, the bucket/batch/spill fields, HJTUPLE_* macros.
  • src/include/nodes/execnodes.hJoinState, NestLoopState, MergeJoinState, HashJoinState.

Cross-references (sibling docs that own adjacent mechanism)

Section titled “Cross-references (sibling docs that own adjacent mechanism)”
  • postgres-executor.md — the demand-pull framework, ExecProcNode dispatch, ExecReScan, the TupleTableSlot abstraction, per-tuple memory.
  • postgres-scan-nodes.md — the leaf scans the joins pull from, including the parameterized index scan (run-time scan keys from PARAM_EXEC) and the Material / Sort nodes feeding merge join.
  • postgres-expression-eval.md — compiled joinqual / otherqual / hash-clause ExprStates, ExecQual, ExecProject, the per-side hash and merge-key expressions.
  • postgres-path-generation.md / postgres-planner-overview.md — how the planner chooses among the three join algorithms and assigns inner/outer.