CUBRID Hash Join — Build/Probe Pattern, Hash-Scan Primitives, and Spill Behaviour
Contents:
- Theoretical Background
- Common DBMS Design
- CUBRID’s Approach
- XASL shape —
HASHJOIN_PROCand the manager/context split - Driver —
qexec_hash_joinand the empty-side fast paths - Build phase
- Probe phase
- Hash-scan primitive — three table layouts
- Hash-key normalisation, type coercion, and NULL handling
- Spill — grace-style equi-hash partitioning
- Cost model integration —
qo_hjoin_costandqo_examine_hash_join - Plan-tree integration —
make_hashjoin_procand dispatch - Variants — inner, left-outer, right-outer, parallel
- XASL shape —
- Source Walkthrough
- Cross-check Notes
- Open Questions
- Sources
Theoretical Background
Section titled “Theoretical Background”A hash join materialises one side of an equi-join into an
associative table keyed on the join columns, then walks the other
side and looks each tuple up in that table. The strategy is the
direct counter-proposal to the two older join families: nested-loop
(quadratic in the absence of an index) and sort-merge (linear after
two sorts). Hash join trades memory for time — when the smaller side
fits in memory it runs in O(|R| + |S|) with one I/O pass over each
side, which is the strict best a single-pass equi-join can achieve.
The canonical references are DeWitt, Katz, Olken, Shapiro, Stonebraker and Wood, Implementation Techniques for Main Memory Database Systems (SIGMOD 1984), and Shapiro, Join Processing in Database Systems with Large Main Memories (TODS 1986), which together formalised the classic hash join (read the build side, hash it, then probe with the second side) and the GRACE hash join (Kitsuregawa, Tanaka, Moto-oka, 1983) which partitions both sides up-front so each partition join fits in memory. Graefe, Query Evaluation Techniques for Large Databases (CSUR 1993) §3 enumerates four design points worth naming because every modern engine, including CUBRID, sits at some combination of them:
-
Classic hash join. One pass: build R into a hash table, then scan S and probe. Only correct when R fits in memory; otherwise the table thrashes and the algorithm degenerates to a multi-pass sequential scan over S per memory-resident chunk of R. The simplicity is real — there is no partitioning logic — but the memory contract is brittle.
-
GRACE hash join. Two passes. Phase 1 reads both R and S once and hash-partitions them into
Bbuckets keyed onh_partition(join_key). Phase 2 joins partition pairs(R_i, S_i)independently with classic hash join. Provided eachR_ifits in memory, the algorithm completes in two passes over each side. Pure GRACE writes everything to disk before any join work happens, so the first matching tuple is delayed by|R|+|S|I/O — bad for short queries. -
Hybrid hash join (Shapiro 1986). Mix: keep the first partition’s R rows in an in-memory hash table during phase 1, so probing tuples for partition 0 join immediately; spill the rest. This recovers classic-hash latency for the easy case while keeping GRACE’s correctness guarantee on the hard case.
-
Recursive partitioning. When even one partition’s R does not fit, re-partition that pair with a different hash function and recurse. Required only for severe data skew or when the partition count was under-estimated.
Three orthogonal considerations cut across all four:
-
Hash-table layout. Open addressing, separate chaining with per-bucket lists, or a directory-of-buckets like a textbook in-memory extendible hash. CUBRID’s in-memory path uses chained hashing (
mht_hls) and its on-disk path uses a directory-style extendible hash (FHS, “file hash structure” — seecubrid-extendible-hash.md). -
Build-side selection. Since the hash table is the memory hog, the smaller relation goes on build. With pre-computed cardinalities this is a CBO decision; without them it can be done dynamically by reading both sides into a “decide threshold” and switching once one side runs to completion (sometimes called role reversal).
-
NULL semantics. SQL equality returns UNKNOWN against NULL, so hash-keyed NULLs never join in equi-joins — they have to be either filtered out at hash time or routed through a fallback that emits null-padded outer tuples for outer joins.
The textbook Garcia-Molina/Ullman/Widom Database Systems: The
Complete Book §15.4–§15.5 contrasts hash join with sort-merge on the
same axes and shows that for an equi-join with no useful sort-order
output requirement, hybrid hash beats sort-merge by roughly a factor
of log(|R|/M) (the sort-merge needs that many merge passes; hash
needs one if |R|<M·B for B partitions). Sort-merge wins only when
some downstream operator (typically a MERGE JOIN or a
GROUP BY/ORDER BY on the join key) wants a sorted output for free.
CUBRID consumes this design space at two levels:
- The optimizer (
qo_hjoin_cost,qo_examine_hash_join) admits hash join as one of three join methods alongside nested-loop and sort-merge, but with a strong gating bias: by default,qo_examine_hash_joinreturns immediately unless the user wrote/*+ USE_HASH */on the inner relation. The cost model itself uses a fixed CPU-overhead constant for build/probe and treats spill as free (the IO term is currently disabled), so the hint is the effective decision lever in production. - The executor (
qexec_hash_join) realises all four of Graefe’s designs at once: it picks an in-memory layout when the build side fits, a hybrid layout (in-memory index, on-disk tuple storage) when the index fits but the tuples do not, an extendible-hash file when even the index does not fit, and falls back to grace-style partitioning (hjoin_try_partition) when the chosen primitive’s per-partition memory budget is still exceeded.
Common DBMS Design
Section titled “Common DBMS Design”Every relational engine that supports hash join exposes the same seam: a physical operator at the executor side that owns build and probe phases, and a cost rule at the optimizer side that picks the operator over nested-loop and sort-merge. Naming the common vocabulary up front makes the CUBRID identifiers in the next section read as one set of dial settings.
PostgreSQL — nodeHashjoin.c + nodeHash.c
Section titled “PostgreSQL — nodeHashjoin.c + nodeHash.c”PostgreSQL splits the operator into two: Hash (a blocking node that
materialises its child into a HashJoinTable) and HashJoin (the
probing node). The build runs at Hash’s first ExecProcNode; the
probe runs in HashJoin’s next() until the inner hash table is
exhausted for the current outer tuple. PostgreSQL has used a hybrid
hash since v7.x, with BatchFile-backed grace partitioning when
work_mem is exceeded; v11 added parallel-aware hash join with
shared barriers between worker batches. The hash function is the
hashfunc registered for the join column’s datatype; multi-column
keys hash by combining per-column hashes with (h<<5)|(h>>27) ^ h2.
Build-side selection is a planner decision based on
pg_class.reltuples/pg_class.relpages; it does not flip dynamically.
MySQL — InnoDB’s HashJoinIterator (8.0+)
Section titled “MySQL — InnoDB’s HashJoinIterator (8.0+)”MySQL only got hash join in 8.0.18, replacing block-nested-loop for
many cases. The iterator is in sql/iterators/hash_join_iterator.cc.
It keeps an in-memory mem_root hash table of the build side; on
overflow it spills both inputs to per-batch chunk files and processes
batch pairs in turn (so MySQL’s mode is essentially GRACE without the
hybrid optimisation). MySQL deliberately does not implement parallel
hash join in the iterator; parallelism is handled at the partition
level for its parallel-query feature.
SQL Server — Hash Match operator
Section titled “SQL Server — Hash Match operator”SQL Server picks among In-Memory Hash, GRACE, and Recursive
Hash as the same physical operator (Hash Match), with the
selection driven at runtime by memory grant. When the build side
exceeds the grant, the operator spills bucket pages to TempDB; if a
partition still does not fit, it re-hashes with a secondary function
(“recursive hash”), and as a last resort falls through to a “bailout”
that loops over residual bucket files. The optimizer’s join rule
evaluates all three join methods (PhyOp_HashJoin, PhyOp_LoopsJoin,
PhyOp_MergeJoin) and prunes by cost; there is no “hint required”
gate as in CUBRID’s default configuration.
Oracle — HASH JOIN operator with HJ tagged variants
Section titled “Oracle — HASH JOIN operator with HJ tagged variants”Oracle exposes HASH JOIN (in-memory build-then-probe), HASH JOIN SEMI, HASH JOIN ANTI, and HASH JOIN OUTER as distinct execution
plan operators. Spill is to PGA temporary segments. Oracle’s
optimizer decides at parse time, not at execution time, by way of
_hash_join_enabled and the cost model. The OUTER variant emits
null-padded tuples directly out of the same probe loop with a
“matched” bit on each build entry, similar to PostgreSQL’s
HJ_NULL_INNER machinery.
Theory ↔ CUBRID mapping
Section titled “Theory ↔ CUBRID mapping”| Theoretical concept | CUBRID name |
|---|---|
| Hash-join physical operator | HASHJOIN_PROC (XASL xasl->type); HASHJOIN_PROC_NODE (xasl.h) |
| Build/probe driver | qexec_hash_join → hjoin_execute_internal (query_hash_join.c) |
| Per-execution shared state | HASHJOIN_MANAGER (query_hash_join.h) |
| Per-partition execution context | HASHJOIN_CONTEXT (query_hash_join.h) |
| Per-side fetch / domain / list-id bundle | HASHJOIN_FETCH_INFO (query_hash_join.h) |
| Hash-table primitive (build/probe surface) | HASH_LIST_SCAN (query_hash_scan.h); members temp_key, temp_new_key, memory.hash_table, file.hash_table |
| Hash-table layout selector | HASH_METHOD ∈ {NOT_USE, IN_MEM, HYBRID, HASH_FILE} (query_hash_scan.h) |
| In-memory hash table | MHT_HLS_TABLE from mht_create_hls (memory_hash.c) |
| Disk-backed hash table | FHSID (file hash structure ID) opened by fhs_create (query_hash_scan.c) |
| Per-tuple key descriptor | HASH_SCAN_KEY (val_count, values[]) |
| Per-tuple value descriptor (in-mem stores tuple, hybrid stores VPID/offset) | HASH_SCAN_VALUE (union of tuple and pos) |
| Hash-key computation | qdata_hash_scan_key (query_hash_scan.c) |
| Build phase | hjoin_build + hjoin_build_key (query_hash_join.c) |
| Probe phase (inner) | hjoin_probe + hjoin_probe_key |
| Probe phase (outer-join, with null-pad fallback) | hjoin_outer_probe |
| GRACE-style partitioning trigger | hjoin_check_partition → HASHJOIN_STATUS_PARTITION (query_hash_join.c) |
| GRACE-style splitter | hjoin_split_qlist (query_hash_join.c) |
| Memory budget for one partition | PRM_ID_MAX_HASH_LIST_SCAN_SIZE (max_hash_list_scan_size, default 16 MB) |
| Per-partition fill margin | PARTITION_FILL_FACTOR = 0.8 (query_hash_join.c) |
| Cost-model entry | qo_hjoin_cost (query_planner.c:3604) |
| Optimizer admission gate | qo_examine_hash_join (query_planner.c:6618) |
| Hash-join CPU weight constants | HJ_BUILD_CPU_OVERHEAD_FACTOR = 30, HJ_PROBE_CPU_OVERHEAD_FACTOR = 20 (query_planner.c:82–83) |
| Plan-to-XASL lowering | make_hashjoin_proc (plan_generation.c:553); pt_to_hashjoin_proc (xasl_generation.c:14889) |
| Executor dispatch | case HASHJOIN_PROC: in qexec_execute_mainblock_internal (query_executor.c:14748) |
| Output-tuple writer | hjoin_merge_tuple_to_list_id (query_hash_join.c) |
| Parallel-execution backend | parallel_query::hash_join::execute_partitions (px_hash_join.cpp) |
CUBRID’s Approach
Section titled “CUBRID’s Approach”CUBRID’s hash join is a Volcano-shaped blocking-build/streaming-probe
operator with a dedicated XASL type (HASHJOIN_PROC). Unlike the
adjacent MERGELIST_PROC operator (sort-merge), it does not consume
its inputs through a generic SCAN_ID chain — it owns a custom driver
(qexec_hash_join) that pulls both inputs as already-materialised
list files from the XASL’s aptr_list, drives Build and Probe in
sequence, and writes the joined output into the parent XASL’s
list_id.
XASL shape — HASHJOIN_PROC and the manager/context split
Section titled “XASL shape — HASHJOIN_PROC and the manager/context split”The XASL node carries two HASHJOIN_INPUT substructures (one per
side) and a QFILE_LIST_MERGE_INFO describing the join key columns
and output projection:
// hashjoin_proc_node — src/query/xasl.htypedef struct hashjoin_proc_node{ HASHJOIN_INPUT outer; HASHJOIN_INPUT inner; QFILE_LIST_MERGE_INFO merge_info;#if defined (SERVER_MODE) || defined (SA_MODE) HASHJOIN_DOMAIN_INFO domain_info; HASHJOIN_STATS_GROUP stats_group;#endif} HASHJOIN_PROC_NODE;The two HASHJOIN_INPUT structs carry a child XASL_NODE *xasl
(the sub-block that produces the side’s list file) plus an
REGU_VARIABLE_LIST regu_list_pred for evaluating during-join
predicates. The merge_info is the same QFILE_LIST_MERGE_INFO
sort-merge uses, with ls_column_cnt (number of join-key columns),
ls_outer_column[] / ls_inner_column[] (column positions in each
side’s tuple), ls_pos_cnt / ls_pos_list[] /
ls_outer_inner_list[] (output projection: which side and which
column for each output value), and join_type.
When qexec_hash_join runs it builds two further structs that exist
only for the duration of one execution — HASHJOIN_MANAGER and
HASHJOIN_CONTEXT:
// hashjoin_manager — src/query/query_hash_join.htypedef struct hashjoin_manager{ HASHJOIN_INPUT *outer; /* points into proc.hashjoin */ HASHJOIN_INPUT *inner; QFILE_LIST_MERGE_INFO *merge_info; JOIN_TYPE join_type; int key_cnt; PRED_EXPR *during_join_pred; int num_parallel_threads; QUERY_ID query_id; VAL_DESCR *val_descr; HASHJOIN_CONTEXT single_context; /* fast-path single partition */ HASHJOIN_CONTEXT *contexts; /* one per grace partition */ UINT32 context_cnt; QFILE_TUPLE_VALUE_TYPE_LIST type_list; HASHJOIN_MERGE_METHOD qlist_merge_method; parallel_query::worker_manager *px_worker_manager; ...} HASHJOIN_MANAGER;The split is intentional: HASHJOIN_MANAGER holds everything that
is constant across partitions (input pointers, join type, output
schema, parallel worker pool); HASHJOIN_CONTEXT holds everything
specific to one (build side, probe side, hash scan, output list-id).
For the no-spill case there is exactly one context
(single_context); when grace partitioning kicks in, contexts[] is
allocated with one entry per partition, and the manager iterates
through them.
flowchart LR XASL["XASL_NODE<br/>type=HASHJOIN_PROC<br/>aptr_list=[outer, inner]<br/>list_id=output"] PROC["HASHJOIN_PROC_NODE<br/>(in xasl->proc.hashjoin)<br/>outer.xasl, inner.xasl,<br/>merge_info, domain_info,<br/>stats_group"] MGR["HASHJOIN_MANAGER<br/>(stack-local in qexec_hash_join)<br/>join_type, key_cnt,<br/>type_list, contexts[]"] CTX["HASHJOIN_CONTEXT<br/>outer/inner FETCH_INFO,<br/>build, probe,<br/>hash_scan, list_id, status"] HLS["HASH_LIST_SCAN<br/>temp_key, temp_new_key,<br/>memory.hash_table OR<br/>file.hash_table"] XASL --> PROC XASL -- aptr_list[0] --> OUT["outer XASL → list_id"] XASL -- aptr_list[1] --> INR["inner XASL → list_id"] PROC --> MGR MGR --> CTX CTX --> HLS
Driver — qexec_hash_join and the empty-side fast paths
Section titled “Driver — qexec_hash_join and the empty-side fast paths”The driver is dispatched by the XASL interpreter. The relevant arm of
qexec_execute_mainblock_internal is small:
// qexec_execute_mainblock_internal — src/query/query_executor.ccase HASHJOIN_PROC: if (qexec_hash_join (thread_p, xasl, xasl_state->query_id, &xasl_state->vd) != NO_ERROR) { GOTO_EXIT_ON_ERROR; } break;Because both inputs are already materialised by the time the executor
hits this case (their XASLs sit in aptr_list and have run during
the pre-pass for (xptr2 = xptr->aptr_list; ...) qexec_execute_mainblock(xptr2, ...)
loop), qexec_hash_join can read their final cardinalities directly
from outer.xasl->list_id->tuple_cnt and decide build-side and
partitioning up-front.
The driver’s first job is to cope with empty sides cheaply:
// qexec_hash_join — src/query/query_hash_join.c (condensed)int qexec_hash_join (THREAD_ENTRY *thread_p, XASL_NODE *xasl, QUERY_ID query_id, VAL_DESCR *val_descr){ HASHJOIN_MANAGER manager; HASHJOIN_CONTEXT *single_context; HASHJOIN_STATUS status, part_status;
hjoin_init_manager (thread_p, &manager, xasl, query_id, val_descr); single_context = &manager.single_context;
status = hjoin_check_empty_inputs (&manager, single_context); switch (status) { case HASHJOIN_STATUS_FILL_NULL_VALUES: hjoin_outer_fill_null_values (thread_p, &manager, single_context); break; case HASHJOIN_STATUS_TRY: part_status = hjoin_try_partition (thread_p, &manager, single_context); switch (part_status) { case HASHJOIN_STATUS_SINGLE: hjoin_execute (..., single_context); break; case HASHJOIN_STATUS_PARTITION: hjoin_execute_partitions (...); break; case HASHJOIN_STATUS_PARALLEL: parallel_query::hash_join::execute_partitions (...); break; } break; case HASHJOIN_STATUS_END: /* nothing to do; output is empty */ break; } qfile_copy_list_id (xasl->list_id, single_context->list_id, ...); hjoin_clear_manager (thread_p, &manager); ...}hjoin_check_empty_inputs is the first decision:
// hjoin_check_empty_inputs — src/query/query_hash_join.c (condensed)switch (manager->join_type) { case JOIN_INNER: status = (outer_tuple_cnt == 0 || inner_tuple_cnt == 0) ? HASHJOIN_STATUS_END : HASHJOIN_STATUS_TRY; break; case JOIN_LEFT: status = (outer_tuple_cnt == 0) ? HASHJOIN_STATUS_END : (inner_tuple_cnt == 0) ? HASHJOIN_STATUS_FILL_NULL_VALUES : HASHJOIN_STATUS_TRY; break; case JOIN_RIGHT: status = (inner_tuple_cnt == 0) ? HASHJOIN_STATUS_END : (outer_tuple_cnt == 0) ? HASHJOIN_STATUS_FILL_NULL_VALUES : HASHJOIN_STATUS_TRY; break; }Three statuses: END (one or both sides are empty in a way that
produces no output), FILL_NULL_VALUES (the preserved side of an
outer join is non-empty but the null-supplying side is empty — emit
null-padded preserved-side tuples directly without ever building a
hash table), and TRY (proceed with build/probe). The
FILL_NULL_VALUES shortcut pays off for outer joins against empty
lookup tables, where the entire build phase would be wasted.
flowchart TB
START([qexec_hash_join])
CHK{hjoin_check_empty_inputs}
END_NULL[HASHJOIN_STATUS_END<br/>output is empty]
FILL[hjoin_outer_fill_null_values<br/>emit null-padded preserved rows]
TRY{hjoin_try_partition}
SINGLE[HASHJOIN_STATUS_SINGLE<br/>hjoin_execute on single_context]
PART[HASHJOIN_STATUS_PARTITION<br/>hjoin_execute_partitions<br/>over contexts[]]
PXL[HASHJOIN_STATUS_PARALLEL<br/>parallel_query::hash_join::<br/>execute_partitions]
COPY[qfile_copy_list_id<br/>→ xasl->list_id]
START --> CHK
CHK -- END --> END_NULL --> COPY
CHK -- FILL_NULL_VALUES --> FILL --> COPY
CHK -- TRY --> TRY
TRY -- SINGLE --> SINGLE --> COPY
TRY -- PARTITION --> PART --> COPY
TRY -- PARALLEL --> PXL --> COPY
COPY --> DONE([clear manager])
Build phase
Section titled “Build phase”hjoin_build is the build loop. Three things matter here: which side
plays build, what is fetched into the key, and what is stored as the
value.
Build-side selection is per-context, in hjoin_init_context:
// hjoin_init_context — src/query/query_hash_join.c (condensed)switch (manager->join_type) { case JOIN_INNER: if (outer->list_id->tuple_cnt < inner->list_id->tuple_cnt) { context->build = outer; context->probe = inner; } else if (outer->list_id->tuple_cnt == inner->list_id->tuple_cnt && outer->list_id->page_cnt < inner->list_id->page_cnt) { context->build = outer; context->probe = inner; } else { context->build = inner; context->probe = outer; } break; case JOIN_LEFT: /* preserve outer; null-pad inner */ context->build = inner; context->probe = outer; break; case JOIN_RIGHT: context->build = outer; context->probe = inner; break; }For inner join the smaller side wins, with page count as the tie-breaker (page-count proxies for tuple-width × tuple-count, so it is a closer proxy for total bytes than tuple count alone). For outer join the choice is forced by semantics: the null-supplying side has to be on build, so its hash table can be probed and any miss can fire null-padding for the preserved side.
Per-tuple build in the no-partition case:
// hjoin_build — src/query/query_hash_join.c (condensed)while ((scan_code = qfile_scan_list_next (thread_p, &build->list_scan_id, &build->tuple_record, PEEK)) == S_SUCCESS) { error = hjoin_fetch_key (thread_p, build, &build->tuple_record, key, NULL /* compare_key */, &need_skip_next); if (need_skip_next) { need_skip_next = false; continue; } /* NULL in any join column */ hash_scan->curr_hash_key = qdata_hash_scan_key (key, UINT_MAX, hash_method); error = hjoin_build_key (thread_p, hash_scan, &build->list_scan_id, &build->tuple_record); }The loop is the literal Volcano shape: pull the next tuple,
materialise the join-key columns into a HASH_SCAN_KEY, hash, insert.
The interesting wrinkle is need_skip_next: tuples whose join column
contains any NULL are silently dropped from the build side, because
SQL’s = on NULL never yields TRUE. (For outer-join’s preserved-NULL
tuples the dropping happens on the probe side instead — they are
routed to a “fill null” partition; see “Spill” below.)
In the partitioned case the tuple has already been hashed during
the splitter pass, and the hash key is stashed in the tuple’s first
value slot; hjoin_build reads it back rather than recomputing it:
// hjoin_build (partitioned arm) — src/query/query_hash_join.c (condensed)tuple_value = build->tuple_record.tpl + QFILE_TUPLE_LENGTH_SIZE;tuple_value += QFILE_TUPLE_VALUE_HEADER_LENGTH;hash_scan->curr_hash_key = (UINT32) OR_GET_INT (tuple_value);hjoin_build_key (thread_p, hash_scan, &build->list_scan_id, &build->tuple_record);This is the same trick PostgreSQL plays in batched hash joins: hash once at partition time, reuse the hash word at build/probe time.
hjoin_build_key is the polymorphic insert. It dispatches on
hash_scan->hash_list_scan_type:
// hjoin_build_key — src/query/query_hash_join.c (condensed)switch (hash_scan->hash_list_scan_type) { case HASH_METH_IN_MEM: hash_value = qdata_alloc_hscan_value (thread_p, tuple_record->tpl); mht_put_hls (hash_scan->memory.hash_table, (void *) &hash_scan->curr_hash_key, (void *) hash_value); break; case HASH_METH_HYBRID: hash_value = qdata_alloc_hscan_value_OID (thread_p, list_scan_id); mht_put_hls (hash_scan->memory.hash_table, (void *) &hash_scan->curr_hash_key, (void *) hash_value); break; case HASH_METH_HASH_FILE: SET_TFTID (tftid, list_scan_id->curr_vpid.volid, list_scan_id->curr_vpid.pageid, list_scan_id->curr_offset); fhs_insert (thread_p, hash_scan->file.hash_table, (void *) &hash_scan->curr_hash_key, &tftid); break; }The arms differ only in what is stored as the value: a deep-copied
tuple (in-memory), a (volid, pageid, offset) tuple position
(hybrid), or a TFTID written to an extendible-hash bucket
(hash-file). The key — a 32-bit hash word — is the same in all three.
Probe phase
Section titled “Probe phase”hjoin_probe is the inner-join probe loop. It mirrors the build
loop’s outer shape (a qfile_scan_list_next over the probe side’s
list file) and inside it runs an inner do { } while (true) loop
that walks the chain of build entries hashing to the same key:
// hjoin_probe — src/query/query_hash_join.c (condensed)while ((scan_code = qfile_scan_list_next (thread_p, &probe->list_scan_id, &probe->tuple_record, PEEK)) == S_SUCCESS) { if (manager->context_cnt == 0) { /* fresh fetch — no pre-computed hash on tuple */ hjoin_fetch_key (thread_p, probe, &probe->tuple_record, key, NULL, &need_skip_next); if (need_skip_next) { continue; } /* NULL → no inner match possible */ hash_scan->curr_hash_key = qdata_hash_scan_key (key, UINT_MAX, hash_method); } else { /* partitioned — read pre-computed hash word out of tuple slot 0 */ ... }
do { hjoin_probe_key (thread_p, hash_scan, &build->list_scan_id, &build->tuple_record); if (build->tuple_record.tpl == NULL) break; /* not found */ if (manager->context_cnt != 0) { hjoin_fetch_key (thread_p, probe, &probe->tuple_record, key, NULL, &need_skip_next); } hjoin_fetch_key (thread_p, build, &build->tuple_record, found_key, key /* compare_key */, &need_skip_next); if (need_skip_next) { continue; } /* hash collision: same hash, different key */ hjoin_merge_tuple_to_list_id (thread_p, list_id, &outer->tuple_record, &inner->tuple_record, manager->merge_info, &overflow_record); } while (true); }The inner loop is what makes the operator tolerate hash collisions
correctly: hjoin_probe_key returns any build entry whose hash word
equals the current probe hash; the body then fetches the build
tuple’s key columns into found_key and re-compares them against
the probe key. If the columns disagree (same hash, different value
— a collision), need_skip_next is set and the loop goes after the
next chained entry. Only when keys match exactly does
hjoin_merge_tuple_to_list_id write the joined output tuple.
This is the same correctness guard PostgreSQL has in ExecHashJoin
when it does ExecQual(hjclauses, econtext) after hashing — the hash
is a fast filter, not a definitive match.
hjoin_probe_key itself dispatches on the table layout:
// hjoin_probe_key — src/query/query_hash_join.c (condensed)switch (hash_scan->hash_list_scan_type) { case HASH_METH_IN_MEM: hash_value = (tuple_record->tpl == NULL) ? mht_get_hls (hash_scan->memory.hash_table, (void *) &hash_scan->curr_hash_key, (void **) &hash_scan->memory.curr_hash_entry) : mht_get_next_hls (hash_scan->memory.hash_table, ...); if (hash_value != NULL) tuple_record->tpl = hash_value->tuple; break;
case HASH_METH_HYBRID: hash_value = (tuple_record->tpl == NULL) ? mht_get_hls (...) : mht_get_next_hls (...); if (hash_value != NULL) { MAKE_TUPLE_POSTION (tuple_position, hash_value->pos, list_scan_id); qfile_jump_scan_tuple_position (thread_p, list_scan_id, &tuple_position, tuple_record, PEEK); } break;
case HASH_METH_HASH_FILE: eh_search = (tuple_record->tpl == NULL) ? fhs_search (thread_p, hash_scan, &tftid) : fhs_search_next (thread_p, hash_scan, &tftid); if (eh_search == EH_KEY_FOUND) { MAKE_TFTID_TO_TUPLE_POSTION (tuple_position, tftid, list_scan_id); qfile_jump_scan_tuple_position (thread_p, list_scan_id, &tuple_position, tuple_record, PEEK); } break; }The tuple_record->tpl == NULL check is the chain-walk anchor: on
first call for a probe key, ask for the first matching entry; on
subsequent calls, ask for the next. The hybrid and hash-file paths
follow up the position lookup with a qfile_jump_scan_tuple_position
that re-reads the tuple from the build-side list file — this is the
extra page fetch that is the cost of avoiding deep-copying tuples
into the hash table.
hjoin_outer_probe is the outer-join variant. Structurally it is the
inner-join probe with one extra branch: before going to the next
probe tuple, check whether at least one match was emitted; if not,
emit a null-padded record using the preserved side’s tuple and
outer->fill_record / inner->fill_record (one of which is NULL):
// hjoin_outer_probe — src/query/query_hash_join.c (condensed)any_record_added = false;do { /* same chain walk as hjoin_probe, with during_join_pred check */ ... any_record_added = true; }while (true);
if (!any_record_added) { /* fill null for unmatched preserved row */ hjoin_merge_tuple_to_list_id (thread_p, list_id, outer->fill_record, inner->fill_record, manager->merge_info, &overflow_record); }during_join_pred is the same hook PostgreSQL uses to attach
inequality predicates from ON clauses to the hash-join probe loop;
it lets R LEFT JOIN S ON R.k = S.k AND R.x > S.x evaluate the second
predicate per matched-pair during the chain walk, not as a
post-filter, so a non-qualifying inequality does not count as a
“match” and the null-padding still fires.
flowchart TB
P0([Open probe scan])
P1[qfile_scan_list_next<br/>over probe list_id]
P2{NULL in<br/>join key?}
P3[fetch_key into HASH_SCAN_KEY]
P4[hash_key = qdata_hash_scan_key]
P5[hjoin_probe_key:<br/>first match in chain]
P6{found?}
P7[fetch build key,<br/>compare_key against probe key]
P8{collision<br/>same h diff k?}
P9{during_join_pred?}
P10[merge_tuple_to_list_id<br/>append output row]
P11[hjoin_probe_key:<br/>next match in chain]
P12[outer-only:<br/>any_record_added?]
P13[fill null-padded row]
P0 --> P1 --> P2
P2 -- yes inner --> P1
P2 -- yes outer --> P13 --> P1
P2 -- no --> P3 --> P4 --> P5 --> P6
P6 -- no --> P12
P6 -- yes --> P7 --> P8
P8 -- collision --> P11 --> P5
P8 -- match --> P9
P9 -- false --> P11
P9 -- true --> P10 --> P11
P12 -- inner / matched --> P1
P12 -- outer / no match --> P13 --> P1
Hash-scan primitive — three table layouts
Section titled “Hash-scan primitive — three table layouts”HASH_LIST_SCAN is the primitive shared between hash join and hash
list scan (CUBRID’s term for hash-keyed IN-list and uncorrelated
subquery probing — see cubrid-post-processing.md). The struct is in
query_hash_scan.h:
// hash_list_scan — src/query/query_hash_scan.hstruct hash_list_scan{ regu_variable_list_node *build_regu_list; /* used by hash list scan only */ regu_variable_list_node *probe_regu_list; /* used by hash list scan only */ HASH_SCAN_KEY *temp_key; /* probe key buffer */ HASH_SCAN_KEY *temp_new_key; /* coerced/build key buffer */ union { struct { MHT_HLS_TABLE *hash_table; /* in-memory hash, chained */ HENTRY_HLS_PTR curr_hash_entry; /* iterator state */ } memory; struct { FHSID *hash_table; /* extendible hash file */ OID curr_oid; /* iterator state */ bool is_dk_bucket; /* duplicate-key bucket flag */ } file; }; HASH_METHOD hash_list_scan_type; /* IN_MEM | HYBRID | HASH_FILE */ unsigned int curr_hash_key; bool need_coerce_type;};The layout selection is made once per HASHJOIN_CONTEXT in
hjoin_scan_init, off the build side’s page count and tuple count
versus a single memory budget mem_limit:
// hjoin_scan_init — src/query/query_hash_join.c (condensed)mem_limit = prm_get_bigint_value (PRM_ID_MAX_HASH_LIST_SCAN_SIZE);
if ((UINT64) list_id->page_cnt * DB_PAGESIZE <= mem_limit) { hash_scan->hash_list_scan_type = HASH_METH_IN_MEM; hash_scan->memory.hash_table = mht_create_hls ("Hash Join", list_id->tuple_cnt, NULL, NULL); }else if ((UINT64) list_id->tuple_cnt * (sizeof (HENTRY_HLS) + sizeof (QFILE_TUPLE_SIMPLE_POS)) <= mem_limit) { hash_scan->hash_list_scan_type = HASH_METH_HYBRID; hash_scan->memory.hash_table = mht_create_hls ("Hash Join", list_id->tuple_cnt, NULL, NULL); }else { hash_scan->hash_list_scan_type = HASH_METH_HASH_FILE; hash_scan->file.hash_table = (FHSID *) db_private_alloc (thread_p, sizeof (FHSID)); fhs_create (thread_p, hash_scan->file.hash_table, list_id->tuple_cnt); }Three thresholds, three layouts:
| Layout | Condition | Where keys live | Where tuples live |
|---|---|---|---|
HASH_METH_IN_MEM | page_cnt * DB_PAGESIZE ≤ mem_limit | mht_hls | inside the hash entry (deep-copied) |
HASH_METH_HYBRID | tuples don’t fit but tuple_cnt * (HENTRY+POS) ≤ mem_limit | mht_hls | original list-file pages, addressed by (volid,pageid,offset) |
HASH_METH_HASH_FILE | even the index doesn’t fit | FHSID directory + bucket pages | original list-file pages, addressed by TFTID |
The hybrid mode is CUBRID’s native middle-ground between classic
in-memory hash and pure GRACE: the hash index fits, but the hash
payload does not; on probe-side hits, the operator pays one extra
page fetch (qfile_jump_scan_tuple_position) per match to re-read the
tuple from the original list file. This is cheaper than the pure
hash-file path because the index probe is still in memory.
The HASH_METH_HASH_FILE path leans on the same extendible-hash
machinery the catalog uses (cubrid-extendible-hash.md): a directory
of bucket VPIDs, doubled when a bucket overflows its local depth, with
duplicate-key chains via “dk” buckets. Memory consumption is bounded
by the directory size plus one bucket-page-fix at a time; the rest is
on disk.
flowchart TB
subgraph IN_MEM["HASH_METH_IN_MEM"]
direction LR
K1[hash_key] --> B1[bucket head]
B1 --> E1["HENTRY_HLS<br/>{key=hash_key,<br/>data=HASH_SCAN_VALUE.tuple}"]
E1 --> E2["HENTRY_HLS"]
end
subgraph HYBRID["HASH_METH_HYBRID"]
direction LR
K2[hash_key] --> B2[bucket head]
B2 --> E3["HENTRY_HLS<br/>{key=hash_key,<br/>data=HASH_SCAN_VALUE.pos<br/>={vpid,offset}}"]
E3 -. jump --> LF[list-file page]
LF --> T2["tuple"]
end
subgraph FILE["HASH_METH_HASH_FILE"]
direction LR
K3[hash_key] --> DIR["FHSID directory"]
DIR --> BKT["bucket page"]
BKT --> R3["{TFTID = (volid,pageid,offset)}"]
R3 -. jump --> LF2[list-file page]
LF2 --> T3["tuple"]
end
Hash-key normalisation, type coercion, and NULL handling
Section titled “Hash-key normalisation, type coercion, and NULL handling”Two relations joined on R.k = S.k only have a meaningful hash join
if R.k and S.k agree on type, precision, scale, and collation;
otherwise the same logical value can hash to two different words and
the join silently misses rows. CUBRID’s defence is hjoin_init_domain_info,
called once per context:
// hjoin_init_domain_info — src/query/query_hash_join.c (condensed)for (domain_index = 0; domain_index < domain_cnt; domain_index++) { outer_type = TP_DOMAIN_TYPE (outer_domains[domain_index]); inner_type = TP_DOMAIN_TYPE (inner_domains[domain_index]);
if (outer_type == inner_type) common_type = outer_type; else { need_coerce_domains = true; common_type = (tp_more_general_type (outer_type, inner_type) > 0) ? outer_type : inner_type; }
if (outer_precision == inner_precision && outer_scale == inner_scale) { common_precision = inner_precision; common_scale = inner_scale; } else { need_coerce_domains = true; if (common_type == DB_TYPE_NUMERIC) { common_scale = MAX (outer_scale, inner_scale); common_precision = MAX (outer_integral, inner_integral) + common_scale; common_precision = MIN (common_precision, DB_MAX_NUMERIC_PRECISION); } else { common_precision = MAX (outer_precision, inner_precision); common_scale = 0; } }
if (need_coerce_domains) coerce_domains[domain_index] = tp_domain_cache (...); }The result is domain_info->coerce_domains[] — one common domain per
join column, used by hjoin_fetch_key to coerce values into a shared
representation before hashing. Without this, hashing INT 1 against
BIGINT 1 would produce two different hash words.
hjoin_fetch_key is the per-tuple key materialiser. It walks the
tuple’s value slots, picks out the join columns by index
(value_indexes[key_index]), and for each:
// hjoin_fetch_key — src/query/query_hash_join.c (condensed)if (QFILE_GET_TUPLE_VALUE_FLAG (tuple_value) == V_UNBOUND) goto skip_next; /* NULL → tuple drops out of join */
or_init (&buf, tuple_value + QFILE_TUPLE_VALUE_HEADER_SIZE, value_size);
if (need_coerce_domains && coerce_domains[key_index] != NULL && coerce_domains[key_index] != domains[key_index]) { domains[key_index]->type->data_readval (&buf, &pre_coerce_value, ...); tp_value_coerce (&pre_coerce_value, key->values[key_index], coerce_domains[key_index]); }else { domains[key_index]->type->data_readval (&buf, key->values[key_index], ...); }
if (compare_key != NULL) { /* collision check on probe side */ compare_result = tp_value_compare (key->values[key_index], compare_key->values[key_index], 0, 0); if (compare_result != DB_EQ) goto skip_next; }Two roles in one function: the first call (with compare_key=NULL)
materialises the key for hashing; the second call (with
compare_key=key, where key is the probe key already materialised)
re-fetches the build tuple’s key columns and verifies them against the
probe key — the collision check the probe loop relies on.
qdata_hash_scan_key is the actual hash function:
// qdata_hash_scan_key — src/query/query_hash_scan.cunsigned int qdata_hash_scan_key (const void *key, unsigned int ht_size, HASH_METHOD hash_method){ HASH_SCAN_KEY *ckey = (HASH_SCAN_KEY *) key; unsigned int hash_val = 0, tmp_hash_val; for (i = 0; i < ckey->val_count; i++) { hash_val = ROTL32 (hash_val, 13); tmp_hash_val = mht_get_hash_number (ht_size, ckey->values[i]); hash_val ^= tmp_hash_val; if (hash_val == 0) hash_val = tmp_hash_val; } if (hash_method == HASH_METH_HASH_FILE) hash_val = fhs_hash (&hash_val); /* extra mixing for FHSID */ return hash_val;}Multi-column keys are combined by the standard XOR-with-rotate scheme
(ROTL32(h,13) ^ h_i), with a guard against the all-zero state
collapsing the chain. For the disk path one extra round of mixing
(fhs_hash) is applied because FHSID’s directory uses the high
bits of the hash word as the bucket index, and an unmixed hash leaks
type-specific patterns into those bits.
NULL handling is uniform: V_UNBOUND in any join column drops the
tuple from the join. For inner joins this is correct (NULL never
equates). For outer joins, the splitter additionally routes
NULL-keyed preserved-side tuples to a dedicated “fill null”
partition; see the next section.
Spill — grace-style equi-hash partitioning
Section titled “Spill — grace-style equi-hash partitioning”When the build side’s footprint exceeds the layout’s threshold for
a single-partition run, hjoin_check_partition splits the work into
multiple partitions, GRACE-style. The decision is in
hjoin_check_partition:
// hjoin_check_partition — src/query/query_hash_join.c (condensed)mem_limit = prm_get_bigint_value (PRM_ID_MAX_HASH_LIST_SCAN_SIZE);min_tuple_cnt = MIN (outer_list_id->tuple_cnt, inner_list_id->tuple_cnt);
part_cnt = CEIL_PTVDIV ( (sizeof (HENTRY_HLS) + sizeof (QFILE_TUPLE_SIMPLE_POS)) * min_tuple_cnt, mem_limit * PARTITION_FILL_FACTOR);
if (part_cnt > 1) { if (IS_OUTER_JOIN_TYPE (manager->join_type)) part_cnt += 1; /* one extra partition for NULL-keyed preserved rows */ manager->context_cnt = part_cnt; return HASHJOIN_STATUS_PARTITION; }return HASHJOIN_STATUS_SINGLE;The partition count is sized so that one partition’s hash index
(plus tuple-position pointers, the worst case is the HASH_METH_HYBRID
overhead) fits in 80% of mem_limit. The 80% margin —
PARTITION_FILL_FACTOR — is a fudge factor against bucket
imbalance: a perfectly hashed input would fit exactly at 100%, but
real hash distributions have a long tail.
The split itself is hjoin_split_qlist, run twice (once for outer,
once for inner). It is the GRACE phase-1 splitter:
// hjoin_split_qlist — src/query/query_hash_join.c (condensed)while ((scan_code = qfile_scan_list_next (thread_p, &list_scan_id, &tuple_record, PEEK)) == S_SUCCESS) { hjoin_fetch_key (thread_p, split_info->fetch_info, &tuple_record, temp_key, NULL, &need_skip_next); if (need_skip_next) { if (is_outer_join) part_id = part_cnt - 1; /* last partition := "fill null" bucket */ else continue; /* inner: drop NULL-keyed tuples */ } else { hash_key = qdata_hash_scan_key (temp_key, UINT_MAX, HASH_METH_IN_MEM); part_id = is_outer_join ? hash_key % (part_cnt - 1) : hash_key % (part_cnt); hjoin_update_tuple_hash_key (thread_p, &tuple_record, hash_key); }
qfile_add_tuple_to_list (thread_p, part_list_id[part_id], tuple_record.tpl); }Three details worth calling out:
-
Hash word is cached on the tuple.
hjoin_update_tuple_hash_keywrites the 32-bit hash into a reserved slot at the head of the tuple, so the build/probe phase reads it back rather than re-hashing every tuple a second time. This is paid for at XASL-generation time by leaving an extraMAX_ALIGNMENT-sized value slot at tuple position 0 with an initial sentinel of-1;hjoin_update_tuple_hash_keyasserts that sentinel is still there before overwriting. -
Outer-join “fill null” partition. For
JOIN_LEFT/JOIN_RIGHT, one extra partition is allocated and reserved for tuples whose preserved-side join column is NULL. These tuples never match anything in the build side, so their join is decided by SQL’s outer-join semantics alone — null-pad and emit.hjoin_executedetects the last partition specially:// hjoin_execute — src/query/query_hash_join.c (excerpt)if (IS_OUTER_JOIN_TYPE (manager->join_type)&& context == &manager->contexts[manager->context_cnt - 1]){status = (status == HASHJOIN_STATUS_TRY)? HASHJOIN_STATUS_FILL_NULL_VALUES : status;}The dedicated
hjoin_outer_fill_null_valuesthen walks the partition’s preserved-side list file and emits one null-padded tuple per row, without touching the build side at all. -
Append-vs-temp dance. Each partition’s output list-file is built up via a
temp_part_list_id[]working set — when the in-memory buffer of one partition fills, the splitter flushes it to the partition’s “real” list-file (part_list_id[]) and starts a new buffer. This is the equivalent of PostgreSQL’sBufFile-backed batch tapes; the difference is that CUBRID’sQFILE_LIST_IDis the same abstraction the executor already uses for every materialised intermediate, so the partition tapes integrate transparently with the existingqfile_scan_list_nextmachinery.
After splitting, hjoin_execute_partitions walks contexts[]
sequentially:
// hjoin_execute_partitions — src/query/query_hash_join.c (condensed)for (context_index = 0; context_index < context_cnt; context_index++) { current_context = &manager->contexts[context_index]; hjoin_execute (thread_p, manager, current_context); hjoin_merge_qlist (thread_p, manager, current_context); }hjoin_merge_qlist handles the per-partition output stitching — by
default qlist_merge_method == HASHJOIN_MERGE_CONNECT, which
chains list files end-to-end via qfile_connect_list rather than
copying tuples; this is the cheap option and is correct because each
partition’s output is independent. The fallbacks
HASHJOIN_MERGE_APPEND (copy-append) and HASHJOIN_MERGE_COMBINE
(union with deduplication) exist for code paths that the operator
does not currently exercise.
stateDiagram-v2 [*] --> Check Check --> SingleRun : part_cnt == 1 Check --> Partition : part_cnt > 1 Partition --> SplitOuter : hjoin_split_qlist(outer) SplitOuter --> SplitInner : hjoin_split_qlist(inner) SplitInner --> ParallelOk : SERVER_MODE && enough pages SplitInner --> Sequential : single-thread fallback ParallelOk --> ForkWorkers : px_worker_manager::reserve ForkWorkers --> Joined : barrier on parallel::execute_partitions Sequential --> NextCtx NextCtx --> RunCtx : hjoin_execute(ctx[i]) RunCtx --> MergeCtx : hjoin_merge_qlist MergeCtx --> NextCtx : i++ MergeCtx --> Done : i == part_cnt SingleRun --> RunSingle : hjoin_execute(single_context) RunSingle --> Done Joined --> Done Done --> [*]
Cost model integration — qo_hjoin_cost and qo_examine_hash_join
Section titled “Cost model integration — qo_hjoin_cost and qo_examine_hash_join”The optimizer side of hash join lives in two places. qo_hjoin_cost
(in query_planner.c) computes the four-slot cost vector
(fixed_cpu_cost, fixed_io_cost, variable_cpu_cost,
variable_io_cost) for a candidate hash-join plan node. The model is
deliberately simple and entirely in CPU terms:
// qo_hjoin_cost — src/optimizer/query_planner.c (condensed)inner_cardinality = inner_plan_p->info->cardinality;outer_cardinality = outer_plan_p->info->cardinality;
plan_p->fixed_cpu_cost = outer_plan_p->fixed_cpu_cost + inner_plan_p->fixed_cpu_cost;plan_p->fixed_io_cost = outer_plan_p->fixed_io_cost + inner_plan_p->fixed_io_cost;plan_p->variable_cpu_cost = outer_plan_p->variable_cpu_cost + inner_plan_p->variable_cpu_cost;plan_p->variable_io_cost = outer_plan_p->variable_io_cost + inner_plan_p->variable_io_cost;
/* if inner plays build */inner_build_cpu_cost = inner_cardinality * QO_CPU_WEIGHT * HJ_BUILD_CPU_OVERHEAD_FACTOR;inner_build_cpu_cost += outer_cardinality * QO_CPU_WEIGHT * HJ_PROBE_CPU_OVERHEAD_FACTOR;
/* if outer plays build */outer_build_cpu_cost = inner_cardinality * QO_CPU_WEIGHT * HJ_PROBE_CPU_OVERHEAD_FACTOR;outer_build_cpu_cost += outer_cardinality * QO_CPU_WEIGHT * HJ_BUILD_CPU_OVERHEAD_FACTOR;
switch (plan_p->plan_un.join.join_type) { case JOIN_LEFT: /* outer preserved → inner is build */ plan_p->variable_cpu_cost += inner_build_cpu_cost; break; case JOIN_RIGHT: /* inner preserved → outer is build */ plan_p->variable_cpu_cost += outer_build_cpu_cost; break; case JOIN_INNER: /* whichever is cheaper */ plan_p->variable_cpu_cost += MIN (inner_build_cpu_cost + inner_build_io_cost, outer_build_cpu_cost + outer_build_io_cost); break; }The two cost constants are defined at the top of query_planner.c:
// query_planner.c — cost constants#define HJ_BUILD_CPU_OVERHEAD_FACTOR 30#define HJ_PROBE_CPU_OVERHEAD_FACTOR 20Build is taken to be ~50% more expensive per tuple than probe — a
reasonable approximation given that build allocates and inserts a
hash entry while probe only walks a chain. Crucially, the IO term is
disabled at present (the mem_limit IO surcharge is wrapped in a
#if 0), with the comment “No need to increase weight since
partitioned hash join is used even when mem_limit is exceeded.” This
is the optimizer’s own admission that it under-models the spill cost
— the cost difference between an in-memory single-partition run and a
spilled multi-partition run is currently zero in the planner’s eyes.
qo_examine_hash_join is the admission gate. It is called from
planner_visit_node’s STEP 5-5 alongside qo_examine_nl_join,
qo_examine_idx_join, and qo_examine_merge_join:
// planner_visit_node — src/optimizer/query_planner.c (excerpt)/* STEP 5-5: examine hash-join */if (!bitset_is_empty (&sm_join_terms)) { kept += qo_examine_hash_join (new_info, join_type, head_info, tail_info, &sm_join_terms, &duj_terms, &afj_terms, &sarged_terms, &pinned_subqueries); }The interesting part is qo_examine_hash_join’s gating logic — which
is how the operator stays out of the optimizer’s hot path until
explicitly requested:
// qo_examine_hash_join — src/optimizer/query_planner.c (condensed)mem_limit = prm_get_bigint_value (PRM_ID_MAX_HASH_LIST_SCAN_SIZE);if (mem_limit <= 0) goto exit; /* hash list scan disabled at all */
if (bitset_intersects (sarged_terms, &info->env->fake_terms)) goto exit; /* fake terms incompatible */
for (each term in hash_join_terms) if (QO_IS_PATH_TERM (term) && QO_TERM_JOIN_TYPE (term) != JOIN_INNER) goto exit; /* path terms forbid hash-join */
if (QO_NODE_HINT (inner_node) & PT_HINT_NO_USE_HASH) goto exit;else if (QO_NODE_HINT (inner_node) & PT_HINT_USE_HASH) /* fall through */;else if (QO_NODE_HINT (inner_node) & (PT_HINT_USE_NL | PT_HINT_USE_IDX | PT_HINT_USE_MERGE)) goto exit;else {#if TEST_HASH_JOIN_ENABLE /* fall through */#else goto exit; /* default: hash-join is not considered */#endif }Decoded, the rules are:
| Hint state | Result |
|---|---|
NO_USE_HASH | skip |
USE_HASH | consider |
USE_NL / USE_IDX / USE_MERGE | skip (other join method forced) |
| (no hint) | skip (compile-time TEST_HASH_JOIN_ENABLE flips this) |
In the production build, TEST_HASH_JOIN_ENABLE is not defined, so
hash join is never considered without an explicit /*+ USE_HASH */
hint on the inner relation. This is a deliberately conservative
posture: hash join is on by feature, but off by cost-policy. The cost
model exists, but is gated. Workloads that benefit from hash join
(typical analytical equi-joins of large fact and dimension tables)
must be marked up by the developer. The path-term and key-limit
checks further exclude semi-correctness-sensitive configurations
where the join’s row-multiplicity contract matters.
If admission passes, qo_join_new constructs a QO_PLAN of method
QO_JOINMETHOD_HASH_JOIN, which qo_check_plan_on_info then ranks
against any other plans for the same DP slot.
Plan-tree integration — make_hashjoin_proc and dispatch
Section titled “Plan-tree integration — make_hashjoin_proc and dispatch”When qo_to_xasl walks the surviving plan tree, hash-join nodes are
lowered by make_hashjoin_proc (in plan_generation.c):
// make_hashjoin_proc — src/optimizer/plan_generation.cstatic XASL_NODE *make_hashjoin_proc (QO_ENV *env, QO_PLAN *plan, XASL_NODE *outer_xasl, XASL_NODE *inner_xasl, QO_PROJECTION_INFO *projection_info){ ... HASHJOIN_PROC_NODE *proc; ... xasl = pt_to_hashjoin_proc (parser, outer_xasl, inner_xasl); ...}pt_to_hashjoin_proc (in the parser-side xasl_generation.c) sets
xasl->type = HASHJOIN_PROC, populates proc->outer.xasl /
proc->inner.xasl, and fills merge_info from the plan’s join
key-column metadata (ls_outer_column[], ls_inner_column[],
ls_pos_list[], etc.). The two child XASLs are also linked into the
parent’s aptr_list so that the executor’s pre-pass evaluates them
before the case HASHJOIN_PROC: arm of the dispatcher fires.
There is one extra check, check_hashjoin_xasl, that runs at
make_hashjoin_proc time:
// check_hashjoin_xasl — src/optimizer/plan_generation.c (condensed)for (hashjoin_xasl = xasl->aptr_list; hashjoin_xasl != NULL; hashjoin_xasl = hashjoin_xasl->next) if (hashjoin_xasl->type == HASHJOIN_PROC) break;
if (hashjoin_xasl == NULL || hashjoin_xasl->aptr_list == NULL /* outer */ || hashjoin_xasl->aptr_list->next == NULL /* inner */) goto error_exit;
/* both sides present and merge_info column counts > 0 */The check guarantees that the lowered XASL has a well-formed
HASHJOIN_PROC node with both children attached and at least one
join-key column — a hash join with zero key columns would be a
Cartesian product, which the operator does not implement.
The xasl_to_stream.c packing path (xts_process_hashjoin_proc) and
its corresponding sizing function (xts_sizeof_hashjoin_proc) ensure
the node round-trips through the wire format that the broker uses to
ship XASL trees to the server.
Variants — inner, left-outer, right-outer, parallel
Section titled “Variants — inner, left-outer, right-outer, parallel”CUBRID’s hash join supports three SQL join shapes plus a parallel flavour:
-
JOIN_INNER— the canonical case. Build-side is whichever is smaller; collisions are filtered by the post-hash key compare; output is ungrouped. -
JOIN_LEFT(left outer) — outer (LHS) is preserved, inner (RHS) is null-supplying. Inner is forced to build (so that any outer row failing to match can fire null-padding from the probe-side fallback). The fallback is inhjoin_outer_probe’sif (!any_record_added)branch. -
JOIN_RIGHT(right outer) — symmetric. Outer is forced to build. -
Parallel — when the partition count is > 1 and the build side is large enough (
min_page_cntclears thecompute_parallel_degreethreshold) and the worker pool grants ≥2 workers,hjoin_try_parallelupgrades the run toHASHJOIN_STATUS_PARALLELand dispatches toparallel_query::hash_join::execute_partitions. Each worker takes an independent partition pair and runshjoin_executeover it; the results are concatenated under a per-manager mutex. The parallel path piggybacks on the sameHASHJOIN_CONTEXT[]array as the sequential path, with extra synchronisation inHASHJOIN_SHARED_SPLIT_INFOandHASHJOIN_SHARED_JOIN_INFO.
Three notable absences: there is no full-outer join path
(JOIN_OUTER / JOIN_FULL is rejected by hjoin_init_context’s
assert_release_error (false) default arm), no semi/anti-join
specialisation (CUBRID rewrites EXISTS / IN subqueries elsewhere
and the rewritten form does not reach HASHJOIN_PROC), and no
recursive (re-)partitioning (after one round of grace partitioning,
hjoin_scan_init for an oversized partition will simply fall through
to HASH_METH_HASH_FILE and accept the on-disk overhead).
The “fill null values” arm — hjoin_outer_fill_null_values — is
notable for being a complete bypass of the build/probe machinery:
when the null-supplying side is empty, no hash table is created at
all; the operator just walks the preserved side and emits one
null-padded record per row.
flowchart LR
TYPE{join_type}
INN[JOIN_INNER<br/>build = smaller side]
LEF[JOIN_LEFT<br/>build = inner RHS]
RIG[JOIN_RIGHT<br/>build = outer LHS]
TYPE --> INN
TYPE --> LEF
TYPE --> RIG
INN --> PIN[hjoin_probe<br/>no null pad]
LEF --> POL[hjoin_outer_probe<br/>null-pad outer<br/>on probe miss]
RIG --> POR[hjoin_outer_probe<br/>null-pad inner<br/>on probe miss]
PIN --> OUT[xasl->list_id]
POL --> OUT
POR --> OUT
EMP{empty side?}
TYPE -. inner empty .-> EMP
EMP -- preserved empty --> NOP[empty output]
EMP -- supplying empty --> FILL[hjoin_outer_fill_null_values<br/>walk preserved-only,<br/>null-pad each row]
FILL --> OUT
Source Walkthrough
Section titled “Source Walkthrough”Symbols grouped by responsibility. Refer to the position-hints table
at the end for (file, line) pairs as observed when this doc was
last updated:.
Driver and lifecycle
qexec_hash_join— public entry, called fromqexec_execute_mainblock_internal’scase HASHJOIN_PROC.hjoin_init_manager,hjoin_clear_manager—HASHJOIN_MANAGERlifecycle.hjoin_init_context,hjoin_clear_context,hjoin_destroy_qlist—HASHJOIN_CONTEXTlifecycle (build-side selection lives here).hjoin_check_empty_inputs—HASHJOIN_STATUS_END/FILL_NULL_VALUES/TRYdecision.hjoin_execute— per-context build+probe driver.hjoin_execute_internal— open list scans, runhjoin_build, open probe scan, runhjoin_probeorhjoin_outer_probe.hjoin_outer_fill_null_values— fast path for outer joins with empty null-supplying side.hjoin_execute_partitions— sequential walk overcontexts[]for partitioned execution.
Build / probe primitives
hjoin_fetch_key— read join-column values out of one tuple, optionally compare against an existing key.hjoin_update_tuple_hash_key— write the 32-bit hash word into the tuple’s reserved value slot during partitioning.hjoin_build,hjoin_build_key— the build loop and per-tuple insert (dispatches tomht_put_hlsorfhs_insert).hjoin_probe,hjoin_outer_probe,hjoin_probe_key— probe loops and chain-walking primitive (dispatches tomht_get_hls/mht_get_next_hls/fhs_search/fhs_search_next).hjoin_merge_tuple_to_list_id,hjoin_merge_tuple— write the joined row to the output list-file, with overflow fallback for rows > one page.
Hash-scan primitive (shared with hash list scan)
qdata_alloc_hscan_key,qdata_free_hscan_key—HASH_SCAN_KEYallocator/deallocator.qdata_alloc_hscan_value,qdata_alloc_hscan_value_OID,qdata_free_hscan_value—HASH_SCAN_VALUEallocators (in-mem-tuple vs hybrid-position).qdata_build_hscan_key,qdata_copy_hscan_key,qdata_copy_hscan_key_without_alloc— key materialisers from a reguvar list.qdata_hash_scan_key— the hash function (multi-column rotate-XOR;fhs_hashround for HASH_FILE).qdata_hscan_key_compare,qdata_hscan_key_eq— collision resolvers.qdata_print_hash_scan_entry— debug dumper.hjoin_scan_init,hjoin_scan_clear—HASH_LIST_SCANlifecycle plus method selection (HASH_METH_IN_MEM/HYBRID/HASH_FILE).
Domain handling
hjoin_init_domain_info— compute common domain per join column, setcoerce_domains[], setneed_coerce_domainsflag.
Partitioning (spill / grace)
hjoin_check_partition— decide whether to partition based onmem_limitandmin_tuple_cnt.hjoin_try_partition— top-level partitioning driver, including fallback toSINGLEon error and dispatch to parallel.hjoin_prepare_partition— allocatecontexts[]and per-partition list-files.hjoin_split_qlist— phase-1 splitter: hash, route to partition, cache hash word on tuple.hjoin_init_split_info,hjoin_clear_split_info—HASHJOIN_SPLIT_INFOlifecycle.hjoin_init_shared_split_info,hjoin_clear_shared_split_info— parallel synchronisation primitives.hjoin_merge_qlist— concatenate per-partition output list-files viaqfile_connect_list(or fall back to copy-append / union-with-distinct for alternate merge methods).
Parallel
hjoin_try_parallel— admission gate (compute_parallel_degree+worker_manager::try_reserve_workers).parallel_query::hash_join::build_partitions,parallel_query::hash_join::execute_partitions— worker-managed splitter and per-partition runner (insrc/query/parallel/px_hash_join/px_hash_join.cpp).
Optimizer side
qo_hjoin_cost— four-slot cost vector for a hash-join plan node.qo_examine_hash_join— admission gate (hint check, path-term exclusion, key-limit exclusion).make_hashjoin_proc,pt_to_hashjoin_proc— plan-to-XASL lowering.check_hashjoin_xasl— well-formedness check on the lowered XASL.xts_process_hashjoin_proc,xts_sizeof_hashjoin_proc— XASL-stream packing.qdump_print_hashjoin_proc_node— query plan dumper.
Tracing / profiling
hjoin_trace_start,hjoin_trace_end,hjoin_trace_merge_stats— per-step elapsed-time / fetch / ioread accumulation.hjoin_profile_start,hjoin_profile_end(compile-time gated byHASHJOIN_PROFILE_TIME) — finer-grained per-step profiling.hjoin_trace_get_worker_stats,hjoin_trace_drain_worker_stats— parallel-stats merging.
Position hints as of updated: 2026-05-01
Section titled “Position hints as of updated: 2026-05-01”| Symbol | File | Line |
|---|---|---|
qexec_hash_join | src/query/query_hash_join.c | 182 |
hjoin_execute_partitions | src/query/query_hash_join.c | 327 |
hjoin_execute | src/query/query_hash_join.c | 408 |
hjoin_outer_fill_null_values | src/query/query_hash_join.c | 463 |
hjoin_execute_internal | src/query/query_hash_join.c | 609 |
hjoin_init_manager | src/query/query_hash_join.c | 723 |
hjoin_clear_manager | src/query/query_hash_join.c | 882 |
hjoin_init_domain_info | src/query/query_hash_join.c | 970 |
hjoin_try_partition | src/query/query_hash_join.c | 1166 |
hjoin_check_partition | src/query/query_hash_join.c | 1320 |
hjoin_prepare_partition | src/query/query_hash_join.c | 1377 |
hjoin_build_partitions | src/query/query_hash_join.c | 1534 |
hjoin_split_qlist | src/query/query_hash_join.c | 1629 |
hjoin_merge_qlist | src/query/query_hash_join.c | 1828 |
hjoin_try_parallel | src/query/query_hash_join.c | 1952 |
hjoin_init_split_info | src/query/query_hash_join.c | 2070 |
hjoin_clear_split_info | src/query/query_hash_join.c | 2127 |
hjoin_init_shared_split_info | src/query/query_hash_join.c | 2198 |
hjoin_init_context | src/query/query_hash_join.c | 2298 |
hjoin_clear_context | src/query/query_hash_join.c | 2394 |
hjoin_scan_init | src/query/query_hash_join.c | 2465 |
hjoin_scan_clear | src/query/query_hash_join.c | 2581 |
hjoin_check_empty_inputs | src/query/query_hash_join.c | 2637 |
hjoin_fetch_key | src/query/query_hash_join.c | 2701 |
hjoin_update_tuple_hash_key | src/query/query_hash_join.c | 2849 |
hjoin_build | src/query/query_hash_join.c | 2874 |
hjoin_build_key | src/query/query_hash_join.c | 3046 |
hjoin_probe | src/query/query_hash_join.c | 3129 |
hjoin_outer_probe | src/query/query_hash_join.c | 3364 |
hjoin_probe_key | src/query/query_hash_join.c | 3700 |
hjoin_merge_tuple_to_list_id | src/query/query_hash_join.c | 3838 |
hjoin_merge_tuple | src/query/query_hash_join.c | 3904 |
hjoin_trace_merge_stats | src/query/query_hash_join.c | 4163 |
qdata_alloc_hscan_key | src/query/query_hash_scan.c | 206 |
qdata_hash_scan_key | src/query/query_hash_scan.c | 287 |
qdata_hscan_key_eq | src/query/query_hash_scan.c | 363 |
qdata_build_hscan_key | src/query/query_hash_scan.c | 379 |
qdata_alloc_hscan_value | src/query/query_hash_scan.c | 634 |
qdata_alloc_hscan_value_OID | src/query/query_hash_scan.c | 669 |
scan_build_hash_list_scan | src/query/scan_manager.c | 8574 |
scan_hash_probe_next | src/query/scan_manager.c | 8789 |
qo_hjoin_cost | src/optimizer/query_planner.c | 3604 |
HJ_BUILD_CPU_OVERHEAD_FACTOR | src/optimizer/query_planner.c | 82 |
HJ_PROBE_CPU_OVERHEAD_FACTOR | src/optimizer/query_planner.c | 83 |
qo_examine_hash_join | src/optimizer/query_planner.c | 6617 |
Hash-join admission in planner_visit_node | src/optimizer/query_planner.c | 8021 |
make_hashjoin_proc | src/optimizer/plan_generation.c | 553 |
check_hashjoin_xasl | src/optimizer/plan_generation.c | 1856 |
pt_to_hashjoin_proc | src/parser/xasl_generation.c | 14889 |
case HASHJOIN_PROC in executor | src/query/query_executor.c | 14748 |
xts_process_hashjoin_proc | src/query/xasl_to_stream.c | 3609 |
hashjoin_proc_node struct | src/query/xasl.h | 376 |
HASH_LIST_SCAN struct | src/query/query_hash_scan.h | 110 |
HASHJOIN_MANAGER struct | src/query/query_hash_join.h | 319 |
HASHJOIN_CONTEXT struct | src/query/query_hash_join.h | 297 |
PRM_ID_MAX_HASH_LIST_SCAN_SIZE | src/base/system_parameter.c | 3517 |
Cross-check Notes
Section titled “Cross-check Notes”-
Optimizer ↔ executor symmetry on memory budget. Both
qo_examine_hash_join(admission) andhjoin_check_partition(spill trigger) read the samePRM_ID_MAX_HASH_LIST_SCAN_SIZE. The parameter therefore controls both whether hash join is considered at all (when set to zero, the optimizer disables it) and the partition count when it is. This duplication is correct but fragile: a future refactor that changes one site has to chase the other. The optimizer doc (cubrid-query-optimizer.md) references the four-slot cost model but does not call out theqo_examine_hash_join-level zero-check. -
Cost model under-models spill. As noted in
qo_hjoin_cost, the IO surcharge for spilled partitions is#if 0-ed out with the comment that partitioned hash join “is used even when mem_limit is exceeded”. This means the optimizer treats an in-memory hash join and a spilled hash join as having identical IO cost. In practice, the spill path does pay real disk I/O on the per-partition list-file and the FHSID bucket pages, so the planner under-estimates spilled hash-join cost. For workloads where this matters the user is expected to use theUSE_HASHhint as a forcing function rather than a tie-breaker, which fits the conservative admission posture. -
USE_HASHhint is the practical switch.TEST_HASH_JOIN_ENABLEis not set in the production build, so without/*+ USE_HASH(inner_alias) */on the query, hash join is never generated even when the cost model would prefer it. This is consistent with the executor doc’s silence on hash join (cubrid-query-executor.mdmentionsqexec_hash_joinonly by name in the dispatch table) and with the post-processing doc’s emphasis oncubrid-post-processing.mdfor sharedmht_hlsmachinery. -
Hash-scan primitive is shared with hash list scan and hash group-by.
MHT_HLS_TABLE,HASH_SCAN_KEY,HASH_SCAN_VALUE,qdata_hash_scan_key,qdata_build_hscan_key,qdata_alloc_hscan_value,fhs_*are all shared betweenquery_hash_join.c(this doc),scan_manager.c’sscan_build_hash_list_scan/scan_next_hash_list_scan/scan_hash_probe_next(hash list scan, used by uncorrelated subqueries and IN-list probes — seecubrid-query-executor.md’s scan-manager section), andquery_executor.c’sqexec_hash_gby_*family (hash group-by — seecubrid-post-processing.md’s sort-vs-hash GROUP BY section). Any evolution of the hash key encoding has to land in all three. -
Splitter assumes a reserved tuple slot.
hjoin_split_qlistandhjoin_update_tuple_hash_keywrite the 32-bit hash word into the first tuple value slot, asserting the slot is exactlyMAX_ALIGNMENTbytes wide and was initialised to a-1sentinel. This is a contract with the XASL generator (pt_to_hashjoin_proc/make_hashjoin_proc); any future change to the tuple shape that reorders or shrinks slot 0 will silently break hash-join partitioning. The contract is not asserted at XASL build time, only at split time. -
Build-side selection is per-partition, not per-query. For inner join,
hjoin_init_contextrecomputesbuild = smaller sidefor each partition independently. After splitting, one partition can have a build-side that is a different side from another partition’s, depending on how the split distributed rows. This is correct per-partition (each partition is a self-contained classic hash join), but it means a single query’s trace output can show mixed build sides for a partitioned hash join. The deck for the QP module does not call this out. -
hjoin_outer_probecollision-skip on partitioned probe is an asserted-impossible case. When the probe side is partitioned, the probe key is fetched lazily inside the chain-walk loop; ifneed_skip_nextfires there, it would mean a NULL appeared in a pre-partitioned probe tuple, which the splitter is supposed to have already routed to the fill-null partition. The code defensively emits a null-padded row and continues, but theassert_release_error (false)shows the operator does not expect this branch to fire in well-formed XASL. Useful diagnostic if someone changes the splitter behaviour. -
No collation in the hash function.
qdata_hash_scan_keycallsmht_get_hash_numberwhich goes through the type-specific hashers; for string types the hash includes the collation only insofar as the underlying primitive encodes it. The post-hash key compare (tp_value_compareinhjoin_fetch_key) does respect collation viapr_clone_value/tp_value_coerce. The system therefore relies on the post-hash compare to catch any collation-driven pseudo-collisions; it is not the case that two collation-equal strings necessarily hash to the same word. -
HASHJOIN_MERGE_CONNECTis the only exercised merge path.hjoin_init_managerhard-codesqlist_merge_method = HASHJOIN_MERGE_CONNECT. TheMERGE_COMBINEandMERGE_APPENDarms ofhjoin_merge_qlistare reachable only by editing this line, so they are effectively dead code as of this revision. They are presumably retained for future work (e.g., distinct-result hash joins) where the partition outputs would need union-with-deduplication semantics.
Open Questions
Section titled “Open Questions”-
Bloom-filter pre-probe. Modern hash joins (PostgreSQL parallel hash, MonetDB, DuckDB) push a Bloom filter from the build side down to the probe-side scan, so non-matching probe rows can be discarded before they cross the operator boundary. CUBRID’s hash join does not implement this; whether it would pay for itself given the conservative admission policy (only USE_HASH-hinted joins reach the operator) is an open question.
-
Adaptive role reversal. When the build side turns out larger than the cost-model estimate (because cardinality estimation was off), some engines (SQL Server’s Hash Match, recent PostgreSQL versions in some shapes) abort the build and restart with the other side as build. CUBRID’s
hjoin_init_contextdecides build side once per partition based ontuple_cnt/page_cntalready captured in the materialised list-file headers, which are exact — not estimates — so role reversal is not necessary for the correctness it usually corrects. However, when the estimate that fed intoqo_examine_hash_join’s admission was wrong, a hash join may be chosen for a query whose actual sizes would favour nested-loop; CUBRID has no fallback to switch operators at execution time. -
Recursive partitioning under skew. When a single partition’s build side is still too large after one round of grace partitioning (heavy skew on the join key), CUBRID falls through to
HASH_METH_HASH_FILEand accepts the on-disk extendible-hash overhead. SQL Server’s “recursive hash” instead re-partitions with a different hash function. For workloads with severe skew on join keys (hot-key dimensions), CUBRID’s path is correct but slower than the alternative. -
Parallel build coordination. The parallel path partitions first, then assigns one partition pair per worker; within a partition the build is single-threaded. PostgreSQL since v11 has a shared-memory parallel build, where workers cooperatively insert into one shared hash table. CUBRID’s parallel hash join is at the partition-level only; whether shared-build parallelism is on the roadmap is not documented in the source.
-
Semi-/anti-join specialisation.
EXISTS/NOT EXISTS/ uncorrelatedINare currently rewritten to standard joins or routed through hash list scan; there is noHASHJOIN_PROCsemi-join variant that can early-out after the first build match. The cost saving for selective semi-joins could be significant. -
Cost-model IO term. The
#if 0-ed IO surcharge inqo_hjoin_costis a known underestimate of spilled hash-join cost. The justification (“partitioned hash join is used even when mem_limit is exceeded”) explains why the model gives up but not what it should be replaced with. A page-based estimator that multipliesmin_page_cntbymax(1, (build_size / mem_limit))would be a small change with large impact on join-method selection in workloads dominated by spilled hash joins. -
No full-outer hash join.
hjoin_init_context’s default arm isassert_release_error (false), which rejectsJOIN_OUTER. PostgreSQL implements full-outer hash join with a “matched” bit per build entry plus a final pass to emit unmatched build rows; CUBRID does not. Full-outer joins are presumably routed to sort-merge.
Sources
Section titled “Sources”src/query/query_hash_join.c,src/query/query_hash_join.h— driver, manager/context, build/probe phases, partitioning, parallel glue.src/query/query_hash_scan.c,src/query/query_hash_scan.h— shared hash-scan primitive (key/value allocation, hash function, FHSID extendible-hash file).src/query/scan_manager.c—scan_build_hash_list_scan,scan_next_hash_list_scan,scan_hash_probe_next(hash list scan, the sibling consumer ofHASH_LIST_SCAN).src/query/query_executor.c—qexec_execute_mainblock_internal’scase HASHJOIN_PROCdispatch (line ~14748).src/query/xasl.h—HASHJOIN_PROC_NODE,HASHJOIN_INPUT,QFILE_LIST_MERGE_INFOdefinitions.src/query/xasl_to_stream.c—xts_process_hashjoin_proc,xts_sizeof_hashjoin_procpacking.src/query/query_dump.c—qdump_print_hashjoin_proc_nodeplan-tree printer.src/optimizer/query_planner.c—qo_hjoin_cost,qo_examine_hash_join, hash-join CPU constants (HJ_BUILD_CPU_OVERHEAD_FACTOR,HJ_PROBE_CPU_OVERHEAD_FACTOR), call-site inplanner_visit_node’s STEP 5-5.src/optimizer/plan_generation.c—make_hashjoin_proc,check_hashjoin_xasl.src/parser/xasl_generation.c—pt_to_hashjoin_proc.src/query/parallel/px_hash_join/— parallel-execution back-end.src/base/system_parameter.c—PRM_ID_MAX_HASH_LIST_SCAN_SIZEparameter declaration.- DeWitt, Katz, Olken, Shapiro, Stonebraker, Wood, Implementation Techniques for Main Memory Database Systems, SIGMOD 1984.
- Shapiro, Join Processing in Database Systems with Large Main Memories, ACM TODS 1986.
- Kitsuregawa, Tanaka, Moto-oka, Application of Hash to Data Base Machine and Its Architecture, New Generation Computing 1983 (GRACE).
- Graefe, Query Evaluation Techniques for Large Databases, ACM Computing Surveys 1993, §3 (hash join survey).
- Graefe, Volcano — An Extensible and Parallel Query Evaluation System, IEEE TKDE 1994 (operator integration).
- Petrov, Database Internals, ch. 14 “Query Execution” (sort vs hash join).
- Garcia-Molina, Ullman, Widom, Database Systems: The Complete Book, ch. 15.4–15.5 (hash join cost analysis).
- Silberschatz, Korth, Sudarshan, Database System Concepts, ch. on join operators (hash join + sort-merge comparison).