Skip to content

CUBRID Hash Join — Build/Probe Pattern, Hash-Scan Primitives, and Spill Behaviour

Contents:

A hash join materialises one side of an equi-join into an associative table keyed on the join columns, then walks the other side and looks each tuple up in that table. The strategy is the direct counter-proposal to the two older join families: nested-loop (quadratic in the absence of an index) and sort-merge (linear after two sorts). Hash join trades memory for time — when the smaller side fits in memory it runs in O(|R| + |S|) with one I/O pass over each side, which is the strict best a single-pass equi-join can achieve.

The canonical references are DeWitt, Katz, Olken, Shapiro, Stonebraker and Wood, Implementation Techniques for Main Memory Database Systems (SIGMOD 1984), and Shapiro, Join Processing in Database Systems with Large Main Memories (TODS 1986), which together formalised the classic hash join (read the build side, hash it, then probe with the second side) and the GRACE hash join (Kitsuregawa, Tanaka, Moto-oka, 1983) which partitions both sides up-front so each partition join fits in memory. Graefe, Query Evaluation Techniques for Large Databases (CSUR 1993) §3 enumerates four design points worth naming because every modern engine, including CUBRID, sits at some combination of them:

  1. Classic hash join. One pass: build R into a hash table, then scan S and probe. Only correct when R fits in memory; otherwise the table thrashes and the algorithm degenerates to a multi-pass sequential scan over S per memory-resident chunk of R. The simplicity is real — there is no partitioning logic — but the memory contract is brittle.

  2. GRACE hash join. Two passes. Phase 1 reads both R and S once and hash-partitions them into B buckets keyed on h_partition(join_key). Phase 2 joins partition pairs (R_i, S_i) independently with classic hash join. Provided each R_i fits in memory, the algorithm completes in two passes over each side. Pure GRACE writes everything to disk before any join work happens, so the first matching tuple is delayed by |R|+|S| I/O — bad for short queries.

  3. Hybrid hash join (Shapiro 1986). Mix: keep the first partition’s R rows in an in-memory hash table during phase 1, so probing tuples for partition 0 join immediately; spill the rest. This recovers classic-hash latency for the easy case while keeping GRACE’s correctness guarantee on the hard case.

  4. Recursive partitioning. When even one partition’s R does not fit, re-partition that pair with a different hash function and recurse. Required only for severe data skew or when the partition count was under-estimated.

Three orthogonal considerations cut across all four:

  • Hash-table layout. Open addressing, separate chaining with per-bucket lists, or a directory-of-buckets like a textbook in-memory extendible hash. CUBRID’s in-memory path uses chained hashing (mht_hls) and its on-disk path uses a directory-style extendible hash (FHS, “file hash structure” — see cubrid-extendible-hash.md).

  • Build-side selection. Since the hash table is the memory hog, the smaller relation goes on build. With pre-computed cardinalities this is a CBO decision; without them it can be done dynamically by reading both sides into a “decide threshold” and switching once one side runs to completion (sometimes called role reversal).

  • NULL semantics. SQL equality returns UNKNOWN against NULL, so hash-keyed NULLs never join in equi-joins — they have to be either filtered out at hash time or routed through a fallback that emits null-padded outer tuples for outer joins.

The textbook Garcia-Molina/Ullman/Widom Database Systems: The Complete Book §15.4–§15.5 contrasts hash join with sort-merge on the same axes and shows that for an equi-join with no useful sort-order output requirement, hybrid hash beats sort-merge by roughly a factor of log(|R|/M) (the sort-merge needs that many merge passes; hash needs one if |R|<M·B for B partitions). Sort-merge wins only when some downstream operator (typically a MERGE JOIN or a GROUP BY/ORDER BY on the join key) wants a sorted output for free.

CUBRID consumes this design space at two levels:

  • The optimizer (qo_hjoin_cost, qo_examine_hash_join) admits hash join as one of three join methods alongside nested-loop and sort-merge, but with a strong gating bias: by default, qo_examine_hash_join returns immediately unless the user wrote /*+ USE_HASH */ on the inner relation. The cost model itself uses a fixed CPU-overhead constant for build/probe and treats spill as free (the IO term is currently disabled), so the hint is the effective decision lever in production.
  • The executor (qexec_hash_join) realises all four of Graefe’s designs at once: it picks an in-memory layout when the build side fits, a hybrid layout (in-memory index, on-disk tuple storage) when the index fits but the tuples do not, an extendible-hash file when even the index does not fit, and falls back to grace-style partitioning (hjoin_try_partition) when the chosen primitive’s per-partition memory budget is still exceeded.

Every relational engine that supports hash join exposes the same seam: a physical operator at the executor side that owns build and probe phases, and a cost rule at the optimizer side that picks the operator over nested-loop and sort-merge. Naming the common vocabulary up front makes the CUBRID identifiers in the next section read as one set of dial settings.

PostgreSQL — nodeHashjoin.c + nodeHash.c

Section titled “PostgreSQL — nodeHashjoin.c + nodeHash.c”

PostgreSQL splits the operator into two: Hash (a blocking node that materialises its child into a HashJoinTable) and HashJoin (the probing node). The build runs at Hash’s first ExecProcNode; the probe runs in HashJoin’s next() until the inner hash table is exhausted for the current outer tuple. PostgreSQL has used a hybrid hash since v7.x, with BatchFile-backed grace partitioning when work_mem is exceeded; v11 added parallel-aware hash join with shared barriers between worker batches. The hash function is the hashfunc registered for the join column’s datatype; multi-column keys hash by combining per-column hashes with (h<<5)|(h>>27) ^ h2. Build-side selection is a planner decision based on pg_class.reltuples/pg_class.relpages; it does not flip dynamically.

MySQL — InnoDB’s HashJoinIterator (8.0+)

Section titled “MySQL — InnoDB’s HashJoinIterator (8.0+)”

MySQL only got hash join in 8.0.18, replacing block-nested-loop for many cases. The iterator is in sql/iterators/hash_join_iterator.cc. It keeps an in-memory mem_root hash table of the build side; on overflow it spills both inputs to per-batch chunk files and processes batch pairs in turn (so MySQL’s mode is essentially GRACE without the hybrid optimisation). MySQL deliberately does not implement parallel hash join in the iterator; parallelism is handled at the partition level for its parallel-query feature.

SQL Server picks among In-Memory Hash, GRACE, and Recursive Hash as the same physical operator (Hash Match), with the selection driven at runtime by memory grant. When the build side exceeds the grant, the operator spills bucket pages to TempDB; if a partition still does not fit, it re-hashes with a secondary function (“recursive hash”), and as a last resort falls through to a “bailout” that loops over residual bucket files. The optimizer’s join rule evaluates all three join methods (PhyOp_HashJoin, PhyOp_LoopsJoin, PhyOp_MergeJoin) and prunes by cost; there is no “hint required” gate as in CUBRID’s default configuration.

Oracle — HASH JOIN operator with HJ tagged variants

Section titled “Oracle — HASH JOIN operator with HJ tagged variants”

Oracle exposes HASH JOIN (in-memory build-then-probe), HASH JOIN SEMI, HASH JOIN ANTI, and HASH JOIN OUTER as distinct execution plan operators. Spill is to PGA temporary segments. Oracle’s optimizer decides at parse time, not at execution time, by way of _hash_join_enabled and the cost model. The OUTER variant emits null-padded tuples directly out of the same probe loop with a “matched” bit on each build entry, similar to PostgreSQL’s HJ_NULL_INNER machinery.

Theoretical conceptCUBRID name
Hash-join physical operatorHASHJOIN_PROC (XASL xasl->type); HASHJOIN_PROC_NODE (xasl.h)
Build/probe driverqexec_hash_joinhjoin_execute_internal (query_hash_join.c)
Per-execution shared stateHASHJOIN_MANAGER (query_hash_join.h)
Per-partition execution contextHASHJOIN_CONTEXT (query_hash_join.h)
Per-side fetch / domain / list-id bundleHASHJOIN_FETCH_INFO (query_hash_join.h)
Hash-table primitive (build/probe surface)HASH_LIST_SCAN (query_hash_scan.h); members temp_key, temp_new_key, memory.hash_table, file.hash_table
Hash-table layout selectorHASH_METHOD ∈ {NOT_USE, IN_MEM, HYBRID, HASH_FILE} (query_hash_scan.h)
In-memory hash tableMHT_HLS_TABLE from mht_create_hls (memory_hash.c)
Disk-backed hash tableFHSID (file hash structure ID) opened by fhs_create (query_hash_scan.c)
Per-tuple key descriptorHASH_SCAN_KEY (val_count, values[])
Per-tuple value descriptor (in-mem stores tuple, hybrid stores VPID/offset)HASH_SCAN_VALUE (union of tuple and pos)
Hash-key computationqdata_hash_scan_key (query_hash_scan.c)
Build phasehjoin_build + hjoin_build_key (query_hash_join.c)
Probe phase (inner)hjoin_probe + hjoin_probe_key
Probe phase (outer-join, with null-pad fallback)hjoin_outer_probe
GRACE-style partitioning triggerhjoin_check_partitionHASHJOIN_STATUS_PARTITION (query_hash_join.c)
GRACE-style splitterhjoin_split_qlist (query_hash_join.c)
Memory budget for one partitionPRM_ID_MAX_HASH_LIST_SCAN_SIZE (max_hash_list_scan_size, default 16 MB)
Per-partition fill marginPARTITION_FILL_FACTOR = 0.8 (query_hash_join.c)
Cost-model entryqo_hjoin_cost (query_planner.c:3604)
Optimizer admission gateqo_examine_hash_join (query_planner.c:6618)
Hash-join CPU weight constantsHJ_BUILD_CPU_OVERHEAD_FACTOR = 30, HJ_PROBE_CPU_OVERHEAD_FACTOR = 20 (query_planner.c:82–83)
Plan-to-XASL loweringmake_hashjoin_proc (plan_generation.c:553); pt_to_hashjoin_proc (xasl_generation.c:14889)
Executor dispatchcase HASHJOIN_PROC: in qexec_execute_mainblock_internal (query_executor.c:14748)
Output-tuple writerhjoin_merge_tuple_to_list_id (query_hash_join.c)
Parallel-execution backendparallel_query::hash_join::execute_partitions (px_hash_join.cpp)

CUBRID’s hash join is a Volcano-shaped blocking-build/streaming-probe operator with a dedicated XASL type (HASHJOIN_PROC). Unlike the adjacent MERGELIST_PROC operator (sort-merge), it does not consume its inputs through a generic SCAN_ID chain — it owns a custom driver (qexec_hash_join) that pulls both inputs as already-materialised list files from the XASL’s aptr_list, drives Build and Probe in sequence, and writes the joined output into the parent XASL’s list_id.

XASL shape — HASHJOIN_PROC and the manager/context split

Section titled “XASL shape — HASHJOIN_PROC and the manager/context split”

The XASL node carries two HASHJOIN_INPUT substructures (one per side) and a QFILE_LIST_MERGE_INFO describing the join key columns and output projection:

// hashjoin_proc_node — src/query/xasl.h
typedef struct hashjoin_proc_node
{
HASHJOIN_INPUT outer;
HASHJOIN_INPUT inner;
QFILE_LIST_MERGE_INFO merge_info;
#if defined (SERVER_MODE) || defined (SA_MODE)
HASHJOIN_DOMAIN_INFO domain_info;
HASHJOIN_STATS_GROUP stats_group;
#endif
} HASHJOIN_PROC_NODE;

The two HASHJOIN_INPUT structs carry a child XASL_NODE *xasl (the sub-block that produces the side’s list file) plus an REGU_VARIABLE_LIST regu_list_pred for evaluating during-join predicates. The merge_info is the same QFILE_LIST_MERGE_INFO sort-merge uses, with ls_column_cnt (number of join-key columns), ls_outer_column[] / ls_inner_column[] (column positions in each side’s tuple), ls_pos_cnt / ls_pos_list[] / ls_outer_inner_list[] (output projection: which side and which column for each output value), and join_type.

When qexec_hash_join runs it builds two further structs that exist only for the duration of one execution — HASHJOIN_MANAGER and HASHJOIN_CONTEXT:

// hashjoin_manager — src/query/query_hash_join.h
typedef struct hashjoin_manager
{
HASHJOIN_INPUT *outer; /* points into proc.hashjoin */
HASHJOIN_INPUT *inner;
QFILE_LIST_MERGE_INFO *merge_info;
JOIN_TYPE join_type;
int key_cnt;
PRED_EXPR *during_join_pred;
int num_parallel_threads;
QUERY_ID query_id;
VAL_DESCR *val_descr;
HASHJOIN_CONTEXT single_context; /* fast-path single partition */
HASHJOIN_CONTEXT *contexts; /* one per grace partition */
UINT32 context_cnt;
QFILE_TUPLE_VALUE_TYPE_LIST type_list;
HASHJOIN_MERGE_METHOD qlist_merge_method;
parallel_query::worker_manager *px_worker_manager;
...
} HASHJOIN_MANAGER;

The split is intentional: HASHJOIN_MANAGER holds everything that is constant across partitions (input pointers, join type, output schema, parallel worker pool); HASHJOIN_CONTEXT holds everything specific to one (build side, probe side, hash scan, output list-id). For the no-spill case there is exactly one context (single_context); when grace partitioning kicks in, contexts[] is allocated with one entry per partition, and the manager iterates through them.

flowchart LR
  XASL["XASL_NODE<br/>type=HASHJOIN_PROC<br/>aptr_list=[outer, inner]<br/>list_id=output"]
  PROC["HASHJOIN_PROC_NODE<br/>(in xasl->proc.hashjoin)<br/>outer.xasl, inner.xasl,<br/>merge_info, domain_info,<br/>stats_group"]
  MGR["HASHJOIN_MANAGER<br/>(stack-local in qexec_hash_join)<br/>join_type, key_cnt,<br/>type_list, contexts[]"]
  CTX["HASHJOIN_CONTEXT<br/>outer/inner FETCH_INFO,<br/>build, probe,<br/>hash_scan, list_id, status"]
  HLS["HASH_LIST_SCAN<br/>temp_key, temp_new_key,<br/>memory.hash_table OR<br/>file.hash_table"]
  XASL --> PROC
  XASL -- aptr_list[0] --> OUT["outer XASL → list_id"]
  XASL -- aptr_list[1] --> INR["inner XASL → list_id"]
  PROC --> MGR
  MGR --> CTX
  CTX --> HLS

Driver — qexec_hash_join and the empty-side fast paths

Section titled “Driver — qexec_hash_join and the empty-side fast paths”

The driver is dispatched by the XASL interpreter. The relevant arm of qexec_execute_mainblock_internal is small:

// qexec_execute_mainblock_internal — src/query/query_executor.c
case HASHJOIN_PROC:
if (qexec_hash_join (thread_p, xasl, xasl_state->query_id, &xasl_state->vd) != NO_ERROR)
{
GOTO_EXIT_ON_ERROR;
}
break;

Because both inputs are already materialised by the time the executor hits this case (their XASLs sit in aptr_list and have run during the pre-pass for (xptr2 = xptr->aptr_list; ...) qexec_execute_mainblock(xptr2, ...) loop), qexec_hash_join can read their final cardinalities directly from outer.xasl->list_id->tuple_cnt and decide build-side and partitioning up-front.

The driver’s first job is to cope with empty sides cheaply:

// qexec_hash_join — src/query/query_hash_join.c (condensed)
int qexec_hash_join (THREAD_ENTRY *thread_p, XASL_NODE *xasl,
QUERY_ID query_id, VAL_DESCR *val_descr)
{
HASHJOIN_MANAGER manager;
HASHJOIN_CONTEXT *single_context;
HASHJOIN_STATUS status, part_status;
hjoin_init_manager (thread_p, &manager, xasl, query_id, val_descr);
single_context = &manager.single_context;
status = hjoin_check_empty_inputs (&manager, single_context);
switch (status)
{
case HASHJOIN_STATUS_FILL_NULL_VALUES:
hjoin_outer_fill_null_values (thread_p, &manager, single_context);
break;
case HASHJOIN_STATUS_TRY:
part_status = hjoin_try_partition (thread_p, &manager, single_context);
switch (part_status)
{
case HASHJOIN_STATUS_SINGLE: hjoin_execute (..., single_context); break;
case HASHJOIN_STATUS_PARTITION: hjoin_execute_partitions (...); break;
case HASHJOIN_STATUS_PARALLEL: parallel_query::hash_join::execute_partitions (...); break;
}
break;
case HASHJOIN_STATUS_END:
/* nothing to do; output is empty */
break;
}
qfile_copy_list_id (xasl->list_id, single_context->list_id, ...);
hjoin_clear_manager (thread_p, &manager);
...
}

hjoin_check_empty_inputs is the first decision:

// hjoin_check_empty_inputs — src/query/query_hash_join.c (condensed)
switch (manager->join_type)
{
case JOIN_INNER:
status = (outer_tuple_cnt == 0 || inner_tuple_cnt == 0)
? HASHJOIN_STATUS_END : HASHJOIN_STATUS_TRY;
break;
case JOIN_LEFT:
status = (outer_tuple_cnt == 0)
? HASHJOIN_STATUS_END
: (inner_tuple_cnt == 0)
? HASHJOIN_STATUS_FILL_NULL_VALUES
: HASHJOIN_STATUS_TRY;
break;
case JOIN_RIGHT:
status = (inner_tuple_cnt == 0)
? HASHJOIN_STATUS_END
: (outer_tuple_cnt == 0)
? HASHJOIN_STATUS_FILL_NULL_VALUES
: HASHJOIN_STATUS_TRY;
break;
}

Three statuses: END (one or both sides are empty in a way that produces no output), FILL_NULL_VALUES (the preserved side of an outer join is non-empty but the null-supplying side is empty — emit null-padded preserved-side tuples directly without ever building a hash table), and TRY (proceed with build/probe). The FILL_NULL_VALUES shortcut pays off for outer joins against empty lookup tables, where the entire build phase would be wasted.

flowchart TB
  START([qexec_hash_join])
  CHK{hjoin_check_empty_inputs}
  END_NULL[HASHJOIN_STATUS_END<br/>output is empty]
  FILL[hjoin_outer_fill_null_values<br/>emit null-padded preserved rows]
  TRY{hjoin_try_partition}
  SINGLE[HASHJOIN_STATUS_SINGLE<br/>hjoin_execute on single_context]
  PART[HASHJOIN_STATUS_PARTITION<br/>hjoin_execute_partitions<br/>over contexts[]]
  PXL[HASHJOIN_STATUS_PARALLEL<br/>parallel_query::hash_join::<br/>execute_partitions]
  COPY[qfile_copy_list_id<br/>→ xasl->list_id]
  START --> CHK
  CHK -- END --> END_NULL --> COPY
  CHK -- FILL_NULL_VALUES --> FILL --> COPY
  CHK -- TRY --> TRY
  TRY -- SINGLE --> SINGLE --> COPY
  TRY -- PARTITION --> PART --> COPY
  TRY -- PARALLEL --> PXL --> COPY
  COPY --> DONE([clear manager])

hjoin_build is the build loop. Three things matter here: which side plays build, what is fetched into the key, and what is stored as the value.

Build-side selection is per-context, in hjoin_init_context:

// hjoin_init_context — src/query/query_hash_join.c (condensed)
switch (manager->join_type)
{
case JOIN_INNER:
if (outer->list_id->tuple_cnt < inner->list_id->tuple_cnt)
{ context->build = outer; context->probe = inner; }
else if (outer->list_id->tuple_cnt == inner->list_id->tuple_cnt
&& outer->list_id->page_cnt < inner->list_id->page_cnt)
{ context->build = outer; context->probe = inner; }
else
{ context->build = inner; context->probe = outer; }
break;
case JOIN_LEFT:
/* preserve outer; null-pad inner */
context->build = inner; context->probe = outer; break;
case JOIN_RIGHT:
context->build = outer; context->probe = inner; break;
}

For inner join the smaller side wins, with page count as the tie-breaker (page-count proxies for tuple-width × tuple-count, so it is a closer proxy for total bytes than tuple count alone). For outer join the choice is forced by semantics: the null-supplying side has to be on build, so its hash table can be probed and any miss can fire null-padding for the preserved side.

Per-tuple build in the no-partition case:

// hjoin_build — src/query/query_hash_join.c (condensed)
while ((scan_code = qfile_scan_list_next (thread_p, &build->list_scan_id,
&build->tuple_record, PEEK)) == S_SUCCESS)
{
error = hjoin_fetch_key (thread_p, build, &build->tuple_record, key,
NULL /* compare_key */, &need_skip_next);
if (need_skip_next)
{ need_skip_next = false; continue; } /* NULL in any join column */
hash_scan->curr_hash_key =
qdata_hash_scan_key (key, UINT_MAX, hash_method);
error = hjoin_build_key (thread_p, hash_scan,
&build->list_scan_id, &build->tuple_record);
}

The loop is the literal Volcano shape: pull the next tuple, materialise the join-key columns into a HASH_SCAN_KEY, hash, insert. The interesting wrinkle is need_skip_next: tuples whose join column contains any NULL are silently dropped from the build side, because SQL’s = on NULL never yields TRUE. (For outer-join’s preserved-NULL tuples the dropping happens on the probe side instead — they are routed to a “fill null” partition; see “Spill” below.)

In the partitioned case the tuple has already been hashed during the splitter pass, and the hash key is stashed in the tuple’s first value slot; hjoin_build reads it back rather than recomputing it:

// hjoin_build (partitioned arm) — src/query/query_hash_join.c (condensed)
tuple_value = build->tuple_record.tpl + QFILE_TUPLE_LENGTH_SIZE;
tuple_value += QFILE_TUPLE_VALUE_HEADER_LENGTH;
hash_scan->curr_hash_key = (UINT32) OR_GET_INT (tuple_value);
hjoin_build_key (thread_p, hash_scan, &build->list_scan_id, &build->tuple_record);

This is the same trick PostgreSQL plays in batched hash joins: hash once at partition time, reuse the hash word at build/probe time.

hjoin_build_key is the polymorphic insert. It dispatches on hash_scan->hash_list_scan_type:

// hjoin_build_key — src/query/query_hash_join.c (condensed)
switch (hash_scan->hash_list_scan_type)
{
case HASH_METH_IN_MEM:
hash_value = qdata_alloc_hscan_value (thread_p, tuple_record->tpl);
mht_put_hls (hash_scan->memory.hash_table,
(void *) &hash_scan->curr_hash_key,
(void *) hash_value);
break;
case HASH_METH_HYBRID:
hash_value = qdata_alloc_hscan_value_OID (thread_p, list_scan_id);
mht_put_hls (hash_scan->memory.hash_table,
(void *) &hash_scan->curr_hash_key,
(void *) hash_value);
break;
case HASH_METH_HASH_FILE:
SET_TFTID (tftid, list_scan_id->curr_vpid.volid,
list_scan_id->curr_vpid.pageid, list_scan_id->curr_offset);
fhs_insert (thread_p, hash_scan->file.hash_table,
(void *) &hash_scan->curr_hash_key, &tftid);
break;
}

The arms differ only in what is stored as the value: a deep-copied tuple (in-memory), a (volid, pageid, offset) tuple position (hybrid), or a TFTID written to an extendible-hash bucket (hash-file). The key — a 32-bit hash word — is the same in all three.

hjoin_probe is the inner-join probe loop. It mirrors the build loop’s outer shape (a qfile_scan_list_next over the probe side’s list file) and inside it runs an inner do { } while (true) loop that walks the chain of build entries hashing to the same key:

// hjoin_probe — src/query/query_hash_join.c (condensed)
while ((scan_code = qfile_scan_list_next (thread_p, &probe->list_scan_id,
&probe->tuple_record, PEEK)) == S_SUCCESS)
{
if (manager->context_cnt == 0)
{ /* fresh fetch — no pre-computed hash on tuple */
hjoin_fetch_key (thread_p, probe, &probe->tuple_record, key, NULL, &need_skip_next);
if (need_skip_next) { continue; } /* NULL → no inner match possible */
hash_scan->curr_hash_key = qdata_hash_scan_key (key, UINT_MAX, hash_method);
}
else
{ /* partitioned — read pre-computed hash word out of tuple slot 0 */
...
}
do
{
hjoin_probe_key (thread_p, hash_scan, &build->list_scan_id, &build->tuple_record);
if (build->tuple_record.tpl == NULL) break; /* not found */
if (manager->context_cnt != 0)
{ hjoin_fetch_key (thread_p, probe, &probe->tuple_record, key, NULL, &need_skip_next); }
hjoin_fetch_key (thread_p, build, &build->tuple_record, found_key,
key /* compare_key */, &need_skip_next);
if (need_skip_next)
{ continue; } /* hash collision: same hash, different key */
hjoin_merge_tuple_to_list_id (thread_p, list_id,
&outer->tuple_record, &inner->tuple_record,
manager->merge_info, &overflow_record);
}
while (true);
}

The inner loop is what makes the operator tolerate hash collisions correctly: hjoin_probe_key returns any build entry whose hash word equals the current probe hash; the body then fetches the build tuple’s key columns into found_key and re-compares them against the probe key. If the columns disagree (same hash, different value — a collision), need_skip_next is set and the loop goes after the next chained entry. Only when keys match exactly does hjoin_merge_tuple_to_list_id write the joined output tuple.

This is the same correctness guard PostgreSQL has in ExecHashJoin when it does ExecQual(hjclauses, econtext) after hashing — the hash is a fast filter, not a definitive match.

hjoin_probe_key itself dispatches on the table layout:

// hjoin_probe_key — src/query/query_hash_join.c (condensed)
switch (hash_scan->hash_list_scan_type)
{
case HASH_METH_IN_MEM:
hash_value = (tuple_record->tpl == NULL)
? mht_get_hls (hash_scan->memory.hash_table,
(void *) &hash_scan->curr_hash_key,
(void **) &hash_scan->memory.curr_hash_entry)
: mht_get_next_hls (hash_scan->memory.hash_table, ...);
if (hash_value != NULL) tuple_record->tpl = hash_value->tuple;
break;
case HASH_METH_HYBRID:
hash_value = (tuple_record->tpl == NULL)
? mht_get_hls (...) : mht_get_next_hls (...);
if (hash_value != NULL)
{
MAKE_TUPLE_POSTION (tuple_position, hash_value->pos, list_scan_id);
qfile_jump_scan_tuple_position (thread_p, list_scan_id,
&tuple_position, tuple_record, PEEK);
}
break;
case HASH_METH_HASH_FILE:
eh_search = (tuple_record->tpl == NULL)
? fhs_search (thread_p, hash_scan, &tftid)
: fhs_search_next (thread_p, hash_scan, &tftid);
if (eh_search == EH_KEY_FOUND)
{
MAKE_TFTID_TO_TUPLE_POSTION (tuple_position, tftid, list_scan_id);
qfile_jump_scan_tuple_position (thread_p, list_scan_id,
&tuple_position, tuple_record, PEEK);
}
break;
}

The tuple_record->tpl == NULL check is the chain-walk anchor: on first call for a probe key, ask for the first matching entry; on subsequent calls, ask for the next. The hybrid and hash-file paths follow up the position lookup with a qfile_jump_scan_tuple_position that re-reads the tuple from the build-side list file — this is the extra page fetch that is the cost of avoiding deep-copying tuples into the hash table.

hjoin_outer_probe is the outer-join variant. Structurally it is the inner-join probe with one extra branch: before going to the next probe tuple, check whether at least one match was emitted; if not, emit a null-padded record using the preserved side’s tuple and outer->fill_record / inner->fill_record (one of which is NULL):

// hjoin_outer_probe — src/query/query_hash_join.c (condensed)
any_record_added = false;
do
{
/* same chain walk as hjoin_probe, with during_join_pred check */
...
any_record_added = true;
}
while (true);
if (!any_record_added)
{
/* fill null for unmatched preserved row */
hjoin_merge_tuple_to_list_id (thread_p, list_id,
outer->fill_record, inner->fill_record,
manager->merge_info, &overflow_record);
}

during_join_pred is the same hook PostgreSQL uses to attach inequality predicates from ON clauses to the hash-join probe loop; it lets R LEFT JOIN S ON R.k = S.k AND R.x > S.x evaluate the second predicate per matched-pair during the chain walk, not as a post-filter, so a non-qualifying inequality does not count as a “match” and the null-padding still fires.

flowchart TB
  P0([Open probe scan])
  P1[qfile_scan_list_next<br/>over probe list_id]
  P2{NULL in<br/>join key?}
  P3[fetch_key into HASH_SCAN_KEY]
  P4[hash_key = qdata_hash_scan_key]
  P5[hjoin_probe_key:<br/>first match in chain]
  P6{found?}
  P7[fetch build key,<br/>compare_key against probe key]
  P8{collision<br/>same h diff k?}
  P9{during_join_pred?}
  P10[merge_tuple_to_list_id<br/>append output row]
  P11[hjoin_probe_key:<br/>next match in chain]
  P12[outer-only:<br/>any_record_added?]
  P13[fill null-padded row]
  P0 --> P1 --> P2
  P2 -- yes inner --> P1
  P2 -- yes outer --> P13 --> P1
  P2 -- no --> P3 --> P4 --> P5 --> P6
  P6 -- no --> P12
  P6 -- yes --> P7 --> P8
  P8 -- collision --> P11 --> P5
  P8 -- match --> P9
  P9 -- false --> P11
  P9 -- true --> P10 --> P11
  P12 -- inner / matched --> P1
  P12 -- outer / no match --> P13 --> P1

Hash-scan primitive — three table layouts

Section titled “Hash-scan primitive — three table layouts”

HASH_LIST_SCAN is the primitive shared between hash join and hash list scan (CUBRID’s term for hash-keyed IN-list and uncorrelated subquery probing — see cubrid-post-processing.md). The struct is in query_hash_scan.h:

// hash_list_scan — src/query/query_hash_scan.h
struct hash_list_scan
{
regu_variable_list_node *build_regu_list; /* used by hash list scan only */
regu_variable_list_node *probe_regu_list; /* used by hash list scan only */
HASH_SCAN_KEY *temp_key; /* probe key buffer */
HASH_SCAN_KEY *temp_new_key; /* coerced/build key buffer */
union {
struct {
MHT_HLS_TABLE *hash_table; /* in-memory hash, chained */
HENTRY_HLS_PTR curr_hash_entry; /* iterator state */
} memory;
struct {
FHSID *hash_table; /* extendible hash file */
OID curr_oid; /* iterator state */
bool is_dk_bucket; /* duplicate-key bucket flag */
} file;
};
HASH_METHOD hash_list_scan_type; /* IN_MEM | HYBRID | HASH_FILE */
unsigned int curr_hash_key;
bool need_coerce_type;
};

The layout selection is made once per HASHJOIN_CONTEXT in hjoin_scan_init, off the build side’s page count and tuple count versus a single memory budget mem_limit:

// hjoin_scan_init — src/query/query_hash_join.c (condensed)
mem_limit = prm_get_bigint_value (PRM_ID_MAX_HASH_LIST_SCAN_SIZE);
if ((UINT64) list_id->page_cnt * DB_PAGESIZE <= mem_limit)
{
hash_scan->hash_list_scan_type = HASH_METH_IN_MEM;
hash_scan->memory.hash_table =
mht_create_hls ("Hash Join", list_id->tuple_cnt, NULL, NULL);
}
else if ((UINT64) list_id->tuple_cnt
* (sizeof (HENTRY_HLS) + sizeof (QFILE_TUPLE_SIMPLE_POS)) <= mem_limit)
{
hash_scan->hash_list_scan_type = HASH_METH_HYBRID;
hash_scan->memory.hash_table =
mht_create_hls ("Hash Join", list_id->tuple_cnt, NULL, NULL);
}
else
{
hash_scan->hash_list_scan_type = HASH_METH_HASH_FILE;
hash_scan->file.hash_table =
(FHSID *) db_private_alloc (thread_p, sizeof (FHSID));
fhs_create (thread_p, hash_scan->file.hash_table, list_id->tuple_cnt);
}

Three thresholds, three layouts:

LayoutConditionWhere keys liveWhere tuples live
HASH_METH_IN_MEMpage_cnt * DB_PAGESIZE ≤ mem_limitmht_hlsinside the hash entry (deep-copied)
HASH_METH_HYBRIDtuples don’t fit but tuple_cnt * (HENTRY+POS) ≤ mem_limitmht_hlsoriginal list-file pages, addressed by (volid,pageid,offset)
HASH_METH_HASH_FILEeven the index doesn’t fitFHSID directory + bucket pagesoriginal list-file pages, addressed by TFTID

The hybrid mode is CUBRID’s native middle-ground between classic in-memory hash and pure GRACE: the hash index fits, but the hash payload does not; on probe-side hits, the operator pays one extra page fetch (qfile_jump_scan_tuple_position) per match to re-read the tuple from the original list file. This is cheaper than the pure hash-file path because the index probe is still in memory.

The HASH_METH_HASH_FILE path leans on the same extendible-hash machinery the catalog uses (cubrid-extendible-hash.md): a directory of bucket VPIDs, doubled when a bucket overflows its local depth, with duplicate-key chains via “dk” buckets. Memory consumption is bounded by the directory size plus one bucket-page-fix at a time; the rest is on disk.

flowchart TB
  subgraph IN_MEM["HASH_METH_IN_MEM"]
    direction LR
    K1[hash_key] --> B1[bucket head]
    B1 --> E1["HENTRY_HLS<br/>{key=hash_key,<br/>data=HASH_SCAN_VALUE.tuple}"]
    E1 --> E2["HENTRY_HLS"]
  end
  subgraph HYBRID["HASH_METH_HYBRID"]
    direction LR
    K2[hash_key] --> B2[bucket head]
    B2 --> E3["HENTRY_HLS<br/>{key=hash_key,<br/>data=HASH_SCAN_VALUE.pos<br/>={vpid,offset}}"]
    E3 -. jump --> LF[list-file page]
    LF --> T2["tuple"]
  end
  subgraph FILE["HASH_METH_HASH_FILE"]
    direction LR
    K3[hash_key] --> DIR["FHSID directory"]
    DIR --> BKT["bucket page"]
    BKT --> R3["{TFTID = (volid,pageid,offset)}"]
    R3 -. jump --> LF2[list-file page]
    LF2 --> T3["tuple"]
  end

Hash-key normalisation, type coercion, and NULL handling

Section titled “Hash-key normalisation, type coercion, and NULL handling”

Two relations joined on R.k = S.k only have a meaningful hash join if R.k and S.k agree on type, precision, scale, and collation; otherwise the same logical value can hash to two different words and the join silently misses rows. CUBRID’s defence is hjoin_init_domain_info, called once per context:

// hjoin_init_domain_info — src/query/query_hash_join.c (condensed)
for (domain_index = 0; domain_index < domain_cnt; domain_index++)
{
outer_type = TP_DOMAIN_TYPE (outer_domains[domain_index]);
inner_type = TP_DOMAIN_TYPE (inner_domains[domain_index]);
if (outer_type == inner_type)
common_type = outer_type;
else
{
need_coerce_domains = true;
common_type = (tp_more_general_type (outer_type, inner_type) > 0)
? outer_type : inner_type;
}
if (outer_precision == inner_precision && outer_scale == inner_scale)
{ common_precision = inner_precision; common_scale = inner_scale; }
else
{
need_coerce_domains = true;
if (common_type == DB_TYPE_NUMERIC)
{
common_scale = MAX (outer_scale, inner_scale);
common_precision = MAX (outer_integral, inner_integral) + common_scale;
common_precision = MIN (common_precision, DB_MAX_NUMERIC_PRECISION);
}
else
{
common_precision = MAX (outer_precision, inner_precision);
common_scale = 0;
}
}
if (need_coerce_domains)
coerce_domains[domain_index] = tp_domain_cache (...);
}

The result is domain_info->coerce_domains[] — one common domain per join column, used by hjoin_fetch_key to coerce values into a shared representation before hashing. Without this, hashing INT 1 against BIGINT 1 would produce two different hash words.

hjoin_fetch_key is the per-tuple key materialiser. It walks the tuple’s value slots, picks out the join columns by index (value_indexes[key_index]), and for each:

// hjoin_fetch_key — src/query/query_hash_join.c (condensed)
if (QFILE_GET_TUPLE_VALUE_FLAG (tuple_value) == V_UNBOUND)
goto skip_next; /* NULL → tuple drops out of join */
or_init (&buf, tuple_value + QFILE_TUPLE_VALUE_HEADER_SIZE, value_size);
if (need_coerce_domains && coerce_domains[key_index] != NULL
&& coerce_domains[key_index] != domains[key_index])
{
domains[key_index]->type->data_readval (&buf, &pre_coerce_value, ...);
tp_value_coerce (&pre_coerce_value, key->values[key_index],
coerce_domains[key_index]);
}
else
{
domains[key_index]->type->data_readval (&buf, key->values[key_index], ...);
}
if (compare_key != NULL)
{
/* collision check on probe side */
compare_result = tp_value_compare (key->values[key_index],
compare_key->values[key_index], 0, 0);
if (compare_result != DB_EQ) goto skip_next;
}

Two roles in one function: the first call (with compare_key=NULL) materialises the key for hashing; the second call (with compare_key=key, where key is the probe key already materialised) re-fetches the build tuple’s key columns and verifies them against the probe key — the collision check the probe loop relies on.

qdata_hash_scan_key is the actual hash function:

// qdata_hash_scan_key — src/query/query_hash_scan.c
unsigned int qdata_hash_scan_key (const void *key, unsigned int ht_size,
HASH_METHOD hash_method)
{
HASH_SCAN_KEY *ckey = (HASH_SCAN_KEY *) key;
unsigned int hash_val = 0, tmp_hash_val;
for (i = 0; i < ckey->val_count; i++)
{
hash_val = ROTL32 (hash_val, 13);
tmp_hash_val = mht_get_hash_number (ht_size, ckey->values[i]);
hash_val ^= tmp_hash_val;
if (hash_val == 0) hash_val = tmp_hash_val;
}
if (hash_method == HASH_METH_HASH_FILE)
hash_val = fhs_hash (&hash_val); /* extra mixing for FHSID */
return hash_val;
}

Multi-column keys are combined by the standard XOR-with-rotate scheme (ROTL32(h,13) ^ h_i), with a guard against the all-zero state collapsing the chain. For the disk path one extra round of mixing (fhs_hash) is applied because FHSID’s directory uses the high bits of the hash word as the bucket index, and an unmixed hash leaks type-specific patterns into those bits.

NULL handling is uniform: V_UNBOUND in any join column drops the tuple from the join. For inner joins this is correct (NULL never equates). For outer joins, the splitter additionally routes NULL-keyed preserved-side tuples to a dedicated “fill null” partition; see the next section.

Spill — grace-style equi-hash partitioning

Section titled “Spill — grace-style equi-hash partitioning”

When the build side’s footprint exceeds the layout’s threshold for a single-partition run, hjoin_check_partition splits the work into multiple partitions, GRACE-style. The decision is in hjoin_check_partition:

// hjoin_check_partition — src/query/query_hash_join.c (condensed)
mem_limit = prm_get_bigint_value (PRM_ID_MAX_HASH_LIST_SCAN_SIZE);
min_tuple_cnt = MIN (outer_list_id->tuple_cnt, inner_list_id->tuple_cnt);
part_cnt = CEIL_PTVDIV (
(sizeof (HENTRY_HLS) + sizeof (QFILE_TUPLE_SIMPLE_POS)) * min_tuple_cnt,
mem_limit * PARTITION_FILL_FACTOR);
if (part_cnt > 1)
{
if (IS_OUTER_JOIN_TYPE (manager->join_type))
part_cnt += 1; /* one extra partition for NULL-keyed preserved rows */
manager->context_cnt = part_cnt;
return HASHJOIN_STATUS_PARTITION;
}
return HASHJOIN_STATUS_SINGLE;

The partition count is sized so that one partition’s hash index (plus tuple-position pointers, the worst case is the HASH_METH_HYBRID overhead) fits in 80% of mem_limit. The 80% margin — PARTITION_FILL_FACTOR — is a fudge factor against bucket imbalance: a perfectly hashed input would fit exactly at 100%, but real hash distributions have a long tail.

The split itself is hjoin_split_qlist, run twice (once for outer, once for inner). It is the GRACE phase-1 splitter:

// hjoin_split_qlist — src/query/query_hash_join.c (condensed)
while ((scan_code = qfile_scan_list_next (thread_p, &list_scan_id,
&tuple_record, PEEK)) == S_SUCCESS)
{
hjoin_fetch_key (thread_p, split_info->fetch_info, &tuple_record,
temp_key, NULL, &need_skip_next);
if (need_skip_next)
{
if (is_outer_join)
part_id = part_cnt - 1; /* last partition := "fill null" bucket */
else
continue; /* inner: drop NULL-keyed tuples */
}
else
{
hash_key = qdata_hash_scan_key (temp_key, UINT_MAX, HASH_METH_IN_MEM);
part_id = is_outer_join ? hash_key % (part_cnt - 1)
: hash_key % (part_cnt);
hjoin_update_tuple_hash_key (thread_p, &tuple_record, hash_key);
}
qfile_add_tuple_to_list (thread_p, part_list_id[part_id], tuple_record.tpl);
}

Three details worth calling out:

  • Hash word is cached on the tuple. hjoin_update_tuple_hash_key writes the 32-bit hash into a reserved slot at the head of the tuple, so the build/probe phase reads it back rather than re-hashing every tuple a second time. This is paid for at XASL-generation time by leaving an extra MAX_ALIGNMENT-sized value slot at tuple position 0 with an initial sentinel of -1; hjoin_update_tuple_hash_key asserts that sentinel is still there before overwriting.

  • Outer-join “fill null” partition. For JOIN_LEFT / JOIN_RIGHT, one extra partition is allocated and reserved for tuples whose preserved-side join column is NULL. These tuples never match anything in the build side, so their join is decided by SQL’s outer-join semantics alone — null-pad and emit. hjoin_execute detects the last partition specially:

    // hjoin_execute — src/query/query_hash_join.c (excerpt)
    if (IS_OUTER_JOIN_TYPE (manager->join_type)
    && context == &manager->contexts[manager->context_cnt - 1])
    {
    status = (status == HASHJOIN_STATUS_TRY)
    ? HASHJOIN_STATUS_FILL_NULL_VALUES : status;
    }

    The dedicated hjoin_outer_fill_null_values then walks the partition’s preserved-side list file and emits one null-padded tuple per row, without touching the build side at all.

  • Append-vs-temp dance. Each partition’s output list-file is built up via a temp_part_list_id[] working set — when the in-memory buffer of one partition fills, the splitter flushes it to the partition’s “real” list-file (part_list_id[]) and starts a new buffer. This is the equivalent of PostgreSQL’s BufFile-backed batch tapes; the difference is that CUBRID’s QFILE_LIST_ID is the same abstraction the executor already uses for every materialised intermediate, so the partition tapes integrate transparently with the existing qfile_scan_list_next machinery.

After splitting, hjoin_execute_partitions walks contexts[] sequentially:

// hjoin_execute_partitions — src/query/query_hash_join.c (condensed)
for (context_index = 0; context_index < context_cnt; context_index++)
{
current_context = &manager->contexts[context_index];
hjoin_execute (thread_p, manager, current_context);
hjoin_merge_qlist (thread_p, manager, current_context);
}

hjoin_merge_qlist handles the per-partition output stitching — by default qlist_merge_method == HASHJOIN_MERGE_CONNECT, which chains list files end-to-end via qfile_connect_list rather than copying tuples; this is the cheap option and is correct because each partition’s output is independent. The fallbacks HASHJOIN_MERGE_APPEND (copy-append) and HASHJOIN_MERGE_COMBINE (union with deduplication) exist for code paths that the operator does not currently exercise.

stateDiagram-v2
  [*] --> Check
  Check --> SingleRun : part_cnt == 1
  Check --> Partition : part_cnt > 1
  Partition --> SplitOuter : hjoin_split_qlist(outer)
  SplitOuter --> SplitInner : hjoin_split_qlist(inner)
  SplitInner --> ParallelOk : SERVER_MODE && enough pages
  SplitInner --> Sequential : single-thread fallback
  ParallelOk --> ForkWorkers : px_worker_manager::reserve
  ForkWorkers --> Joined : barrier on parallel::execute_partitions
  Sequential --> NextCtx
  NextCtx --> RunCtx : hjoin_execute(ctx[i])
  RunCtx --> MergeCtx : hjoin_merge_qlist
  MergeCtx --> NextCtx : i++
  MergeCtx --> Done : i == part_cnt
  SingleRun --> RunSingle : hjoin_execute(single_context)
  RunSingle --> Done
  Joined --> Done
  Done --> [*]

Cost model integration — qo_hjoin_cost and qo_examine_hash_join

Section titled “Cost model integration — qo_hjoin_cost and qo_examine_hash_join”

The optimizer side of hash join lives in two places. qo_hjoin_cost (in query_planner.c) computes the four-slot cost vector (fixed_cpu_cost, fixed_io_cost, variable_cpu_cost, variable_io_cost) for a candidate hash-join plan node. The model is deliberately simple and entirely in CPU terms:

// qo_hjoin_cost — src/optimizer/query_planner.c (condensed)
inner_cardinality = inner_plan_p->info->cardinality;
outer_cardinality = outer_plan_p->info->cardinality;
plan_p->fixed_cpu_cost = outer_plan_p->fixed_cpu_cost + inner_plan_p->fixed_cpu_cost;
plan_p->fixed_io_cost = outer_plan_p->fixed_io_cost + inner_plan_p->fixed_io_cost;
plan_p->variable_cpu_cost = outer_plan_p->variable_cpu_cost + inner_plan_p->variable_cpu_cost;
plan_p->variable_io_cost = outer_plan_p->variable_io_cost + inner_plan_p->variable_io_cost;
/* if inner plays build */
inner_build_cpu_cost = inner_cardinality * QO_CPU_WEIGHT * HJ_BUILD_CPU_OVERHEAD_FACTOR;
inner_build_cpu_cost += outer_cardinality * QO_CPU_WEIGHT * HJ_PROBE_CPU_OVERHEAD_FACTOR;
/* if outer plays build */
outer_build_cpu_cost = inner_cardinality * QO_CPU_WEIGHT * HJ_PROBE_CPU_OVERHEAD_FACTOR;
outer_build_cpu_cost += outer_cardinality * QO_CPU_WEIGHT * HJ_BUILD_CPU_OVERHEAD_FACTOR;
switch (plan_p->plan_un.join.join_type)
{
case JOIN_LEFT: /* outer preserved → inner is build */
plan_p->variable_cpu_cost += inner_build_cpu_cost;
break;
case JOIN_RIGHT: /* inner preserved → outer is build */
plan_p->variable_cpu_cost += outer_build_cpu_cost;
break;
case JOIN_INNER: /* whichever is cheaper */
plan_p->variable_cpu_cost +=
MIN (inner_build_cpu_cost + inner_build_io_cost,
outer_build_cpu_cost + outer_build_io_cost);
break;
}

The two cost constants are defined at the top of query_planner.c:

// query_planner.c — cost constants
#define HJ_BUILD_CPU_OVERHEAD_FACTOR 30
#define HJ_PROBE_CPU_OVERHEAD_FACTOR 20

Build is taken to be ~50% more expensive per tuple than probe — a reasonable approximation given that build allocates and inserts a hash entry while probe only walks a chain. Crucially, the IO term is disabled at present (the mem_limit IO surcharge is wrapped in a #if 0), with the comment “No need to increase weight since partitioned hash join is used even when mem_limit is exceeded.” This is the optimizer’s own admission that it under-models the spill cost — the cost difference between an in-memory single-partition run and a spilled multi-partition run is currently zero in the planner’s eyes.

qo_examine_hash_join is the admission gate. It is called from planner_visit_node’s STEP 5-5 alongside qo_examine_nl_join, qo_examine_idx_join, and qo_examine_merge_join:

// planner_visit_node — src/optimizer/query_planner.c (excerpt)
/* STEP 5-5: examine hash-join */
if (!bitset_is_empty (&sm_join_terms))
{
kept += qo_examine_hash_join (new_info, join_type, head_info, tail_info,
&sm_join_terms, &duj_terms, &afj_terms,
&sarged_terms, &pinned_subqueries);
}

The interesting part is qo_examine_hash_join’s gating logic — which is how the operator stays out of the optimizer’s hot path until explicitly requested:

// qo_examine_hash_join — src/optimizer/query_planner.c (condensed)
mem_limit = prm_get_bigint_value (PRM_ID_MAX_HASH_LIST_SCAN_SIZE);
if (mem_limit <= 0) goto exit; /* hash list scan disabled at all */
if (bitset_intersects (sarged_terms, &info->env->fake_terms))
goto exit; /* fake terms incompatible */
for (each term in hash_join_terms)
if (QO_IS_PATH_TERM (term) && QO_TERM_JOIN_TYPE (term) != JOIN_INNER)
goto exit; /* path terms forbid hash-join */
if (QO_NODE_HINT (inner_node) & PT_HINT_NO_USE_HASH) goto exit;
else if (QO_NODE_HINT (inner_node) & PT_HINT_USE_HASH)
/* fall through */;
else if (QO_NODE_HINT (inner_node) & (PT_HINT_USE_NL | PT_HINT_USE_IDX | PT_HINT_USE_MERGE))
goto exit;
else
{
#if TEST_HASH_JOIN_ENABLE
/* fall through */
#else
goto exit; /* default: hash-join is not considered */
#endif
}

Decoded, the rules are:

Hint stateResult
NO_USE_HASHskip
USE_HASHconsider
USE_NL / USE_IDX / USE_MERGEskip (other join method forced)
(no hint)skip (compile-time TEST_HASH_JOIN_ENABLE flips this)

In the production build, TEST_HASH_JOIN_ENABLE is not defined, so hash join is never considered without an explicit /*+ USE_HASH */ hint on the inner relation. This is a deliberately conservative posture: hash join is on by feature, but off by cost-policy. The cost model exists, but is gated. Workloads that benefit from hash join (typical analytical equi-joins of large fact and dimension tables) must be marked up by the developer. The path-term and key-limit checks further exclude semi-correctness-sensitive configurations where the join’s row-multiplicity contract matters.

If admission passes, qo_join_new constructs a QO_PLAN of method QO_JOINMETHOD_HASH_JOIN, which qo_check_plan_on_info then ranks against any other plans for the same DP slot.

Plan-tree integration — make_hashjoin_proc and dispatch

Section titled “Plan-tree integration — make_hashjoin_proc and dispatch”

When qo_to_xasl walks the surviving plan tree, hash-join nodes are lowered by make_hashjoin_proc (in plan_generation.c):

// make_hashjoin_proc — src/optimizer/plan_generation.c
static XASL_NODE *
make_hashjoin_proc (QO_ENV *env, QO_PLAN *plan, XASL_NODE *outer_xasl,
XASL_NODE *inner_xasl, QO_PROJECTION_INFO *projection_info)
{
...
HASHJOIN_PROC_NODE *proc;
...
xasl = pt_to_hashjoin_proc (parser, outer_xasl, inner_xasl);
...
}

pt_to_hashjoin_proc (in the parser-side xasl_generation.c) sets xasl->type = HASHJOIN_PROC, populates proc->outer.xasl / proc->inner.xasl, and fills merge_info from the plan’s join key-column metadata (ls_outer_column[], ls_inner_column[], ls_pos_list[], etc.). The two child XASLs are also linked into the parent’s aptr_list so that the executor’s pre-pass evaluates them before the case HASHJOIN_PROC: arm of the dispatcher fires.

There is one extra check, check_hashjoin_xasl, that runs at make_hashjoin_proc time:

// check_hashjoin_xasl — src/optimizer/plan_generation.c (condensed)
for (hashjoin_xasl = xasl->aptr_list; hashjoin_xasl != NULL;
hashjoin_xasl = hashjoin_xasl->next)
if (hashjoin_xasl->type == HASHJOIN_PROC) break;
if (hashjoin_xasl == NULL || hashjoin_xasl->aptr_list == NULL /* outer */
|| hashjoin_xasl->aptr_list->next == NULL /* inner */)
goto error_exit;
/* both sides present and merge_info column counts > 0 */

The check guarantees that the lowered XASL has a well-formed HASHJOIN_PROC node with both children attached and at least one join-key column — a hash join with zero key columns would be a Cartesian product, which the operator does not implement.

The xasl_to_stream.c packing path (xts_process_hashjoin_proc) and its corresponding sizing function (xts_sizeof_hashjoin_proc) ensure the node round-trips through the wire format that the broker uses to ship XASL trees to the server.

Variants — inner, left-outer, right-outer, parallel

Section titled “Variants — inner, left-outer, right-outer, parallel”

CUBRID’s hash join supports three SQL join shapes plus a parallel flavour:

  • JOIN_INNER — the canonical case. Build-side is whichever is smaller; collisions are filtered by the post-hash key compare; output is ungrouped.

  • JOIN_LEFT (left outer) — outer (LHS) is preserved, inner (RHS) is null-supplying. Inner is forced to build (so that any outer row failing to match can fire null-padding from the probe-side fallback). The fallback is in hjoin_outer_probe’s if (!any_record_added) branch.

  • JOIN_RIGHT (right outer) — symmetric. Outer is forced to build.

  • Parallel — when the partition count is > 1 and the build side is large enough (min_page_cnt clears the compute_parallel_degree threshold) and the worker pool grants ≥2 workers, hjoin_try_parallel upgrades the run to HASHJOIN_STATUS_PARALLEL and dispatches to parallel_query::hash_join::execute_partitions. Each worker takes an independent partition pair and runs hjoin_execute over it; the results are concatenated under a per-manager mutex. The parallel path piggybacks on the same HASHJOIN_CONTEXT[] array as the sequential path, with extra synchronisation in HASHJOIN_SHARED_SPLIT_INFO and HASHJOIN_SHARED_JOIN_INFO.

Three notable absences: there is no full-outer join path (JOIN_OUTER / JOIN_FULL is rejected by hjoin_init_context’s assert_release_error (false) default arm), no semi/anti-join specialisation (CUBRID rewrites EXISTS / IN subqueries elsewhere and the rewritten form does not reach HASHJOIN_PROC), and no recursive (re-)partitioning (after one round of grace partitioning, hjoin_scan_init for an oversized partition will simply fall through to HASH_METH_HASH_FILE and accept the on-disk overhead).

The “fill null values” arm — hjoin_outer_fill_null_values — is notable for being a complete bypass of the build/probe machinery: when the null-supplying side is empty, no hash table is created at all; the operator just walks the preserved side and emits one null-padded record per row.

flowchart LR
  TYPE{join_type}
  INN[JOIN_INNER<br/>build = smaller side]
  LEF[JOIN_LEFT<br/>build = inner RHS]
  RIG[JOIN_RIGHT<br/>build = outer LHS]
  TYPE --> INN
  TYPE --> LEF
  TYPE --> RIG

  INN --> PIN[hjoin_probe<br/>no null pad]
  LEF --> POL[hjoin_outer_probe<br/>null-pad outer<br/>on probe miss]
  RIG --> POR[hjoin_outer_probe<br/>null-pad inner<br/>on probe miss]

  PIN --> OUT[xasl->list_id]
  POL --> OUT
  POR --> OUT

  EMP{empty side?}
  TYPE -. inner empty .-> EMP
  EMP -- preserved empty --> NOP[empty output]
  EMP -- supplying empty --> FILL[hjoin_outer_fill_null_values<br/>walk preserved-only,<br/>null-pad each row]
  FILL --> OUT

Symbols grouped by responsibility. Refer to the position-hints table at the end for (file, line) pairs as observed when this doc was last updated:.

Driver and lifecycle

  • qexec_hash_join — public entry, called from qexec_execute_mainblock_internal’s case HASHJOIN_PROC.
  • hjoin_init_manager, hjoin_clear_managerHASHJOIN_MANAGER lifecycle.
  • hjoin_init_context, hjoin_clear_context, hjoin_destroy_qlistHASHJOIN_CONTEXT lifecycle (build-side selection lives here).
  • hjoin_check_empty_inputsHASHJOIN_STATUS_END / FILL_NULL_VALUES / TRY decision.
  • hjoin_execute — per-context build+probe driver.
  • hjoin_execute_internal — open list scans, run hjoin_build, open probe scan, run hjoin_probe or hjoin_outer_probe.
  • hjoin_outer_fill_null_values — fast path for outer joins with empty null-supplying side.
  • hjoin_execute_partitions — sequential walk over contexts[] for partitioned execution.

Build / probe primitives

  • hjoin_fetch_key — read join-column values out of one tuple, optionally compare against an existing key.
  • hjoin_update_tuple_hash_key — write the 32-bit hash word into the tuple’s reserved value slot during partitioning.
  • hjoin_build, hjoin_build_key — the build loop and per-tuple insert (dispatches to mht_put_hls or fhs_insert).
  • hjoin_probe, hjoin_outer_probe, hjoin_probe_key — probe loops and chain-walking primitive (dispatches to mht_get_hls / mht_get_next_hls / fhs_search / fhs_search_next).
  • hjoin_merge_tuple_to_list_id, hjoin_merge_tuple — write the joined row to the output list-file, with overflow fallback for rows > one page.

Hash-scan primitive (shared with hash list scan)

  • qdata_alloc_hscan_key, qdata_free_hscan_keyHASH_SCAN_KEY allocator/deallocator.
  • qdata_alloc_hscan_value, qdata_alloc_hscan_value_OID, qdata_free_hscan_valueHASH_SCAN_VALUE allocators (in-mem-tuple vs hybrid-position).
  • qdata_build_hscan_key, qdata_copy_hscan_key, qdata_copy_hscan_key_without_alloc — key materialisers from a reguvar list.
  • qdata_hash_scan_key — the hash function (multi-column rotate-XOR; fhs_hash round for HASH_FILE).
  • qdata_hscan_key_compare, qdata_hscan_key_eq — collision resolvers.
  • qdata_print_hash_scan_entry — debug dumper.
  • hjoin_scan_init, hjoin_scan_clearHASH_LIST_SCAN lifecycle plus method selection (HASH_METH_IN_MEM / HYBRID / HASH_FILE).

Domain handling

  • hjoin_init_domain_info — compute common domain per join column, set coerce_domains[], set need_coerce_domains flag.

Partitioning (spill / grace)

  • hjoin_check_partition — decide whether to partition based on mem_limit and min_tuple_cnt.
  • hjoin_try_partition — top-level partitioning driver, including fallback to SINGLE on error and dispatch to parallel.
  • hjoin_prepare_partition — allocate contexts[] and per-partition list-files.
  • hjoin_split_qlist — phase-1 splitter: hash, route to partition, cache hash word on tuple.
  • hjoin_init_split_info, hjoin_clear_split_infoHASHJOIN_SPLIT_INFO lifecycle.
  • hjoin_init_shared_split_info, hjoin_clear_shared_split_info — parallel synchronisation primitives.
  • hjoin_merge_qlist — concatenate per-partition output list-files via qfile_connect_list (or fall back to copy-append / union-with-distinct for alternate merge methods).

Parallel

  • hjoin_try_parallel — admission gate (compute_parallel_degree + worker_manager::try_reserve_workers).
  • parallel_query::hash_join::build_partitions, parallel_query::hash_join::execute_partitions — worker-managed splitter and per-partition runner (in src/query/parallel/px_hash_join/px_hash_join.cpp).

Optimizer side

  • qo_hjoin_cost — four-slot cost vector for a hash-join plan node.
  • qo_examine_hash_join — admission gate (hint check, path-term exclusion, key-limit exclusion).
  • make_hashjoin_proc, pt_to_hashjoin_proc — plan-to-XASL lowering.
  • check_hashjoin_xasl — well-formedness check on the lowered XASL.
  • xts_process_hashjoin_proc, xts_sizeof_hashjoin_proc — XASL-stream packing.
  • qdump_print_hashjoin_proc_node — query plan dumper.

Tracing / profiling

  • hjoin_trace_start, hjoin_trace_end, hjoin_trace_merge_stats — per-step elapsed-time / fetch / ioread accumulation.
  • hjoin_profile_start, hjoin_profile_end (compile-time gated by HASHJOIN_PROFILE_TIME) — finer-grained per-step profiling.
  • hjoin_trace_get_worker_stats, hjoin_trace_drain_worker_stats — parallel-stats merging.
SymbolFileLine
qexec_hash_joinsrc/query/query_hash_join.c182
hjoin_execute_partitionssrc/query/query_hash_join.c327
hjoin_executesrc/query/query_hash_join.c408
hjoin_outer_fill_null_valuessrc/query/query_hash_join.c463
hjoin_execute_internalsrc/query/query_hash_join.c609
hjoin_init_managersrc/query/query_hash_join.c723
hjoin_clear_managersrc/query/query_hash_join.c882
hjoin_init_domain_infosrc/query/query_hash_join.c970
hjoin_try_partitionsrc/query/query_hash_join.c1166
hjoin_check_partitionsrc/query/query_hash_join.c1320
hjoin_prepare_partitionsrc/query/query_hash_join.c1377
hjoin_build_partitionssrc/query/query_hash_join.c1534
hjoin_split_qlistsrc/query/query_hash_join.c1629
hjoin_merge_qlistsrc/query/query_hash_join.c1828
hjoin_try_parallelsrc/query/query_hash_join.c1952
hjoin_init_split_infosrc/query/query_hash_join.c2070
hjoin_clear_split_infosrc/query/query_hash_join.c2127
hjoin_init_shared_split_infosrc/query/query_hash_join.c2198
hjoin_init_contextsrc/query/query_hash_join.c2298
hjoin_clear_contextsrc/query/query_hash_join.c2394
hjoin_scan_initsrc/query/query_hash_join.c2465
hjoin_scan_clearsrc/query/query_hash_join.c2581
hjoin_check_empty_inputssrc/query/query_hash_join.c2637
hjoin_fetch_keysrc/query/query_hash_join.c2701
hjoin_update_tuple_hash_keysrc/query/query_hash_join.c2849
hjoin_buildsrc/query/query_hash_join.c2874
hjoin_build_keysrc/query/query_hash_join.c3046
hjoin_probesrc/query/query_hash_join.c3129
hjoin_outer_probesrc/query/query_hash_join.c3364
hjoin_probe_keysrc/query/query_hash_join.c3700
hjoin_merge_tuple_to_list_idsrc/query/query_hash_join.c3838
hjoin_merge_tuplesrc/query/query_hash_join.c3904
hjoin_trace_merge_statssrc/query/query_hash_join.c4163
qdata_alloc_hscan_keysrc/query/query_hash_scan.c206
qdata_hash_scan_keysrc/query/query_hash_scan.c287
qdata_hscan_key_eqsrc/query/query_hash_scan.c363
qdata_build_hscan_keysrc/query/query_hash_scan.c379
qdata_alloc_hscan_valuesrc/query/query_hash_scan.c634
qdata_alloc_hscan_value_OIDsrc/query/query_hash_scan.c669
scan_build_hash_list_scansrc/query/scan_manager.c8574
scan_hash_probe_nextsrc/query/scan_manager.c8789
qo_hjoin_costsrc/optimizer/query_planner.c3604
HJ_BUILD_CPU_OVERHEAD_FACTORsrc/optimizer/query_planner.c82
HJ_PROBE_CPU_OVERHEAD_FACTORsrc/optimizer/query_planner.c83
qo_examine_hash_joinsrc/optimizer/query_planner.c6617
Hash-join admission in planner_visit_nodesrc/optimizer/query_planner.c8021
make_hashjoin_procsrc/optimizer/plan_generation.c553
check_hashjoin_xaslsrc/optimizer/plan_generation.c1856
pt_to_hashjoin_procsrc/parser/xasl_generation.c14889
case HASHJOIN_PROC in executorsrc/query/query_executor.c14748
xts_process_hashjoin_procsrc/query/xasl_to_stream.c3609
hashjoin_proc_node structsrc/query/xasl.h376
HASH_LIST_SCAN structsrc/query/query_hash_scan.h110
HASHJOIN_MANAGER structsrc/query/query_hash_join.h319
HASHJOIN_CONTEXT structsrc/query/query_hash_join.h297
PRM_ID_MAX_HASH_LIST_SCAN_SIZEsrc/base/system_parameter.c3517
  • Optimizer ↔ executor symmetry on memory budget. Both qo_examine_hash_join (admission) and hjoin_check_partition (spill trigger) read the same PRM_ID_MAX_HASH_LIST_SCAN_SIZE. The parameter therefore controls both whether hash join is considered at all (when set to zero, the optimizer disables it) and the partition count when it is. This duplication is correct but fragile: a future refactor that changes one site has to chase the other. The optimizer doc (cubrid-query-optimizer.md) references the four-slot cost model but does not call out the qo_examine_hash_join-level zero-check.

  • Cost model under-models spill. As noted in qo_hjoin_cost, the IO surcharge for spilled partitions is #if 0-ed out with the comment that partitioned hash join “is used even when mem_limit is exceeded”. This means the optimizer treats an in-memory hash join and a spilled hash join as having identical IO cost. In practice, the spill path does pay real disk I/O on the per-partition list-file and the FHSID bucket pages, so the planner under-estimates spilled hash-join cost. For workloads where this matters the user is expected to use the USE_HASH hint as a forcing function rather than a tie-breaker, which fits the conservative admission posture.

  • USE_HASH hint is the practical switch. TEST_HASH_JOIN_ENABLE is not set in the production build, so without /*+ USE_HASH(inner_alias) */ on the query, hash join is never generated even when the cost model would prefer it. This is consistent with the executor doc’s silence on hash join (cubrid-query-executor.md mentions qexec_hash_join only by name in the dispatch table) and with the post-processing doc’s emphasis on cubrid-post-processing.md for shared mht_hls machinery.

  • Hash-scan primitive is shared with hash list scan and hash group-by. MHT_HLS_TABLE, HASH_SCAN_KEY, HASH_SCAN_VALUE, qdata_hash_scan_key, qdata_build_hscan_key, qdata_alloc_hscan_value, fhs_* are all shared between query_hash_join.c (this doc), scan_manager.c’s scan_build_hash_list_scan / scan_next_hash_list_scan / scan_hash_probe_next (hash list scan, used by uncorrelated subqueries and IN-list probes — see cubrid-query-executor.md’s scan-manager section), and query_executor.c’s qexec_hash_gby_* family (hash group-by — see cubrid-post-processing.md’s sort-vs-hash GROUP BY section). Any evolution of the hash key encoding has to land in all three.

  • Splitter assumes a reserved tuple slot. hjoin_split_qlist and hjoin_update_tuple_hash_key write the 32-bit hash word into the first tuple value slot, asserting the slot is exactly MAX_ALIGNMENT bytes wide and was initialised to a -1 sentinel. This is a contract with the XASL generator (pt_to_hashjoin_proc/make_hashjoin_proc); any future change to the tuple shape that reorders or shrinks slot 0 will silently break hash-join partitioning. The contract is not asserted at XASL build time, only at split time.

  • Build-side selection is per-partition, not per-query. For inner join, hjoin_init_context recomputes build = smaller side for each partition independently. After splitting, one partition can have a build-side that is a different side from another partition’s, depending on how the split distributed rows. This is correct per-partition (each partition is a self-contained classic hash join), but it means a single query’s trace output can show mixed build sides for a partitioned hash join. The deck for the QP module does not call this out.

  • hjoin_outer_probe collision-skip on partitioned probe is an asserted-impossible case. When the probe side is partitioned, the probe key is fetched lazily inside the chain-walk loop; if need_skip_next fires there, it would mean a NULL appeared in a pre-partitioned probe tuple, which the splitter is supposed to have already routed to the fill-null partition. The code defensively emits a null-padded row and continues, but the assert_release_error (false) shows the operator does not expect this branch to fire in well-formed XASL. Useful diagnostic if someone changes the splitter behaviour.

  • No collation in the hash function. qdata_hash_scan_key calls mht_get_hash_number which goes through the type-specific hashers; for string types the hash includes the collation only insofar as the underlying primitive encodes it. The post-hash key compare (tp_value_compare in hjoin_fetch_key) does respect collation via pr_clone_value / tp_value_coerce. The system therefore relies on the post-hash compare to catch any collation-driven pseudo-collisions; it is not the case that two collation-equal strings necessarily hash to the same word.

  • HASHJOIN_MERGE_CONNECT is the only exercised merge path. hjoin_init_manager hard-codes qlist_merge_method = HASHJOIN_MERGE_CONNECT. The MERGE_COMBINE and MERGE_APPEND arms of hjoin_merge_qlist are reachable only by editing this line, so they are effectively dead code as of this revision. They are presumably retained for future work (e.g., distinct-result hash joins) where the partition outputs would need union-with-deduplication semantics.

  • Bloom-filter pre-probe. Modern hash joins (PostgreSQL parallel hash, MonetDB, DuckDB) push a Bloom filter from the build side down to the probe-side scan, so non-matching probe rows can be discarded before they cross the operator boundary. CUBRID’s hash join does not implement this; whether it would pay for itself given the conservative admission policy (only USE_HASH-hinted joins reach the operator) is an open question.

  • Adaptive role reversal. When the build side turns out larger than the cost-model estimate (because cardinality estimation was off), some engines (SQL Server’s Hash Match, recent PostgreSQL versions in some shapes) abort the build and restart with the other side as build. CUBRID’s hjoin_init_context decides build side once per partition based on tuple_cnt/page_cnt already captured in the materialised list-file headers, which are exact — not estimates — so role reversal is not necessary for the correctness it usually corrects. However, when the estimate that fed into qo_examine_hash_join’s admission was wrong, a hash join may be chosen for a query whose actual sizes would favour nested-loop; CUBRID has no fallback to switch operators at execution time.

  • Recursive partitioning under skew. When a single partition’s build side is still too large after one round of grace partitioning (heavy skew on the join key), CUBRID falls through to HASH_METH_HASH_FILE and accepts the on-disk extendible-hash overhead. SQL Server’s “recursive hash” instead re-partitions with a different hash function. For workloads with severe skew on join keys (hot-key dimensions), CUBRID’s path is correct but slower than the alternative.

  • Parallel build coordination. The parallel path partitions first, then assigns one partition pair per worker; within a partition the build is single-threaded. PostgreSQL since v11 has a shared-memory parallel build, where workers cooperatively insert into one shared hash table. CUBRID’s parallel hash join is at the partition-level only; whether shared-build parallelism is on the roadmap is not documented in the source.

  • Semi-/anti-join specialisation. EXISTS / NOT EXISTS / uncorrelated IN are currently rewritten to standard joins or routed through hash list scan; there is no HASHJOIN_PROC semi-join variant that can early-out after the first build match. The cost saving for selective semi-joins could be significant.

  • Cost-model IO term. The #if 0-ed IO surcharge in qo_hjoin_cost is a known underestimate of spilled hash-join cost. The justification (“partitioned hash join is used even when mem_limit is exceeded”) explains why the model gives up but not what it should be replaced with. A page-based estimator that multiplies min_page_cnt by max(1, (build_size / mem_limit)) would be a small change with large impact on join-method selection in workloads dominated by spilled hash joins.

  • No full-outer hash join. hjoin_init_context’s default arm is assert_release_error (false), which rejects JOIN_OUTER. PostgreSQL implements full-outer hash join with a “matched” bit per build entry plus a final pass to emit unmatched build rows; CUBRID does not. Full-outer joins are presumably routed to sort-merge.

  • src/query/query_hash_join.c, src/query/query_hash_join.h — driver, manager/context, build/probe phases, partitioning, parallel glue.
  • src/query/query_hash_scan.c, src/query/query_hash_scan.h — shared hash-scan primitive (key/value allocation, hash function, FHSID extendible-hash file).
  • src/query/scan_manager.cscan_build_hash_list_scan, scan_next_hash_list_scan, scan_hash_probe_next (hash list scan, the sibling consumer of HASH_LIST_SCAN).
  • src/query/query_executor.cqexec_execute_mainblock_internal’s case HASHJOIN_PROC dispatch (line ~14748).
  • src/query/xasl.hHASHJOIN_PROC_NODE, HASHJOIN_INPUT, QFILE_LIST_MERGE_INFO definitions.
  • src/query/xasl_to_stream.cxts_process_hashjoin_proc, xts_sizeof_hashjoin_proc packing.
  • src/query/query_dump.cqdump_print_hashjoin_proc_node plan-tree printer.
  • src/optimizer/query_planner.cqo_hjoin_cost, qo_examine_hash_join, hash-join CPU constants (HJ_BUILD_CPU_OVERHEAD_FACTOR, HJ_PROBE_CPU_OVERHEAD_FACTOR), call-site in planner_visit_node’s STEP 5-5.
  • src/optimizer/plan_generation.cmake_hashjoin_proc, check_hashjoin_xasl.
  • src/parser/xasl_generation.cpt_to_hashjoin_proc.
  • src/query/parallel/px_hash_join/ — parallel-execution back-end.
  • src/base/system_parameter.cPRM_ID_MAX_HASH_LIST_SCAN_SIZE parameter declaration.
  • DeWitt, Katz, Olken, Shapiro, Stonebraker, Wood, Implementation Techniques for Main Memory Database Systems, SIGMOD 1984.
  • Shapiro, Join Processing in Database Systems with Large Main Memories, ACM TODS 1986.
  • Kitsuregawa, Tanaka, Moto-oka, Application of Hash to Data Base Machine and Its Architecture, New Generation Computing 1983 (GRACE).
  • Graefe, Query Evaluation Techniques for Large Databases, ACM Computing Surveys 1993, §3 (hash join survey).
  • Graefe, Volcano — An Extensible and Parallel Query Evaluation System, IEEE TKDE 1994 (operator integration).
  • Petrov, Database Internals, ch. 14 “Query Execution” (sort vs hash join).
  • Garcia-Molina, Ullman, Widom, Database Systems: The Complete Book, ch. 15.4–15.5 (hash join cost analysis).
  • Silberschatz, Korth, Sudarshan, Database System Concepts, ch. on join operators (hash join + sort-merge comparison).