CUBRID Extendible Hash — Disk-Resident Hash File With Doubling Directory and Local Depth
Contents:
- Theoretical Background
- Common DBMS Design
- CUBRID’s Approach
- Source Walkthrough
- Cross-check Notes
- Open Questions
- Sources
Theoretical Background
Section titled “Theoretical Background”The extendible hashing scheme was introduced by Fagin, Nievergelt, Pippenger & Strong in Extendible Hashing — A Fast Access Method for Dynamic Files (TODS 1979) as the disk-friendly answer to a question B+Trees do not solve cleanly: how does a hash file grow? An in-memory hash table grows by reallocating a larger array and re-hashing all entries; that is unacceptable on a disk where each bucket is a page and each rehash is a scan of the entire file. Linear hashing (Litwin, 1980) gave one answer — split one bucket at a time in a fixed round-robin order, regardless of which bucket overflowed. Extendible hashing gave the other — split the bucket that overflowed, and double the directory only when the split exhausts the addressing depth.
Database Internals (Petrov, ch. 7 §“Hash Files”) lays out the
core data structure. A pseudo-key is computed from the user key
by a uniform hash function (output: a fixed-width integer, in
practice 32 bits). The directory is an array of pointers
indexed by the leftmost N bits of the pseudo-key, where N
is the directory’s global depth (d). The directory holds
2^d entries. Each entry points to a bucket — a single page
containing one or more (key, value) records.
Each bucket carries its own local depth (d_b), which is the
number of leftmost bits the bucket has so far been
“differentiated by”. The invariants are:
d_b ≤ dfor every bucket.- If
d_b < d, then2^(d - d_b)directory entries all point at this same bucket — the bucket is shared. - If
d_b = d, then exactly one directory entry points at it.
When an insert finds the target bucket full:
- If
d_b < d(the bucket is shared): allocate a sibling, re-distribute records by theird_b + 1-th bit, set both buckets’ local depth tod_b + 1, and update the half of the directory entries that previously pointed at the original bucket so they now point at the sibling. The directory itself does not change size. - If
d_b = d: the addressing space is exhausted. Double the directory (d := d + 1, copy each old pointer twice into the new directory), then perform the case-1 split.
Deletion runs the dual: when a bucket falls below an underflow
threshold and its sibling has the same local depth and they
together fit, merge them and decrement the local depth. If after
the merge no bucket has local depth d, halve the directory.
Two properties drive the design’s appeal:
- Constant-time lookup. A pseudo-key fix locates the directory entry; one bucket read returns the value. No tree descent, no log-N comparisons.
- No rehash on growth. A directory doubling is an O(
2^d) pointer copy; bucket splits affect exactly two pages. Other buckets are untouched.
The cost is two-level indirection (directory page + bucket page,
two buffer fixes per lookup, versus one-page hash-of-hash for an
in-memory table) and the directory’s size — 2^d pointers can
balloon if the hash function clusters. Engines that adopt it
either size the directory generously up front or accept the
occasional doubling cost.
CUBRID’s implementation, the subject of this document, adopts
Fagin’s scheme almost literally: an EHID rooted at a directory
file, slotted bucket pages with a small EHASH_BUCKET_HEADER
holding the local depth, a 32-bit pseudo-key produced by
ehash_hash, and split/merge logic conditioned on local-depth /
global-depth comparison. The interesting CUBRID-specific
ingredients live around the edges: WAL integration through
RVEH_* records, system-op brackets around split/merge,
slotted-page-style binary search inside buckets, and the
restricted set of key types (DB_TYPE_STRING, DB_TYPE_OBJECT)
the implementation actually exercises today.
Common DBMS Design
Section titled “Common DBMS Design”Disk-resident hash files appear in many engines. The shape is shared; the differences are in WAL integration, key-type flexibility, and what the hash file is for.
Where extendible hashing usually shows up
Section titled “Where extendible hashing usually shows up”- PostgreSQL
hashindex — extendible hashing on user tables. Through PG 9.6 the index was un-WAL-logged and could not be replicated; PG 10 added WAL records and crash safety, closing a long-standing complaint. The directory is a meta-page-rooted array of bucket pointers; bucket overflow spills to overflow pages chained from the bucket. PG exposes hash indexes to users and uses them in catalogue tables for point-lookup keys. - BerkeleyDB
DB_HASH— extendible hashing as one of the primary access methods. The directory is split-pointed by global vs local depth in textbook fashion; deletions can optionally compact buckets but the directory rarely shrinks. - Oracle hash clusters — clustered tables grouped on a hash function over the cluster key. Not extendible in Fagin’s sense (the bucket count is fixed at create time), but the user-facing motivation — point lookup with one I/O — is the same.
- DB2 LOAD with HASH — bulk-load aware variants that build the hash file bottom-up like a B+Tree bulk load.
- Linear hashing — Litwin’s split-one-bucket-at-a-time scheme is the principal alternative. PostgreSQL’s hash index was linear through 9.x; the rewrite that landed in PG 10 kept the on-disk layout but improved WAL.
Where CUBRID sits
Section titled “Where CUBRID sits”CUBRID’s extendible hash is not a user-facing index type.
There is no CREATE INDEX ... USING HASH; users get B+Tree only.
Internally, the hash file backs a small set of name → OID
directories where the workload is dominated by point lookup and
range scans are not needed:
- Class-name → OID directory. When the SQL compiler resolves
SELECT … FROM employeeit must turn the identifieremployeeinto the class OID that owns the heap file. The mapping is stored inboot_Db_parm->classname_table, a permanent extendible hash file built at database boot (xehash_createinboot_sr.c:5001). - Catalog representation directory (
catalog_Id.xhid).xehash_create (… DB_TYPE_OBJECT, …)insystem_catalog.c:2635builds a hash file keyed on class OIDs, returning representation IDs used to retrieve the schema record from the catalog heap. EachDROP CLASSemits anehash_deleteagainst this directory (system_catalog.c:2230). - UPDATE/DELETE OID-dedup. When an UPDATE or DELETE
statement may visit a row through more than one access path
(e.g., union of two indexes), the executor needs to suppress
double-application.
qexec_init_upddel_ehash_files(query_executor.c:9822) creates one temporary hash file per target class, keyed on the row’s OID. Each candidate row isehash_search-ed first; if absent,ehash_insertrecords it and the update proceeds; if present, the row is skipped. Files are destroyed at statement end viaqexec_destroy_upddel_ehash_files. - Hash list / hash set scans (
src/query/query_hash_scan.{c,h}) re-useEHIDand the directory file format for the in-memory hash side of HASH-LIST-SCAN / HASH-SET-SCAN operators. This path usesfile_create_ehash_dirdirectly (the dir-file creator) rather than going throughxehash_create, because only the directory representation is reused, not the bucket manager.
The pattern across all four use cases is the same: point lookup of a small key against a small-to-medium set of entries, with no requirement for range scans. That is exactly the workload extendible hashing wins on against B+Trees, and exactly the constraint that justified building the module.
Theory ↔ CUBRID mapping
Section titled “Theory ↔ CUBRID mapping”| Concept | CUBRID name |
|---|---|
| Hash file identifier | EHID = { VFID vfid; INT32 pageid } (storage_common.h:212) |
| Directory header | EHASH_DIR_HEADER (extendible_hash.c:103) |
| Directory entry | EHASH_DIR_RECORD = { VPID bucket_vpid } (extendible_hash.c:120) |
| Bucket header | EHASH_BUCKET_HEADER = { char local_depth } (extendible_hash.c:127) |
| Pseudo-key | EHASH_HASH_KEY = unsigned int (extendible_hash.c:99) |
| Search return | EH_SEARCH { EH_KEY_FOUND, EH_KEY_NOTFOUND, EH_ERROR_OCCURRED } (storage_common.h:381) |
| Internal split/merge result | EHASH_RESULT enum (extendible_hash.c:132) |
| Top-N-bits extraction | GETBITS(value, pos, n) macro (extendible_hash.c) |
| First-N-bit dir lookup | FIND_OFFSET(hash_key, depth) macro |
| WAL record family | RVEH_REPLACE, RVEH_INSERT, RVEH_DELETE, RVEH_INIT_BUCKET, RVEH_CONNECT_BUCKET, RVEH_INC_COUNTER, RVEH_INIT_DIR, RVEH_INIT_NEW_DIR_PAGE (recovery.h:105-112) |
CUBRID’s Approach
Section titled “CUBRID’s Approach”The module fits in two files plus a handful of recovery hooks
and is best understood layer by layer: the on-disk layout
(EHID, directory file, bucket pages), the read path
(ehash_search), the write path (ehash_insert and the
split / directory-doubling logic), the delete path
(ehash_delete and the merge / directory-shrink logic), and the
WAL integration that makes all of the above crash-safe.
Overall picture
Section titled “Overall picture”flowchart TB
subgraph EHID["EHID = { vfid, pageid }"]
DIRROOT["dir root page<br/>EHASH_DIR_HEADER<br/>· first directory chunk"]
DIRPAGES["directory pages 1..N<br/>(EHASH_DIR_RECORD[])"]
DIRROOT --> DIRPAGES
end
subgraph BUCKETS["bucket file (separate VFID)"]
BUC0["bucket 0<br/>local_depth=2"]
BUC1["bucket 1<br/>local_depth=2"]
BUC2["bucket 2<br/>local_depth=1"]
end
DIRROOT -. dir_header.bucket_file .-> BUCKETS
DIRPAGES -- "VPID pointers" --> BUC0
DIRPAGES -- "VPID pointers" --> BUC1
DIRPAGES -- "VPID pointers x2 (shared)" --> BUC2
subgraph LOOKUP["Lookup"]
L1["ehash_hash(key) → 32-bit pseudo_key"]
L2["FIND_OFFSET(pseudo_key, dir.depth) → entry idx"]
L3["ehash_dir_locate → (dir page no, offset in page)"]
L4["ehash_locate_slot (binary search inside bucket)"]
L1 --> L2 --> L3 --> L4
end
The figure encodes three boundaries. (directory / bucket)
the directory is one file (rooted at EHID.pageid), the
bucket pool is a different file (the bucket VFID lives in
the directory header). (shared / unique) a directory entry
with d_b < d shares the bucket page with 2^(d - d_b) - 1
peers; an entry with d_b == d owns its bucket alone.
(read / write) lookup is a pure walk of the addressing
chain; insert and delete may modify the directory, the
sibling, and the LSN-protected on-disk state under a system
operation.
EHID and on-disk layout
Section titled “EHID and on-disk layout”// EHID — src/storage/storage_common.h:212typedef struct ehid EHID;struct ehid{ VFID vfid; /* The directory file (volid + fileid) */ INT32 pageid; /* The first (root) page of the directory */};An EHID is 12 bytes (4-byte volid, 4-byte fileid, 4-byte
pageid; see OR_EHID_SIZE in object_representation_constants.h).
It’s the on-disk handle for an extendible hash file — written
into catalog records (or_pack_ehid in
object_representation.c:1754) and into the database boot
parameter page that holds boot_Db_parm->classname_table.
The first page of the directory file is pointed at by
EHID.pageid and starts with the directory header:
// EHASH_DIR_HEADER — src/storage/extendible_hash.c:103typedef struct ehash_dir_header EHASH_DIR_HEADER;struct ehash_dir_header{ VFID bucket_file; /* bucket file identifier */ VFID overflow_file; /* overflow (long-string) file identifier */ int local_depth_count[EHASH_HASH_KEY_BITS + 1]; /* hist[i] = number of buckets with local_depth = i; used to detect when global depth can shrink */ DB_TYPE key_type; /* DB_TYPE_STRING or DB_TYPE_OBJECT */ short depth; /* GLOBAL DEPTH d */ char alignment; /* slot alignment for bucket pages */};After the header, the same first page carries directory
entries up to EHASH_NUM_FIRST_PAGES. Subsequent directory
pages carry EHASH_NUM_NON_FIRST_PAGES entries each (no
header). Entry layout is trivial:
// EHASH_DIR_RECORD — src/storage/extendible_hash.c:120typedef struct ehash_dir_record EHASH_DIR_RECORD;struct ehash_dir_record{ VPID bucket_vpid; /* bucket pointer */};The macro ehash_dir_locate (extendible_hash.c:384) maps a
directory entry index to (page_no, in-page offset):
// ehash_dir_locate — src/storage/extendible_hash.c:384static voidehash_dir_locate (int *out_page_no_p, int *out_offset_p){ int page_no, offset = *out_offset_p; if (offset < EHASH_NUM_FIRST_PAGES) { offset = offset * sizeof (EHASH_DIR_RECORD) + EHASH_DIR_HEADER_SIZE; page_no = 0; } else { offset -= EHASH_NUM_FIRST_PAGES; page_no = offset / EHASH_NUM_NON_FIRST_PAGES + 1; offset = (offset % EHASH_NUM_NON_FIRST_PAGES) * sizeof (EHASH_DIR_RECORD); } *out_page_no_p = page_no; *out_offset_p = offset;}The directory thus appears as a single contiguous array of
2^depth EHASH_DIR_RECORDs that happens to span multiple
pages. The first page is special only in that its layout
starts after EHASH_DIR_HEADER_SIZE.
Buckets live in the bucket file (a separate VFID,
dir_header.bucket_file). Each bucket page is a slotted page
(PAGE_EHASH ptype) whose slot 0 holds a one-byte
EHASH_BUCKET_HEADER with the local depth, and whose
remaining slots each hold one (OID || key_bytes) record.
The bucket page is initialized through
ehash_initialize_bucket_new_page (extendible_hash.c:628),
called by file_alloc as the page-init callback.
// EHASH_BUCKET_HEADER — src/storage/extendible_hash.c:127typedef struct ehash_bucket_header EHASH_BUCKET_HEADER;struct ehash_bucket_header{ char local_depth; /* The local depth d_b of this bucket */};The bucket page uses UNANCHORED_KEEP_SEQUENCE slot ordering
(spage_initialize argument) — slot order may change with
inserts/deletes and the slot index space is not stable across
a sort. That choice is deliberate: bucket records are kept in
user-key order (not slot order) so the bucket can be searched
by binary search, and that requires the slotted-page layer to be
willing to renumber slots on insert.
EHID creation — xehash_create
Section titled “EHID creation — xehash_create”The public creator is xehash_create
(extendible_hash.c:866); it forwards to
ehash_create_helper (extendible_hash.c:954):
// xehash_create — src/storage/extendible_hash.c:866EHID *xehash_create (THREAD_ENTRY *thread_p, EHID *ehid_p, DB_TYPE key_type, int exp_num_entries, OID *class_oid_p, int attr_id, bool is_tmp){ return ehash_create_helper (thread_p, ehid_p, key_type, exp_num_entries, class_oid_p, attr_id, is_tmp);}Helper steps (condensed from extendible_hash.c:954-1178):
- Validate
key_type. CUBRID’s runtime only enablesDB_TYPE_STRINGandDB_TYPE_OBJECT; the rest are guarded underENABLE_UNUSED_FUNCTION. The assertkey_type == DB_TYPE_STRING || key_type == DB_TYPE_OBJECTsits at the top of the helper. - Estimate the bucket-page count from
exp_num_entries. A string key budget assumes 20 bytes/key + OID + slot overhead (4) per entry; an OID key budget uses the actual key size.exp_num_entries < 0collapses to a one-page estimate. - Compute the bucket-page slot alignment as
max(sizeof(int), sizeof(OID.pageid), key_size)capped atsizeof(int). - Create the bucket file with
file_create_ehash(registers it as aFILE_EXTENDIBLE_HASHfile with theFILE_EHASH_DESdescriptor{class_oid, attr_id}). Allocate bucket page 0 withfile_alloc, passingehash_initialize_bucket_new_pageas the init callback — that callback writes the bucket header (local_depth = 0) into slot 0 and emits anRVEH_INIT_BUCKETlog record. - Create the directory file with
file_create_ehash_dir(a sibling helper infile_manager.c:3285that registers the file asFILE_EXTENDIBLE_HASH_DIR). Allocate page 0 with the generic page-type init. - Lay down the directory header in the dir root page:
global depth
0, key type, alignment, bucket file VFID, null overflow file VFID,local_depth_count[0] = 1(one bucket exists, with depth 0). - Write the single directory entry — pointer to bucket page 0.
- Emit
RVEH_INIT_DIR(a redo-only record because the file itself will be reaped by recovery if the create transaction aborts). - Return the populated
EHIDto the caller.
The starting state is therefore: depth = 0, one directory
entry, one bucket with local_depth = 0. Every key hashes to
that one bucket until it overflows.
Pseudo-key — ehash_hash
Section titled “Pseudo-key — ehash_hash”ehash_hash (extendible_hash.c:4347) dispatches by key
type:
// ehash_hash — src/storage/extendible_hash.c:4347 (condensed)static EHASH_HASH_KEYehash_hash (void *original_key_p, DB_TYPE key_type){ switch (key_type) { case DB_TYPE_STRING: return ehash_hash_string_type (key, original_key_p); case DB_TYPE_OBJECT: return ehash_hash_eight_bytes_type (key); /* OID is 8 useful bytes */ /* … other types only under ENABLE_UNUSED_FUNCTION … */ default: er_set (ER_FATAL_ERROR_SEVERITY, ARG_FILE_LINE, ER_EH_CORRUPTED, 0); }}The string variant (ehash_hash_string_type,
extendible_hash.c:4156) trims trailing spaces, applies a
classic folding method (sum 4-byte windows of the key into
one 32-bit accumulator), then runs three independent string
hashes from memory_hash — mht_1strhash, mht_2strhash,
mht_3strhash — on the four bytes of the folded value, packing
their results into the high three bytes of the final
pseudo-key, and putting the byte sum in the low byte. The
intent is to spread bits across the 32-bit pseudo-key space
even when the input strings share long common prefixes (which
is common for class names: customer, customer_address,
customer_phone).
The eight-bytes variant (used for OIDs) folds two ints with
htonl and runs a final byte-sum pass to scatter the result.
For OIDs, since (volid, pageid, slotid) is already nearly
unique, the goal is “do not concentrate in one quadrant of the
key space” rather than full avalanche.
The output is unsigned int (EHASH_HASH_KEY,
extendible_hash.c:99), 32 bits — so global depth d cannot
exceed 32. The macro GETBITS extracts a left-adjusted bit
field, FIND_OFFSET(hash_key, depth) extracts the leftmost
depth bits.
Lookup — ehash_search
Section titled “Lookup — ehash_search”Lookup is short:
// ehash_search — src/storage/extendible_hash.c:1379 (condensed)EH_SEARCHehash_search (THREAD_ENTRY *thread_p, EHID *ehid_p, void *key_p, OID *value_p){ if (ehid_p == NULL || key_p == NULL) return EH_KEY_NOTFOUND;
dir_root_page_p = ehash_find_bucket_vpid_with_hash ( thread_p, ehid_p, key_p, PGBUF_LATCH_READ, PGBUF_LATCH_READ, /* root S, bucket S */ &bucket_vpid, NULL, NULL);
if (bucket_vpid.pageid == NULL_PAGEID) goto end; /* directory hole */
bucket_page_p = ehash_fix_old_page (thread_p, &ehid_p->vfid, &bucket_vpid, PGBUF_LATCH_READ); if (!ehash_locate_slot (thread_p, bucket_page_p, dir_header_p->key_type, key_p, &slot_id)) { result = EH_KEY_NOTFOUND; goto end; }
spage_get_record (thread_p, bucket_page_p, slot_id, &recdes, PEEK); ehash_read_oid_from_record (recdes.data, value_p); result = EH_KEY_FOUND;end: if (bucket_page_p) pgbuf_unfix_and_init (thread_p, bucket_page_p); if (dir_root_page_p) pgbuf_unfix_and_init (thread_p, dir_root_page_p); return result;}ehash_find_bucket_vpid_with_hash
(extendible_hash.c:1327) is the addressing routine: fix the
dir root page with an S latch, hash the key, compute
location = FIND_OFFSET(pseudo_key, depth), then
ehash_find_bucket_vpid (extendible_hash.c:1295) figures
out whether the entry is in the root page or in some later
directory page, fixes the latter if necessary, and copies the
bucket_vpid out.
If the directory pointer is NULL_PAGEID, the bucket has been
deleted (an empty cell that hasn’t yet been merged away);
search returns EH_KEY_NOTFOUND immediately.
Otherwise the bucket page is fixed with a read latch, and
ehash_locate_slot (extendible_hash.c:2479) runs a
binary search inside the bucket. The bucket maintains
records in user-key order (the slotted page is anchored
unanchored-but-keep-sequence); the binary search uses
ehash_compare_key (extendible_hash.c:2223), which
dispatches by key type and returns -1 / 0 / +1.
The lookup therefore costs two page fixes (dir root +
bucket — almost always both buffer-pool hits for a hot table)
and one in-page binary search of O(log m) where m is
the bucket’s record count.
Insert — ehash_insert
Section titled “Insert — ehash_insert”The public entry point delegates to ehash_insert_helper
with an initial shared latch on the directory root, on
the optimistic assumption that the insert will not require a
directory expansion:
// ehash_insert — src/storage/extendible_hash.c:1461void *ehash_insert (THREAD_ENTRY *thread_p, EHID *ehid_p, void *key_p, OID *value_p){ if (ehid_p == NULL || key_p == NULL) return NULL; return ehash_insert_helper (thread_p, ehid_p, key_p, value_p, S_LOCK, NULL);}ehash_insert_helper (extendible_hash.c:1715) is the
orchestrator. Its lock-promotion pattern is the analogue of
B+Tree’s “restart from root with X” — the helper releases the
shared root latch and recursively re-enters with X_LOCK
whenever it discovers it must mutate the directory.
flowchart TB
start["ehash_insert_helper(key, value, lock=S)"]
start --> resolve["ehash_find_bucket_vpid_with_hash<br/>S/X latch dir, S latch bucket-vpid"]
resolve --> case_null{"bucket_vpid<br/>== NULL?"}
case_null -- "yes" --> retry1{"lock == S?"}
retry1 -- "yes" --> restart1["release dir,<br/>recurse with X_LOCK"]
retry1 -- "no" --> create["ehash_insert_to_bucket_after_create<br/>(allocate new bucket, connect, log sysop)"]
case_null -- "no" --> attempt["ehash_insert_bucket_after_extend_if_need"]
attempt --> simple["ehash_insert_to_bucket<br/>(write or replace)"]
simple --> ok{"result"}
ok -- "OK" --> done["pgbuf_unfix dir; return key"]
ok -- "BUCKET_FULL" --> retry2{"lock == S?"}
retry2 -- "yes" --> restart2["release dir,<br/>recurse with X_LOCK"]
retry2 -- "no" --> split["ehash_extend_bucket<br/>(split + maybe expand directory)"]
split --> reinsert["ehash_insert_to_bucket on chosen side"]
reinsert --> done
Inserting into a bucket — ehash_insert_to_bucket
Section titled “Inserting into a bucket — ehash_insert_to_bucket”The straight-line case (room available, key not present)
runs ehash_insert_to_bucket (extendible_hash.c:1837):
ehash_locate_slotto find where the key would sit.- If
found(key already exists) — replace: copy the old record, write the new OID into its OID slot in place, emitRVEH_DELETEundo (which restores the old OID on abort) andRVEH_REPLACEredo (which re-applies the new OID). This meansehash_insertis upsert semantics — the second insert of the same key replaces the value. - Otherwise —
ehash_compose_recordbuilds a record of the form(OID || key_bytes)(small variant) or(OID || prefix_vpid || prefix)(REC_BIGONE, the long- string overflow variant — guarded underENABLE_UNUSED_FUNCTIONand not active in current builds), thenspage_insert_atputs it at the located slot. - If
spage_insert_atreturnsSP_DOESNT_FIT, the bucket is full — returnEHASH_BUCKET_FULLto the caller (which is the signal that triggers the split path). - Emit
RVEH_INSERTundo and redo records. The undo record carries the EHID + the inserted record (the undo handlerehash_rv_insert_undoreads the EHID, then callsehash_rv_deleteto remove the entry). pgbuf_set_dirty (bucket_page_p, DONT_FREE).
The shape of the WAL records — undo carries enough to identify the EHID + the record-to-remove, redo carries slot-id + the record-to-insert — is logical undo, physical redo, the classical ARIES discipline for variable-length record stores.
Splitting a bucket — ehash_split_bucket and the depth dance
Section titled “Splitting a bucket — ehash_split_bucket and the depth dance”When ehash_insert_to_bucket returns EHASH_BUCKET_FULL and
the caller holds an X latch on the directory,
ehash_extend_bucket (extendible_hash.c:1546) is the
orchestrator. It runs under a system operation
(log_sysop_start / log_sysop_commit) so a crash in the
middle does not leave the directory inconsistent:
sequenceDiagram
participant H as ehash_insert_helper
participant E as ehash_extend_bucket
participant S as ehash_split_bucket
participant D as ehash_expand_directory
participant C as ehash_connect_bucket
participant L as log_manager (sysop)
H->>E: bucket full, X latch held
E->>L: log_sysop_start
E->>S: split_bucket(buc_pgptr, key)
S->>S: log RVEH_REPLACE undo (full bucket image)
S->>S: ehash_find_first_bit_position (compute new local depth)
S->>S: file_alloc sibling page (init via ehash_initialize_bucket_new_page)
S->>S: ehash_distribute_records_into_two_bucket
S->>S: log RVEH_REPLACE redo for bucket and sibling
S-->>E: sibling page ptr, old/new local_depth, new VPID
E->>E: ehash_adjust_local_depth (-1 at old, +2 at new)
alt new_local_depth > global depth
E->>D: ehash_expand_directory (double the directory)
end
alt new_local_depth - old_local_depth > 1
E->>C: connect old slot range to NULL_VPID
E->>C: connect lower half to original bucket
end
E->>C: connect upper half to sibling
E->>L: log_sysop_commit
E-->>H: sibling_page_p, new_bit
The interesting steps:
ehash_find_first_bit_position(extendible_hash.c:2545) walks the bucket’s records (computing each record’s pseudo-key withehash_get_pseudo_key), XORs them against the first record’s pseudo-key, and locates the first bit position at which any record differs. That position becomes the new local depth — i.e. CUBRID does not blindly dolocal_depth + 1; it picks the smallest local depth that actually separates the records. This avoids creating two buckets where one is empty (which would force an immediate re-split).ehash_distribute_records_into_two_bucket(extendible_hash.c:2605) iterates records: those whoselocal_depth-th pseudo-key bit is 1 move to the sibling; the rest stay. The order in the source bucket is preserved in both halves, which keeps both buckets sorted by user key (because the pseudo-key is independent of the user-key ordering, and binary search inside the bucket relies on user-key order, not pseudo-key bits).ehash_expand_directory(extendible_hash.c:2792) doubles the directory by walking it from the last entry backward, copying each old entry2^(new_depth - old_depth)times into the expanded space. New directory pages, if needed, are allocated withfile_alloc_multipleand initialized viaehash_initialize_dir_new_pagewhich emitsRVEH_INIT_NEW_DIR_PAGE. Each modified existing dir page is logged with pairedRVEH_REPLACEundo+redo of the full page image.ehash_connect_bucket(extendible_hash.c:3067) rewrites a contiguous range of directory entries to point at a givenVPID. The number of entries to rewrite is2^(global_depth - local_depth); the range is computed by setting the low(global_depth - local_depth)bits of the hash-derived location alternately to all-ones (set_bits) and zeros, then usingehash_dir_locateto translate the endpoints into page-and-offset. Each affected directory page emits oneRVEH_CONNECT_BUCKETrecord carrying anEHASH_REPETITION = { VPID, count }payload. The matching redo functionehash_rv_connect_bucket_redo(extendible_hash.c:5429) just rewritescountslots. This keeps the WAL volumeO(1)per affected dir page rather thanO(count * sizeof(VPID)).ehash_adjust_local_depth(extendible_hash.c:3633) updatesdir_header.local_depth_count[depth] += deltaon the dir root page and emits anRVEH_INC_COUNTERundo+redo log record (ehash_rv_incrementis the recovery handler). This histogram is what drives the directory-shrink decision later: whenlocal_depth_count[depth]falls to 0 for the currentdir_header.depth, the directory has fewer meaningful bits than its width, and shrinking is profitable.
Bucket creation when the directory entry is NULL
Section titled “Bucket creation when the directory entry is NULL”A delete can leave a directory entry set to NULL_VPID (the
bucket was emptied, deallocated, and the directory wasn’t
shrunk). On the next insert that hashes there, the helper
calls ehash_insert_to_bucket_after_create
(extendible_hash.c:1472):
ehash_find_depth(extendible_hash.c:3183) walks the neighborhood of the NULL slot to find the smallest local depth that distinguishes this entry’s range from the buckets sharing the high-order bits with it. The resulting depth feeds the new bucket’s header.file_allocallocates the bucket page, initialized viaehash_initialize_bucket_new_pagewith the chosen depth.ehash_connect_bucketrewrites the2^(d - d_b)directory entries in the run to point at the new bucket.ehash_adjust_local_depth(+1)updates the histogram.ehash_insert_to_bucketwrites the user record.
The whole sequence is bracketed in log_sysop_start /
log_sysop_commit.
Delete — ehash_delete
Section titled “Delete — ehash_delete”ehash_delete (extendible_hash.c:3418) is the inverse of
ehash_insert. It descends with S on the dir root and X
on the bucket — there is no second-attempt promotion path
because deletion does not need to mutate the directory in
the common case:
// ehash_delete — src/storage/extendible_hash.c:3418 (condensed)void *ehash_delete (THREAD_ENTRY *thread_p, EHID *ehid_p, void *key_p){ dir_root_page_p = ehash_find_bucket_vpid_with_hash ( thread_p, ehid_p, key_p, PGBUF_LATCH_READ, PGBUF_LATCH_WRITE, /* root S, bucket X */ &bucket_vpid, NULL, &location);
if (bucket_vpid.pageid == NULL_PAGEID) { /* not found */ }
bucket_page_p = ehash_fix_old_page (… PGBUF_LATCH_WRITE); if (!ehash_locate_slot (… &slot_no)) { /* ER_EH_UNKNOWN_KEY */ }
spage_get_record (thread_p, bucket_page_p, slot_no, &bucket_recdes, PEEK);
/* logical undo log: EHID || full record */ log_append_undo_data2 (thread_p, RVEH_DELETE, &ehid_p->vfid, NULL, bucket_recdes.type, log_recdes_undo.length, log_recdes_undo.data); /* physical redo log: slot-id with key-type prefix */ log_append_redo_data2 (thread_p, RVEH_DELETE, &ehid_p->vfid, bucket_page_p, slot_no, log_recdes_redo.length, log_recdes_redo.data);
spage_delete (thread_p, bucket_page_p, slot_no); pgbuf_set_dirty (thread_p, bucket_page_p, DONT_FREE);
/* Decide whether to merge */ if (bucket became empty || bucket underflows below EHASH_UNDERFLOW_THRESHOLD) { if (ehash_check_merge_possible (… S_LOCK …) == EHASH_SUCCESSFUL_COMPLETION) do_merge = true; }
pgbuf_unfix dir, bucket; if (do_merge) ehash_merge (thread_p, ehid_p, key_p, is_temp);
return key_p;}The thresholds:
// src/storage/extendible_hash.c:69-91#define EHASH_OVERFLOW_RATE 0.9 /* sibling fullness ceiling for merge */#define EHASH_UNDERFLOW_RATE 0.4 /* below this -> try to merge */#define EHASH_OVERFLOW_THRESHOLD (EHASH_OVERFLOW_RATE * DB_PAGESIZE)#define EHASH_UNDERFLOW_THRESHOLD (EHASH_UNDERFLOW_RATE * DB_PAGESIZE)The asymmetry between underflow (40%) and overflow (90%) prevents yo-yo split-merge oscillation: a delete merges buckets only when both the bucket has dropped below 40% and the sibling has at most 90% room left after absorbing the remaining records. In other words — merge if the result won’t immediately re-split.
Merge — ehash_merge
Section titled “Merge — ehash_merge”ehash_merge (extendible_hash.c:3771) re-acquires the
directory under an X latch and re-validates that the bucket
still wants to merge (a concurrent insert might have refilled
it). The decision branches on
ehash_check_merge_possible (extendible_hash.c:3274):
EHASH_NO_SIBLING_BUCKET— the bucket has local depth 0 (single-bucket file) or its sibling slot is null.- If the bucket is now empty and its local depth > 0,
the bucket page is
file_dealloc-ed, all directory entries pointing at it are reset toNULL_VPID, andlocal_depth_count[old_depth]is decremented.ehash_shrink_directory_if_needthen checks whether the histogram allows global-depth reduction.
- If the bucket is now empty and its local depth > 0,
the bucket page is
EHASH_FULL_SIBLING_BUCKET— the sibling cannot absorb the remaining records under the 90% rule. Merge is abandoned; the bucket stays underfilled.EHASH_SUCCESSFUL_COMPLETION— actual merge:ehash_merge_permanent(extendible_hash.c:3655) moves the records to the sibling, updates the sibling’s local depth viaehash_find_depth, deallocates the now-empty source bucket page, and rewrites the directory entries that pointed at the source to now point at the sibling. The histogram is updated by twoehash_adjust_local_depthcalls (-2 at old depth, +1 at new depth), andehash_shrink_directory_if_needmay then halve the directory.
The whole merge sequence runs under log_sysop_start /
log_sysop_commit — the same system-op envelope that brackets
splits.
Directory shrink — ehash_shrink_directory_if_need / ehash_shrink_directory
Section titled “Directory shrink — ehash_shrink_directory_if_need / ehash_shrink_directory”The histogram-driven shrink trigger:
// ehash_shrink_directory_if_need — src/storage/extendible_hash.c:3615static voidehash_shrink_directory_if_need (THREAD_ENTRY *thread_p, EHID *ehid_p, EHASH_DIR_HEADER *dir_header_p, bool is_temp){ int i = dir_header_p->depth; while (dir_header_p->local_depth_count[i] == 0) i--; if ((dir_header_p->depth - i) > 1) ehash_shrink_directory (thread_p, ehid_p, i + 1, is_temp);}Note the threshold is > 1, not >= 1: the directory shrinks
only when at least two depth levels are unused. This avoids
shrink-grow oscillation when the live max depth straddles a
boundary. ehash_shrink_directory (extendible_hash.c:3982)
implements the inverse of expansion: walk the new directory
left-to-right, copying every 2^times-th entry from the old
directory, log each modified page with paired RVEH_REPLACE
undo+redo, and file_dealloc the trailing dir pages.
Concurrency and latching
Section titled “Concurrency and latching”CUBRID’s extendible hash uses page latches only — there are no separate hash-level locks. The discipline:
- Search — S-latch the directory root page; from it, optionally fix the secondary directory page with S; then S-latch the bucket. All three latches can be held briefly; the dir root and dir page are released as soon as the bucket VPID is extracted. The bucket S-latch is held through the binary search and the value extraction.
- Insert (fast path, no split) — S-latch the dir root, X-latch the bucket. The bucket X-latch covers locate ▸ insert ▸ log ▸ set-dirty.
- Insert (split or directory-extend) — restart with
X-latch on the dir root, X-latch on bucket, X-latch on
sibling, X-latch on every directory page touched by the
rewrite (
ehash_connect_bucket/ehash_expand_directory). All under one system op. - Delete (fast path, no merge) — S-latch the dir root, X-latch the bucket. Merge promotion releases both, then re-fixes with X / X.
- Merge / shrink — X on dir root, X on bucket, X on sibling, X on each affected dir page. System-op bracket.
The shared model is therefore fix-the-page, search, maybe-promote, log, release — much like B+Tree, but without latch-coupling because there’s no tree path. The directory is one-deep (header-page → body-page → bucket-page); descent is at most three fixes.
The cost of the optimistic-S-latch-then-restart pattern in
ehash_insert is two descents when the bucket happens to
be full: one to discover the BUCKET_FULL condition under S,
then again under X. The benefit is that the common case —
insert into a non-full bucket — never holds an X latch on the
directory root, so concurrent searches of unrelated keys are
not blocked by inserts. With a workload dominated by
short-name-to-OID lookups (boot-time class resolution, query
compilation), this trade-off is well-suited.
WAL integration — RVEH_* records
Section titled “WAL integration — RVEH_* records”CUBRID emits eight distinct recovery indices for extendible
hash (recovery.h:105-112):
| Index | Producer | Recovery handler | Purpose |
|---|---|---|---|
RVEH_REPLACE = 58 | ehash_split_bucket, ehash_expand_directory, ehash_shrink_directory, ehash_merge_permanent | log_rv_copy_char / ehash_rv_…_redo | Full-page-image undo/redo of dir or bucket page |
RVEH_INSERT = 59 | ehash_insert_to_bucket | ehash_rv_insert_redo / ehash_rv_insert_undo | Per-record bucket insertion |
RVEH_DELETE = 60 | ehash_delete, ehash_rv_delete | ehash_rv_delete_redo / ehash_rv_delete_undo | Per-record bucket deletion |
RVEH_INIT_BUCKET = 61 | ehash_initialize_bucket_new_page | ehash_rv_init_bucket_redo | Re-initialize a freshly allocated bucket page (slotted-page init + header) |
RVEH_CONNECT_BUCKET = 62 | ehash_connect_bucket | ehash_rv_connect_bucket_redo | Rewrite a contiguous run of directory entries to a single VPID |
RVEH_INC_COUNTER = 63 | ehash_adjust_local_depth | ehash_rv_increment | Adjust local_depth_count[i] by ±delta |
RVEH_INIT_DIR = 64 | ehash_create_helper | ehash_rv_init_dir_redo | Initialize the dir root page on file create |
RVEH_INIT_NEW_DIR_PAGE = 65 | ehash_initialize_dir_new_page | ehash_rv_init_dir_new_page_redo | Initialize a directory expansion page |
The dispatch table that wires these into the recovery
function array sits in transaction/recovery.c:415-452. The
record’s data buffer carries enough payload that the handler
can be re-run idempotently on the page that the LSN
identifies — for RVEH_INSERT that’s (short rec_type, OID, key_bytes); for RVEH_DELETE redo it’s the same plus the
slot id; for RVEH_DELETE undo it’s (EHID, OID, key_bytes)
because the undo handler must re-locate the bucket
(it cannot trust the LSN’s page identity — split or merge may
have moved the record).
The undo discipline is logical for inserts and deletes,
physical for splits and merges (page-image replace).
Logical-undo for inserts means
ehash_rv_insert_undo looks up the entry by EHID + key and
calls ehash_rv_delete to remove it; physical undo would have
to reverse the slotted-page state byte-for-byte and would
break if the bucket has been split since the original insert.
The split/merge path uses physical undo (full DB_PAGESIZE
image) because the page is already X-latched under a system
op and there’s no safe way to compute a “logical reverse” of
a redistribution; replaying the page image is correct and
simple. The cost — one full page image per modified page —
is bounded by the system-op envelope: a single split logs at
most three page-images (source bucket, sibling bucket, one
dir page), which is tolerable.
Use cases inside CUBRID — the four call sites
Section titled “Use cases inside CUBRID — the four call sites”Verified by grep -rn "ehash_\|EHID\|xehash_" src/:
- Class-name → OID directory.
EHID classname_tablelives in the database boot parameter (boot_sr.c:122). It is created at first-time database format withxehash_create (… DB_TYPE_STRING, …, rootclass_oid, …, false)(boot_sr.c:5001) and persists for the database’s lifetime. Lookups (ehash_search), inserts onCREATE TABLEand deletes onDROP TABLEroute throughlocator_*(locator manager) wrappers; the underlying calls are the standard public API. - Catalog representation directory. Inside
system_catalog.c, the catalog indexcatalog_id_p->xhidis created withxehash_create (… DB_TYPE_OBJECT, 1, oid_Root_class_oid, -1, false)(system_catalog.c:2635). It maps a class OID to its catalog representation entry.DROP CLASSremoves the mapping withehash_delete (thread_p, &catalog_Id.xhid, key)(system_catalog.c:2230). - UPDATE/DELETE OID dedup.
qexec_init_upddel_ehash_files(query_executor.c:9822) allocatesxasl->upd_del_class_cnttemporary EHIDs at the start of each multi-class UPDATE/DELETE plan, withxehash_create (… DB_TYPE_OBJECT, -1, NULL, 0, true)— noteis_tmp == trueso logging is suppressed. During executionqexec_upddel_add_unique_oid_to_ehid(query_executor.c:1015) callsehash_searchfirst; if the OID is found, the row is NULL-ed out as already processed, otherwiseehash_insertrecords it. The files are destroyed at statement end viaqexec_destroy_upddel_ehash_files. - Hash-list / hash-set scans.
src/query/query_hash_scan.{c,h}definesFILE_HASH_SCANstructures that embedEHID ehidand reuse the directory file format (viafile_create_ehash_dir) for the materialized hash side of HASH-LIST-SCAN and HASH-SET-SCAN operators. This is the only call site that does not go throughxehash_create— it builds only the directory portion and runs its own bucket logic on top of it.
There are no other callers in the engine. In particular,
user-defined indexes never use extendible hashing — only
B+Tree (btree.c).
Restrictions and current state
Section titled “Restrictions and current state”- Key types in production: only
DB_TYPE_STRINGandDB_TYPE_OBJECT. The implementation has code underENABLE_UNUSED_FUNCTIONforDB_TYPE_DOUBLE,DB_TYPE_FLOAT,DB_TYPE_INTEGER,DB_TYPE_BIGINT,DB_TYPE_SHORT,DB_TYPE_DATE,DB_TYPE_TIME,DB_TYPE_TIMESTAMP,DB_TYPE_DATETIME,DB_TYPE_MONETARY— but those branches are unreachable in current builds because the assert at the top ofehash_create_helperforbids any other key type. - Long-string overflow: also
ENABLE_UNUSED_FUNCTION-gated. TheEHASH_DIR_HEADER::overflow_filefield is initialized toVFID_NULLand never populated in current builds, even for class names — theassert (strlen ((char *) key_p) < 256)inehash_insert_helperenforces a 256-byte cap on string keys and the long-string code paths are dead. xehash_destroy(extendible_hash.c:1259) only works on temporary hash files (it usesfile_destroy (…, true)). Permanent files are deleted indirectly viafile_destroyon the directory and bucket VFIDs at higher levels (e.g. duringDROP TABLEfor the class-name table during database wipe).- No
MULTI_VERSIONintegration. Extendible hash is pre-MVCC: there is no version chain, no MVCC visibility filter, noBTREE_OP_PURPOSE-style enum. Concurrent visibility is enforced at a higher layer (the call site serializes class-name modifications with a directory-level lock fromlock_manager.c).
Source Walkthrough
Section titled “Source Walkthrough”Anchor on symbol names, not line numbers. Position hints are scoped to the document’s
updated:date.
Header types and layout
Section titled “Header types and layout”EHID(storage_common.h) — public 12-byte handle{ VFID vfid; INT32 pageid }.EH_SEARCH(storage_common.h) — return type ofehash_search(EH_KEY_FOUND,EH_KEY_NOTFOUND,EH_ERROR_OCCURRED).EHASH_DIR_HEADER(extendible_hash.c) — directory root-page header.EHASH_DIR_RECORD(extendible_hash.c) — single-VPID directory entry.EHASH_BUCKET_HEADER(extendible_hash.c) — slot-0 one-byte bucket header carrying local depth.EHASH_REPETITION(extendible_hash.c) — the(VPID, count)payload ofRVEH_CONNECT_BUCKET.EHASH_HASH_KEY(extendible_hash.c) — theunsigned intpseudo-key type.EHASH_RESULTenum (extendible_hash.c) — internal return code (SUCCESSFUL_COMPLETION,BUCKET_FULL,BUCKET_UNDERFLOW,BUCKET_EMPTY,FULL_SIBLING_BUCKET,NO_SIBLING_BUCKET,ERROR_OCCURRED).GETBITS,FIND_OFFSET,GETBIT,SETBIT,CLEARBIT(extendible_hash.c) — bit manipulation macros.EHASH_OVERFLOW_THRESHOLD,EHASH_UNDERFLOW_THRESHOLD,EHASH_DIR_HEADER_SIZE,EHASH_NUM_FIRST_PAGES,EHASH_NUM_NON_FIRST_PAGES,EHASH_HASH_KEY_BITS(extendible_hash.c) — page geometry constants.RVEH_*recovery indices (transaction/recovery.h) — the eight-record family.
Public API
Section titled “Public API”xehash_create(extendible_hash.c) — constructor for permanent or temporary hash files.xehash_destroy(extendible_hash.c) — destructor for temporary hash files only.ehash_search(extendible_hash.c) — point lookup.ehash_insert(extendible_hash.c) — upsert.ehash_delete(extendible_hash.c) — point delete.ehash_map(extendible_hash.c) — full-scan apply.ehash_dump,ehash_print_bucket,ehash_dump_bucket(extendible_hash.c) — debug dumpers.
Addressing
Section titled “Addressing”ehash_dir_locate(extendible_hash.c) — directory index → (page_no, offset) translator.ehash_find_bucket_vpid(extendible_hash.c) — given a directory header and an entry index, return the bucket VPID.ehash_find_bucket_vpid_with_hash(extendible_hash.c) — hash-then-locate convenience wrapper used by all three CRUD operations.ehash_fix_old_page,ehash_fix_ehid_page,ehash_fix_nth_page(extendible_hash.c) — page-buffer fix wrappers withPAGE_EHASHptype check.
Hash function
Section titled “Hash function”ehash_hash(extendible_hash.c) — type-dispatched entry.ehash_hash_string_type(extendible_hash.c) — folding +mht_*strhashtriple-pass.ehash_hash_eight_bytes_type(extendible_hash.c) — OID-aware folding.ehash_get_pseudo_key(extendible_hash.c) — re-hashes a record extracted from a bucket (used during split).
Bucket-page operations
Section titled “Bucket-page operations”ehash_initialize_bucket_new_page(extendible_hash.c) — page-init callback forfile_alloc.ehash_compose_record(extendible_hash.c) — pack(OID, key)into aRECDES.ehash_write_key_to_record(extendible_hash.c) — copy key bytes by type.ehash_compare_key(extendible_hash.c) — type-dispatched three-way compare.ehash_binary_search_bucket(extendible_hash.c) — log-N search inside a slotted bucket.ehash_locate_slot(extendible_hash.c) — wrapper that handles the empty-bucket edge.ehash_get_key_size(extendible_hash.c) — fixed-size key types.
Insert path
Section titled “Insert path”ehash_insert_helper(extendible_hash.c) — top-level orchestrator with S→X promotion.ehash_insert_to_bucket(extendible_hash.c) — straight insert; emitsRVEH_INSERTundo+redo.ehash_insert_to_bucket_after_create(extendible_hash.c) — bucket creation when the directory entry was NULL.ehash_insert_bucket_after_extend_if_need(extendible_hash.c) — bucket-full retry that calls split.ehash_extend_bucket(extendible_hash.c) — split + optional dir-doubling orchestrator under sysop.ehash_split_bucket(extendible_hash.c) — physical bucket-split (allocate sibling, distribute records).ehash_find_first_bit_position(extendible_hash.c) — smallest local-depth bump that actually separates records.ehash_distribute_records_into_two_bucket(extendible_hash.c) — slot-by-slot move of records to the sibling.ehash_expand_directory(extendible_hash.c) — directory doubling, including new-page allocation.ehash_initialize_dir_new_page(extendible_hash.c) — page-init callback for new directory pages.
Delete path
Section titled “Delete path”ehash_delete(extendible_hash.c) — public delete.ehash_check_merge_possible(extendible_hash.c) — merge feasibility check (sibling exists, has same depth, has room).ehash_merge(extendible_hash.c) — re-validates and performs the merge.ehash_merge_permanent(extendible_hash.c) — physical record migration to sibling.ehash_find_depth(extendible_hash.c) — derive the new local depth from the directory neighborhood after a merge.ehash_shrink_directory_if_need(extendible_hash.c) — histogram-driven shrink trigger (> 1slack).ehash_shrink_directory(extendible_hash.c) — directory halving (or 4×, 8×, …) with page deallocation.
Directory pointer maintenance
Section titled “Directory pointer maintenance”ehash_connect_bucket(extendible_hash.c) — rewrite a range of directory entries to one VPID; emitsRVEH_CONNECT_BUCKET.ehash_adjust_local_depth(extendible_hash.c) — apply delta tolocal_depth_count[depth]; emitsRVEH_INC_COUNTER.
Recovery handlers
Section titled “Recovery handlers”ehash_rv_init_bucket_redo(extendible_hash.c) — redo ofRVEH_INIT_BUCKET.ehash_rv_init_dir_redo(extendible_hash.c) — redo ofRVEH_INIT_DIR.ehash_rv_init_dir_new_page_redo(extendible_hash.c) — redo ofRVEH_INIT_NEW_DIR_PAGE.ehash_rv_insert_redo/ehash_rv_insert_undo(extendible_hash.c) —RVEH_INSERT(physical redo, logical undo).ehash_rv_delete_redo/ehash_rv_delete_undo(extendible_hash.c) —RVEH_DELETE(physical redo, logical undo).ehash_rv_delete(extendible_hash.c) — internal logical delete used by_rv_insert_undo; does not check merge conditions and does not log undo (it generates a CLR- compensating redo only).ehash_rv_increment(extendible_hash.c) —RVEH_INC_COUNTER.ehash_rv_connect_bucket_redo(extendible_hash.c) —RVEH_CONNECT_BUCKET.RV_fun[]registrations intransaction/recovery.c:415-452.
Serialization
Section titled “Serialization”or_pack_ehid/or_unpack_ehid(object_representation.c) — 12-byte EHID on-disk form.OR_GET_EHID/OR_PUT_EHID(object_representation.h) — primitive byte-level codecs.ehash_read_oid_from_record/ehash_write_oid_to_record(extendible_hash.c) — OID byte order in bucket records.ehash_read_ehid_from_record/ehash_write_ehid_to_record(extendible_hash.c) — EHID payload codec for log records.
Position hints as of 2026-04-30
Section titled “Position hints as of 2026-04-30”| Symbol | File | Line |
|---|---|---|
EHID (struct) | storage_common.h | 212 |
EH_SEARCH (enum) | storage_common.h | 381 |
OR_EHID_SIZE | object_representation_constants.h | 87 |
EHASH_HASH_KEY (typedef) | extendible_hash.c | 99 |
EHASH_DIR_HEADER (struct) | extendible_hash.c | 103 |
EHASH_DIR_RECORD (struct) | extendible_hash.c | 120 |
EHASH_BUCKET_HEADER (struct) | extendible_hash.c | 127 |
EHASH_RESULT (enum) | extendible_hash.c | 132 |
EHASH_REPETITION (struct) | extendible_hash.c | 144 |
GETBITS / FIND_OFFSET (macros) | extendible_hash.c | 193 |
ehash_dir_locate | extendible_hash.c | 384 |
ehash_initialize_bucket_new_page | extendible_hash.c | 628 |
ehash_initialize_dir_new_page | extendible_hash.c | 712 |
ehash_rv_init_dir_new_page_redo | extendible_hash.c | 735 |
xehash_create | extendible_hash.c | 866 |
ehash_get_key_size | extendible_hash.c | 879 |
ehash_create_helper | extendible_hash.c | 954 |
ehash_fix_old_page | extendible_hash.c | 1188 |
ehash_fix_ehid_page | extendible_hash.c | 1218 |
ehash_fix_nth_page | extendible_hash.c | 1236 |
xehash_destroy | extendible_hash.c | 1259 |
ehash_find_bucket_vpid | extendible_hash.c | 1295 |
ehash_find_bucket_vpid_with_hash | extendible_hash.c | 1327 |
ehash_search | extendible_hash.c | 1379 |
ehash_insert | extendible_hash.c | 1461 |
ehash_insert_to_bucket_after_create | extendible_hash.c | 1472 |
ehash_extend_bucket | extendible_hash.c | 1546 |
ehash_insert_bucket_after_extend_if_need | extendible_hash.c | 1641 |
ehash_insert_helper | extendible_hash.c | 1715 |
ehash_insert_to_bucket | extendible_hash.c | 1837 |
ehash_compose_record | extendible_hash.c | 2172 |
ehash_compare_key | extendible_hash.c | 2223 |
ehash_binary_search_bucket | extendible_hash.c | 2401 |
ehash_locate_slot | extendible_hash.c | 2479 |
ehash_get_pseudo_key | extendible_hash.c | 2509 |
ehash_find_first_bit_position | extendible_hash.c | 2545 |
ehash_distribute_records_into_two_bucket | extendible_hash.c | 2605 |
ehash_split_bucket | extendible_hash.c | 2684 |
ehash_expand_directory | extendible_hash.c | 2792 |
ehash_connect_bucket | extendible_hash.c | 3067 |
ehash_find_depth | extendible_hash.c | 3183 |
ehash_check_merge_possible | extendible_hash.c | 3274 |
ehash_delete | extendible_hash.c | 3418 |
ehash_shrink_directory_if_need | extendible_hash.c | 3615 |
ehash_adjust_local_depth | extendible_hash.c | 3633 |
ehash_merge_permanent | extendible_hash.c | 3655 |
ehash_merge | extendible_hash.c | 3771 |
ehash_shrink_directory | extendible_hash.c | 3982 |
ehash_hash_string_type | extendible_hash.c | 4156 |
ehash_hash_eight_bytes_type | extendible_hash.c | 4244 |
ehash_hash | extendible_hash.c | 4347 |
ehash_apply_each | extendible_hash.c | 4395 |
ehash_map | extendible_hash.c | 4518 |
ehash_dump | extendible_hash.c | 4598 |
ehash_rv_init_bucket_redo | extendible_hash.c | 5023 |
ehash_rv_init_dir_redo | extendible_hash.c | 5072 |
ehash_rv_insert_redo | extendible_hash.c | 5089 |
ehash_rv_insert_undo | extendible_hash.c | 5126 |
ehash_rv_delete_redo | extendible_hash.c | 5181 |
ehash_rv_delete_undo | extendible_hash.c | 5220 |
ehash_rv_delete | extendible_hash.c | 5293 |
ehash_rv_increment | extendible_hash.c | 5405 |
ehash_rv_connect_bucket_redo | extendible_hash.c | 5429 |
RVEH_REPLACE … RVEH_INIT_NEW_DIR_PAGE | transaction/recovery.h | 105-112 |
RV_fun[] ehash entries | transaction/recovery.c | 415-452 |
xehash_create (boot, classname_table) | transaction/boot_sr.c | 5001 |
xehash_create (catalog rep dir) | storage/system_catalog.c | 2635 |
ehash_delete (catalog rep dir) | storage/system_catalog.c | 2230 |
qexec_init_upddel_ehash_files | query/query_executor.c | 9822 |
qexec_destroy_upddel_ehash_files | query/query_executor.c | 9875 |
qexec_upddel_add_unique_oid_to_ehid | query/query_executor.c | 1015 |
Cross-check Notes
Section titled “Cross-check Notes”A grep across the source tree (grep -rn "ehash_\|EHID\|xehash_" src/) confirms the four call sites listed in ## CUBRID's Approach § "Use cases inside CUBRID":
boot_sr.c:boot_Db_parm->classname_tableis the only permanent-EHID created at database boot. Lookups go throughlocator_*wrappers that ultimately callehash_search.system_catalog.c: catalog representation directory created at server start;ehash_deleteis called onDROP CLASS.query_executor.c: temporary EHIDs for UPDATE/DELETE OID dedup, created withis_tmp=trueso logging is suppressed.query_hash_scan.c: reuses only the directory-file format, not the bucket logic.
Subsystems that do not use extendible hashing today, contrary to what one might guess from the public API:
- User indexes.
CREATE INDEX … USING HASHis not parsed; user indexes are always B+Tree (btree.c). - Unique-constraint violation tracking. Despite the
document brief’s hint, current
btree_uniqueplumbing does not reach intoextendible_hash.c. The unique-constraint enforcement path inbtree.c(btree_find_oid_and_its_page,BTREE_NEED_UNIQUE_CHECK) and the global stats table (btree_unique_stats,GLOBAL_UNIQUE_STATS_TABLE) work entirely within the B+Tree subsystem; noehash_*calls appear inbtree.corbtree_unique.{cpp,hpp}. - Catalog scan / catalog name lookup beyond representation directory. Catalog tables are heap-stored and accessed by B+Tree; only the rep-id directory uses extendible hash.
The implementation contains a substantial volume of code
guarded by ENABLE_UNUSED_FUNCTION (long-string overflow,
multi-byte numeric key types, in-bucket record sorting
fallbacks). None of those branches is reachable in current
builds — the asserts in ehash_create_helper and the cap on
key length in ehash_insert_helper actively fence them off.
A future change that needs broader key support or oversized
keys would have to revive (and reverify) those paths.
The on-disk record header for a bucket record is (OID, key_bytes) — note that the OID is first, before the key.
This differs from the B+Tree leaf record where the key
precedes the OID. The reason is that extendible hash records
are physically positioned in the bucket by their key (binary
search compares key-first), but the record payload hands
the value back to the caller before the key, because callers
of ehash_search only need the OID and ehash_read_oid_from_record
can read it without scanning the variable-length key.
The ehash_create_helper documentation comment claims the
function “creates two files on this volume: one for the
directory and one for the buckets” — but in fact, the bucket
file’s volid comes from ehid->vfid.volid, not from the
directory’s volid, because the directory’s VFID isn’t yet
assigned at the time the bucket file is created
(bucket_vfid.volid = ehid_p->vfid.volid; precedes
file_create_ehash_dir). The two files end up on the same
volume because the dir-file creation later sets
dir_vfid.volid = bucket_vfid.volid. The end result is
correct; the reading order is mildly counter-intuitive.
The ehash_get_key_size function distinguishes
DB_TYPE_STRING by returning 1 rather than the key length,
because string keys have variable length determined per record.
This is a sentinel; callers check key_type == DB_TYPE_STRING
explicitly and bypass the size computation.
Open Questions
Section titled “Open Questions”-
Is the PG-style “hash bucket overflow chain” worth adding for catalog lookups? CUBRID’s bucket size is one page; if a degenerate hash collision puts more than one page worth of records into the same hash address, the implementation handles it by recursively splitting until the local depth is large enough. With a 32-bit pseudo-key and
mht_*strhashthe collision space is empirically small, but a pathological class-naming convention (e.g. all classes named with the same 8-character prefix) could in principle exhaust the addressing depth. PostgreSQL’s per-bucket overflow chain handles this case by accepting a long chain on one bucket rather than doubling the directory. Investigation: what global-depth values are observed in real CUBRID deployments with thousands of classes? -
Why slot-0 holds the bucket header. The choice of using slot 0 of a slotted page for
EHASH_BUCKET_HEADERrather than reserving page-header space is unusual. The likely reason is reuse of the heap-page slot machinery (spage_initialize,spage_get_record,spage_insert) without a separate “ehash page” page-buffer integration — the cost is a one-byte allocator overhead per page (the slot-0 record). Verification: is there a code path that ever scans the bucket starting at slot 1 and assumes the header is in the page header rather than slot 0? -
Concurrent-split correctness with optimistic S latch on dir.
ehash_insert_helperholds an S latch on the dir root while it discovers a bucket-full, then releases and re-acquires X. Between the release and the X-fix, another thread could split the same bucket. The X-fixed insert would then see a non-full bucket on the second try, the “BUCKET_FULL” branch would not be taken, and the insert would succeed without a second split. Is this benign in all cases, or is there a window where the new hash location (after the directory state change) differs from what the first-passehash_find_bucket_vpid_with_hashcomputed? The recursive call starts from the top, so it should re-resolve the bucket — but the comment inehash_find_bucket_vpid_with_hashdoes not explicitly document the safety property. -
Why
EHASH_OVERFLOW_RATE = 0.9andEHASH_UNDERFLOW_RATE = 0.4? The 0.9 / 0.4 split looks deliberate (the gap prevents oscillation) but the specific values are not parameterized and not commented. Was there empirical workload tuning, or are these inherited from a textbook reference? -
Logical-undo vs physical-undo: when does a split-undo need to know about previous merges? The split path logs full page images (physical) for the bucket and sibling under a system op. If a transaction inserts (causing a split), then deletes (causing a merge of the sibling back into the bucket), then aborts, the undo replays the delete-undo (which is a logical re-insert) followed by the split-undo (which restores the pre-split bucket image). Does the order produce a consistent state, or is there a constraint on system-op ordering that’s not visible in the code?
-
Why does
RVEH_CONNECT_BUCKETlog as undoredo and not as paired undo+redo? The single record carries a before-image (bef_lengthbytes from the page) and an after-payload (EHASH_REPETITION). The undo handler must therefore know how to invert the repetition into a list of distinct VPIDs — but that’s not whatehash_rv_connect_bucket_redodoes. Either the undo path actually replays the originalbef_lengthbytes (in which case it’s “physical undo, structured redo”), or there’s a missing handler. Investigation path: readRV_fun[]entry forRVEH_CONNECT_BUCKETto see whether a separate undo function is registered or whetherlog_rv_copy_charis used as a generic page-image-replay undo. -
Why no
ehash_destroyfor permanent files?xehash_destroyonly handles temporary files. The permanent class-name table is never destroyed individually — it lives for the database’s lifetime and is reaped along with the database itself when the volume is dropped. If a future feature needed to destroy a permanent extendible hash file (e.g., for online schema-renames at a database level), the implementation would need a permanent-file variant; the current shape suggests no such requirement has surfaced.
Sources
Section titled “Sources”Sibling docs
Section titled “Sibling docs”knowledge/code-analysis/cubrid/cubrid-btree.md— the alternative on-disk index that user indexes use; same WAL/page-buffer substrate, different access pattern.knowledge/code-analysis/cubrid/cubrid-page-buffer-manager.md—pgbuf_fix,PGBUF_LATCH_READ/PGBUF_LATCH_WRITE,pgbuf_set_dirtydiscipline used by every ehash routine.knowledge/code-analysis/cubrid/cubrid-heap-manager.md— the slotted-page substrate that bucket pages reuse.knowledge/code-analysis/cubrid/cubrid-log-manager.md—log_append_undoredo_data2, system-ops, and theRV_fun[]recovery dispatch table.knowledge/code-analysis/cubrid/cubrid-recovery-manager.md— redo/undo passes that consumeRVEH_*records.knowledge/code-analysis/cubrid/cubrid-catalog-manager.md— the rep-id directory caller.
Textbook chapters (under knowledge/research/dbms-general/)
Section titled “Textbook chapters (under knowledge/research/dbms-general/)”- Database Internals (Petrov), Ch. 7 §“Hash Files” — the global-vs-local-depth model.
- Fagin, Nievergelt, Pippenger & Strong, Extendible Hashing — A Fast Access Method for Dynamic Files, ACM TODS 4(3), 1979 — the original paper.
- Litwin, Linear Hashing: A New Tool for File and Table Addressing, VLDB 1980 — the alternative.
CUBRID source (/data/hgryoo/references/cubrid/)
Section titled “CUBRID source (/data/hgryoo/references/cubrid/)”src/storage/extendible_hash.c— the implementation.src/storage/extendible_hash.h— the public API.src/storage/storage_common.h—EHID,EH_SEARCH.src/storage/system_catalog.c,system_catalog.h— catalog rep-id directory caller.src/transaction/boot_sr.c— class-name table caller and database-boot creator.src/transaction/recovery.h,recovery.c—RVEH_*codes andRV_fun[]registrations.src/query/query_executor.c— UPDATE/DELETE OID-dedup caller (qexec_init_upddel_ehash_files,qexec_upddel_add_unique_oid_to_ehid).src/query/query_hash_scan.{c,h}— directory-file reuse for HASH-LIST-SCAN / HASH-SET-SCAN.src/base/object_representation.c,object_representation.h,object_representation_constants.h—or_pack_ehid/OR_EHID_*codecs.src/storage/file_manager.c,file_manager.h—file_create_ehash,file_create_ehash_dir,FILE_EHASH_DES.