Skip to content

CUBRID Extendible Hash — Disk-Resident Hash File With Doubling Directory and Local Depth

Contents:

The extendible hashing scheme was introduced by Fagin, Nievergelt, Pippenger & Strong in Extendible Hashing — A Fast Access Method for Dynamic Files (TODS 1979) as the disk-friendly answer to a question B+Trees do not solve cleanly: how does a hash file grow? An in-memory hash table grows by reallocating a larger array and re-hashing all entries; that is unacceptable on a disk where each bucket is a page and each rehash is a scan of the entire file. Linear hashing (Litwin, 1980) gave one answer — split one bucket at a time in a fixed round-robin order, regardless of which bucket overflowed. Extendible hashing gave the other — split the bucket that overflowed, and double the directory only when the split exhausts the addressing depth.

Database Internals (Petrov, ch. 7 §“Hash Files”) lays out the core data structure. A pseudo-key is computed from the user key by a uniform hash function (output: a fixed-width integer, in practice 32 bits). The directory is an array of pointers indexed by the leftmost N bits of the pseudo-key, where N is the directory’s global depth (d). The directory holds 2^d entries. Each entry points to a bucket — a single page containing one or more (key, value) records.

Each bucket carries its own local depth (d_b), which is the number of leftmost bits the bucket has so far been “differentiated by”. The invariants are:

  1. d_b ≤ d for every bucket.
  2. If d_b < d, then 2^(d - d_b) directory entries all point at this same bucket — the bucket is shared.
  3. If d_b = d, then exactly one directory entry points at it.

When an insert finds the target bucket full:

  • If d_b < d (the bucket is shared): allocate a sibling, re-distribute records by their d_b + 1-th bit, set both buckets’ local depth to d_b + 1, and update the half of the directory entries that previously pointed at the original bucket so they now point at the sibling. The directory itself does not change size.
  • If d_b = d: the addressing space is exhausted. Double the directory (d := d + 1, copy each old pointer twice into the new directory), then perform the case-1 split.

Deletion runs the dual: when a bucket falls below an underflow threshold and its sibling has the same local depth and they together fit, merge them and decrement the local depth. If after the merge no bucket has local depth d, halve the directory.

Two properties drive the design’s appeal:

  • Constant-time lookup. A pseudo-key fix locates the directory entry; one bucket read returns the value. No tree descent, no log-N comparisons.
  • No rehash on growth. A directory doubling is an O(2^d) pointer copy; bucket splits affect exactly two pages. Other buckets are untouched.

The cost is two-level indirection (directory page + bucket page, two buffer fixes per lookup, versus one-page hash-of-hash for an in-memory table) and the directory’s size — 2^d pointers can balloon if the hash function clusters. Engines that adopt it either size the directory generously up front or accept the occasional doubling cost.

CUBRID’s implementation, the subject of this document, adopts Fagin’s scheme almost literally: an EHID rooted at a directory file, slotted bucket pages with a small EHASH_BUCKET_HEADER holding the local depth, a 32-bit pseudo-key produced by ehash_hash, and split/merge logic conditioned on local-depth / global-depth comparison. The interesting CUBRID-specific ingredients live around the edges: WAL integration through RVEH_* records, system-op brackets around split/merge, slotted-page-style binary search inside buckets, and the restricted set of key types (DB_TYPE_STRING, DB_TYPE_OBJECT) the implementation actually exercises today.

Disk-resident hash files appear in many engines. The shape is shared; the differences are in WAL integration, key-type flexibility, and what the hash file is for.

  • PostgreSQL hash index — extendible hashing on user tables. Through PG 9.6 the index was un-WAL-logged and could not be replicated; PG 10 added WAL records and crash safety, closing a long-standing complaint. The directory is a meta-page-rooted array of bucket pointers; bucket overflow spills to overflow pages chained from the bucket. PG exposes hash indexes to users and uses them in catalogue tables for point-lookup keys.
  • BerkeleyDB DB_HASH — extendible hashing as one of the primary access methods. The directory is split-pointed by global vs local depth in textbook fashion; deletions can optionally compact buckets but the directory rarely shrinks.
  • Oracle hash clusters — clustered tables grouped on a hash function over the cluster key. Not extendible in Fagin’s sense (the bucket count is fixed at create time), but the user-facing motivation — point lookup with one I/O — is the same.
  • DB2 LOAD with HASH — bulk-load aware variants that build the hash file bottom-up like a B+Tree bulk load.
  • Linear hashing — Litwin’s split-one-bucket-at-a-time scheme is the principal alternative. PostgreSQL’s hash index was linear through 9.x; the rewrite that landed in PG 10 kept the on-disk layout but improved WAL.

CUBRID’s extendible hash is not a user-facing index type. There is no CREATE INDEX ... USING HASH; users get B+Tree only. Internally, the hash file backs a small set of name → OID directories where the workload is dominated by point lookup and range scans are not needed:

  1. Class-name → OID directory. When the SQL compiler resolves SELECT … FROM employee it must turn the identifier employee into the class OID that owns the heap file. The mapping is stored in boot_Db_parm->classname_table, a permanent extendible hash file built at database boot (xehash_create in boot_sr.c:5001).
  2. Catalog representation directory (catalog_Id.xhid). xehash_create (… DB_TYPE_OBJECT, …) in system_catalog.c:2635 builds a hash file keyed on class OIDs, returning representation IDs used to retrieve the schema record from the catalog heap. Each DROP CLASS emits an ehash_delete against this directory (system_catalog.c:2230).
  3. UPDATE/DELETE OID-dedup. When an UPDATE or DELETE statement may visit a row through more than one access path (e.g., union of two indexes), the executor needs to suppress double-application. qexec_init_upddel_ehash_files (query_executor.c:9822) creates one temporary hash file per target class, keyed on the row’s OID. Each candidate row is ehash_search-ed first; if absent, ehash_insert records it and the update proceeds; if present, the row is skipped. Files are destroyed at statement end via qexec_destroy_upddel_ehash_files.
  4. Hash list / hash set scans (src/query/query_hash_scan.{c,h}) re-use EHID and the directory file format for the in-memory hash side of HASH-LIST-SCAN / HASH-SET-SCAN operators. This path uses file_create_ehash_dir directly (the dir-file creator) rather than going through xehash_create, because only the directory representation is reused, not the bucket manager.

The pattern across all four use cases is the same: point lookup of a small key against a small-to-medium set of entries, with no requirement for range scans. That is exactly the workload extendible hashing wins on against B+Trees, and exactly the constraint that justified building the module.

ConceptCUBRID name
Hash file identifierEHID = { VFID vfid; INT32 pageid } (storage_common.h:212)
Directory headerEHASH_DIR_HEADER (extendible_hash.c:103)
Directory entryEHASH_DIR_RECORD = { VPID bucket_vpid } (extendible_hash.c:120)
Bucket headerEHASH_BUCKET_HEADER = { char local_depth } (extendible_hash.c:127)
Pseudo-keyEHASH_HASH_KEY = unsigned int (extendible_hash.c:99)
Search returnEH_SEARCH { EH_KEY_FOUND, EH_KEY_NOTFOUND, EH_ERROR_OCCURRED } (storage_common.h:381)
Internal split/merge resultEHASH_RESULT enum (extendible_hash.c:132)
Top-N-bits extractionGETBITS(value, pos, n) macro (extendible_hash.c)
First-N-bit dir lookupFIND_OFFSET(hash_key, depth) macro
WAL record familyRVEH_REPLACE, RVEH_INSERT, RVEH_DELETE, RVEH_INIT_BUCKET, RVEH_CONNECT_BUCKET, RVEH_INC_COUNTER, RVEH_INIT_DIR, RVEH_INIT_NEW_DIR_PAGE (recovery.h:105-112)

The module fits in two files plus a handful of recovery hooks and is best understood layer by layer: the on-disk layout (EHID, directory file, bucket pages), the read path (ehash_search), the write path (ehash_insert and the split / directory-doubling logic), the delete path (ehash_delete and the merge / directory-shrink logic), and the WAL integration that makes all of the above crash-safe.

flowchart TB
  subgraph EHID["EHID = { vfid, pageid }"]
    DIRROOT["dir root page<br/>EHASH_DIR_HEADER<br/>· first directory chunk"]
    DIRPAGES["directory pages 1..N<br/>(EHASH_DIR_RECORD[])"]
    DIRROOT --> DIRPAGES
  end
  subgraph BUCKETS["bucket file (separate VFID)"]
    BUC0["bucket 0<br/>local_depth=2"]
    BUC1["bucket 1<br/>local_depth=2"]
    BUC2["bucket 2<br/>local_depth=1"]
  end
  DIRROOT -. dir_header.bucket_file .-> BUCKETS
  DIRPAGES -- "VPID pointers" --> BUC0
  DIRPAGES -- "VPID pointers" --> BUC1
  DIRPAGES -- "VPID pointers x2 (shared)" --> BUC2
  subgraph LOOKUP["Lookup"]
    L1["ehash_hash(key) → 32-bit pseudo_key"]
    L2["FIND_OFFSET(pseudo_key, dir.depth) → entry idx"]
    L3["ehash_dir_locate → (dir page no, offset in page)"]
    L4["ehash_locate_slot (binary search inside bucket)"]
    L1 --> L2 --> L3 --> L4
  end

The figure encodes three boundaries. (directory / bucket) the directory is one file (rooted at EHID.pageid), the bucket pool is a different file (the bucket VFID lives in the directory header). (shared / unique) a directory entry with d_b < d shares the bucket page with 2^(d - d_b) - 1 peers; an entry with d_b == d owns its bucket alone. (read / write) lookup is a pure walk of the addressing chain; insert and delete may modify the directory, the sibling, and the LSN-protected on-disk state under a system operation.

// EHID — src/storage/storage_common.h:212
typedef struct ehid EHID;
struct ehid
{
VFID vfid; /* The directory file (volid + fileid) */
INT32 pageid; /* The first (root) page of the directory */
};

An EHID is 12 bytes (4-byte volid, 4-byte fileid, 4-byte pageid; see OR_EHID_SIZE in object_representation_constants.h). It’s the on-disk handle for an extendible hash file — written into catalog records (or_pack_ehid in object_representation.c:1754) and into the database boot parameter page that holds boot_Db_parm->classname_table.

The first page of the directory file is pointed at by EHID.pageid and starts with the directory header:

// EHASH_DIR_HEADER — src/storage/extendible_hash.c:103
typedef struct ehash_dir_header EHASH_DIR_HEADER;
struct ehash_dir_header
{
VFID bucket_file; /* bucket file identifier */
VFID overflow_file; /* overflow (long-string) file identifier */
int local_depth_count[EHASH_HASH_KEY_BITS + 1];
/* hist[i] = number of buckets with local_depth = i;
used to detect when global depth can shrink */
DB_TYPE key_type; /* DB_TYPE_STRING or DB_TYPE_OBJECT */
short depth; /* GLOBAL DEPTH d */
char alignment; /* slot alignment for bucket pages */
};

After the header, the same first page carries directory entries up to EHASH_NUM_FIRST_PAGES. Subsequent directory pages carry EHASH_NUM_NON_FIRST_PAGES entries each (no header). Entry layout is trivial:

// EHASH_DIR_RECORD — src/storage/extendible_hash.c:120
typedef struct ehash_dir_record EHASH_DIR_RECORD;
struct ehash_dir_record
{
VPID bucket_vpid; /* bucket pointer */
};

The macro ehash_dir_locate (extendible_hash.c:384) maps a directory entry index to (page_no, in-page offset):

// ehash_dir_locate — src/storage/extendible_hash.c:384
static void
ehash_dir_locate (int *out_page_no_p, int *out_offset_p)
{
int page_no, offset = *out_offset_p;
if (offset < EHASH_NUM_FIRST_PAGES)
{
offset = offset * sizeof (EHASH_DIR_RECORD) + EHASH_DIR_HEADER_SIZE;
page_no = 0;
}
else
{
offset -= EHASH_NUM_FIRST_PAGES;
page_no = offset / EHASH_NUM_NON_FIRST_PAGES + 1;
offset = (offset % EHASH_NUM_NON_FIRST_PAGES) * sizeof (EHASH_DIR_RECORD);
}
*out_page_no_p = page_no;
*out_offset_p = offset;
}

The directory thus appears as a single contiguous array of 2^depth EHASH_DIR_RECORDs that happens to span multiple pages. The first page is special only in that its layout starts after EHASH_DIR_HEADER_SIZE.

Buckets live in the bucket file (a separate VFID, dir_header.bucket_file). Each bucket page is a slotted page (PAGE_EHASH ptype) whose slot 0 holds a one-byte EHASH_BUCKET_HEADER with the local depth, and whose remaining slots each hold one (OID || key_bytes) record. The bucket page is initialized through ehash_initialize_bucket_new_page (extendible_hash.c:628), called by file_alloc as the page-init callback.

// EHASH_BUCKET_HEADER — src/storage/extendible_hash.c:127
typedef struct ehash_bucket_header EHASH_BUCKET_HEADER;
struct ehash_bucket_header
{
char local_depth; /* The local depth d_b of this bucket */
};

The bucket page uses UNANCHORED_KEEP_SEQUENCE slot ordering (spage_initialize argument) — slot order may change with inserts/deletes and the slot index space is not stable across a sort. That choice is deliberate: bucket records are kept in user-key order (not slot order) so the bucket can be searched by binary search, and that requires the slotted-page layer to be willing to renumber slots on insert.

The public creator is xehash_create (extendible_hash.c:866); it forwards to ehash_create_helper (extendible_hash.c:954):

// xehash_create — src/storage/extendible_hash.c:866
EHID *
xehash_create (THREAD_ENTRY *thread_p, EHID *ehid_p, DB_TYPE key_type, int exp_num_entries,
OID *class_oid_p, int attr_id, bool is_tmp)
{
return ehash_create_helper (thread_p, ehid_p, key_type, exp_num_entries,
class_oid_p, attr_id, is_tmp);
}

Helper steps (condensed from extendible_hash.c:954-1178):

  1. Validate key_type. CUBRID’s runtime only enables DB_TYPE_STRING and DB_TYPE_OBJECT; the rest are guarded under ENABLE_UNUSED_FUNCTION. The assert key_type == DB_TYPE_STRING || key_type == DB_TYPE_OBJECT sits at the top of the helper.
  2. Estimate the bucket-page count from exp_num_entries. A string key budget assumes 20 bytes/key + OID + slot overhead (4) per entry; an OID key budget uses the actual key size. exp_num_entries < 0 collapses to a one-page estimate.
  3. Compute the bucket-page slot alignment as max(sizeof(int), sizeof(OID.pageid), key_size) capped at sizeof(int).
  4. Create the bucket file with file_create_ehash (registers it as a FILE_EXTENDIBLE_HASH file with the FILE_EHASH_DES descriptor {class_oid, attr_id}). Allocate bucket page 0 with file_alloc, passing ehash_initialize_bucket_new_page as the init callback — that callback writes the bucket header (local_depth = 0) into slot 0 and emits an RVEH_INIT_BUCKET log record.
  5. Create the directory file with file_create_ehash_dir (a sibling helper in file_manager.c:3285 that registers the file as FILE_EXTENDIBLE_HASH_DIR). Allocate page 0 with the generic page-type init.
  6. Lay down the directory header in the dir root page: global depth 0, key type, alignment, bucket file VFID, null overflow file VFID, local_depth_count[0] = 1 (one bucket exists, with depth 0).
  7. Write the single directory entry — pointer to bucket page 0.
  8. Emit RVEH_INIT_DIR (a redo-only record because the file itself will be reaped by recovery if the create transaction aborts).
  9. Return the populated EHID to the caller.

The starting state is therefore: depth = 0, one directory entry, one bucket with local_depth = 0. Every key hashes to that one bucket until it overflows.

ehash_hash (extendible_hash.c:4347) dispatches by key type:

// ehash_hash — src/storage/extendible_hash.c:4347 (condensed)
static EHASH_HASH_KEY
ehash_hash (void *original_key_p, DB_TYPE key_type)
{
switch (key_type)
{
case DB_TYPE_STRING:
return ehash_hash_string_type (key, original_key_p);
case DB_TYPE_OBJECT:
return ehash_hash_eight_bytes_type (key); /* OID is 8 useful bytes */
/* … other types only under ENABLE_UNUSED_FUNCTION … */
default:
er_set (ER_FATAL_ERROR_SEVERITY, ARG_FILE_LINE, ER_EH_CORRUPTED, 0);
}
}

The string variant (ehash_hash_string_type, extendible_hash.c:4156) trims trailing spaces, applies a classic folding method (sum 4-byte windows of the key into one 32-bit accumulator), then runs three independent string hashes from memory_hashmht_1strhash, mht_2strhash, mht_3strhash — on the four bytes of the folded value, packing their results into the high three bytes of the final pseudo-key, and putting the byte sum in the low byte. The intent is to spread bits across the 32-bit pseudo-key space even when the input strings share long common prefixes (which is common for class names: customer, customer_address, customer_phone).

The eight-bytes variant (used for OIDs) folds two ints with htonl and runs a final byte-sum pass to scatter the result. For OIDs, since (volid, pageid, slotid) is already nearly unique, the goal is “do not concentrate in one quadrant of the key space” rather than full avalanche.

The output is unsigned int (EHASH_HASH_KEY, extendible_hash.c:99), 32 bits — so global depth d cannot exceed 32. The macro GETBITS extracts a left-adjusted bit field, FIND_OFFSET(hash_key, depth) extracts the leftmost depth bits.

Lookup is short:

// ehash_search — src/storage/extendible_hash.c:1379 (condensed)
EH_SEARCH
ehash_search (THREAD_ENTRY *thread_p, EHID *ehid_p, void *key_p, OID *value_p)
{
if (ehid_p == NULL || key_p == NULL)
return EH_KEY_NOTFOUND;
dir_root_page_p = ehash_find_bucket_vpid_with_hash (
thread_p, ehid_p, key_p,
PGBUF_LATCH_READ, PGBUF_LATCH_READ, /* root S, bucket S */
&bucket_vpid, NULL, NULL);
if (bucket_vpid.pageid == NULL_PAGEID) goto end; /* directory hole */
bucket_page_p = ehash_fix_old_page (thread_p, &ehid_p->vfid,
&bucket_vpid, PGBUF_LATCH_READ);
if (!ehash_locate_slot (thread_p, bucket_page_p,
dir_header_p->key_type, key_p, &slot_id))
{ result = EH_KEY_NOTFOUND; goto end; }
spage_get_record (thread_p, bucket_page_p, slot_id, &recdes, PEEK);
ehash_read_oid_from_record (recdes.data, value_p);
result = EH_KEY_FOUND;
end:
if (bucket_page_p) pgbuf_unfix_and_init (thread_p, bucket_page_p);
if (dir_root_page_p) pgbuf_unfix_and_init (thread_p, dir_root_page_p);
return result;
}

ehash_find_bucket_vpid_with_hash (extendible_hash.c:1327) is the addressing routine: fix the dir root page with an S latch, hash the key, compute location = FIND_OFFSET(pseudo_key, depth), then ehash_find_bucket_vpid (extendible_hash.c:1295) figures out whether the entry is in the root page or in some later directory page, fixes the latter if necessary, and copies the bucket_vpid out.

If the directory pointer is NULL_PAGEID, the bucket has been deleted (an empty cell that hasn’t yet been merged away); search returns EH_KEY_NOTFOUND immediately.

Otherwise the bucket page is fixed with a read latch, and ehash_locate_slot (extendible_hash.c:2479) runs a binary search inside the bucket. The bucket maintains records in user-key order (the slotted page is anchored unanchored-but-keep-sequence); the binary search uses ehash_compare_key (extendible_hash.c:2223), which dispatches by key type and returns -1 / 0 / +1.

The lookup therefore costs two page fixes (dir root + bucket — almost always both buffer-pool hits for a hot table) and one in-page binary search of O(log m) where m is the bucket’s record count.

The public entry point delegates to ehash_insert_helper with an initial shared latch on the directory root, on the optimistic assumption that the insert will not require a directory expansion:

// ehash_insert — src/storage/extendible_hash.c:1461
void *
ehash_insert (THREAD_ENTRY *thread_p, EHID *ehid_p, void *key_p, OID *value_p)
{
if (ehid_p == NULL || key_p == NULL) return NULL;
return ehash_insert_helper (thread_p, ehid_p, key_p, value_p, S_LOCK, NULL);
}

ehash_insert_helper (extendible_hash.c:1715) is the orchestrator. Its lock-promotion pattern is the analogue of B+Tree’s “restart from root with X” — the helper releases the shared root latch and recursively re-enters with X_LOCK whenever it discovers it must mutate the directory.

flowchart TB
  start["ehash_insert_helper(key, value, lock=S)"]
  start --> resolve["ehash_find_bucket_vpid_with_hash<br/>S/X latch dir, S latch bucket-vpid"]
  resolve --> case_null{"bucket_vpid<br/>== NULL?"}
  case_null -- "yes" --> retry1{"lock == S?"}
  retry1 -- "yes" --> restart1["release dir,<br/>recurse with X_LOCK"]
  retry1 -- "no" --> create["ehash_insert_to_bucket_after_create<br/>(allocate new bucket, connect, log sysop)"]
  case_null -- "no" --> attempt["ehash_insert_bucket_after_extend_if_need"]
  attempt --> simple["ehash_insert_to_bucket<br/>(write or replace)"]
  simple --> ok{"result"}
  ok -- "OK" --> done["pgbuf_unfix dir; return key"]
  ok -- "BUCKET_FULL" --> retry2{"lock == S?"}
  retry2 -- "yes" --> restart2["release dir,<br/>recurse with X_LOCK"]
  retry2 -- "no" --> split["ehash_extend_bucket<br/>(split + maybe expand directory)"]
  split --> reinsert["ehash_insert_to_bucket on chosen side"]
  reinsert --> done

Inserting into a bucket — ehash_insert_to_bucket

Section titled “Inserting into a bucket — ehash_insert_to_bucket”

The straight-line case (room available, key not present) runs ehash_insert_to_bucket (extendible_hash.c:1837):

  1. ehash_locate_slot to find where the key would sit.
  2. If found (key already exists) — replace: copy the old record, write the new OID into its OID slot in place, emit RVEH_DELETE undo (which restores the old OID on abort) and RVEH_REPLACE redo (which re-applies the new OID). This means ehash_insert is upsert semantics — the second insert of the same key replaces the value.
  3. Otherwise — ehash_compose_record builds a record of the form (OID || key_bytes) (small variant) or (OID || prefix_vpid || prefix) (REC_BIGONE, the long- string overflow variant — guarded under ENABLE_UNUSED_FUNCTION and not active in current builds), then spage_insert_at puts it at the located slot.
  4. If spage_insert_at returns SP_DOESNT_FIT, the bucket is full — return EHASH_BUCKET_FULL to the caller (which is the signal that triggers the split path).
  5. Emit RVEH_INSERT undo and redo records. The undo record carries the EHID + the inserted record (the undo handler ehash_rv_insert_undo reads the EHID, then calls ehash_rv_delete to remove the entry).
  6. pgbuf_set_dirty (bucket_page_p, DONT_FREE).

The shape of the WAL records — undo carries enough to identify the EHID + the record-to-remove, redo carries slot-id + the record-to-insert — is logical undo, physical redo, the classical ARIES discipline for variable-length record stores.

Splitting a bucket — ehash_split_bucket and the depth dance

Section titled “Splitting a bucket — ehash_split_bucket and the depth dance”

When ehash_insert_to_bucket returns EHASH_BUCKET_FULL and the caller holds an X latch on the directory, ehash_extend_bucket (extendible_hash.c:1546) is the orchestrator. It runs under a system operation (log_sysop_start / log_sysop_commit) so a crash in the middle does not leave the directory inconsistent:

sequenceDiagram
  participant H as ehash_insert_helper
  participant E as ehash_extend_bucket
  participant S as ehash_split_bucket
  participant D as ehash_expand_directory
  participant C as ehash_connect_bucket
  participant L as log_manager (sysop)

  H->>E: bucket full, X latch held
  E->>L: log_sysop_start
  E->>S: split_bucket(buc_pgptr, key)
  S->>S: log RVEH_REPLACE undo (full bucket image)
  S->>S: ehash_find_first_bit_position (compute new local depth)
  S->>S: file_alloc sibling page (init via ehash_initialize_bucket_new_page)
  S->>S: ehash_distribute_records_into_two_bucket
  S->>S: log RVEH_REPLACE redo for bucket and sibling
  S-->>E: sibling page ptr, old/new local_depth, new VPID
  E->>E: ehash_adjust_local_depth (-1 at old, +2 at new)
  alt new_local_depth > global depth
    E->>D: ehash_expand_directory (double the directory)
  end
  alt new_local_depth - old_local_depth > 1
    E->>C: connect old slot range to NULL_VPID
    E->>C: connect lower half to original bucket
  end
  E->>C: connect upper half to sibling
  E->>L: log_sysop_commit
  E-->>H: sibling_page_p, new_bit

The interesting steps:

  • ehash_find_first_bit_position (extendible_hash.c:2545) walks the bucket’s records (computing each record’s pseudo-key with ehash_get_pseudo_key), XORs them against the first record’s pseudo-key, and locates the first bit position at which any record differs. That position becomes the new local depth — i.e. CUBRID does not blindly do local_depth + 1; it picks the smallest local depth that actually separates the records. This avoids creating two buckets where one is empty (which would force an immediate re-split).
  • ehash_distribute_records_into_two_bucket (extendible_hash.c:2605) iterates records: those whose local_depth-th pseudo-key bit is 1 move to the sibling; the rest stay. The order in the source bucket is preserved in both halves, which keeps both buckets sorted by user key (because the pseudo-key is independent of the user-key ordering, and binary search inside the bucket relies on user-key order, not pseudo-key bits).
  • ehash_expand_directory (extendible_hash.c:2792) doubles the directory by walking it from the last entry backward, copying each old entry 2^(new_depth - old_depth) times into the expanded space. New directory pages, if needed, are allocated with file_alloc_multiple and initialized via ehash_initialize_dir_new_page which emits RVEH_INIT_NEW_DIR_PAGE. Each modified existing dir page is logged with paired RVEH_REPLACE undo+redo of the full page image.
  • ehash_connect_bucket (extendible_hash.c:3067) rewrites a contiguous range of directory entries to point at a given VPID. The number of entries to rewrite is 2^(global_depth - local_depth); the range is computed by setting the low (global_depth - local_depth) bits of the hash-derived location alternately to all-ones (set_bits) and zeros, then using ehash_dir_locate to translate the endpoints into page-and-offset. Each affected directory page emits one RVEH_CONNECT_BUCKET record carrying an EHASH_REPETITION = { VPID, count } payload. The matching redo function ehash_rv_connect_bucket_redo (extendible_hash.c:5429) just rewrites count slots. This keeps the WAL volume O(1) per affected dir page rather than O(count * sizeof(VPID)).
  • ehash_adjust_local_depth (extendible_hash.c:3633) updates dir_header.local_depth_count[depth] += delta on the dir root page and emits an RVEH_INC_COUNTER undo+redo log record (ehash_rv_increment is the recovery handler). This histogram is what drives the directory-shrink decision later: when local_depth_count[depth] falls to 0 for the current dir_header.depth, the directory has fewer meaningful bits than its width, and shrinking is profitable.

Bucket creation when the directory entry is NULL

Section titled “Bucket creation when the directory entry is NULL”

A delete can leave a directory entry set to NULL_VPID (the bucket was emptied, deallocated, and the directory wasn’t shrunk). On the next insert that hashes there, the helper calls ehash_insert_to_bucket_after_create (extendible_hash.c:1472):

  1. ehash_find_depth (extendible_hash.c:3183) walks the neighborhood of the NULL slot to find the smallest local depth that distinguishes this entry’s range from the buckets sharing the high-order bits with it. The resulting depth feeds the new bucket’s header.
  2. file_alloc allocates the bucket page, initialized via ehash_initialize_bucket_new_page with the chosen depth.
  3. ehash_connect_bucket rewrites the 2^(d - d_b) directory entries in the run to point at the new bucket.
  4. ehash_adjust_local_depth(+1) updates the histogram.
  5. ehash_insert_to_bucket writes the user record.

The whole sequence is bracketed in log_sysop_start / log_sysop_commit.

ehash_delete (extendible_hash.c:3418) is the inverse of ehash_insert. It descends with S on the dir root and X on the bucket — there is no second-attempt promotion path because deletion does not need to mutate the directory in the common case:

// ehash_delete — src/storage/extendible_hash.c:3418 (condensed)
void *
ehash_delete (THREAD_ENTRY *thread_p, EHID *ehid_p, void *key_p)
{
dir_root_page_p = ehash_find_bucket_vpid_with_hash (
thread_p, ehid_p, key_p,
PGBUF_LATCH_READ, PGBUF_LATCH_WRITE, /* root S, bucket X */
&bucket_vpid, NULL, &location);
if (bucket_vpid.pageid == NULL_PAGEID) { /* not found */ }
bucket_page_p = ehash_fix_old_page (… PGBUF_LATCH_WRITE);
if (!ehash_locate_slot (… &slot_no)) { /* ER_EH_UNKNOWN_KEY */ }
spage_get_record (thread_p, bucket_page_p, slot_no, &bucket_recdes, PEEK);
/* logical undo log: EHID || full record */
log_append_undo_data2 (thread_p, RVEH_DELETE, &ehid_p->vfid, NULL,
bucket_recdes.type, log_recdes_undo.length, log_recdes_undo.data);
/* physical redo log: slot-id with key-type prefix */
log_append_redo_data2 (thread_p, RVEH_DELETE, &ehid_p->vfid, bucket_page_p,
slot_no, log_recdes_redo.length, log_recdes_redo.data);
spage_delete (thread_p, bucket_page_p, slot_no);
pgbuf_set_dirty (thread_p, bucket_page_p, DONT_FREE);
/* Decide whether to merge */
if (bucket became empty || bucket underflows below EHASH_UNDERFLOW_THRESHOLD)
{
if (ehash_check_merge_possible (… S_LOCK …) == EHASH_SUCCESSFUL_COMPLETION)
do_merge = true;
}
pgbuf_unfix dir, bucket;
if (do_merge) ehash_merge (thread_p, ehid_p, key_p, is_temp);
return key_p;
}

The thresholds:

// src/storage/extendible_hash.c:69-91
#define EHASH_OVERFLOW_RATE 0.9 /* sibling fullness ceiling for merge */
#define EHASH_UNDERFLOW_RATE 0.4 /* below this -> try to merge */
#define EHASH_OVERFLOW_THRESHOLD (EHASH_OVERFLOW_RATE * DB_PAGESIZE)
#define EHASH_UNDERFLOW_THRESHOLD (EHASH_UNDERFLOW_RATE * DB_PAGESIZE)

The asymmetry between underflow (40%) and overflow (90%) prevents yo-yo split-merge oscillation: a delete merges buckets only when both the bucket has dropped below 40% and the sibling has at most 90% room left after absorbing the remaining records. In other words — merge if the result won’t immediately re-split.

ehash_merge (extendible_hash.c:3771) re-acquires the directory under an X latch and re-validates that the bucket still wants to merge (a concurrent insert might have refilled it). The decision branches on ehash_check_merge_possible (extendible_hash.c:3274):

  • EHASH_NO_SIBLING_BUCKET — the bucket has local depth 0 (single-bucket file) or its sibling slot is null.
    • If the bucket is now empty and its local depth > 0, the bucket page is file_dealloc-ed, all directory entries pointing at it are reset to NULL_VPID, and local_depth_count[old_depth] is decremented. ehash_shrink_directory_if_need then checks whether the histogram allows global-depth reduction.
  • EHASH_FULL_SIBLING_BUCKET — the sibling cannot absorb the remaining records under the 90% rule. Merge is abandoned; the bucket stays underfilled.
  • EHASH_SUCCESSFUL_COMPLETION — actual merge: ehash_merge_permanent (extendible_hash.c:3655) moves the records to the sibling, updates the sibling’s local depth via ehash_find_depth, deallocates the now-empty source bucket page, and rewrites the directory entries that pointed at the source to now point at the sibling. The histogram is updated by two ehash_adjust_local_depth calls (-2 at old depth, +1 at new depth), and ehash_shrink_directory_if_need may then halve the directory.

The whole merge sequence runs under log_sysop_start / log_sysop_commit — the same system-op envelope that brackets splits.

Directory shrink — ehash_shrink_directory_if_need / ehash_shrink_directory

Section titled “Directory shrink — ehash_shrink_directory_if_need / ehash_shrink_directory”

The histogram-driven shrink trigger:

// ehash_shrink_directory_if_need — src/storage/extendible_hash.c:3615
static void
ehash_shrink_directory_if_need (THREAD_ENTRY *thread_p, EHID *ehid_p,
EHASH_DIR_HEADER *dir_header_p, bool is_temp)
{
int i = dir_header_p->depth;
while (dir_header_p->local_depth_count[i] == 0) i--;
if ((dir_header_p->depth - i) > 1)
ehash_shrink_directory (thread_p, ehid_p, i + 1, is_temp);
}

Note the threshold is > 1, not >= 1: the directory shrinks only when at least two depth levels are unused. This avoids shrink-grow oscillation when the live max depth straddles a boundary. ehash_shrink_directory (extendible_hash.c:3982) implements the inverse of expansion: walk the new directory left-to-right, copying every 2^times-th entry from the old directory, log each modified page with paired RVEH_REPLACE undo+redo, and file_dealloc the trailing dir pages.

CUBRID’s extendible hash uses page latches only — there are no separate hash-level locks. The discipline:

  • Search — S-latch the directory root page; from it, optionally fix the secondary directory page with S; then S-latch the bucket. All three latches can be held briefly; the dir root and dir page are released as soon as the bucket VPID is extracted. The bucket S-latch is held through the binary search and the value extraction.
  • Insert (fast path, no split) — S-latch the dir root, X-latch the bucket. The bucket X-latch covers locate ▸ insert ▸ log ▸ set-dirty.
  • Insert (split or directory-extend) — restart with X-latch on the dir root, X-latch on bucket, X-latch on sibling, X-latch on every directory page touched by the rewrite (ehash_connect_bucket / ehash_expand_directory). All under one system op.
  • Delete (fast path, no merge) — S-latch the dir root, X-latch the bucket. Merge promotion releases both, then re-fixes with X / X.
  • Merge / shrink — X on dir root, X on bucket, X on sibling, X on each affected dir page. System-op bracket.

The shared model is therefore fix-the-page, search, maybe-promote, log, release — much like B+Tree, but without latch-coupling because there’s no tree path. The directory is one-deep (header-page → body-page → bucket-page); descent is at most three fixes.

The cost of the optimistic-S-latch-then-restart pattern in ehash_insert is two descents when the bucket happens to be full: one to discover the BUCKET_FULL condition under S, then again under X. The benefit is that the common case — insert into a non-full bucket — never holds an X latch on the directory root, so concurrent searches of unrelated keys are not blocked by inserts. With a workload dominated by short-name-to-OID lookups (boot-time class resolution, query compilation), this trade-off is well-suited.

CUBRID emits eight distinct recovery indices for extendible hash (recovery.h:105-112):

IndexProducerRecovery handlerPurpose
RVEH_REPLACE = 58ehash_split_bucket, ehash_expand_directory, ehash_shrink_directory, ehash_merge_permanentlog_rv_copy_char / ehash_rv_…_redoFull-page-image undo/redo of dir or bucket page
RVEH_INSERT = 59ehash_insert_to_bucketehash_rv_insert_redo / ehash_rv_insert_undoPer-record bucket insertion
RVEH_DELETE = 60ehash_delete, ehash_rv_deleteehash_rv_delete_redo / ehash_rv_delete_undoPer-record bucket deletion
RVEH_INIT_BUCKET = 61ehash_initialize_bucket_new_pageehash_rv_init_bucket_redoRe-initialize a freshly allocated bucket page (slotted-page init + header)
RVEH_CONNECT_BUCKET = 62ehash_connect_bucketehash_rv_connect_bucket_redoRewrite a contiguous run of directory entries to a single VPID
RVEH_INC_COUNTER = 63ehash_adjust_local_depthehash_rv_incrementAdjust local_depth_count[i] by ±delta
RVEH_INIT_DIR = 64ehash_create_helperehash_rv_init_dir_redoInitialize the dir root page on file create
RVEH_INIT_NEW_DIR_PAGE = 65ehash_initialize_dir_new_pageehash_rv_init_dir_new_page_redoInitialize a directory expansion page

The dispatch table that wires these into the recovery function array sits in transaction/recovery.c:415-452. The record’s data buffer carries enough payload that the handler can be re-run idempotently on the page that the LSN identifies — for RVEH_INSERT that’s (short rec_type, OID, key_bytes); for RVEH_DELETE redo it’s the same plus the slot id; for RVEH_DELETE undo it’s (EHID, OID, key_bytes) because the undo handler must re-locate the bucket (it cannot trust the LSN’s page identity — split or merge may have moved the record).

The undo discipline is logical for inserts and deletes, physical for splits and merges (page-image replace). Logical-undo for inserts means ehash_rv_insert_undo looks up the entry by EHID + key and calls ehash_rv_delete to remove it; physical undo would have to reverse the slotted-page state byte-for-byte and would break if the bucket has been split since the original insert. The split/merge path uses physical undo (full DB_PAGESIZE image) because the page is already X-latched under a system op and there’s no safe way to compute a “logical reverse” of a redistribution; replaying the page image is correct and simple. The cost — one full page image per modified page — is bounded by the system-op envelope: a single split logs at most three page-images (source bucket, sibling bucket, one dir page), which is tolerable.

Use cases inside CUBRID — the four call sites

Section titled “Use cases inside CUBRID — the four call sites”

Verified by grep -rn "ehash_\|EHID\|xehash_" src/:

  1. Class-name → OID directory. EHID classname_table lives in the database boot parameter (boot_sr.c:122). It is created at first-time database format with xehash_create (… DB_TYPE_STRING, …, rootclass_oid, …, false) (boot_sr.c:5001) and persists for the database’s lifetime. Lookups (ehash_search), inserts on CREATE TABLE and deletes on DROP TABLE route through locator_* (locator manager) wrappers; the underlying calls are the standard public API.
  2. Catalog representation directory. Inside system_catalog.c, the catalog index catalog_id_p->xhid is created with xehash_create (… DB_TYPE_OBJECT, 1, oid_Root_class_oid, -1, false) (system_catalog.c:2635). It maps a class OID to its catalog representation entry. DROP CLASS removes the mapping with ehash_delete (thread_p, &catalog_Id.xhid, key) (system_catalog.c:2230).
  3. UPDATE/DELETE OID dedup. qexec_init_upddel_ehash_files (query_executor.c:9822) allocates xasl->upd_del_class_cnt temporary EHIDs at the start of each multi-class UPDATE/DELETE plan, with xehash_create (… DB_TYPE_OBJECT, -1, NULL, 0, true) — note is_tmp == true so logging is suppressed. During execution qexec_upddel_add_unique_oid_to_ehid (query_executor.c:1015) calls ehash_search first; if the OID is found, the row is NULL-ed out as already processed, otherwise ehash_insert records it. The files are destroyed at statement end via qexec_destroy_upddel_ehash_files.
  4. Hash-list / hash-set scans. src/query/query_hash_scan.{c,h} defines FILE_HASH_SCAN structures that embed EHID ehid and reuse the directory file format (via file_create_ehash_dir) for the materialized hash side of HASH-LIST-SCAN and HASH-SET-SCAN operators. This is the only call site that does not go through xehash_create — it builds only the directory portion and runs its own bucket logic on top of it.

There are no other callers in the engine. In particular, user-defined indexes never use extendible hashing — only B+Tree (btree.c).

  • Key types in production: only DB_TYPE_STRING and DB_TYPE_OBJECT. The implementation has code under ENABLE_UNUSED_FUNCTION for DB_TYPE_DOUBLE, DB_TYPE_FLOAT, DB_TYPE_INTEGER, DB_TYPE_BIGINT, DB_TYPE_SHORT, DB_TYPE_DATE, DB_TYPE_TIME, DB_TYPE_TIMESTAMP, DB_TYPE_DATETIME, DB_TYPE_MONETARY — but those branches are unreachable in current builds because the assert at the top of ehash_create_helper forbids any other key type.
  • Long-string overflow: also ENABLE_UNUSED_FUNCTION-gated. The EHASH_DIR_HEADER::overflow_file field is initialized to VFID_NULL and never populated in current builds, even for class names — the assert (strlen ((char *) key_p) < 256) in ehash_insert_helper enforces a 256-byte cap on string keys and the long-string code paths are dead.
  • xehash_destroy (extendible_hash.c:1259) only works on temporary hash files (it uses file_destroy (…, true)). Permanent files are deleted indirectly via file_destroy on the directory and bucket VFIDs at higher levels (e.g. during DROP TABLE for the class-name table during database wipe).
  • No MULTI_VERSION integration. Extendible hash is pre-MVCC: there is no version chain, no MVCC visibility filter, no BTREE_OP_PURPOSE-style enum. Concurrent visibility is enforced at a higher layer (the call site serializes class-name modifications with a directory-level lock from lock_manager.c).

Anchor on symbol names, not line numbers. Position hints are scoped to the document’s updated: date.

  • EHID (storage_common.h) — public 12-byte handle { VFID vfid; INT32 pageid }.
  • EH_SEARCH (storage_common.h) — return type of ehash_search (EH_KEY_FOUND, EH_KEY_NOTFOUND, EH_ERROR_OCCURRED).
  • EHASH_DIR_HEADER (extendible_hash.c) — directory root-page header.
  • EHASH_DIR_RECORD (extendible_hash.c) — single-VPID directory entry.
  • EHASH_BUCKET_HEADER (extendible_hash.c) — slot-0 one-byte bucket header carrying local depth.
  • EHASH_REPETITION (extendible_hash.c) — the (VPID, count) payload of RVEH_CONNECT_BUCKET.
  • EHASH_HASH_KEY (extendible_hash.c) — the unsigned int pseudo-key type.
  • EHASH_RESULT enum (extendible_hash.c) — internal return code (SUCCESSFUL_COMPLETION, BUCKET_FULL, BUCKET_UNDERFLOW, BUCKET_EMPTY, FULL_SIBLING_BUCKET, NO_SIBLING_BUCKET, ERROR_OCCURRED).
  • GETBITS, FIND_OFFSET, GETBIT, SETBIT, CLEARBIT (extendible_hash.c) — bit manipulation macros.
  • EHASH_OVERFLOW_THRESHOLD, EHASH_UNDERFLOW_THRESHOLD, EHASH_DIR_HEADER_SIZE, EHASH_NUM_FIRST_PAGES, EHASH_NUM_NON_FIRST_PAGES, EHASH_HASH_KEY_BITS (extendible_hash.c) — page geometry constants.
  • RVEH_* recovery indices (transaction/recovery.h) — the eight-record family.
  • xehash_create (extendible_hash.c) — constructor for permanent or temporary hash files.
  • xehash_destroy (extendible_hash.c) — destructor for temporary hash files only.
  • ehash_search (extendible_hash.c) — point lookup.
  • ehash_insert (extendible_hash.c) — upsert.
  • ehash_delete (extendible_hash.c) — point delete.
  • ehash_map (extendible_hash.c) — full-scan apply.
  • ehash_dump, ehash_print_bucket, ehash_dump_bucket (extendible_hash.c) — debug dumpers.
  • ehash_dir_locate (extendible_hash.c) — directory index → (page_no, offset) translator.
  • ehash_find_bucket_vpid (extendible_hash.c) — given a directory header and an entry index, return the bucket VPID.
  • ehash_find_bucket_vpid_with_hash (extendible_hash.c) — hash-then-locate convenience wrapper used by all three CRUD operations.
  • ehash_fix_old_page, ehash_fix_ehid_page, ehash_fix_nth_page (extendible_hash.c) — page-buffer fix wrappers with PAGE_EHASH ptype check.
  • ehash_hash (extendible_hash.c) — type-dispatched entry.
  • ehash_hash_string_type (extendible_hash.c) — folding + mht_*strhash triple-pass.
  • ehash_hash_eight_bytes_type (extendible_hash.c) — OID-aware folding.
  • ehash_get_pseudo_key (extendible_hash.c) — re-hashes a record extracted from a bucket (used during split).
  • ehash_initialize_bucket_new_page (extendible_hash.c) — page-init callback for file_alloc.
  • ehash_compose_record (extendible_hash.c) — pack (OID, key) into a RECDES.
  • ehash_write_key_to_record (extendible_hash.c) — copy key bytes by type.
  • ehash_compare_key (extendible_hash.c) — type-dispatched three-way compare.
  • ehash_binary_search_bucket (extendible_hash.c) — log-N search inside a slotted bucket.
  • ehash_locate_slot (extendible_hash.c) — wrapper that handles the empty-bucket edge.
  • ehash_get_key_size (extendible_hash.c) — fixed-size key types.
  • ehash_insert_helper (extendible_hash.c) — top-level orchestrator with S→X promotion.
  • ehash_insert_to_bucket (extendible_hash.c) — straight insert; emits RVEH_INSERT undo+redo.
  • ehash_insert_to_bucket_after_create (extendible_hash.c) — bucket creation when the directory entry was NULL.
  • ehash_insert_bucket_after_extend_if_need (extendible_hash.c) — bucket-full retry that calls split.
  • ehash_extend_bucket (extendible_hash.c) — split + optional dir-doubling orchestrator under sysop.
  • ehash_split_bucket (extendible_hash.c) — physical bucket-split (allocate sibling, distribute records).
  • ehash_find_first_bit_position (extendible_hash.c) — smallest local-depth bump that actually separates records.
  • ehash_distribute_records_into_two_bucket (extendible_hash.c) — slot-by-slot move of records to the sibling.
  • ehash_expand_directory (extendible_hash.c) — directory doubling, including new-page allocation.
  • ehash_initialize_dir_new_page (extendible_hash.c) — page-init callback for new directory pages.
  • ehash_delete (extendible_hash.c) — public delete.
  • ehash_check_merge_possible (extendible_hash.c) — merge feasibility check (sibling exists, has same depth, has room).
  • ehash_merge (extendible_hash.c) — re-validates and performs the merge.
  • ehash_merge_permanent (extendible_hash.c) — physical record migration to sibling.
  • ehash_find_depth (extendible_hash.c) — derive the new local depth from the directory neighborhood after a merge.
  • ehash_shrink_directory_if_need (extendible_hash.c) — histogram-driven shrink trigger (> 1 slack).
  • ehash_shrink_directory (extendible_hash.c) — directory halving (or 4×, 8×, …) with page deallocation.
  • ehash_connect_bucket (extendible_hash.c) — rewrite a range of directory entries to one VPID; emits RVEH_CONNECT_BUCKET.
  • ehash_adjust_local_depth (extendible_hash.c) — apply delta to local_depth_count[depth]; emits RVEH_INC_COUNTER.
  • ehash_rv_init_bucket_redo (extendible_hash.c) — redo of RVEH_INIT_BUCKET.
  • ehash_rv_init_dir_redo (extendible_hash.c) — redo of RVEH_INIT_DIR.
  • ehash_rv_init_dir_new_page_redo (extendible_hash.c) — redo of RVEH_INIT_NEW_DIR_PAGE.
  • ehash_rv_insert_redo / ehash_rv_insert_undo (extendible_hash.c) — RVEH_INSERT (physical redo, logical undo).
  • ehash_rv_delete_redo / ehash_rv_delete_undo (extendible_hash.c) — RVEH_DELETE (physical redo, logical undo).
  • ehash_rv_delete (extendible_hash.c) — internal logical delete used by _rv_insert_undo; does not check merge conditions and does not log undo (it generates a CLR- compensating redo only).
  • ehash_rv_increment (extendible_hash.c) — RVEH_INC_COUNTER.
  • ehash_rv_connect_bucket_redo (extendible_hash.c) — RVEH_CONNECT_BUCKET.
  • RV_fun[] registrations in transaction/recovery.c:415-452.
  • or_pack_ehid / or_unpack_ehid (object_representation.c) — 12-byte EHID on-disk form.
  • OR_GET_EHID / OR_PUT_EHID (object_representation.h) — primitive byte-level codecs.
  • ehash_read_oid_from_record / ehash_write_oid_to_record (extendible_hash.c) — OID byte order in bucket records.
  • ehash_read_ehid_from_record / ehash_write_ehid_to_record (extendible_hash.c) — EHID payload codec for log records.
SymbolFileLine
EHID (struct)storage_common.h212
EH_SEARCH (enum)storage_common.h381
OR_EHID_SIZEobject_representation_constants.h87
EHASH_HASH_KEY (typedef)extendible_hash.c99
EHASH_DIR_HEADER (struct)extendible_hash.c103
EHASH_DIR_RECORD (struct)extendible_hash.c120
EHASH_BUCKET_HEADER (struct)extendible_hash.c127
EHASH_RESULT (enum)extendible_hash.c132
EHASH_REPETITION (struct)extendible_hash.c144
GETBITS / FIND_OFFSET (macros)extendible_hash.c193
ehash_dir_locateextendible_hash.c384
ehash_initialize_bucket_new_pageextendible_hash.c628
ehash_initialize_dir_new_pageextendible_hash.c712
ehash_rv_init_dir_new_page_redoextendible_hash.c735
xehash_createextendible_hash.c866
ehash_get_key_sizeextendible_hash.c879
ehash_create_helperextendible_hash.c954
ehash_fix_old_pageextendible_hash.c1188
ehash_fix_ehid_pageextendible_hash.c1218
ehash_fix_nth_pageextendible_hash.c1236
xehash_destroyextendible_hash.c1259
ehash_find_bucket_vpidextendible_hash.c1295
ehash_find_bucket_vpid_with_hashextendible_hash.c1327
ehash_searchextendible_hash.c1379
ehash_insertextendible_hash.c1461
ehash_insert_to_bucket_after_createextendible_hash.c1472
ehash_extend_bucketextendible_hash.c1546
ehash_insert_bucket_after_extend_if_needextendible_hash.c1641
ehash_insert_helperextendible_hash.c1715
ehash_insert_to_bucketextendible_hash.c1837
ehash_compose_recordextendible_hash.c2172
ehash_compare_keyextendible_hash.c2223
ehash_binary_search_bucketextendible_hash.c2401
ehash_locate_slotextendible_hash.c2479
ehash_get_pseudo_keyextendible_hash.c2509
ehash_find_first_bit_positionextendible_hash.c2545
ehash_distribute_records_into_two_bucketextendible_hash.c2605
ehash_split_bucketextendible_hash.c2684
ehash_expand_directoryextendible_hash.c2792
ehash_connect_bucketextendible_hash.c3067
ehash_find_depthextendible_hash.c3183
ehash_check_merge_possibleextendible_hash.c3274
ehash_deleteextendible_hash.c3418
ehash_shrink_directory_if_needextendible_hash.c3615
ehash_adjust_local_depthextendible_hash.c3633
ehash_merge_permanentextendible_hash.c3655
ehash_mergeextendible_hash.c3771
ehash_shrink_directoryextendible_hash.c3982
ehash_hash_string_typeextendible_hash.c4156
ehash_hash_eight_bytes_typeextendible_hash.c4244
ehash_hashextendible_hash.c4347
ehash_apply_eachextendible_hash.c4395
ehash_mapextendible_hash.c4518
ehash_dumpextendible_hash.c4598
ehash_rv_init_bucket_redoextendible_hash.c5023
ehash_rv_init_dir_redoextendible_hash.c5072
ehash_rv_insert_redoextendible_hash.c5089
ehash_rv_insert_undoextendible_hash.c5126
ehash_rv_delete_redoextendible_hash.c5181
ehash_rv_delete_undoextendible_hash.c5220
ehash_rv_deleteextendible_hash.c5293
ehash_rv_incrementextendible_hash.c5405
ehash_rv_connect_bucket_redoextendible_hash.c5429
RVEH_REPLACE … RVEH_INIT_NEW_DIR_PAGEtransaction/recovery.h105-112
RV_fun[] ehash entriestransaction/recovery.c415-452
xehash_create (boot, classname_table)transaction/boot_sr.c5001
xehash_create (catalog rep dir)storage/system_catalog.c2635
ehash_delete (catalog rep dir)storage/system_catalog.c2230
qexec_init_upddel_ehash_filesquery/query_executor.c9822
qexec_destroy_upddel_ehash_filesquery/query_executor.c9875
qexec_upddel_add_unique_oid_to_ehidquery/query_executor.c1015

A grep across the source tree (grep -rn "ehash_\|EHID\|xehash_" src/) confirms the four call sites listed in ## CUBRID's Approach § "Use cases inside CUBRID":

  • boot_sr.c: boot_Db_parm->classname_table is the only permanent-EHID created at database boot. Lookups go through locator_* wrappers that ultimately call ehash_search.
  • system_catalog.c: catalog representation directory created at server start; ehash_delete is called on DROP CLASS.
  • query_executor.c: temporary EHIDs for UPDATE/DELETE OID dedup, created with is_tmp=true so logging is suppressed.
  • query_hash_scan.c: reuses only the directory-file format, not the bucket logic.

Subsystems that do not use extendible hashing today, contrary to what one might guess from the public API:

  • User indexes. CREATE INDEX … USING HASH is not parsed; user indexes are always B+Tree (btree.c).
  • Unique-constraint violation tracking. Despite the document brief’s hint, current btree_unique plumbing does not reach into extendible_hash.c. The unique-constraint enforcement path in btree.c (btree_find_oid_and_its_page, BTREE_NEED_UNIQUE_CHECK) and the global stats table (btree_unique_stats, GLOBAL_UNIQUE_STATS_TABLE) work entirely within the B+Tree subsystem; no ehash_* calls appear in btree.c or btree_unique.{cpp,hpp}.
  • Catalog scan / catalog name lookup beyond representation directory. Catalog tables are heap-stored and accessed by B+Tree; only the rep-id directory uses extendible hash.

The implementation contains a substantial volume of code guarded by ENABLE_UNUSED_FUNCTION (long-string overflow, multi-byte numeric key types, in-bucket record sorting fallbacks). None of those branches is reachable in current builds — the asserts in ehash_create_helper and the cap on key length in ehash_insert_helper actively fence them off. A future change that needs broader key support or oversized keys would have to revive (and reverify) those paths.

The on-disk record header for a bucket record is (OID, key_bytes) — note that the OID is first, before the key. This differs from the B+Tree leaf record where the key precedes the OID. The reason is that extendible hash records are physically positioned in the bucket by their key (binary search compares key-first), but the record payload hands the value back to the caller before the key, because callers of ehash_search only need the OID and ehash_read_oid_from_record can read it without scanning the variable-length key.

The ehash_create_helper documentation comment claims the function “creates two files on this volume: one for the directory and one for the buckets” — but in fact, the bucket file’s volid comes from ehid->vfid.volid, not from the directory’s volid, because the directory’s VFID isn’t yet assigned at the time the bucket file is created (bucket_vfid.volid = ehid_p->vfid.volid; precedes file_create_ehash_dir). The two files end up on the same volume because the dir-file creation later sets dir_vfid.volid = bucket_vfid.volid. The end result is correct; the reading order is mildly counter-intuitive.

The ehash_get_key_size function distinguishes DB_TYPE_STRING by returning 1 rather than the key length, because string keys have variable length determined per record. This is a sentinel; callers check key_type == DB_TYPE_STRING explicitly and bypass the size computation.

  1. Is the PG-style “hash bucket overflow chain” worth adding for catalog lookups? CUBRID’s bucket size is one page; if a degenerate hash collision puts more than one page worth of records into the same hash address, the implementation handles it by recursively splitting until the local depth is large enough. With a 32-bit pseudo-key and mht_*strhash the collision space is empirically small, but a pathological class-naming convention (e.g. all classes named with the same 8-character prefix) could in principle exhaust the addressing depth. PostgreSQL’s per-bucket overflow chain handles this case by accepting a long chain on one bucket rather than doubling the directory. Investigation: what global-depth values are observed in real CUBRID deployments with thousands of classes?

  2. Why slot-0 holds the bucket header. The choice of using slot 0 of a slotted page for EHASH_BUCKET_HEADER rather than reserving page-header space is unusual. The likely reason is reuse of the heap-page slot machinery (spage_initialize, spage_get_record, spage_insert) without a separate “ehash page” page-buffer integration — the cost is a one-byte allocator overhead per page (the slot-0 record). Verification: is there a code path that ever scans the bucket starting at slot 1 and assumes the header is in the page header rather than slot 0?

  3. Concurrent-split correctness with optimistic S latch on dir. ehash_insert_helper holds an S latch on the dir root while it discovers a bucket-full, then releases and re-acquires X. Between the release and the X-fix, another thread could split the same bucket. The X-fixed insert would then see a non-full bucket on the second try, the “BUCKET_FULL” branch would not be taken, and the insert would succeed without a second split. Is this benign in all cases, or is there a window where the new hash location (after the directory state change) differs from what the first-pass ehash_find_bucket_vpid_with_hash computed? The recursive call starts from the top, so it should re-resolve the bucket — but the comment in ehash_find_bucket_vpid_with_hash does not explicitly document the safety property.

  4. Why EHASH_OVERFLOW_RATE = 0.9 and EHASH_UNDERFLOW_RATE = 0.4? The 0.9 / 0.4 split looks deliberate (the gap prevents oscillation) but the specific values are not parameterized and not commented. Was there empirical workload tuning, or are these inherited from a textbook reference?

  5. Logical-undo vs physical-undo: when does a split-undo need to know about previous merges? The split path logs full page images (physical) for the bucket and sibling under a system op. If a transaction inserts (causing a split), then deletes (causing a merge of the sibling back into the bucket), then aborts, the undo replays the delete-undo (which is a logical re-insert) followed by the split-undo (which restores the pre-split bucket image). Does the order produce a consistent state, or is there a constraint on system-op ordering that’s not visible in the code?

  6. Why does RVEH_CONNECT_BUCKET log as undoredo and not as paired undo+redo? The single record carries a before-image (bef_length bytes from the page) and an after-payload (EHASH_REPETITION). The undo handler must therefore know how to invert the repetition into a list of distinct VPIDs — but that’s not what ehash_rv_connect_bucket_redo does. Either the undo path actually replays the original bef_length bytes (in which case it’s “physical undo, structured redo”), or there’s a missing handler. Investigation path: read RV_fun[] entry for RVEH_CONNECT_BUCKET to see whether a separate undo function is registered or whether log_rv_copy_char is used as a generic page-image-replay undo.

  7. Why no ehash_destroy for permanent files? xehash_destroy only handles temporary files. The permanent class-name table is never destroyed individually — it lives for the database’s lifetime and is reaped along with the database itself when the volume is dropped. If a future feature needed to destroy a permanent extendible hash file (e.g., for online schema-renames at a database level), the implementation would need a permanent-file variant; the current shape suggests no such requirement has surfaced.

  • knowledge/code-analysis/cubrid/cubrid-btree.md — the alternative on-disk index that user indexes use; same WAL/page-buffer substrate, different access pattern.
  • knowledge/code-analysis/cubrid/cubrid-page-buffer-manager.mdpgbuf_fix, PGBUF_LATCH_READ / PGBUF_LATCH_WRITE, pgbuf_set_dirty discipline used by every ehash routine.
  • knowledge/code-analysis/cubrid/cubrid-heap-manager.md — the slotted-page substrate that bucket pages reuse.
  • knowledge/code-analysis/cubrid/cubrid-log-manager.mdlog_append_undoredo_data2, system-ops, and the RV_fun[] recovery dispatch table.
  • knowledge/code-analysis/cubrid/cubrid-recovery-manager.md — redo/undo passes that consume RVEH_* records.
  • knowledge/code-analysis/cubrid/cubrid-catalog-manager.md — the rep-id directory caller.

Textbook chapters (under knowledge/research/dbms-general/)

Section titled “Textbook chapters (under knowledge/research/dbms-general/)”
  • Database Internals (Petrov), Ch. 7 §“Hash Files” — the global-vs-local-depth model.
  • Fagin, Nievergelt, Pippenger & Strong, Extendible Hashing — A Fast Access Method for Dynamic Files, ACM TODS 4(3), 1979 — the original paper.
  • Litwin, Linear Hashing: A New Tool for File and Table Addressing, VLDB 1980 — the alternative.

CUBRID source (/data/hgryoo/references/cubrid/)

Section titled “CUBRID source (/data/hgryoo/references/cubrid/)”
  • src/storage/extendible_hash.c — the implementation.
  • src/storage/extendible_hash.h — the public API.
  • src/storage/storage_common.hEHID, EH_SEARCH.
  • src/storage/system_catalog.c, system_catalog.h — catalog rep-id directory caller.
  • src/transaction/boot_sr.c — class-name table caller and database-boot creator.
  • src/transaction/recovery.h, recovery.cRVEH_* codes and RV_fun[] registrations.
  • src/query/query_executor.c — UPDATE/DELETE OID-dedup caller (qexec_init_upddel_ehash_files, qexec_upddel_add_unique_oid_to_ehid).
  • src/query/query_hash_scan.{c,h} — directory-file reuse for HASH-LIST-SCAN / HASH-SET-SCAN.
  • src/base/object_representation.c, object_representation.h, object_representation_constants.hor_pack_ehid / OR_EHID_* codecs.
  • src/storage/file_manager.c, file_manager.hfile_create_ehash, file_create_ehash_dir, FILE_EHASH_DES.