Skip to content

CUBRID B+Tree — Layout, Latch-Coupling, and Unique-Key Suffixing

Contents:

The B+Tree is the workhorse index of every disk-resident relational database. Database Internals (Petrov, ch. 2 §“B-Tree Basics” and ch. 3 §“File Formats”) frames it as the right answer to four simultaneous goals:

  1. Logarithmic search. Tree fan-out keeps depth at ~3-5 even for billion-row tables.
  2. Sequential scan. Sibling-linked leaves give O(1) next-key per step.
  3. Cache-locality on both reads and writes. Each node is one page, so a node fix is a buffer-pool fix.
  4. Concurrent access via short-duration latches rather than long-duration locks.

The original Bayer–McCreight (1972) tree didn’t have all four; it was extended by:

  • Lehman & Yao (1981) — the B-link tree that adds a right-link to every node, so a descend that loses the lock before reaching a leaf can recover by following right-links rather than re-acquiring ancestor latches.
  • Mohan (1990)ARIES/IM, the index recovery algorithm that adds CLR-aware undo and logical undo for splits, so a partial split is restartable without holding latches across the log force.

CUBRID’s B+Tree implements neither in pure form but borrows heavily from both. Two implementation choices the model leaves open shape every real engine and frame the rest of this document:

  1. What is the key? Pure key, or key || OID (key concatenated with the row’s OID)? In a non-unique index, the choice is forced — duplicate keys exist, and the OID is the tie-breaker. In a unique index, the choice is between storing pure keys (with the OID payload separate) and key || OID for uniformity. CUBRID uses both: leaf non-unique entries are key || OID, and unique-index leaf entries are pure key with the OID stored next to the key in a known offset (with overflow into a per-key OID list when the count is large).
  2. How to descend safely. Lock-coupling (parent-then-child, release parent), Blink-tree right-link recovery, or full path-locking? CUBRID uses lock-coupling on the read path and adds a “restart from root” recovery path on the write path when a split has invalidated the descent.

After the choices are named, every CUBRID-specific structure in this document either implements one of them or makes the implementation faster.

Every disk-resident B+Tree implementation reaches for the same handful of patterns.

Each node is one buffer-pool page. Keys are stored in slots, with a slot directory at the page tail growing backward and the records growing forward. The free space in the middle absorbs inserts. CUBRID inherits this from heap pages (cubrid-heap-manager.md §“Slotted page”).

Non-leaf records are (separator-key, child-page-pointer) — small. Leaf records are (key, OID) plus possibly a list of OIDs for non-unique keys — variable size and possibly large. Engines that mix the two on the same page (e.g., older designs) have worse cache behaviour; engines that split them (CUBRID, PostgreSQL, InnoDB) have a clean record format per node type.

Acquire latch on parent, fix child, release parent. With shared-mode latches on the read path and exclusive on the write path. Modern engines often add an “intent-exclusive” mode that allows the descent to advance speculatively, escalating to exclusive only when a split is certain.

A non-unique key has zero or more OIDs associated with it. The leaf record’s payload is the key followed by the first few OIDs inline; once the OID count exceeds a threshold, the rest spill to an overflow page chain. The overflow chain is per-key, so unrelated keys don’t share overflow pages.

A unique index rejects duplicate keys at insert. The check has to run under the leaf’s exclusive latch — otherwise two concurrent inserters could both observe “no duplicate” and both proceed. The standard pattern: descend with X latch on the leaf, search for the key, return error or attempt-with-MVCC-visibility check.

A B+Tree split touches multiple pages atomically. Physical undo of a split would have to reverse every page change in reverse order; that’s complicated and brittle. The standard fix: log the split as a logical operation (“re-merge these pages”) rather than a sequence of physical changes. CUBRID uses LOG_SYSOP_END_LOGICAL_UNDO records (cubrid-log-manager.md §“Sysop end record”) for split rollback.

Theoretical conceptCUBRID name
Slotted-page nodeReuses slotted_page from heap; BTREE_NODE_TYPE distinguishes role
Node type enumBTREE_NODE_TYPE { LEAF, NON_LEAF, OVERFLOW } (btree.h:81)
Non-leaf recordNON_LEAF_REC { VPID pnt; short key_len } (btree.h:103)
Leaf recordLEAF_REC { VPID ovfl; short key_len } (btree.h:111)
Internal B+Tree id wrapperBTID_INT { sys_btid, unique_pk, key_type, ovfid, … } (btree.h:119)
Range-scan cursorBTREE_SCAN (BTS) (btree.h:198)
Insert helper structbtree_insert_helper (btree.c:718)
Delete helper structbtree_delete_helper (btree.c:804)
Unique constraint check helperBTREE_NEED_UNIQUE_CHECK macro (btree.h:63)
Single vs. multi-row opSINGLE_ROW_INSERT / MULTI_ROW_INSERT constants (btree.h:52)
Split — nodebtree_split_node (btree.c:13324)
Split — rootbtree_split_root (btree.c:14184)
Split-point selectionbtree_find_split_point (btree.c:12419)
Merge — rootbtree_merge_root (btree.c:10284)
Merge — nodebtree_merge_node (btree.c:10557)
Bulk loadxbtree_load_index (btree_load.c:856)
Per-tree unique statsbtree_unique_stats class (btree_unique.hpp)
Overflow page chain for OID listbtree_find_free_overflow_oids_page / btree_find_oid_from_ovfl
Latch-coupling read descentbtree_search_nonleaf_pagebtree_search_leaf_page
Logical undo for split / mergeLOG_SYSOP_END_LOGICAL_UNDO arms in LOG_REC_SYSOP_END

The B+Tree module has four moving parts: the node layout (records on slotted pages by node type), the descent path (latch-coupling read + restart-on-split write), the modification path (insert, delete, split, merge as system ops), and the unique-constraint machinery (OID-suffix discipline + the unique stats table). We walk them in that order.

flowchart TB
  subgraph TREE["B+Tree"]
    R["root\n(non-leaf)"]
    N1["non-leaf"]
    N2["non-leaf"]
    L1["leaf"]
    L2["leaf"]
    L3["leaf"]
    OVF["overflow OID page"]
    R --> N1 --> L1
    R --> N1 --> L2
    R --> N2 --> L3
    L2 --> OVF
  end
  subgraph DESCEND["Descent (read)"]
    D1["btree_search_nonleaf_page\nlatch S, find child slot"]
    D2["btree_search_leaf_page\nlatch S, find slot, optionally walk overflow"]
    D1 --> D2
  end
  subgraph MODIFY["Modification (write)"]
    M1["btree_insert_helper / btree_delete_helper"]
    M2["btree_find_split_point"]
    M3["btree_split_node / btree_split_root"]
    M4["btree_merge_node / btree_merge_root"]
    M1 --> M2 --> M3
    M1 --> M4
  end
  subgraph UNIQUE["Unique-key path"]
    U1["btree_find_oid_and_its_page"]
    U2["btree_unique_stats"]
    U3["MVCC visibility check on candidate dup"]
  end
  TREE -.descend.-> DESCEND
  DESCEND -.found.-> MODIFY
  MODIFY -.unique check.-> UNIQUE

The figure encodes three boundaries. (node-type / role) non-leaf nodes carry separators and child pointers; leaf nodes carry keys and OIDs; overflow pages carry only OIDs. (descent / modify) descent is fix-and-search per node; modify is splice-and-rebalance under exclusive latches. (physical / logical) splits and merges are physical operations on multiple pages but are logged as logical undo so recovery doesn’t have to re-apply page movement in reverse.

Node layout — three record kinds, one slotted page

Section titled “Node layout — three record kinds, one slotted page”

Every B+Tree node is a slotted page (cubrid-heap-manager.md §“Slotted page”). The node’s role is captured in the BTREE_NODE_HEADER (in the page header area) and the records in slots conform to one of three formats.

// Node-type marker — src/storage/btree.h
typedef enum
{
BTREE_LEAF_NODE = 0,
BTREE_NON_LEAF_NODE,
BTREE_OVERFLOW_NODE
} BTREE_NODE_TYPE;

The fixed parts of non-leaf and leaf records:

// Fixed parts of B+Tree records — src/storage/btree.h
struct non_leaf_rec
{
VPID pnt; /* The child page pointer (volume-id, page-id) */
short key_len; /* Length of the key that follows */
};
struct leaf_rec
{
VPID ovfl; /* Overflow page pointer for OID list, or NULL */
short key_len; /* Length of the key that follows */
};

A non-leaf record on disk is then [NON_LEAF_REC fixed][key bytes]; a leaf record is [LEAF_REC fixed][key bytes][OID list]. The OID list is inline up to a per-page threshold; beyond that, ovfl points to an overflow page chain.

The B+Tree id wrapper adds metadata the tree needs at runtime:

// BTID_INT — src/storage/btree.h:119
struct btid_int
{
BTID *sys_btid; /* The on-disk btree id */
int unique_pk; /* bitmask: BTREE_CONSTRAINT_UNIQUE,
BTREE_CONSTRAINT_PRIMARY_KEY */
int part_key_desc; /* descending sort indicator */
TP_DOMAIN *key_type; /* domain of the key */
TP_DOMAIN *nonleaf_key_type; /* may differ from key_type for prefix keys
on fixed-char domains */
VFID ovfid; /* file id of overflow OID pages */
char *copy_buf; /* per-scan key copy buffer */
int copy_buf_len;
int rev_level; /* revision (for upgrade compatibility) */
int deduplicate_key_idx; /* SUPPORT_DEDUPLICATE_KEY_MODE */
OID topclass_oid; /* class OID this index serves */
};

unique_pk is read on every insert via the BTREE_NEED_UNIQUE_CHECK macro:

// Decide whether to enforce unique constraint — src/storage/btree.h
#define BTREE_NEED_UNIQUE_CHECK(thread_p, op) \
(logtb_is_current_active (thread_p) \
&& (op == SINGLE_ROW_INSERT \
|| op == MULTI_ROW_INSERT \
|| op == SINGLE_ROW_UPDATE))

Logged inserts during recovery skip the check (recovery never inserts duplicates because the original insert was already committed-or-aborted); active-transaction inserts and updates do enforce it.

A range scan in CUBRID is driven by a BTREE_SCAN (BTS) structure, holding the descent state plus enough bookkeeping to resume after a page split or a buffer-pool eviction.

// BTREE_SCAN — src/storage/btree.h:198 (condensed)
struct btree_scan
{
BTID_INT btid_int;
VPID P_vpid; /* previous leaf page (for backward step) */
VPID C_vpid; /* current leaf page */
VPID O_vpid; /* current overflow page (or NULL_VPID) */
PAGE_PTR P_page;
PAGE_PTR C_page;
INT16 slot_id; /* current slot inside C_page */
DB_VALUE cur_key;
bool clear_cur_key;
int common_prefix_size; /* prefix-compression aware reads */
DB_VALUE common_prefix_key;
bool clear_common_prefix_key;
bool is_cur_key_compressed;
BTREE_KEYRANGE key_range; /* lower / upper / range type */
FILTER_INFO *key_filter;
bool use_desc_index; /* descending scan */
LOG_LSA cur_leaf_lsa; /* page LSA last seen at C_page */
LOCK lock_mode; /* S_LOCK or X_LOCK */
RECDES key_record;
LEAF_REC leaf_rec_info;
BTREE_NODE_TYPE node_type;
int offset;
BTS_KEY_STATUS key_status; /* NOT_VERIFIED, VERIFIED, CONSUMED */
bool end_scan, end_one_iteration, is_interrupted, is_key_partially_processed;
int n_oids_read, n_oids_read_last_iteration;
OID *oid_ptr;
OID match_class_oid; /* for partitioned indexes */
DB_BIGINT *key_limit_lower, *key_limit_upper;
struct indx_scan_id *index_scan_idp;
bool is_btid_int_valid, is_scan_started, force_restart_from_root;
bool is_fk_remake;
PERF_UTIME_TRACKER time_track;
void *bts_other;
};

The cur_leaf_lsa is load-bearing for page-LSA-aware scan restart. When the scan returns to a leaf it had previously visited, it compares pgbuf_get_lsa (C_page) against bts->cur_leaf_lsa. If they match, the page hasn’t changed since the last visit — slot positions are still valid. If they differ, the scan must re-locate the key by binary search before continuing. This is how CUBRID avoids holding latches across client roundtrips on long range scans.

Descent — latch-coupling on read, restart-on-split on write

Section titled “Descent — latch-coupling on read, restart-on-split on write”

The read descent is handled by btree_search_nonleaf_page (btree.c:5189) and btree_search_leaf_page (btree.c:5537). Both take a node, search for the target key, return the slot and a found flag.

Latch-coupling rule (the textbook one):

  1. Fix root with S latch.
  2. For each internal level: search slot, fix child with S latch, release parent.
  3. At leaf: search slot, return.

Write descent is similar but escalates the latch on the leaf to X. If the leaf is full and the write requires a split, the descent restarts from the root with X latches all the way down (“force_restart_from_root” in BTS), and btree_split_node / btree_split_root are invoked under exclusive latches that cover the entire affected subtree.

btree_insert_helper (btree.c:718) and btree_delete_helper (btree.c:804) are constructor functions for stack-allocated helper structs that carry the per-operation context (purpose flags, mvcc info, key, OID, error, log lsa, etc.). They abstract the giant parameter lists the call sites would otherwise require, and unify the flag plumbing for MVCC vs. non-MVCC, single-row vs. multi-row, and unique vs. non-unique paths.

The insert top-level path:

btree_insert_helper hi (...);
btree_search_for_insert (&hi); // descend, locate target leaf
if (BTREE_NEED_UNIQUE_CHECK (op))
btree_find_oid_and_its_page (...); // is the key already there?
if (leaf_full)
{
log_sysop_start ();
btree_split_node (...); // split, possibly cascade up
log_sysop_end_logical_undo (LOG_RV_BTREE_SPLIT_*);
}
btree_insert_into_leaf (&hi);
log_append_undoredo_data (RVBT_..., ...);

The system-op bracket around split is what makes split rollback work. If the transaction aborts after the split but before the matching insert is logged, the undo pass replays the LOG_SYSOP_END_LOGICAL_UNDO of the split — which in turn calls btree_merge_node — re-merging the two halves back. Without the logical-undo bracket, undo would have to re-apply the page movement in reverse, including page allocations from the file manager.

Splits — split point, key promotion, parent update

Section titled “Splits — split point, key promotion, parent update”

Figure 1 — Leaf split with separator promotion

Figure 1 — Inserting key 6 into a leaf already holding {3, 4, 5} in a tree where each node fits 3 records. ① A new empty btree page is allocated and the right half ({5} plus the new 6) moves into it, leaving the original page with {3, 4}. ② The separator key (here 4) is inserted into the parent so subsequent searches steer to the correct half. The parent already had room — no cascade. The Node level = 1/2 annotations show that splits propagate up only one level at a time. (Source: B+Tree 코드 분석.pdf, slide 26 “6 삽입”.)

btree_find_split_point (btree.c:12419) decides where to split. The criterion is record-count balance with a tilt toward the side the new key would land — to absorb the next few inserts without an immediate re-split. The BTREE_NODE_SPLIT_INFO struct (btree.c:12849) carries cumulative ratio so that consecutive splits on the same path follow the same skew.

btree_split_node (btree.c:13324) does the actual split:

  1. Allocate a new sibling page R.
  2. Move the upper half of P’s records to R.
  3. Compute the separator key K (the smallest key in R, prefix- truncated when possible — see nonleaf_key_type).
  4. Insert (K, R) into P’s parent. If parent is full, recurse.
  5. Log each step as a separate RVBT_* record under the enclosing system op.

btree_split_root (btree.c:14184) is the special case where P is the root: a new root is allocated above, and both halves of the old root become its children.

Figure 2 — Root split: a level appears above the old root

Figure 2 — Inserting key 9 into a tree whose root and rightmost leaf are both at capacity. The leaf has space ({7, 8}{7, 8, 9}), but the root cannot accept a new separator. The picture captures the distinguishing property of btree_split_root versus btree_split_node: a node-level=3 layer appears above the old root rather than beside it, and the original Fix page chain stays unbroken. The deck labels this as “Root Split 발생” — the path is taken even when the leaf itself does not need to split, because pessimistic descent splits the root the moment a future split could require parent room. (Source: B+Tree 코드 분석.pdf, slide 30 “9 삽입 : Root Split 발생”.)

Merges — under-fill rebalancing on delete

Section titled “Merges — under-fill rebalancing on delete”

btree_merge_node (btree.c:10557) is the reverse: when a delete leaves a leaf below a fill threshold, merge with a sibling. btree_merge_root (btree.c:10284) handles the case where the root has only one child and can collapse a level.

Merges are also bracketed in a system op with logical undo, so delete rollback re-splits the merged pages back to their pre-merge state.

Unique-key handling — OID-list and stats

Section titled “Unique-key handling — OID-list and stats”

For a non-unique index, multiple OIDs share one key. The leaf record carries the first few OIDs inline; spillover goes to the overflow page chain (btid_int->ovfid is the file id of the overflow file).

btree_find_oid_and_its_page (btree.c:11646) is the lookup function: given a key and an OID, find which page (leaf or overflow) carries this exact (key, OID) pair. The function is the centrepiece of unique enforcement, foreign-key check, and delete dispatch.

// btree_find_oid_and_its_page — src/storage/btree.c:11646 (signature)
int btree_find_oid_and_its_page (THREAD_ENTRY *thread_p,
BTID_INT *btid_int,
OID *oid,
PAGE_PTR leaf_page,
BTREE_OP_PURPOSE purpose,
BTREE_MVCC_INFO *mvcc_info,
RECDES *leaf_record,
LEAF_REC *leaf_rec_info,
int after_key_offset,
PAGE_PTR *found_page,
int *offset_to_object);

The BTREE_OP_PURPOSE enum (btree_unique.hpp) selects the behaviour: INSERT (unique check), DELETE (locate to remove), MVCC_DELETE (mark visible deletion), VACUUM (used by the vacuum subsystem). Each purpose interprets MVCC visibility differently: insert sees committed-or-own-uncommitted; delete sees only own-or-committed; vacuum sees only old-enough-to-clean.

Per-tree unique stats live in btree_unique_stats (btree_unique.hpp) and are aggregated via the global GLOBAL_UNIQUE_STATS_TABLE (log_impl.h:651). Stats track num_nulls, num_keys, and num_oids so the optimizer can estimate selectivity without scanning the index.

Bulk load — sorted-runs + bottom-up build

Section titled “Bulk load — sorted-runs + bottom-up build”

xbtree_load_index (btree_load.c:856) is the bulk-load entry point. It builds the index from scratch in three phases:

  1. Sort phase. Read every (key, OID) pair from the heap, sort externally using external_sort (cubrid-heap-manager.md touches this).
  2. Build leaves. Pack leaf nodes left-to-right from the sorted stream, leaving the configured fill factor of free space.
  3. Build internal levels. Walk the leaves’ first keys upward, building the next level. Repeat until one node remains — that’s the root.

Bulk load skips the descent path entirely; it doesn’t latch leaves because the tree isn’t visible to other transactions until the whole build commits. The cost: bulk-loaded indexes can’t be incrementally maintained — adding rows after load uses the regular insert path.

sequenceDiagram
  participant CL as caller (heap insert / index update)
  participant IH as btree_insert_helper
  participant DESC as btree_search_nonleaf_page
  participant LF as btree_search_leaf_page
  participant UQ as btree_find_oid_and_its_page (if unique)
  participant SP as btree_find_split_point
  participant SN as btree_split_node
  participant LOG as log_append_*
  participant LM as log_manager (sysop)

  CL->>IH: build helper with (btid, key, OID, purpose=INSERT)
  IH->>DESC: latch-coupled descend with X intent
  DESC->>LF: arrive at target leaf page (X latch)
  alt unique index
    LF->>UQ: scan OID list for duplicate
    UQ-->>LF: found ⇒ ER_BTREE_UNIQUE_FAILED
  end
  alt leaf has space
    LF->>LOG: append RVBT_NDRECORD_INS
    LF->>LF: insert key||OID
  else leaf full
    LF->>LM: log_sysop_start
    LF->>SP: pick split point
    SP-->>LF: mid_slot, separator
    LF->>SN: split into Q,R — promote separator to parent
    SN->>LOG: per-page RVBT_* records
    LF->>LM: log_sysop_end_logical_undo
    LF->>LF: insert into the right half
    LF->>LOG: append RVBT_NDRECORD_INS
  end
  LF->>LF: pgbuf_set_dirty
  LF-->>CL: NO_ERROR

Anchor on symbol names, not line numbers.

  • BTREE_NODE_TYPE enum (btree.h) — leaf / non-leaf / overflow.
  • NON_LEAF_REC / LEAF_REC (btree.h) — fixed parts of records.
  • BTID_INT (btree.h) — internal B+Tree id wrapper.
  • BTREE_KEYRANGE (btree.h) — range bounds for scans.
  • BTREE_SCAN (btree.h) — scan cursor.
  • BTS_KEY_STATUS enum (btree.h) — per-key processing state.
  • BTREE_NEED_UNIQUE_CHECK macro (btree.h) — unique enforcement gate.
  • SINGLE_ROW_* / MULTI_ROW_* constants (btree.h) — op-class flags.
  • btree_search_nonleaf_page (btree.c) — internal node search.
  • btree_search_leaf_page (btree.c) — leaf node search.
  • btree_locate_key (btree.c) — top-level “find the leaf carrying this key”.
  • btree_find_lower_bound_leaf (btree.c) — first leaf in a range.
  • btree_insert_helper (btree.c) — per-insert context object.
  • btree_find_split_point (btree.c) — split decision.
  • btree_split_find_pivot / btree_split_next_pivot (btree.c) — split-skew accumulation.
  • btree_split_node (btree.c) — non-root split.
  • btree_split_root (btree.c) — root split (height grows).
  • btree_split_test (btree.c) — split simulation for diagnostics.
  • btree_insert_object_ordered_by_oid (btree.c) — OID-suffix ordered insertion into a leaf record.
  • btree_create_overflow_key_file (btree.c) — first-time allocation of the overflow file.
  • btree_delete_helper (btree.c) — per-delete context object.
  • btree_delete_key_from_leaf (btree.c) — physical delete of a key.
  • btree_delete_meta_record (btree.c) — delete a metadata record (e.g., separator).
  • btree_delete_overflow_key (btree.c) — purge from the OID overflow chain.
  • btree_merge_node / btree_merge_root (btree.c) — rebalancing on under-fill.
  • btree_find_oid_and_its_page (btree.c) — locate the leaf or overflow page that carries (key, OID).
  • btree_find_oid_does_mvcc_info_match (btree.c) — compare MVCC info to decide visibility on a candidate match.
  • btree_find_oid_from_leaf / btree_find_oid_from_ovfl (btree.c) — per-page search.
  • btree_find_foreign_key (btree.c) — FK check helper.
  • btree_unique_stats class (btree_unique.hpp) — per-tree stats.
  • GLOBAL_UNIQUE_STATS_TABLE (log_impl.h) — global aggregation.
  • xbtree_load_index (btree_load.c) — entry point for bottom-up build.
  • btree_load_* helpers (btree_load.c) — sort, pack, link.
SymbolFileLine
BTREE_NODE_TYPE enumbtree.h81
NON_LEAF_REC (struct)btree.h103
LEAF_REC (struct)btree.h111
BTID_INT (struct)btree.h119
BTREE_KEYRANGE (struct)btree.h139
BTREE_SCAN (struct)btree.h198
BTREE_NEED_UNIQUE_CHECK (macro)btree.h63
btree_insert_helper::ctorbtree.c718
btree_delete_helper::ctorbtree.c804
btree_create_overflow_key_filebtree.c1975
btree_delete_overflow_keybtree.c2202
btree_insert_object_ordered_by_oidbtree.c3873
btree_search_nonleaf_pagebtree.c5189
btree_search_leaf_pagebtree.c5537
btree_find_foreign_keybtree.c6361
btree_delete_key_from_leafbtree.c9653
btree_delete_meta_recordbtree.c10148
btree_merge_rootbtree.c10284
btree_merge_nodebtree.c10557
btree_find_free_overflow_oids_pagebtree.c11579
btree_find_oid_and_its_pagebtree.c11646
btree_find_oid_does_mvcc_info_matchbtree.c11794
btree_find_oid_from_leafbtree.c11946
btree_find_oid_from_ovflbtree.c12037
btree_find_split_pointbtree.c12419
btree_split_find_pivotbtree.c12849
btree_split_next_pivotbtree.c12874
btree_split_nodebtree.c13324
btree_split_testbtree.c13996
btree_split_rootbtree.c14184
btree_locate_keybtree.c14882
btree_find_lower_bound_leafbtree.c14938
xbtree_load_indexbtree_load.c856
  • Three node types share the same slotted-page format, differentiated by the BTREE_NODE_TYPE enum on the page header. Verified at btree.h:81. Implication: a single page-format change (e.g., adding TDE flags) propagates to all three node kinds without separate code paths.

  • Non-leaf records are (VPID, key_len, key); leaf records add an overflow VPID and an OID list after the key. Verified at btree.h:103-115. The LEAF_REC::ovfl is NULL_VPID when no overflow is needed; once it points to a page, the OID list past the inline cap lives on a per-key chain.

  • Unique enforcement runs only on active transactions, not on recovery-mode redo. Verified at btree.h:63 (BTREE_NEED_UNIQUE_CHECK macro requires logtb_is_current_active). Recovery never inserts a duplicate because the original insert was already validated.

  • Insert and delete use stack-allocated helper structs (btree_insert_helper, btree_delete_helper) rather than long parameter lists. Verified at btree.c:718 and btree.c:804. The helpers carry purpose flags, MVCC info, key, OID, error, log LSA — fields that would otherwise blow out the call signature.

  • Splits and merges are bracketed in LOG_SYSOP_END_LOGICAL_UNDO. Inferred from the requirement that split rollback re-merge pages and the existence of LOG_SYSOP_END_LOGICAL_UNDO (cubrid-log-manager.md §“LOG_REC_SYSOP_END”). The btree_split_node body opens with log_sysop_start and closes with log_sysop_end_logical_undo — verified by tracing the call patterns from the position-hint table; a deep verification of the exact LSA emission sequence is open question §1.

  • Split-point selection accumulates skew across consecutive splits. Verified at btree.c:12849-12874 (btree_split_find_pivot, btree_split_next_pivot) and BTREE_NODE_SPLIT_INFO struct usage. The accumulator’s purpose is to bias the next split toward the same direction as the current one if consecutive inserts are landing on the same side.

  • BTREE_SCAN::cur_leaf_lsa enables LSA-aware scan resumption. Verified at btree.h:260 plus the discipline documented in the cur_leaf_lsa field comment. After a scan release-and-resume, the LSA comparison decides whether the cached slot id is still valid.

  • OID-list overflow pages share one file per B+Tree. Verified at btree.h:129 (BTID_INT::ovfid is the overflow file id). Per-key overflow chains share this file space, so vacuum that frees an OID list returns pages to the same file the next insert can re-use.

  • Bulk load is sort-then-bottom-up build. Verified by reading the entry-point signature at btree_load.c:856 and the xbtree_load_index parameter list (class OIDs, key type, expected oid count). The function is a x*-prefixed RPC surface, called from DDL paths.

  • Unique stats are per-tree and aggregated globally. Verified by btree_unique_stats class declaration in btree_unique.hpp plus GLOBAL_UNIQUE_STATS_TABLE in log_impl.h:651. The global table is what the optimizer consults for index selectivity.

  • btree_find_oid_and_its_page is the unique-enforcement workhorse. Verified at btree.c:11646. Its BTREE_OP_PURPOSE parameter dispatches MVCC visibility semantics — INSERT vs. DELETE vs. VACUUM see different visibility windows.

  1. Exact LSA sequence emitted by a split. The system-op bracket plus per-page record is documented in this doc, but the precise interleaving (parent update before / after sibling-link update?) was not verified end-to-end. Investigation path: read btree_split_node body around log_append_* calls.

  2. Lock acquisition during scan-resume after page split. If the LSA differs and the scan re-locates the key, what prevents another concurrent operation from splitting again between re-locate and read? Investigation path: read btree_find_next_index_record (declared deprecated per BTREE_SCAN field comments) and its replacement.

  3. Common-prefix compression discipline. BTREE_SCAN has common_prefix_size and common_prefix_key; the macros imply per-page common-prefix tracking. How is the prefix chosen and refreshed on writes? Investigation path: btree.h:184-194 macros and their callers in btree.c.

  4. SUPPORT_DEDUPLICATE_KEY_MODE. BTID_INT::deduplicate_key_idx and BTREE_SCAN::is_fk_remake reference a deduplicate mode that wasn’t traced. Investigation path: grep for SUPPORT_DEDUPLICATE_KEY_MODE macro.

  5. Logical-undo of merge. If a delete causes a merge, then the transaction aborts, the undo must un-merge — that is, re-split. Is the un-merge a separate code path or does it reuse btree_split_node? Investigation path: search for the redo / undo function for merge in RV_fun[].

  6. Bulk-load latching. During bulk build, the index is invisible to other transactions, so latching should be unnecessary. Is the page-buffer fix-and-set-dirty path bypassed? Investigation path: read xbtree_load_index body for pgbuf_* calls; check whether bulk load uses PGBUF_LATCH_WRITE or a dedicated bypass.

Beyond CUBRID — Comparative Designs & Research Frontiers

Section titled “Beyond CUBRID — Comparative Designs & Research Frontiers”

Pointers, not analysis.

  • Lehman & Yao’s B-link tree (CACM 1981) — adds right-link for split-during-descent recovery without ascending latches. CUBRID’s restart-from-root path is the conservative alternative; a side-by-side would quantify the latency cost.

  • OLFIT / FAST / lock-free B+Tree variants — optimistic validation in lieu of latch-coupling. The CUBRID cur_leaf_lsa discipline is half-OLFIT (validate stale state by LSA) but writes still take exclusive latches.

  • Bw-Tree (Levandoski et al., ICDE 2013) — append-only delta chain on a page-id mapping table; eliminates in-place page updates. Hekaton’s index uses it. CUBRID’s in-place updates are the opposite design point.

  • PostgreSQL nbtree — right-link Lehman-Yao with ginxlogConsolidatePage for split coalescing. Comparing CUBRID’s prefix-truncation (nonleaf_key_type) against PG’s full-key separators would document the CUBRID-specific space saving.

  • B+Tree as MVCC version chain (Hyrise, MonetDB-X100) — inline version data with the key, eliminating the heap redirect. CUBRID’s leaf record stores only the OID and lets the heap page own the version chain.

  • Cuckoo / extendible hashing — alternative for primary-key lookup. CUBRID retains B+Tree because secondary-index range scans dominate; pure point lookup workloads might benefit.

Raw analyses (raw/code-analysis/cubrid/storage/index/)

Section titled “Raw analyses (raw/code-analysis/cubrid/storage/index/)”
  • B+Tree 코드 분석.pdf
  • knowledge/code-analysis/cubrid/cubrid-heap-manager.md — slotted-page substrate the leaves share.
  • knowledge/code-analysis/cubrid/cubrid-page-buffer-manager.md — fix discipline and latch types used during descent.
  • knowledge/code-analysis/cubrid/cubrid-lock-manager.md — key-range locking for SERIALIZABLE.
  • knowledge/code-analysis/cubrid/cubrid-log-manager.mdLOG_SYSOP_END_LOGICAL_UNDO for split / merge.
  • knowledge/code-analysis/cubrid/cubrid-vacuum.md — B+Tree vacuum path that consumes RVBT_* records.

Textbook chapters (under knowledge/research/dbms-general/)

Section titled “Textbook chapters (under knowledge/research/dbms-general/)”
  • Database Internals (Petrov), Ch. 2 §“B-Tree Basics”, Ch. 3 §“File Formats”.
  • Lehman, Yao, Efficient Locking for Concurrent Operations on B-Trees, CACM 1981.
  • Mohan, ARIES/IM: An Efficient and High Concurrency Index Management Method Using Write-Ahead Logging, SIGMOD 1992.

CUBRID source (/data/hgryoo/references/cubrid/)

Section titled “CUBRID source (/data/hgryoo/references/cubrid/)”
  • src/storage/btree.{c,h}
  • src/storage/btree_load.{c,h}
  • src/storage/btree_unique.{cpp,hpp}