CUBRID B+Tree — Code-Level Deep Dive
Where this document fits: The high-level analysis
cubrid-btree.mdcovers design intent and theoretical background. This document traces every branch and field at the code level. Each chapter is self-contained, but reading in order follows the full lifecycle of a single index key inside the kernel.
Contents:
Chapter 1: Data Structure Map
Section titled “Chapter 1: Data Structure Map”Question: what in-memory and on-disk structures describe a B+Tree node, record, and key/OID cell, and how do their fields relate? Theory lives in cubrid-btree.md (### Slotted-page nodes, ### Non-leaf vs. leaf record split, ### OID-list suffix in non-unique leaves); this chapter pins down the fields. Layering: slotted page -> node header (slot HEADER = 0) -> per-key records (NON_LEAF_REC/LEAF_REC fixed prefix + variable payload).
1.1 The slotted-page substrate: spage_header and spage_slot
Section titled “1.1 The slotted-page substrate: spage_header and spage_slot”// spage_header -- src/storage/slotted_page.hstruct spage_header{ PGNSLOTS num_slots; PGNSLOTS num_records; INT16 anchor_type; /* ANCHORED / ANCHORED_DONT_REUSE_SLOTS / UNANCHORED_* */ unsigned short alignment; int total_free; int cont_free; int offset_to_free_area; /* records grow up to here */ int reserved1; int flags; /* always SPAGE_HEADER_FLAG_NONE today */ unsigned int is_saving:1; /* space saving for recovery (undo) */ unsigned int need_update_best_hint:1; unsigned int reserved_bits:30;};| Field | Role | Why it exists |
|---|---|---|
num_slots | allocated slot entries (live + reusable) | slot array length; PGSLOTID index space |
num_records | slots currently holding a record | separates live from freed/reusable slots |
anchor_type | slot-stability policy; enum ANCHORED, ANCHORED_DONT_REUSE_SLOTS, UNANCHORED_ANY_SEQUENCE, UNANCHORED_KEEP_SEQUENCE | B+Tree uses UNANCHORED_KEEP_SEQUENCE; slot id = the key’s sorted position, later slots shift on insert/delete (Ch 3) |
alignment | record start alignment | BTREE_MAX_ALIGN so MVCCID/OID reads align |
total_free | all free bytes, may be fragmented | pre-insert feasibility check |
cont_free | contiguous run at offset_to_free_area | fits in place only if <= cont_free, else compact |
offset_to_free_area | record/free-space frontier | new bytes written here, then it advances |
reserved1, flags, reserved_bits | padding / future use | 8-byte alignment, forward-compatible |
is_saving | recovery space reservation active | rollback can restore freed space |
need_update_best_hint | heap stats hint | unused by B+Tree |
// spage_slot -- src/storage/slotted_page.hstruct spage_slot{ unsigned int offset_to_record:14; unsigned int record_length:14; unsigned int record_type:4; /* REC_HOME, REC_NEWHOME, ... */};| Field | Role | Why it exists |
|---|---|---|
offset_to_record | where record bytes start | indirection lets records move on compaction without changing slot ids |
record_length | record byte length | bounds the RECDES handed to B+Tree code |
record_type | REC_* discriminator | B+Tree records are REC_HOME; field exists for heap reuse |
Invariant — slot/record accounting.
num_records <= num_slotsandoffset_to_free_area + cont_freemust not overrun the front-growing slot array. Enforced byspage_insert*; a violation returnsSP_DOESNT_FIT-> split (Ch 6).
Invariant — slot order is key order. B+Tree pages are initialized with
UNANCHORED_KEEP_SEQUENCE(thespage_initializecalls inbtree.c): a slot id is the key’s ordinal position in the node, andspage_insert_at/spage_deleteshift the later slots (spage_shift_slot_up/_down) to preserve that order — this is what lets descent binary-search directly over slot ids (Ch 3). The flip side: aPGSLOTIDis positional, valid only while the page stays latched; a scan that unfixes the page must re-validate via the page LSA before reusing one (Ch 7). Compaction moves record bytes, never slot entries.
1.2 The B+Tree node header and root header
Section titled “1.2 The B+Tree node header and root header”Slot HEADER (= 0) is the node header. Non-root nodes carry btree_node_header; the root carries btree_root_header whose first field is an embedded btree_node_header, so both share node-level accessors.
// btree_node_header -- src/storage/btree_load.hstruct btree_node_header{ BTREE_NODE_SPLIT_INFO split_info; VPID prev_vpid; /* leaf prev pointer */ VPID next_vpid; /* leaf next pointer */ short node_level; /* Leaf == 1, Non_leaf > 1 */ short max_key_len; int common_prefix;};| Field | Role | Why it exists |
|---|---|---|
split_info | {float pivot; int index;} | biases the next split point (Ch 6) toward observed access, not always 50/50 |
prev_vpid | left sibling page id (leaf) | leaf doubly-linked chain range scans walk (Ch 7) |
next_vpid | right sibling page id (leaf) | B-link “next”; a scan on a just-split page follows it rightward |
node_level | node height, leaves == 1 | leaf/non-leaf discriminator used everywhere |
max_key_len | largest key length in subtree | sizing / split estimation without descending |
common_prefix | shared-prefix length on this node | prefix compression (COMMON_PREFIX_UNKNOWN = -1 when uncomputed) |
// btree_node_split_info -- src/storage/storage_common.hstruct btree_node_split_info { float pivot; int index; }; /* pivot = split_slot_id/num_keys; index = insert slot after split */// btree_root_header -- src/storage/btree_load.hstruct btree_root_header{ BTREE_NODE_HEADER node; /* root is also a node */ INT64 num_oids; INT64 num_nulls; INT64 num_keys; OID topclass_oid; /* topclass oid, or NULL OID for non-unique */ int unique_pk; struct { int rev_level:16; int deduplicate_key_idx:16; } _32; VFID ovfid; /* overflow (key) file */ MVCCID creator_mvccid; char packed_key_domain[1]; /* key type for the index — ALWAYS LAST */};
// BTREE_CONSTRAINT_TYPE -- src/storage/storage_common.htypedef enum { BTREE_CONSTRAINT_UNIQUE = 0x01, BTREE_CONSTRAINT_PRIMARY_KEY = 0x02 } BTREE_CONSTRAINT_TYPE;| Field | Role | Why it exists |
|---|---|---|
node | embedded btree_node_header | root is a real node; btree_get_node_header works on it too |
num_oids | total objects indexed | unique stats / estimates; maintained by delta logging |
num_nulls | NULL keys (not stored) | unique constraints ignore NULLs; counted apart |
num_keys | distinct keys | num_oids == num_keys for a healthy unique index |
topclass_oid | owning class OID, or NULL | unique single-class vs. non-unique multi-class |
unique_pk | bitmask BTREE_CONSTRAINT_UNIQUE (0x01), BTREE_CONSTRAINT_PRIMARY_KEY (0x02) | uniqueness enforcement + fixed object sizing (Ch 5) |
_32.rev_level | on-disk revision | format gate (BTREE_CURRENT_REV_LEVEL) |
_32.deduplicate_key_idx | dedup column index, stored idx+1 | SUPPORT_DEDUPLICATE_KEY_MODE; +1 lets 0 mean unset |
ovfid | overflow-key file id | over-long keys (Ch 10) |
creator_mvccid | index-creating transaction | index visibility during online load |
packed_key_domain | serialized key TP_DOMAIN | the key type; must be last (variable-length) |
Invariant —
packed_key_domainis trailing. As a flexible array member, nothing may follow it. Enforced bybtree_init_root_header(packs last) /btree_glean_root_header_info(unpacks). A field after it overwrites domain bytes and corrupts key comparison.
Invariant — unique counters.
num_oids == num_keysmust hold after every committed mutation on a unique index (NULLs innum_nulls). Enforced bybtree_change_root_header_delta, which adjusts all three atomically with the structural change (Ch 5).
1.3 Per-key record fixed prefixes: NON_LEAF_REC and LEAF_REC
Section titled “1.3 Per-key record fixed prefixes: NON_LEAF_REC and LEAF_REC”// non_leaf_rec -- src/storage/btree.hstruct non_leaf_rec { VPID pnt; short key_len; }; /* pnt = child page pointer */// leaf_rec -- src/storage/btree.hstruct leaf_rec { VPID ovfl; short key_len; }; /* ovfl = overflow-OID page */| Struct | Field | Role | Why it exists |
|---|---|---|---|
non_leaf_rec | pnt | child VPID for keys >= separator | the descent (Ch 3) follows it down |
non_leaf_rec | key_len | separator key length (-1 if overflowed) | key bytes after pnt; on-disk prefix NON_LEAF_RECORD_SIZE = DISK_VPID_ALIGNED_SIZE |
leaf_rec | ovfl | first overflow-OID page, NULL_VPID if none | OID spillover beyond record capacity (Ch 5) |
leaf_rec | key_len | inline key length (-1 if overflow key) | bounds key bytes before the OID list; LEAF_RECORD_SIZE = 0 — no fixed prefix, the first OID is the start and leaf_rec is only the parse target |
1.4 The full leaf-record byte layout
Section titled “1.4 The full leaf-record byte layout”Left to right: first OID (inline) -> class OID (only if BTREE_LEAF_RECORD_CLASS_OID) -> insert MVCCID (only if first OID’s volid has BTREE_OID_HAS_MVCC_INSID) -> delete MVCCID (only if BTREE_OID_HAS_MVCC_DELID) -> key bytes (leaf_rec.key_len long, or a VPID if overflow key) -> trailing OID list, each OID possibly carrying its own MVCC fields -> aligned overflow VPID at the tail (only if BTREE_LEAF_RECORD_OVERFLOW_OIDS). The discriminating bits live in two reused sub-fields of the first OID: record flags in slotid, per-OID MVCC flags in volid.
// leaf record flags -- src/storage/btree.c (in first OID's slotid)#define BTREE_LEAF_RECORD_FENCE ((short) 0x1000)#define BTREE_LEAF_RECORD_OVERFLOW_OIDS ((short) 0x2000)#define BTREE_LEAF_RECORD_OVERFLOW_KEY ((short) 0x4000)#define BTREE_LEAF_RECORD_CLASS_OID ((short) 0x8000)#define BTREE_LEAF_RECORD_MASK ((short) 0xF000)| Flag | Meaning when set | Consumer |
|---|---|---|
BTREE_LEAF_RECORD_FENCE | record is a fence (split boundary), not an object | scans skip (Ch 7); splits create (Ch 6) |
BTREE_LEAF_RECORD_OVERFLOW_OIDS | leaf_rec.ovfl set; trailing aligned VPID exists | BTREE_RECORD_OR_BUF_INIT shrinks parsed length by DB_ALIGN(DISK_VPID_SIZE, BTREE_MAX_ALIGN) so the tail VPID is not read as an OID |
BTREE_LEAF_RECORD_OVERFLOW_KEY | key replaced by a VPID into ovfid | key read/compare fetches the overflow page (Ch 10) |
BTREE_LEAF_RECORD_CLASS_OID | class OID stored after first OID | objects in one record span classes (non-unique multi-class) |
// per-OID MVCC flags -- src/storage/btree.c (in each OID's volid)#define BTREE_OID_HAS_MVCC_INSID ((short) 0x4000)#define BTREE_OID_HAS_MVCC_DELID ((short) 0x8000)#define BTREE_OID_MVCC_FLAGS_MASK ((short) 0xC000)BTREE_GET_MVCC_INFO_SIZE_FROM_FLAGS derives the byte count: both set -> 2 * OR_MVCCID_SIZE, one -> OR_MVCCID_SIZE, neither -> 0.
Invariant — flag namespace separation. Record flags are
slotidbits0xF000; MVCC flags arevolidbits0xC000.OVERFLOW_KEYandHAS_MVCC_INSIDare both 0x4000 in different sub-fields. Enforced byBTREE_OID_GET_RECORD_FLAGS(slotid) vs.BTREE_OID_GET_MVCC_FLAGS(volid); mixing them misparses.
Invariant — MVCC fields decode positionally. No length prefix: presence is known only from the OID’s
volidflags, so exactly those bytes must be consumed before the next field. Enforced byBTREE_RECORD_OR_BUF_INIT+ flag macros; ad-hoc arithmetic desyncs on the first variable-MVCC object.
flowchart TD
A["start: first OID"] --> B{"slotid has<br/>CLASS_OID?"}
B -->|yes| C["read class OID"]
B -->|no| D
C --> D{"volid has<br/>HAS_MVCC_INSID?"}
D -->|yes| E["read insert_mvccid"]
D -->|no| F
E --> F{"volid has<br/>HAS_MVCC_DELID?"}
F -->|yes| G["read delete_mvccid"]
F -->|no| H
G --> H{"slotid has<br/>OVERFLOW_KEY?"}
H -->|yes| I["key field = VPID into ovfid"]
H -->|no| J["read key_len key bytes"]
I --> K["trailing OID list"]
J --> K
K --> L{"slotid has<br/>OVERFLOW_OIDS?"}
L -->|yes| M["leaf_rec.ovfl = aligned tail VPID"]
L -->|no| N["done"]
M --> N
Figure 1-1: branch-complete parse of a single leaf record.
1.5 BTID_INT: the in-memory index descriptor
Section titled “1.5 BTID_INT: the in-memory index descriptor”Gleaned from the root header by btree_glean_root_header_info, threaded through nearly every B+Tree function.
// btid_int -- src/storage/btree.hstruct btid_int{ BTID *sys_btid; int unique_pk; int part_key_desc; /* last partial-key domain is descending */ TP_DOMAIN *key_type; TP_DOMAIN *nonleaf_key_type; /* differs from key_type under prefix keys */ VFID ovfid; char *copy_buf; /* key copy buffer, from INDX_SCAN_ID */ int copy_buf_len; int rev_level; int deduplicate_key_idx; OID topclass_oid;};| Field | Role | Why it exists |
|---|---|---|
sys_btid | persistent BTID (file + root page id) | on-disk index identity |
unique_pk | uniqueness/PK bitmask | unique checks (Ch 5) + fixed object sizing |
part_key_desc | last partial-key column is DESC | reverses comparison sign (BTREE_IS_PART_KEY_DESC) |
key_type | leaf key TP_DOMAIN | full-key comparison / serialization |
nonleaf_key_type | non-leaf domain, differs only with prefix keys | fixed-char separators stored as varying counterpart; non-leaf compare uses this |
ovfid | overflow-key file | over-long keys; mirrors root ovfid |
copy_buf / copy_buf_len | key-materialization scratch | avoids per-key malloc in scans; borrowed from scan id |
rev_level | on-disk format revision | guards format-dependent paths |
deduplicate_key_idx | dedup column index | dedup-mode key handling |
topclass_oid | owning class OID | class checks; NULL for non-unique multi-class |
Invariant —
key_typevs.nonleaf_key_type. Same domain for non-prefix indexes; for fixed-char prefix indexesnonleaf_key_typeis the varying counterpart frombtree_generate_prefix_domain, and separators must compare with it, neverkey_type. The wrong domain mis-orders descent (Ch 6).
1.6 BTREE_NODE_TYPE and BTREE_OP_PURPOSE
Section titled “1.6 BTREE_NODE_TYPE and BTREE_OP_PURPOSE”// BTREE_NODE_TYPE -- src/storage/btree.htypedef enum { BTREE_LEAF_NODE = 0, BTREE_NON_LEAF_NODE, BTREE_OVERFLOW_NODE } BTREE_NODE_TYPE;| Value | Page kind | Record layout |
|---|---|---|
BTREE_LEAF_NODE | leaf (node_level == 1) | leaf-record layout of 1.4 |
BTREE_NON_LEAF_NODE | internal (node_level > 1) | NON_LEAF_REC + separator key |
BTREE_OVERFLOW_NODE | overflow-OID page | btree_overflow_header + packed fixed-size OIDs |
BTREE_OP_PURPOSE is the dispatch key for the one shared engine (btree_insert_internal/btree_delete_internal):
// btree_op_purpose -- src/storage/btree.henum btree_op_purpose{ BTREE_OP_NO_OP, BTREE_OP_INSERT_NEW_OBJECT, BTREE_OP_INSERT_MVCC_DELID, BTREE_OP_INSERT_MARK_DELETED, BTREE_OP_INSERT_UNDO_PHYSICAL_DELETE, BTREE_OP_DELETE_OBJECT_PHYSICAL, BTREE_OP_DELETE_OBJECT_PHYSICAL_POSTPONED, BTREE_OP_DELETE_UNDO_INSERT, BTREE_OP_DELETE_UNDO_INSERT_UNQ_MULTIUPD, BTREE_OP_DELETE_UNDO_INSERT_DELID, BTREE_OP_DELETE_VACUUM_OBJECT, BTREE_OP_DELETE_VACUUM_INSID, BTREE_OP_NOTIFY_VACUUM, BTREE_OP_ONLINE_INDEX_IB_INSERT, BTREE_OP_ONLINE_INDEX_IB_DELETE, BTREE_OP_ONLINE_INDEX_TRAN_INSERT, BTREE_OP_ONLINE_INDEX_TRAN_INSERT_DF, BTREE_OP_ONLINE_INDEX_UNDO_TRAN_INSERT, BTREE_OP_ONLINE_INDEX_TRAN_DELETE, BTREE_OP_ONLINE_INDEX_UNDO_TRAN_DELETE};Five families: insert, physical/logical delete, vacuum, notify-vacuum, online-index load. INSERT_MVCC_DELID (add delete MVCCID on MVCC delete) vs. INSERT_MARK_DELETED (non-MVCC unique, vacuum-exempt) matters for Ch 8; DELETE_VACUUM_OBJECT (full remove) vs. DELETE_VACUUM_INSID (insert-MVCCID-only remove) for vacuum.
1.7 BTREE_MVCC_INFO and BTREE_OBJECT_INFO: the in-memory cell
Section titled “1.7 BTREE_MVCC_INFO and BTREE_OBJECT_INFO: the in-memory cell”Parsing the 1.4 layout yields one BTREE_OBJECT_INFO per object, whose MVCC sub-record is BTREE_MVCC_INFO:
// btree_mvcc_info -- src/storage/btree.hstruct btree_mvcc_info { short flags; MVCCID insert_mvccid; MVCCID delete_mvccid; };#define BTREE_MVCC_INFO_INITIALIZER { 0, MVCCID_ALL_VISIBLE, MVCCID_NULL }// btree_object_info -- src/storage/btree.hstruct btree_object_info { OID oid; OID class_oid; BTREE_MVCC_INFO mvcc_info; };| Struct | Field | Role | Why it exists |
|---|---|---|---|
btree_mvcc_info | flags | which MVCCIDs present | in-memory mirror of per-OID volid flags |
btree_mvcc_info | insert_mvccid | creating tx id (or MVCCID_ALL_VISIBLE) | read visibility; vacuum clears once all-visible |
btree_mvcc_info | delete_mvccid | deleting tx id (or MVCCID_NULL) | logical-delete marker; physical removal awaits vacuum |
btree_object_info | oid | indexed instance | the lookup answer |
btree_object_info | class_oid | owning class | multi-class non-unique indexes + lock targeting |
btree_object_info | mvcc_info | object MVCC state | scan visibility (Ch 7) |
Invariant —
flagsmirrors stored bits.BTREE_MVCC_INFO.flagsmust equal the per-OID flags packed into the record. Enforced by the sanctioned mutatorsBTREE_MVCC_INFO_SET_FIXED_SIZE/_CLEAR_FIXED_SIZE; disagreement makes the computed size wrong and reads the next object from the wrong offset. The fixed-size rule (BTREE_OBJECT_FIXED_SIZE,btree_load.h) forcesflagsfull — OID [+ class OID if unique] + both MVCCIDs — for an overflow-page object, a non-first leaf object, or the first object of a record that has overflow pages, so overflow pages address as fixed-stride arrays.
Containment: BTID_INT is gleaned from btree_root_header (embeds btree_node_header, embeds btree_node_split_info); node records are non_leaf_rec (pnt descends to a child) or leaf_rec (ovfl chains to a btree_overflow_header page of fixed-stride OIDs); each leaf record decodes to a list of btree_object_info, each holding a btree_mvcc_info.
1.8 Chapter summary — key takeaways
Section titled “1.8 Chapter summary — key takeaways”- Three layers. Each page is a slotted page (
spage_header+spage_slot[]) whose slot 0 (HEADER) holdsbtree_node_header(non-root) orbtree_root_header(root, embedding a node header); slots 1..N hold per-key records. LEAF_RECORD_SIZEis 0. A leaf record has no fixed on-disk prefix — the first OID is the start;leaf_rec/non_leaf_recare parse targets, and onlynon_leaf_rechas a real on-disk size (NON_LEAF_RECORD_SIZE).- Flags live in reused OID sub-fields. Record flags (
BTREE_LEAF_RECORD_*, mask0xF000) in the first OID’sslotid; MVCC flags (BTREE_OID_HAS_MVCC_*, mask0xC000) in each OID’svolid. 0x4000 differs per field — mask the correct sub-field. - Leaf records decode positionally. Optional class OID, insert/delete MVCCIDs, key (inline or overflow VPID), trailing OID list, optional aligned overflow VPID — present by flag, no length prefixes; the only safe walk is
BTREE_RECORD_OR_BUF_INITplus flag macros. BTID_INTis the threaded descriptor. Gleaned from the root header, it carriesunique_pk, separatekey_type/nonleaf_key_type(diverge only under prefix keys),ovfid, and dedup metadata into nearly every operation.BTREE_OP_PURPOSEis the dispatch key. One insert/delete engine serves five families (insert, delete, vacuum, notify, online-load); the purpose distinguishes MVCC delete from non-MVCC mark-deleted and full-remove from insid-only-remove.- Counters and flags stay coherent. Root-header
num_oids/num_keys/num_nullsare kept consistent by delta logging, andBTREE_MVCC_INFO.flagsmust mirror the stored OID flags — so record sizes and unique arithmetic never desynchronize across a crash or concurrent reader.
Chapter 2: Record Encoding Decoding and Module Initialization
Section titled “Chapter 2: Record Encoding Decoding and Module Initialization”Chapter 1 mapped the in-memory structs (BTID_INT, LEAF_REC, NON_LEAF_REC, BTREE_MVCC_INFO, BTREE_OBJECT_INFO). This chapter traces how a logical (key, OID, class_oid, MVCC info) tuple becomes the raw bytes of a slotted-page record and back, then covers the two per-operation scratch structs (btree_insert_helper, btree_delete_helper) every later chapter threads down the call stack, and the one-time overflow-key file bootstrap. For the why of overflow keys, the OID-list suffix, and unique-vs-non-unique layout see the companion’s Node layout and Unique-key handling; this chapter traces the bytes, not the theory.
2.1 The on-disk record grammar
Section titled “2.1 The on-disk record grammar”A leaf record’s parsing-control bits are not a separate field — they ride in the high nibble of the first object’s slot-id half-word (OR_OID_SLOTID):
// BTREE_LEAF_RECORD_* — src/storage/btree.c#define BTREE_LEAF_RECORD_FENCE ((short) 0x1000) /* fence key, not a real object */#define BTREE_LEAF_RECORD_OVERFLOW_OIDS ((short) 0x2000) /* OID list spills to overflow page */#define BTREE_LEAF_RECORD_OVERFLOW_KEY ((short) 0x4000) /* key spilled to overflow-key file */#define BTREE_LEAF_RECORD_CLASS_OID ((short) 0x8000) /* subclass OID present after instance OID */#define BTREE_LEAF_RECORD_MASK ((short) 0xF000)A second, independent pair rides in the first object’s volid half-word (OR_OID_VOLID) and governs MVCC fields per object: BTREE_OID_HAS_MVCC_INSID (0x4000), BTREE_OID_HAS_MVCC_DELID (0x8000). An OID is 8 bytes (pageid(4) @0 | slotid(2) @4 | volid(2) @6); a disk address uses only the low bits of slotid/volid, so each high nibble is free. This is the central encoding trick: the first object’s OID doubles as the record header.
flowchart LR
subgraph First["First object (also the record header)"]
O1["instance OID 8B<br/>slotid hi-nibble = record flags<br/>volid hi-nibble = MVCC flags"]
C1["class OID 8B<br/>if CLASS_OID flag"]
I1["insert MVCCID 8B<br/>if HAS_MVCC_INSID"]
D1["delete MVCCID 8B<br/>if HAS_MVCC_DELID"]
end
KEY["key bytes<br/>(or 6B overflow VPID if OVERFLOW_KEY)"]
REST["object 2..n<br/>OID + optional MVCCIDs"]
OVF["overflow-OIDs VPID 6B<br/>if OVERFLOW_OIDS"]
O1 --> C1 --> I1 --> D1 --> KEY --> REST --> OVF
Figure 2-1. Leaf-record byte grammar. The first object precedes the key; all additional objects follow it after an align32 pad. The trailing overflow-OIDs VPID is excluded from object iteration by BTREE_RECORD_OR_BUF_INIT (§2.5). A non-leaf record has no objects: it is the NON_LEAF_REC fixed portion (child VPID + key_len) followed by the key, where a negative key_len is the non-leaf equivalent of OVERFLOW_KEY.
2.2 Measuring a key: btree_get_disk_size_of_key
Section titled “2.2 Measuring a key: btree_get_disk_size_of_key”The caller needs the key’s packed length to size the buffer and decide whether it overflows. btree_get_disk_size_of_key is a thin wrapper:
// btree_get_disk_size_of_key — src/storage/btree.cint btree_get_disk_size_of_key (DB_VALUE * key) { if (key == NULL || DB_IS_NULL (key)) { assert (key != NULL && !DB_IS_NULL (key)); /* <- NULL keys never reach here in release */ return 0; } return pr_index_writeval_disk_size (key); /* <- delegates to the type's packer */}Its only branch is the NULL guard: a NULL key is the absence of a record (counted in the root header’s num_nulls), never a zero-length cell, so reaching it is a bug. The result feeds or_init sizing in btree_write_record and the overflow threshold in §2.9.
Invariant — key size is computed once and reused. The insert path stashes the length in
btree_insert_helper::key_len_in_pageafterBTREE_GET_KEY_LEN_IN_PAGE(which collapses keys>= BTREE_MAX_KEYLEN_INPAGEtoDISK_VPID_SIZE). If this measurement and the actualindex_writevaldisagreed, the slot would be mis-sized and adjacent records corrupted. The code enforces agreement by routing both through the samepr_typepacker.
2.3 Setting record flags: btree_leaf_set_flag
Section titled “2.3 Setting record flags: btree_leaf_set_flag”Every record-flag write goes through one chokepoint:
// btree_leaf_set_flag — src/storage/btree.cstatic void btree_leaf_set_flag (RECDES * recp, short record_flag) { short slot_id; assert ((short) (record_flag & ~BTREE_LEAF_RECORD_MASK) == 0); /* <- only the F000 nibble allowed */ slot_id = OR_GET_SHORT (recp->data + OR_OID_SLOTID); /* <- read first object's slotid+flags */ OR_PUT_SHORT (recp->data + OR_OID_SLOTID, slot_id | record_flag); /* <- OR in, preserve slotid */}It reads the word at OR_OID_SLOTID, ORs in the flag, writes it back. The assert catches a stray bit (or an MVCC 0x4000/0x8000 meant for the volid word) that would corrupt the first OID’s slot-id and silently redirect the object pointer. The write is non-destructive on the low bits because a real slot-id never uses the top nibble. MVCC flags use the sibling btree_record_object_set_mvcc_flags, the identical trick on OR_OID_VOLID.
2.4 Serializing a record: btree_write_record
Section titled “2.4 Serializing a record: btree_write_record”btree_write_record builds the record for both node types into a caller-supplied RECDES, writing through an OR_BUF cursor and setting flags as it goes. Figure 2-2 traces every branch.
flowchart TD
A["or_init buf over rec->data"] --> B{node_type == LEAF?}
B -- "non-leaf" --> NL["write fixed NON_LEAF_REC"]
B -- "leaf" --> L1["or_put_oid instance OID"]
L1 --> L2{UNIQUE and class != topclass?}
L2 -- yes --> L3["or_put_oid class OID<br/>set CLASS_OID flag"]
L2 -- no --> L4
L3 --> L4{mvcc_info != NULL?}
L4 -- yes --> L7["or_put_mvccid insid if HAS_INSID<br/>or_put_mvccid delid if HAS_DELID<br/>set MVCC flags in volid"]
L4 -- no --> KT
L7 --> KT
NL --> KT{NORMAL key?}
KT -- normal --> K1["pr_type->index_writeval key"]
KT -- overflow --> K3["leaf: set OVERFLOW_KEY flag<br/>btree_store_overflow_key, write 6B VPID"]
K1 --> AL["or_put_align32"]
K3 --> AL
AL --> FIN["rec->length = ptr-buffer<br/>rec->type = REC_HOME"]
Figure 2-2. Control flow of btree_write_record, all branches.
Every or_put_* is followed by if (error_code != NO_ERROR) { assert_release (false); return error_code; } — an OR_BUF overrun is a sizing bug, not a runtime condition. Leaf-path branches:
- First object —
or_put_oidwrites the instance OID, which is also the record header (§2.1). - Class OID —
or_put_oid+btree_leaf_set_flag (rec, BTREE_LEAF_RECORD_CLASS_OID)only when unique and the object’s class differs frombtid->topclass_oid(a PK on the base table stores none, saving 8 bytes). Flag and bytes move in lockstep. - MVCC — guarded by
mvcc_info != NULL;HAS_INSID/HAS_DELIDare independentor_put_mvccidcalls.btree_record_object_set_mvcc_flags (rec->data, mvcc_info->flags)commits the volid nibble after the bytes, so a partial record never claims fields it lacks.mvcc_info == NULLwrites neither bytes nor flags (non-MVCC bulk paths).
The non-leaf branch emits btree_write_fixed_portion_of_non_leaf_record_to_orbuf and ignores oid/class_oid/mvcc_info. The shared key-write then runs: NORMAL calls pr_type->index_writeval; OVERFLOW sets OVERFLOW_KEY (leaf only — non-leaf signals overflow via negative key_len), btree_store_overflow_key chains the key, and the 6-byte VPID is written in its place. Finally or_put_align32 pads to 4 bytes (the parser relies on this — §2.6), rec->length is the cursor delta, rec->type = REC_HOME.
Invariant — flags written during serialization must exactly predict the fields the parser skips.
btree_write_recordis the only producer,btree_read_record_without_decompressionthe only consumer; they share no state but the flag bits. If the writer setOVERFLOW_KEYwithout storing the key (or vice versa), the reader’sor_advancemath walks into the key region. Hence each flag is set immediately adjacent to itsor_put, never batched.
2.5 BTREE_RECORD_OR_BUF_INIT — the iteration boundary
Section titled “2.5 BTREE_RECORD_OR_BUF_INIT — the iteration boundary”Object loops never read the whole record blindly. BTREE_RECORD_OR_BUF_INIT sets an OR_BUF whose endptr excludes the trailing overflow-OIDs VPID:
// BTREE_RECORD_OR_BUF_INIT — src/storage/btree.c#define BTREE_RECORD_OR_BUF_INIT(buf, btree_rec) \ do { int size = (btree_rec)->length; \ if (btree_leaf_is_flaged (btree_rec, BTREE_LEAF_RECORD_OVERFLOW_OIDS)) \ { size -= DB_ALIGN (DISK_VPID_SIZE, BTREE_MAX_ALIGN); } /* <- hide ovfl pointer from loops */ \ or_init (&buf, (btree_rec)->data, size); } while (false)This lets §2.6 and §2.7 loop while (buf.ptr < buf.endptr) without mistaking the 6-byte overflow VPID for an object.
2.6 Parsing a record: btree_read_record and the without_decompression core
Section titled “2.6 Parsing a record: btree_read_record and the without_decompression core”btree_read_record is a two-step shell: it calls btree_read_record_without_decompression to parse fields and locate the key, then — for in-page leaf keys that are not fences — reattaches the node’s common prefix. The core walks the bytes, skipping exactly the fields the writer claimed to write, gated on the same conditions §2.4 used:
// btree_read_record_without_decompression — src/storage/btree.c (leaf field-skip, condensed)rc = or_advance (&buf, OR_OID_SIZE); /* skip instance oid */if (BTREE_IS_UNIQUE (btid->unique_pk) && btree_leaf_is_flaged (rec, BTREE_LEAF_RECORD_CLASS_OID)) { rc = or_advance (&buf, OR_OID_SIZE); } /* skip class oid (UNIQUE && flag) */if (btree_record_object_is_flagged (rec->data, BTREE_OID_HAS_MVCC_INSID)) { rc = or_advance (&buf, OR_MVCCID_SIZE); } /* skip insert mvccid */if (btree_record_object_is_flagged (rec->data, BTREE_OID_HAS_MVCC_DELID)) { rc = or_advance (&buf, OR_MVCCID_SIZE); } /* skip delete mvccid */if (btree_leaf_is_flaged (rec, BTREE_LEAF_RECORD_OVERFLOW_KEY)) { key_type = BTREE_OVERFLOW_KEY; }if (btree_leaf_is_flaged (rec, BTREE_LEAF_RECORD_OVERFLOW_OIDS)) { btree_leaf_get_vpid_for_overflow_oids (rec, &leaf_rec->ovfl); }else { VPID_SET_NULL (&leaf_rec->ovfl); }Record flags come from the slot-id nibble (btree_leaf_is_flaged), per-object MVCC flags from the volid nibble (btree_record_object_is_flagged); the class-OID skip needs BTREE_IS_UNIQUE and the flag, matching §2.4. Each or_advance has its own if (rc != NO_ERROR) return rc; (omitted). The non-leaf branch reads the fixed NON_LEAF_REC portion and treats key_len < 0 as BTREE_OVERFLOW_KEY. Key-read branches (shared):
- NORMAL key.
clear_keyistrueonly whenkey != NULL && copy_key == COPY_KEY_VALUE; copying a MIDXKEY/char/bit string routes throughbtid->copy_bufto avoid a malloc.index_readvaladvances the cursor andleaf_rec->key_lenis set from the delta. Passingkey == NULLmakesindex_readvalskip the key without materializing it — the cheap path scans use to reach the OID list. - OVERFLOW key. Reads the 6-byte VPID; if
key != NULLit follows the chain viabtree_load_overflow_keyand forces*clear_key = true, else*clear_key = false. A failedor_get_shortasserts and nulls the key.
The tail computes after_key_offset identically for every branch: a key->need_clear override forces *clear_key = true for self-clearing types, then buf.ptr = PTR_ALIGN (buf.ptr, OR_INT_SIZE) skips the writer’s align32 and *offset = CAST_BUFLEN (buf.ptr - buf.buffer) records the byte after the key. That *offset is the module’s most-reused number — where the OID list begins, passed everywhere as after_key_offset. btree_read_record then decompresses only for a leaf, non-overflow, non-fence key when btree_node_get_common_prefix returns n_prefix > 0: it reads slot 1 (the fence) and pr_midxkey_add_prefix splices prefix + suffix into *key (n_prefix < 0 asserts). See the companion’s node-layout section for prefix mechanics.
2.7 Counting objects: btree_leaf_get_first_object and btree_record_get_num_oids
Section titled “2.7 Counting objects: btree_leaf_get_first_object and btree_record_get_num_oids”btree_leaf_get_first_object returns the header-doubling object: it runs BTREE_RECORD_OR_BUF_INIT (record_buffer, recp) then btree_or_get_object (..., BTREE_LEAF_NODE, dummy_offset = 0, oidp, class_oid, mvcc_info), asserting NO_ERROR (the first object must always parse). btree_record_get_num_oids computes the total; its branches encode the storage regimes:
// btree_record_get_num_oids — src/storage/btree.c (condensed; the three branches)if (node_type == BTREE_LEAF_NODE) { rec_oid_cnt = 1; /* the object before the key */ BTREE_RECORD_OR_BUF_INIT (buf, rec); or_seek (&buf, after_key_offset); /* jump past the key */ if (BTREE_IS_UNIQUE (btid_int->unique_pk)) { /* unique: fixed-size objects */ fixed_object_size = BTREE_OBJECT_FIXED_SIZE (btid_int); assert ((CAST_BUFLEN (buf.endptr - buf.ptr) % fixed_object_size) == 0); return rec_oid_cnt + CAST_BUFLEN (buf.endptr - buf.ptr) / fixed_object_size; } while (buf.ptr < buf.endptr) { /* non-unique: variable size, walk it */ mvcc_flag = btree_record_object_get_mvcc_flags (buf.ptr); buf.ptr += OR_OID_SIZE + BTREE_GET_MVCC_INFO_SIZE_FROM_FLAGS (mvcc_flag); rec_oid_cnt++; } return rec_oid_cnt;}return CEIL_PTVDIV (rec->length, BTREE_OBJECT_FIXED_SIZE (btid_int)); /* overflow node: all fixed */Three regimes: unique leaf — non-first objects are fixed (BTREE_OBJECT_FIXED_SIZE = 2*OID + 2*MVCCID); one division of the post-key span gives the count, the modulo assert is the corruption tripwire. Non-unique leaf — variable objects (no class OID; MVCCIDs optional), so the loop advances by OR_OID_SIZE + BTREE_GET_MVCC_INFO_SIZE_FROM_FLAGS(...), the +1 first object added up front. Overflow node — no key, all fixed: rec->length / fixed_size via CEIL_PTVDIV.
Invariant —
after_key_offsetmust point exactly at the byte after the key. All three leaf strategies depend onor_seek (&buf, after_key_offset)landing on the first post-key object; that offset is §2.6’s*offsetverbatim. The unique-leaf modulo assert catches an off-by-one offset before it silently miscounts.
2.8 The per-operation helper structs
Section titled “2.8 The per-operation helper structs”Every mutation flows through one of two heap-free scratch structs, constructed once per top-level call and threaded down the recursion, grouping what would otherwise be a dozen parameters.
btree_insert_helper — every field
Section titled “btree_insert_helper — every field”| Field | Role | Why it exists |
|---|---|---|
obj_info | BTREE_OBJECT_INFO being inserted | the payload (OID + class OID + MVCC info) |
purpose | BTREE_OP_PURPOSE discriminator | selects new-insert / MVCC-delid / online-index / undo behavior |
op_type | single vs multi insert/modify | drives unique_stats_info collection |
unique_stats_info | btree_unique_stats * accumulator | defers unique-count maintenance to a batch flush |
key_len_in_page | packed key length (post BTREE_GET_KEY_LEN_IN_PAGE) | sizes splits without re-measuring |
nonleaf_latch_mode | PGBUF_LATCH_MODE for descent | READ optimistic, WRITE when a split is anticipated (Ch 3) |
is_first_try | true on first root fix | loads B-tree info from root once |
need_update_max_key_len | max-key-len bump to propagate down | children must follow a grown max |
is_crt_node_write_latched | current node already write-latched | skips latch promotion |
is_root | current node is the root | root has special split/merge handling |
is_unique_key_added_or_deleted | key count actually changed | distinguishes key-created vs OID-appended for stats |
is_unique_multi_update | multi-row update of unique index | tolerates >1 visible object mid-statement |
is_ha_enabled | HA replication active | forbids any transient unique violation |
log_operations | er_log tracing toggle | debug instrumentation |
is_null | the key is SQL NULL | NULLs are counted, not stored |
printed_key / printed_key_sha1 | readable key + its SHA1 | diagnostics for large keys |
insert_list | btree_insert_list * | batched multi-key insert driver |
leaf_addr | LOG_DATA_ADDR of target leaf | redo-logging anchor |
rcvindex | LOG_RCVINDEX id | selects the redo handler |
rv_keyval_data / _length | packed key+value undo image | logical undo payload |
rv_redo_data / _ptr | redo buffer + write cursor | physical redo image, built incrementally |
compensate_undo_nxlsa | next-undo LSA | CLR generation during undo |
is_system_op_started | a nested system op is open | tracks pending log_sysop_* commit/abort |
time_track | PERF_UTIME_TRACKER | performance statistics |
saved_locked_oid / _class_oid (SERVER_MODE) | object locked during unique check | carries the lock across latch-drop-and-retry |
Constructor descent-defaults worth memorizing: purpose (BTREE_OP_NO_OP), nonleaf_latch_mode (PGBUF_LATCH_READ) (start optimistic), is_first_try (true) (load btree info once), is_crt_node_write_latched (false), is_root (false), is_unique_key_added_or_deleted (true) (assume the key count changes until proven otherwise), is_system_op_started (false).
btree_delete_helper — every field
Section titled “btree_delete_helper — every field”| Field | Role | Why it exists |
|---|---|---|
object_info | BTREE_OBJECT_INFO to remove | the deletion target |
second_object_info | a second object info | undo of a unique-index insert needs the displaced object |
purpose | BTREE_OP_PURPOSE discriminator | same multiplexing role as insert helper |
nonleaf_latch_mode | descent latch mode | READ-optimistic default |
op_type | operation type (default SINGLE_ROW_DELETE) | gates unique_stats_info collection |
unique_stats_info | btree_unique_stats * | batch unique-count maintenance |
match_mvccinfo | BTREE_MVCC_INFO to match | finds the specific object version in a key’s OID list |
buffered_key | OR_BUF * of the key | search by pre-packed key, avoiding re-encode |
printed_key / printed_key_sha1 | readable key + SHA1 | diagnostics for large keys |
log_operations | tracing toggle | debug logging |
is_root | current node is root | root-specific merge handling |
is_first_search | first traversal | distinguishes initial descent from post-restart retry |
check_key_deleted | possible >1 visible object (MULTI_ROW_UPDATE) | decides whether to verify the key is gone |
is_key_deleted | key actually became empty | corrects collected stats if the key survived |
leaf_addr | LOG_DATA_ADDR of leaf | redo anchor |
rv_keyval_data / _length | undo image | logical undo of the delete |
rv_redo_data / _ptr | redo buffer + cursor | physical redo image |
reference_lsa | reference LSA | recovery/compensation reference point |
is_system_op_started | nested system op open | same commit/abort bookkeeping |
time_track | PERF_UTIME_TRACKER | performance stats |
The two structs mirror each other on the recovery block (leaf_addr, rv_keyval_data*, rv_redo_data*, is_system_op_started); btree_insert_helper_to_delete_helper and its inverse copy these across when one operation switches modes (e.g. an MVCC update that deletes then inserts).
Invariant —
is_system_op_startedmust be balanced on every exit path. If the codelog_sysop_starts (as §2.9 does) and sets this flag, everyreturn/goto errormustlog_sysop_commitorlog_sysop_abort. A leaked system op holds the log-append fixed and stalls recovery; the flag lets error paths ask “did I open one?” before unwinding.
2.9 First-time overflow-key file bootstrap
Section titled “2.9 First-time overflow-key file bootstrap”The overflow-key file is created lazily, the first time a key exceeds BTREE_MAX_KEYLEN_INPAGE (= DB_PAGESIZE / 8). The insert root prep guards creation with a double-checked pattern: it tests key_len >= BTREE_MAX_KEYLEN_INPAGE && VFID_ISNULL (&btid_int->ovfid) under a read latch; if true it pgbuf_promote_read_latch’s the root to WRITE (re-fixing on ER_PAGE_LATCH_PROMOTE_FAIL), sets insert_helper->is_crt_node_write_latched = true, then re-tests VFID_ISNULL because a concurrent inserter may have created the file in the latch window. Only then does it log_sysop_start and call btree_create_overflow_key_file (aborting the sysop and goto error on failure). The system operation makes the file persist even if the surrounding insert rolls back — it is permanent metadata, not transactional data.
// btree_create_overflow_key_file — src/storage/btree.c (condensed)VFID_SET_NULL (&btid->ovfid);des.btree_key_overflow.btid = *btid->sys_btid; /* structure copy */des.btree_key_overflow.class_oid = btid->topclass_oid;assert (!OID_ISNULL (&des.btree_key_overflow.class_oid));error_code = file_create_with_npages (thread_p, FILE_BTREE_OVERFLOW_KEY, 3, &des, &btid->ovfid);if (error_code != NO_ERROR) { return error_code; }error_code = heap_get_class_tde_algorithm (thread_p, &btid->topclass_oid, &tde_algo);if (error_code != NO_ERROR) { VFID_SET_NULL (&btid->ovfid); return error_code; } /* <- roll back VFID */error_code = file_apply_tde_algorithm (thread_p, &btid->ovfid, tde_algo);if (error_code != NO_ERROR) { VFID_SET_NULL (&btid->ovfid); return error_code; } /* <- roll back VFID */return error_code;Branch-complete: it nulls ovfid up front, creates a 3-page FILE_BTREE_OVERFLOW_KEY file tagged with the index’s BTID and top-class OID, then applies the class’s TDE (transparent data encryption) algorithm so encrypted tables encrypt their overflow keys. Every error path after the file is created re-nulls ovfid — a half-initialized file never leaves a dangling VFID in BTID_INT, keeping the caller’s VFID_ISNULL re-check honest on retry. The successful VFID is later persisted into the root header. The class_oid assert guards against an orphan overflow file with no owning class.
2.10 Chapter summary — key takeaways
Section titled “2.10 Chapter summary — key takeaways”- The first object’s OID is the record header — record flags in its
slotidnibble (btree_leaf_set_flag), MVCC flags in itsvolidnibble (btree_record_object_set_mvcc_flags); the low bits stay a valid disk address. btree_write_recordandbtree_read_record_without_decompressionare a producer/consumer pair joined only by those flag bits; the writer sets each flag adjacent to its field and the reader skips exactly those fields, so flags are never batched and everyor_put/or_advancehas its own error exit.*offset/after_key_offsetis the load-bearing number — computed once afterPTR_ALIGN, thenor_seek’d to bybtree_record_get_num_oidsand later iterators;BTREE_RECORD_OR_BUF_INIThides the trailing overflow VPID from those loops.- Object counting has three regimes: unique-leaf (one division, modulo-asserted), non-unique-leaf (per-object walk via
BTREE_GET_MVCC_INFO_SIZE_FROM_FLAGS), overflow (rec->length / fixed_size); the+1first object is added explicitly in the leaf cases. btree_insert_helper/btree_delete_helperare heap-free constructor-initialized scratch structs; their defaults (PGBUF_LATCH_READ,is_first_try = true) encode Ch 3’s optimistic descent and their mirrored recovery block feeds the logging chapters.- The overflow-key file is created lazily and exactly once, behind a double-checked
VFID_ISNULLinside a system operation so it survives rollback; every error path inbtree_create_overflow_key_filere-nullsovfidto keep the re-check honest.
Chapter 3: Descent and Latch Coupling
Section titled “Chapter 3: Descent and Latch Coupling”This chapter answers: given a key, which leaf owns it, and how do we get there without corrupting the tree under concurrency? The descent is shared by every insert, delete, scan-seek, and unique probe. For latch-coupling theory and the “restart from root” claim see the companion’s ### Latch-coupling on descent; this chapter corrects it: writes split pessimistically on descent (Ch 6) and only restart when a latch promotion fails.
3.1 The descent driver and its three callbacks
Section titled “3.1 The descent driver and its three callbacks”Every traversal flows through btree_search_key_and_apply_functions — a template where callers plug in three callbacks (fix-root / advance / process-leaf):
// btree_search_key_and_apply_functions -- src/storage/btree.cstart_btree_traversal: restart = false; is_leaf = false; root_function (..., &crt_page, &is_leaf, search_key, &stop, &restart, root_args); /* 1. fix root, load BTID_INT 1st try */ if (stop) goto end; if (restart) goto start_btree_traversal; /* stop: NULL/vacuum. restart: promote failed */ while (!is_leaf) /* 2. ADVANCE until leaf reached */ { advance_function (..., &crt_page, &advance_page, &is_leaf, search_key, &stop, &restart, advance_args); if (stop) goto end; if (restart) { if (advance_page) pgbuf_unfix_and_init (thread_p, advance_page); goto start_btree_traversal; } if (!is_leaf) /* latch-couple: release parent only AFTER child fixed */ { if (crt_page) pgbuf_unfix (thread_p, crt_page); crt_page = advance_page; advance_page = NULL; } } if (key_function) { key_function (...); if (restart) goto start_btree_traversal; } /* 3. operate on leaf */The callback triples (root / advance / key): read = btree_get_root_with_key / btree_advance_and_find_key / none; insert (Ch 4-6) and delete (Ch 8-9) swap in btree_fix_root_for_insert / _for_delete plus the btree_split_node_and_advance / btree_merge_node_and_advance advances and btree_key_insert_new_object / btree_key_delete_* key functions.
INVARIANT (no page leak across restart): on
restartthe engine unfixesadvance_pagebefore jumping tostart_btree_traversal, and the loop top unfixescrt_page; every advance callback also unfixes bothcrt_pageandchild_pagebefore settingrestart = true. A forgotten unfix pins the buffer-fix counter forever.
Figure 3-1 — the descent skeleton.
stateDiagram-v2
[*] --> FixRoot : start_btree_traversal
FixRoot --> End : stop true
FixRoot --> FixRoot : restart true
FixRoot --> Advance : root fixed, not leaf
FixRoot --> ProcessLeaf : root is also leaf
Advance --> End : stop true
Advance --> FixRoot : restart true, unfix child
Advance --> Advance : child fixed, release parent, descend
Advance --> ProcessLeaf : reached leaf
ProcessLeaf --> FixRoot : restart true
ProcessLeaf --> End : done
End --> [*]
3.2 BTREE_SEARCH_KEY_HELPER — the result carrier
Section titled “3.2 BTREE_SEARCH_KEY_HELPER — the result carrier”Every page-level search writes its verdict into one tiny struct, passed by pointer down the descent and back:
// struct btree_search_key_helper -- src/storage/btree.cstruct btree_search_key_helper{ enum fence_key_presence { NO_FENCE_KEY = 0, HAS_FENCE_KEY }; BTREE_SEARCH result; /* Result of key search. */ PGSLOTID slotid; /* Slot ID of found key or slot ID of the biggest key smaller then key (if not found). */ fence_key_presence has_fence_key;};| Field | Role | Why it exists |
|---|---|---|
result | A BTREE_SEARCH value (matrix below); the primary verdict. | Callers branch on found vs absent. |
slotid | On BTREE_KEY_FOUND, the slot holding the key; when absent, the biggest key smaller than the search key; NULL_SLOTID while uninitialized or on DB_UNK. | One struct serves both “found, here it is” and “absent, here it belongs”. |
has_fence_key | HAS_FENCE_KEY if the last record examined carried BTREE_LEAF_RECORD_FENCE. | The landed-on slot is a synthetic boundary, not data — skip in scans (§3.6). |
Role matrix for the values btree_search_*_page can set (subset of the 7-member BTREE_SEARCH enum in storage_common.h; BTREE_ERROR_OCCURRED and BTREE_ACTIVE_KEY_FOUND belong to OID/version search in Ch 5/8 and are never emitted here):
| Value | Set when / slotid / reaction |
|---|---|
BTREE_KEY_FOUND | exact non-fence match; slotid = the match; read/modify |
BTREE_KEY_NOTFOUND | DB_UNK compare; slotid = NULL_SLOTID; propagate error |
BTREE_KEY_SMALLER | key < all, slot 1; slotid = 1; go to prev sibling |
BTREE_KEY_BIGGER | key > all, off right end; slotid = key_cnt + 1; go to next sibling |
BTREE_KEY_BETWEEN | absent, inside [min,max]; slotid = next-bigger slot; owning leaf, insert here |
BTREE_SEARCH_KEY_HELPER_INITIALIZER seeds it to { BTREE_KEY_NOTFOUND, NULL_SLOTID, NO_FENCE_KEY }. btree_advance_and_find_key dispatches by node type (Figure 3-2): the leaf search fills all three fields; the non-leaf search fills only slotid, never the helper.
Figure 3-2 — who fills the helper.
flowchart LR A["btree_advance_and_find_key"] -->|leaf| LP["btree_search_leaf_page\nfills result+slotid+has_fence_key"] A -->|non-leaf| NP["btree_search_nonleaf_page\nfills only slotid"] LP --> H["BTREE_SEARCH_KEY_HELPER"]
3.3 Fixing the root
Section titled “3.3 Fixing the root”All three root callbacks call btree_fix_root_with_info, which pgbuf_fixes the root at latch_mode, reads the header (btree_get_root_header), and — only when btid_int_p != NULL — gleans metadata via btree_glean_root_header_info. BTID_INT (key type, unique flag, overflow-key VFID, topclass OID, …) loads only on the first fix; on restart the callbacks pass btid_int_p = NULL to reuse it:
// btree_fix_root_for_insert -- src/storage/btree.cif (insert_helper->is_first_try) *root_page = btree_fix_root_with_info (..., insert_helper->nonleaf_latch_mode, NULL, &root_header, btid_int);else { /* restart: re-fix, reset latch flags, skip all first-try-only work */ *root_page = btree_fix_root_with_info (..., insert_helper->nonleaf_latch_mode, NULL, NULL, NULL); insert_helper->is_crt_node_write_latched = false; insert_helper->need_update_max_key_len = false; return NO_ERROR; }insert_helper->is_first_try = false; /* <- never do the below twice */First-try-only branches in btree_fix_root_for_insert (guarded so they never repeat on restart):
purpose == UNDO_PHYSICAL_DELETE/ONLINE_INDEX_UNDO_TRAN_DELETE→ fill class OID for unique, computekey_len_in_page, return.- unique index → build
btree_unique_stats incr(delete_*vsinsert_*by purpose); accumulate intounique_stats_info(multi-row) orlogtb_tran_update_unique_stats; error →goto error. is_null→*stop = true, unfix root, return;INSERT_MVCC_DELID/INSERT_MARK_DELETED→ return.- else (real insert) → if
key_len >= BTREE_MAX_KEYLEN_INPAGEand no overflow-key file exists, promote root to WRITE and create it (§3.4); setkey_len_in_page, return.
btree_fix_root_for_delete mirrors this — first search loads BTID_INT, restart re-fixes and returns. Its first-fix if on delete_helper->purpose (prose summary, not a C excerpt): the two BTREE_OP_DELETE_VACUUM_* purposes return NO_ERROR immediately; the five undo purposes (_UNDO_INSERT*, _OBJECT_PHYSICAL_POSTPONED, _ONLINE_INDEX_UNDO_TRAN_INSERT) COPY_OID btid_int->topclass_oid into a NULL unique BTREE_DELETE_CLASS_OID, then return NO_ERROR; else (asserted btree_is_delete_object_purpose) update unique stats via incr.delete_* and set *stop = true if is_null.
Both fix the root with nonleaf_latch_mode, which defaults to PGBUF_LATCH_READ — writes start optimistic, escalating to WRITE only where a node must change.
3.4 Latch-coupling and where READ becomes WRITE
Section titled “3.4 Latch-coupling and where READ becomes WRITE”- Read path (
btree_advance_and_find_key): root and each child fixedPGBUF_LATCH_READ; the driver releases the parent only after the child is fixed. No promotion. - Write paths (split/merge advance): also start
READ; a latch is promoted to WRITE only where its node must be modified.
btree_split_node_and_advance (Ch 6) fixes the child WRITE eagerly when it is a leaf’s parent or max-key-len must grow, else keeps nonleaf_latch_mode. When a split is needed but the node is still READ-latched, it promotes in place and, on failure, restarts WRITE-mode:
// btree_split_node_and_advance -- src/storage/btree.cif (need_split && insert_helper->nonleaf_latch_mode == PGBUF_LATCH_READ && !insert_helper->is_crt_node_write_latched) { error_code = pgbuf_promote_read_latch (thread_p, crt_page, PGBUF_PROMOTE_ONLY_READER); /* always ONLY_READER */ if (error_code == ER_PAGE_LATCH_PROMOTE_FAIL) { /* could not promote -> unfix both pages, switch to WRITE-mode, restart from root */ insert_helper->nonleaf_latch_mode = PGBUF_LATCH_WRITE; *restart = true; pgbuf_unfix_and_init (thread_p, child_page); pgbuf_unfix_and_init (thread_p, *crt_page); insert_helper->is_crt_node_write_latched = false; return NO_ERROR; } insert_helper->is_crt_node_write_latched = true; /* promoted in place: keep descending */ }A later sibling promote targets the child node via pgbuf_promote_read_latch (..., &child_page, PGBUF_PROMOTE_SHARED_READER); both promotes fail-and-restart identically.
INVARIANT (no latch upgrade deadlock): the current-page promote uses
PGBUF_PROMOTE_ONLY_READERand the child-page promotePGBUF_PROMOTE_SHARED_READER— both fail rather than wait when another reader holds the page, after which the thread releases all latches and restarts WRITE-mode. Blocking on a WRITE upgrade while holding a READ latch is the textbook latch-upgrade deadlock; fail-and-restart makes it impossible.
This corrects the companion’s “restart from root”: a write does not restart per split — it splits pessimistically on descent (Ch 6); a restart happens only on a promotion failure, after which the WRITE-mode descent cannot loop.
3.5 btree_search_nonleaf_page — binary search and child selection
Section titled “3.5 btree_search_nonleaf_page — binary search and child selection”Slot 1 is a dummy negative-infinity key whose only payload is a child pointer:
// btree_search_nonleaf_page -- src/storage/btree.ckey_cnt = btree_node_number_of_keys (thread_p, page_ptr);if (key_cnt <= 0) return ER_FAILED; /* underflow guard */if (key_cnt == 1) /* only the dummy neg-inf key -> follow its pointer */ { *slot_id = 1; *child_vpid = non_leaf_rec.pnt; return NO_ERROR; }left = 2; right = key_cnt; /* <- start at 2: skip dummy slot 1, then narrow as §3.6 */while (left <= right) { middle = CEIL_PTVDIV ((left + right), 2); c = btree_compare_key (key, &temp_key, btid->key_type, 1, 1, &start_col); if (c == DB_UNK) return ER_FAILED; /* incomparable -> error */ if (c == 0) { *slot_id = middle; *child_vpid = non_leaf_rec.pnt; return NO_ERROR; } /* exact separator hit */ else if (c < 0) right = middle - 1; else left = middle + 1; /* (+ remember start_col) */ }Post-loop (left > right): c < 0 → child is left of middle (*slot_id = middle - 1, re-read its pnt); c >= 0 → child is middle (*child_vpid already holds non_leaf_rec.pnt). page_bounds (the update_boundary_* calls) is NULL for the read descent; start_col is a midxkey skip cursor. The non-leaf search emits only slot_id/child_vpid, never the helper (btree_advance_and_find_key lends search_key->slotid as scratch).
3.6 btree_search_leaf_page — found, not-found, and fence-key skipping
Section titled “3.6 btree_search_leaf_page — found, not-found, and fence-key skipping”At the leaf the post-conditions are richer; fence keys complicate equality:
// btree_search_leaf_page -- src/storage/btree.cleft = 1; right = key_cnt; /* search_key zeroed to NOTFOUND/NULL_SLOTID/NO_FENCE first */while (left <= right) /* middle = CEIL_PTVDIV((left+right),2); c = btree_compare_key(...) */ { if (c == DB_UNK) { search_key->result = BTREE_KEY_NOTFOUND; ... return error; } if (c == DB_EQ) { if (btree_leaf_is_flaged (&rec, BTREE_LEAF_RECORD_FENCE)) /* matched a FENCE record */ { search_key->has_fence_key = HAS_FENCE_KEY; if (middle == 1) c = DB_GT; /* lower fence -> "go right" */ else if (middle == key_cnt) /* upper fence -> key is past end */ { search_key->result = BTREE_KEY_BIGGER; search_key->slotid = key_cnt; return NO_ERROR; } } else { search_key->result = BTREE_KEY_FOUND; search_key->slotid = middle; return NO_ERROR; } } if (c < 0) right = middle - 1; else left = middle + 1; /* (narrow + remember start_col, as §3.5) */ }A fence key is a synthetic copy of a neighbor’s boundary that bounds prefix compression (companion’s ### Node layout); it must never be reported as a real match — hence the in-loop branches turn a lower-fence hit into “go right” (c = DB_GT) and an upper-fence hit into BTREE_KEY_BIGGER.
Post-loop (key absent) it flags has_fence_key if the last record was a fence, then sets the verdict per the §3.2 matrix: c < 0 → SMALLER if middle == 1 else BETWEEN, slotid = middle; c >= 0 → BIGGER if middle == key_cnt else BETWEEN, slotid = middle + 1. Per the header comment the key can equal the upper fence only on a scan resuming after an unfix/refix (Ch 7); an #ifndef NDEBUG block cross-checks via btree_leaf_is_key_between_min_max.
3.7 btree_locate_key — the single-key descent entry
Section titled “3.7 btree_locate_key — the single-key descent entry”This one-shot entry calls the engine with the default root function and btree_advance_and_find_key, passing reuse_btid_int = true so BTID_INT is not re-gleaned. Two branches: an error early-return (*leaf_page_out = NULL); and a success path setting *found_p = (search_key.result == BTREE_KEY_FOUND) and *slot_id = search_key.slotid, returning the leaf still latched (READ) — the building block for unique probes (Ch 5) and delete (Ch 8). *found_p is true only for BTREE_KEY_FOUND; the others give found = false with slot_id at the insertion point.
3.8 btree_compare_key — the comparison primitive
Section titled “3.8 btree_compare_key — the comparison primitive”Every binary search bottoms out here (returning DB_LT/DB_EQ/DB_GT/DB_UNK):
// btree_compare_key -- src/storage/btree.cif (DB_IS_NULL (key1)) { if (DB_IS_NULL (key2)) { assert (false); return DB_UNK; } return DB_LT; } /* NULL sorts low; both-NULL impossible */if (DB_IS_NULL (key2)) return DB_GT;if (dom_type == DB_TYPE_MIDXKEY) { c = pr_midxkey_compare (..., start_colp, &dummy_diff_column, dom_is_desc, NULL); if (dom_is_desc[0]) c = ((c == DB_GT) ? DB_LT : (c == DB_LT) ? DB_GT : c); /* DESC flips leading col */ }Beyond the NULL and DESC handling shown, incomparable types or a collation mismatch yield DB_UNK — a hard error for every search. start_colp is the midxkey column-skip cursor (unused for single-column keys).
3.9 Chapter summary — key takeaways
Section titled “3.9 Chapter summary — key takeaways”- One engine, three callbacks —
btree_search_key_and_apply_functionsruns fix-root → advance → process-leaf; operations differ only in callbacks (read:btree_get_root_with_key+btree_advance_and_find_key; write: split/merge advance). BTREE_SEARCH_KEY_HELPERis the universal verdict —result,slotid(match or insertion point),has_fence_key.BTID_INTloads exactly once on the first root fix; restarts re-fix withbtid_int_p = NULL, first-try-only blocks guarded byis_first_try.- Descent is latch-coupled, optimistically READ — child fixed before parent released, escalating only where a node must change.
- Restart is the exception — the write path splits pessimistically on descent (Ch 6); a restart happens only on
ER_PAGE_LATCH_PROMOTE_FAIL, after which it runs WRITE-mode, preventing the latch-upgrade deadlock. - Non-leaf vs. leaf search differ in output —
btree_search_nonleaf_pageskips dummy slot 1 and emits(slot_id, child_vpid);btree_search_leaf_pageemits a full helper and special-cases fence keys. btree_compare_keyis deterministic — NULL sorts low, DESC flips the leading column, incomparable types returnDB_UNK(a hard error).
Chapter 4: Inserting a Brand New Key
Section titled “Chapter 4: Inserting a Brand New Key”This chapter answers one question: when the descent reaches a leaf and the
key is not yet present, how is a fresh (key, first-OID) record built and
spliced into the leaf, and how are the header counters and recovery records
updated? We trace btree_insert_internal, how btree_split_node_and_advance
hands control to the leaf, the dispatch in btree_key_insert_new_object, and
then btree_key_insert_new_key in full. OID-append to an existing key is
Chapter 5; the split branch is Chapter 6. For record layout, latch-coupling
descent, and the unique-statistics model in the abstract, see the companion
cubrid-btree.md (“Record encoding”, “Latch coupling”); we do not re-derive
them. Structs were dissected in Chapter 1; here we touch only the helper fields
this path mutates.
4.1 The orchestration entry point: btree_insert_internal
Section titled “4.1 The orchestration entry point: btree_insert_internal”All three public insert entry points funnel into btree_insert_internal, a thin
setup-and-dispatch layer that never touches a page. It builds one stack-local
BTREE_INSERT_HELPER, picks the per-key function by purpose, then delegates
the descent to btree_search_key_and_apply_functions. The
purpose-to-function dispatch is the first branch point:
// btree_insert_internal -- src/storage/btree.cswitch (purpose) { case BTREE_OP_INSERT_UNDO_PHYSICAL_DELETE: LSA_COPY (&insert_helper.compensate_undo_nxlsa, undo_nxlsa); [[fallthrough]]; case BTREE_OP_INSERT_NEW_OBJECT: key_insert_func = btree_key_insert_new_object; break; /* <- our path */ case BTREE_OP_INSERT_MVCC_DELID: case BTREE_OP_INSERT_MARK_DELETED: key_insert_func = btree_key_find_and_insert_delete_mvccid; break; /* <- logical delete */ default: assert (false); return ER_FAILED;}For a brand-new key the purpose is BTREE_OP_INSERT_NEW_OBJECT. The NULL-key
flag is computed before the switch and is decisive — a NULL key is never
stored: insert_helper.is_null = key == NULL || DB_IS_NULL (key) || btree_multicol_key_is_null (key);. The work is one call threading three
callbacks (root-fix, descend/split, leaf-process) through the generic walker
btree_search_key_and_apply_functions. After it returns, three branches of
post-processing run: (1) saved-lock release (SERVER_MODE) — any predecessor
OID stashed in insert_helper.saved_locked_oid by unique-key locking is unlocked
now, success or not; (2) error short-circuit, else bump PSTAT_BT_NUM_INSERTS
and set *unique; (3) multi-update stats correction — for MULTI_ROW_UPDATE
on a unique index with HA disabled, if the key was not added/deleted
(!is_unique_key_added_or_deleted) the speculative descent delta is reverted
(revert delete_key_and_row(), re-apply add_row()). That is the only place the
speculative key-count delta is undone.
flowchart TD
A["btree_insert public entry"] --> B["btree_insert_internal\nbuild BTREE_INSERT_HELPER"]
B --> C{"purpose?"}
C -->|INSERT_NEW_OBJECT / UNDO_PHYSICAL_DELETE| D["key_insert_func =\nbtree_key_insert_new_object"]
C -->|MVCC_DELID / MARK_DELETED| E["key_insert_func =\nbtree_key_find_and_insert_delete_mvccid"]
D --> F["btree_search_key_and_apply_functions"]
E --> F
F --> G["btree_fix_root_for_insert\nunique stats, NULL gate"]
G --> H["btree_split_node_and_advance\ndescent + split"]
H --> I["leaf reached -> key_insert_func"]
I --> J["post: release lock, stats correction"]
Figure 4-1. btree_insert_internal orchestration and callback wiring.
4.2 The unique-check gate and the NULL-key early exit
Section titled “4.2 The unique-check gate and the NULL-key early exit”btree_fix_root_for_insert runs once per descent (guarded by is_first_try).
Beyond fixing the root it does unique-statistics accounting and computes
key_len_in_page (the packed key length the Chapter 6 split math needs). Two
gates here shape the new-key path.
The unique-stats gate keys on BTREE_IS_UNIQUE (btid_int->unique_pk)
alone. A nearby macro looks like it should be this gate but is not:
// BTREE_NEED_UNIQUE_CHECK -- src/storage/btree.h (defined but never referenced)#define BTREE_NEED_UNIQUE_CHECK(thread_p, op) \ (logtb_is_current_active (thread_p) \ && (op == SINGLE_ROW_INSERT || op == MULTI_ROW_INSERT || op == SINGLE_ROW_UPDATE))Its intended semantics — an active transaction (so rollback/recovery skip
live checking) plus an insert-shaped op type — record the design intent, but
no caller in src/ references it; it is dead code. The live stats bump is
gated only on BTREE_IS_UNIQUE, with the increment chosen by NULL-ness:
// btree_fix_root_for_insert -- src/storage/btree.cif (BTREE_IS_UNIQUE (btid_int->unique_pk)) { btree_unique_stats incr; /* ... MVCC-delete branch omitted ... */ if (insert_helper->is_null) incr.insert_null_and_row (); /* <- NULL: a null + a row, never a key */ else incr.insert_key_and_row (); /* <- speculative: assumes a new key WILL be added */ if (BTREE_IS_MULTI_ROW_OP (insert_helper->op_type)) (*insert_helper->unique_stats_info) += incr; /* multi-row sink */ else if (!btree_is_online_index_loading (insert_helper->purpose)) logtb_tran_update_unique_stats (thread_p, *btid, incr, true); /* single-row tran stats */ }Invariant — speculative key count is corrected, never left wrong. Violation: every duplicate-key insert reports a phantom key, corrupting optimizer cardinality.
insert_key_and_row()is applied before the descent knows the key exists; if the leaf already holds it (Chapter 5) the 4.1 multi-row correction reverts it, and single-row rollback reverses thelogtb_tran_update_unique_statsdelta.
The NULL-key early exit — a NULL key never reaches a leaf:
// btree_fix_root_for_insert -- src/storage/btree.cif (insert_helper->is_null) { *stop = true; /* <- halt walker at root */ pgbuf_unfix_and_init (thread_p, *root_page); return NO_ERROR; }Stats were already counted, so the NULL is accounted but not stored. *stop
makes the walker return without the leaf callback, so btree_key_insert_new_key
is reached only for non-NULL keys — its own assert (key != NULL && ...)
encodes that. Finally the function sets insert_helper->key_len_in_page = BTREE_GET_KEY_LEN_IN_PAGE (btree_get_disk_size_of_key (key)), possibly creating
an overflow-key file when key_len >= BTREE_MAX_KEYLEN_INPAGE.
4.3 Reaching the leaf: btree_split_node_and_advance
Section titled “4.3 Reaching the leaf: btree_split_node_and_advance”The recursive driver is dissected fully in Chapter 6. For this chapter only its leaf-reaching tail matters: after ensuring room for a child split (Chapter 6) it tests the node type:
// btree_split_node_and_advance -- src/storage/btree.cif (node_type == BTREE_LEAF_NODE) { assert (pgbuf_get_latch_mode (*crt_page) == PGBUF_LATCH_WRITE); /* <- leaves always WRITE-latched */ /* No other child. Notify called this is a leaf node and return the slot of new key. */ error_code = btree_search_leaf_page (thread_p, btid_int, *crt_page, key, search_key); if (error_code != NO_ERROR) { ASSERT_ERROR (); goto error; } *is_leaf = true; /* <- walker now calls key_insert_func instead of recursing */ return NO_ERROR; }When *is_leaf is set true, the walker stops recursing and invokes the key
function on *crt_page. The crucial outputs of btree_search_leaf_page are
search_key->slotid (the slot where the key belongs, by binary search) and
search_key->result (BTREE_KEY_FOUND if present, else BTREE_KEY_NOTFOUND /
BTREE_KEY_BETWEEN) — the pivot the next function branches on. The leaf is
always WRITE-latched on arrival, so the leaf path never promotes.
4.4 Leaf dispatch: btree_key_insert_new_object
Section titled “4.4 Leaf dispatch: btree_key_insert_new_object”This leaf callback sets up recovery scaffolding then splits between new key
(this chapter) and append (Chapter 5). First it stamps insert_helper->leaf_addr
(offset = search_key->slotid, pgptr = *leaf_page, vfid) and prepares the undo
keyval blob — the logical undo image (key + class OID + OID + MVCC info):
// btree_key_insert_new_object -- src/storage/btree.cif (insert_helper->purpose == BTREE_OP_INSERT_NEW_OBJECT /* || ONLINE_INDEX_TRAN_INSERT ... */) { insert_helper->rcvindex = BTREE_MVCC_INFO_IS_INSID_NOT_ALL_VISIBLE (BTREE_INSERT_MVCC_INFO (insert_helper)) ? RVBT_MVCC_INSERT_OBJECT : RVBT_NON_MVCC_INSERT_OBJECT; /* <- MVCC vs non-MVCC */ error_code = btree_rv_save_keyval_for_undo (btid_int, key, ..., &insert_helper->rv_keyval_data, ...);}else /* BTREE_OP_INSERT_UNDO_PHYSICAL_DELETE */ insert_helper->rcvindex = RVBT_RECORD_MODIFY_COMPENSATE; /* <- compensation, no undo keyval */Then the central branch — result != BTREE_KEY_FOUND is the new-key path of
this chapter; the else falls through to the append paths of Chapter 5:
// btree_key_insert_new_object -- src/storage/btree.cif (search_key->result != BTREE_KEY_FOUND) { /* Key doesn't exist. Insert new key. */ error_code = btree_key_insert_new_key (thread_p, btid_int, key, *leaf_page, insert_helper, search_key); /* ... free undo blob if reallocated, track time ... */ return NO_ERROR; }/* Key was found. Append new object to existing key. */if (BTREE_IS_UNIQUE (btid_int->unique_pk) && insert_helper->purpose == BTREE_OP_INSERT_NEW_OBJECT) { error_code = btree_key_lock_and_append_object_unique (..., restart, ...); /* <- Chapter 5 */ if (error_code != NO_ERROR) { ASSERT_ERROR (); goto error; } if (*restart == true) { goto exit; } /* <- re-descend: leaf changed under the lock wait */ }else { /* btree_key_append_object_non_unique (...) -- Chapter 5 */ }The found-key unique path may set *restart and goto exit to re-walk from the
root: if btree_key_lock_and_append_object_unique waited on a lock and the leaf
was reorganized meanwhile, it asks the walker to descend again — a fourth way to
leave the dispatch, with mechanics deferred to Chapter 5. The append branches
return through a shared exit:/error: epilogue that frees a heap-reallocated
undo buffer and records the perf timer; the new-key branch does the equivalent
cleanup inline and returns directly without reaching exit:. Only the
result != BTREE_KEY_FOUND branch concerns us.
4.5 Building and splicing the cell: btree_key_insert_new_key
Section titled “4.5 Building and splicing the cell: btree_key_insert_new_key”The heart of the chapter: build a leaf record holding the key and its single
first OID, insert at search_key->slotid, update header counters, emit one
undoredo log record.
4.5.1 — Mark added. insert_helper->is_unique_key_added_or_deleted = true;
confirms the speculative +key from 4.2 was right (reached only when the key was
not found), so no later correction fires.
4.5.2 — Overflow-key vs in-page key (three sub-branches).
// btree_key_insert_new_key -- src/storage/btree.ckey_len = btree_get_disk_size_of_key (key);if (key_len >= BTREE_MAX_KEYLEN_INPAGE) { key_type = BTREE_OVERFLOW_KEY; log_sysop_start (thread_p); /* <- (a) overflow key needs a nested system op */ insert_helper->is_system_op_started = true; }else { int diff_column = btree_node_get_common_prefix (thread_p, btid_int, leaf_page); if (diff_column > 0) /* <- (b) prefix compression */ { new_key = &local_key; pr_clone_value (key, new_key); pr_midxkey_remove_prefix (new_key, diff_column); } else if (diff_column < 0) { ASSERT_ERROR (); return diff_column; } /* <- (c) error: bad prefix, early return before any sysop */ }(a) oversized -> overflow key under a system op so the overflow-page allocation
commits/aborts atomically; (b) common prefix > 0 -> clone into local_key and
strip prefix columns; (c) diff_column < 0 -> propagate error (returns before
any sysop, so no abort needed).
4.5.3 — Encode via btree_write_record into a REC_HOME descriptor:
// btree_key_insert_new_key -- src/storage/btree.crecord.type = REC_HOME; record.data = PTR_ALIGN (data_buffer, MAX_ALIGNMENT); record.area_size = DB_PAGESIZE;error_code = btree_write_record (thread_p, btid_int, NULL /* no non-leaf node_rec */, new_key, BTREE_LEAF_NODE, key_type, key_len, false, BTREE_INSERT_CLASS_OID (insert_helper), BTREE_INSERT_OID (insert_helper), BTREE_INSERT_MVCC_INFO (insert_helper), &record);if (new_key == &local_key) pr_clear_value (&local_key); /* <- free cloned key on BOTH paths, before err check */if (error_code != NO_ERROR) { ASSERT_ERROR (); goto error; }For a leaf this writes the first OID, optionally a subclass OID (unique index +
class differs from topclass, sets BTREE_LEAF_RECORD_CLASS_OID), MVCC info, then
the packed key body (or a VPID to the overflow page). local_key is cleared
before the error check, so the clone never leaks.
4.5.4 — Splice into the slot: spage_insert_at.
// btree_key_insert_new_key -- src/storage/btree.cnode_header = btree_get_node_header (thread_p, leaf_page); /* ... NULL-check -> goto error ... */FI_TEST (thread_p, FI_TEST_BTREE_MANAGER_RANDOM_EXIT, 0);/* Nothing should fail after spage_insert_at! */if (spage_insert_at (thread_p, leaf_page, search_key->slotid, &record) != SP_SUCCESS) { assert_release (false); error_code = ER_FAILED; goto error; }Invariant —
spage_insert_atis the point of no return. Violation: a half-applied insert with no clean rollback, since redo is emitted right after. The Chapter 6 split pass already reservedbtree_get_max_new_data_sizebytes, soSP_DOESNT_FITcannot occur; the only remaining failure is a logic bug (assert_release).
Inside, spage_insert_at validates the slot id, then spage_find_empty_slot_at
branches: at num_slots it appends via spage_add_new_slot; between existing
keys it calls spage_take_slot_in_use, which slides the slot-directory entries
from slot_id up by one to free it. Either branch bumps the page-header counters
and spage_insert_data copies the bytes (Chapter 1). Inserting at
search_key->slotid preserves sorted order.
4.5.5 — Update the node header: max_key_len and split_info.
// btree_key_insert_new_key -- src/storage/btree.ckey_cnt = btree_node_number_of_keys (thread_p, leaf_page);key_len = BTREE_GET_KEY_LEN_IN_PAGE (key_len);if (key_len > node_header->max_key_len) { update_max_key_length = true; node_header->max_key_len = key_len; } /* <- redo must know too */btree_split_next_pivot (&node_header->split_info, (float) search_key->slotid / key_cnt, key_cnt);Key count is not written here — it is derived from the slot count, which
spage_insert_at already incremented. What this function maintains explicitly:
max_key_len— grown only if the new key is longer. The local booleanupdate_max_key_lengthtells the redo encoding (4.5.6) to pack the new key length so recovery rebuilds the header identically. This is leaf growth; the ancestor chain was bumped during descent (Chapter 6).split_info.pivot—btree_split_next_pivotfoldsslotid / key_cntinto a running average so a future split favours where inserts land (Chapter 6).
4.5.6 — Emit the recovery record into a local buffer (the helper’s smaller redo buffer may not hold a full record plus debug info):
// btree_key_insert_new_key -- src/storage/btree.cLOG_RV_RECORD_SET_MODIFY_MODE (&insert_helper->leaf_addr, LOG_RV_RECORD_INSERT); /* <- redo = "insert a slot" */if (update_max_key_length) { rv_redo_data_ptr = or_pack_int (rv_redo_data_ptr, key_len); /* <- replay the header growth */ BTREE_RV_SET_UPDATE_MAX_KEY_LEN (&insert_helper->leaf_addr); }memcpy (rv_redo_data_ptr, record.data, record.length); rv_redo_data_ptr += record.length;btree_rv_log_insert_object (thread_p, *insert_helper, insert_helper->leaf_addr, 0, rv_redo_data_length, NULL, rv_redo_data);Redo data is an optional packed key_len (only when the header grew, gated by
the BTREE_RV_SET_UPDATE_MAX_KEY_LEN address flag) followed by the raw record;
the matching undo is the logical keyval blob from 4.4.
btree_rv_log_insert_object chooses the log shape by purpose:
// btree_rv_log_insert_object -- src/storage/btree.cif (insert_helper.is_system_op_started) log_append_undoredo_data (thread_p, RVBT_RECORD_MODIFY_UNDOREDO, &addr, undo_length, redo_length, undo_data, redo_data);else switch (insert_helper.purpose) { case BTREE_OP_INSERT_NEW_OBJECT: /* undo logical, redo physical */ log_append_undoredo_data (thread_p, insert_helper.rcvindex, &addr, insert_helper.rv_keyval_data_length, redo_length, insert_helper.rv_keyval_data, redo_data); break; /* ... online-index and compensate cases ... */ }Invariant — undo is logical, redo is physical, for the normal new-key case. Violation: undo corrupts a reshaped page or leaks the overflow allocation. Rollback re-derives and physically removes the object rather than restoring raw bytes a concurrent thread may have reshaped. The exception: an overflow-key sysop (
is_system_op_started) logs physicalRVBT_RECORD_MODIFY_UNDOREDObracketed by the sysop.
4.5.7 — Close the system op and finish.
// btree_key_insert_new_key -- src/storage/btree.cif (insert_helper->is_system_op_started) btree_insert_sysop_end (thread_p, insert_helper);pgbuf_set_dirty (thread_p, leaf_page, DONT_FREE); /* <- keep latch, flag for flush */return NO_ERROR;
error: if (insert_helper->is_system_op_started) { log_sysop_abort (thread_p); insert_helper->is_system_op_started = false; } /* <- roll back overflow alloc */ return error_code;The error: label is the single cleanup point; its only real work is aborting a
started sysop (overflow-key case). The point-of-no-return invariant guarantees no
slotted-page change reachable from error: needs undoing.
flowchart TD
A["btree_key_insert_new_key"] --> B["is_unique_key_added_or_deleted = true"]
B --> C{"key_len >= MAX_KEYLEN_INPAGE?"}
C -->|yes| D["key_type = OVERFLOW\nlog_sysop_start"]
C -->|no| E{"common prefix?"}
E -->|diff_column > 0| F["clone + strip prefix into local_key"]
E -->|diff_column < 0| G["return error"]
E -->|0| H["use key as-is"]
D --> I["btree_write_record"]
F --> I
H --> I
I -->|error| Z["goto error"]
I -->|ok| J["spage_insert_at at slotid"]
J -->|not SP_SUCCESS| Z
J -->|SP_SUCCESS| K["update max_key_len if grown\nbtree_split_next_pivot"]
K --> L["pack redo: optional key_len + record"]
L --> M["btree_rv_log_insert_object\nundo logical / redo physical"]
M --> N{"sysop started?"}
N -->|yes| O["btree_insert_sysop_end"]
N -->|no| P["pgbuf_set_dirty; return NO_ERROR"]
O --> P
Z --> Q{"sysop started?"}
Q -->|yes| R["log_sysop_abort"]
Q -->|no| S["return error_code"]
R --> S
Figure 4-2. btree_key_insert_new_key branch-complete control flow.
4.6 Chapter summary — key takeaways
Section titled “4.6 Chapter summary — key takeaways”btree_insert_internalis pure setup: build the helper, selectbtree_key_insert_new_object, computeis_null, delegate to the walker; its only post-work is saved-lock release and the no-op-multi-update correction.- Unique stats are bumped optimistically in
btree_fix_root_for_insert(insert_key_and_row) before the leaf knows if the key exists; truth is restored by the 4.1 correction or undo logs. NULL keys are counted (insert_null_and_row) but stop at root via*stop = true. btree_split_node_and_advancereaches the leaf by setting*is_leaf = trueafterbtree_search_leaf_page; the leaf is always WRITE-latched, andsearch_key->resultis the branch pivot.btree_key_insert_new_objecthas four exits: new-key insert, unique append (which maygoto exiton*restartto re-descend, Chapter 5), non-unique append, and the error path.btree_key_insert_new_keyencodes the cell viabtree_write_recordthrough three sub-branches: oversized -> overflow key under a sysop, prefix-compressible -> clone-and-strip, plain -> as-is.spage_insert_atis the point of no return (append vsspage_take_slot_in_use); the Chapter 6 split pass guarantees room, soSP_DOESNT_FITwould be a bug. Header maintenance ismax_key_len(grown only if longer, echoed into redo viaupdate_max_key_length) andsplit_info.pivot; key count is implicit in the slot count.- Recovery for the normal new-key case is undo logical, redo physical; an
overflow-key sysop instead forces a physical
RVBT_RECORD_MODIFY_UNDOREDOpair bracketed by the sysop.
Chapter 5: Appending an OID and Enforcing Uniqueness
Section titled “Chapter 5: Appending an OID and Enforcing Uniqueness”Chapter 4 covered the first object for a key; this chapter handles the
case where the key already exists, so a second OID must be threaded
onto it. For a non-unique index that is a pure append; for a unique index
it is a locking protocol that may end in ER_BTREE_UNIQUE_FAILED. Three
sub-problems interleave: where the new OID goes (inline until a cap, then
an overflow chain off LEAF_REC::ovfl); whether a unique append is legal
(lock the current visible object, re-check under it); and what “visible”
means (purpose-dependent). For the slotted-page layout, the key || OID
encoding, and overflow files, see cubrid-btree.md §“Node layout” and
§“Unique-key handling”.
5.1 The two append entry points
Section titled “5.1 The two append entry points”The leaf-key dispatcher routes on BTREE_IS_UNIQUE (btid_int->unique_pk):
the non-unique path appends with no lock and no visibility check; the
unique path locks the existing visible object, decides whether the insert
violates uniqueness, then appends. Both end at
btree_key_append_object_into_ovf when the inline list is full.
5.2 btree_key_append_object_non_unique — the bare append
Section titled “5.2 btree_key_append_object_non_unique — the bare append”// btree_key_append_object_non_unique -- src/storage/btree.cn_objects = btree_record_get_num_oids (..., leaf_record, offset_after_key, BTREE_LEAF_NODE);if (n_objects < BTREE_MAX_OIDCOUNT_IN_LEAF_RECORD (btid_int)) /* <- branch A: room inline */ { btree_record_append_object (..., leaf_record, BTREE_LEAF_NODE, btree_obj, NULL, &..rv_redo_data_ptr); spage_update (...); return NO_ERROR; /* writes at tail; logical-undo + physical-redo log */ }BTREE_MVCC_INFO_SET_FIXED_SIZE (&btree_obj->mvcc_info); /* <- branch B: overflow objects are fixed size */error_code = btree_key_append_object_into_ovf (..., leaf_record, leaf_info, insert_helper, btree_obj);The cap BTREE_MAX_OIDCOUNT_IN_LEAF_RECORD (btree_load.h) is
BTREE_MAX_OIDLEN_INPAGE / BTREE_OBJECT_FIXED_SIZE. Branch A writes at
the record tail (descent already guaranteed leaf room, so no split).
Branch B forces fixed size — every overflow object carries both MVCCID
slots so the page is binary-searchable (§5.9).
Invariant: an inline OID list never exceeds
BTREE_MAX_OIDCOUNT_IN_LEAF_RECORD — enforced by the
n_objects < n_objects_limit gate (and a symmetric assert (==) in §5.7).
If violated, the record grows unbounded and breaks the split-point
arithmetic that assumes a bounded record.
5.3 btree_append_oid and btree_record_append_object
Section titled “5.3 btree_append_oid and btree_record_append_object”btree_append_oid is the mechanical writer (also used by the loader): a
bare 8-byte OID (OR_PUT_OID at rec->data + rec->length, then
rec->length += OR_OID_SIZE), no MVCC/class/recovery data. The richer
btree_record_append_object packs the full object; when the leaf record
already carries BTREE_LEAF_RECORD_OVERFLOW_OIDS it first steps
append_at back by DISK_VPID_ALIGNED_SIZE, saves the trailing VPID,
packs the object, then re-appends the VPID:
// btree_record_append_object -- src/storage/btree.cif (node_type == BTREE_LEAF_NODE && record->length > 0 && btree_leaf_is_flaged (record, BTREE_LEAF_RECORD_OVERFLOW_OIDS)) { append_at -= DISK_VPID_ALIGNED_SIZE; OR_GET_VPID (append_at, &ovf_vpid); } /* <- preserve ovfl link */So a new inline object is spliced in front of the trailing
LEAF_REC::ovfl link, keeping the link at a fixed position (the tail).
5.4 The inline-to-overflow transition
Section titled “5.4 The inline-to-overflow transition”btree_key_append_object_into_ovf reuses a partially-full chain page or
starts a new one:
// btree_key_append_object_into_ovf -- src/storage/btree.cbtree_find_free_overflow_oids_page (..., &leaf_record_info->ovfl, &overflow_page);if (overflow_page == NULL) /* <- no free page in chain */ error_code = btree_key_append_object_as_new_overflow (...); /* allocate + prepend new page */else /* <- a chain page has room */ { btree_key_append_object_to_overflow (..., overflow_page, append_object, insert_helper); pgbuf_unfix_and_init (thread_p, overflow_page); }The new-page branch calls btree_start_overflow_page to allocate and
seed the page, then btree_leaf_record_change_overflow_link to splice it
onto the leaf link and set BTREE_LEAF_RECORD_OVERFLOW_OIDS:
// btree_start_overflow_page -- src/storage/btree.c*new_page_ptr = btree_get_new_page (thread_p, btid_int, new_vpid, near_vpid);VPID_COPY (&ovf_header_info.next_vpid, first_overflow_vpid); /* <- prepend: new page -> old head */btree_init_overflow_header (...); spage_insert_at (..., *new_page_ptr, 1, &rec); /* one object, slot 1 */// ... redo-only log (RVBT_RECORD_MODIFY_NO_UNDO); undo handled logically by caller's sysop ...It prepends (leaf -> new -> old head -> ...): the new page’s
next_vpid points to the current head before the leaf link is rewritten.
Invariant: the overflow link and the BTREE_LEAF_RECORD_OVERFLOW_OIDS
flag are set together, under a system operation.
btree_key_append_object_as_new_overflow brackets allocation + link
rewrite in log_sysop_start / btree_insert_sysop_end, so a crash
between them is undone logically. If flag and link disagreed,
btree_record_append_object would read a garbage VPID at the tail.
5.5 The unique protocol — btree_key_lock_and_append_object_unique
Section titled “5.5 The unique protocol — btree_key_lock_and_append_object_unique”A unique index cannot just append: under MVCC a “duplicate” key may have
its visible version already logically deleted, in which case a new
version is legal. The protocol locks the current first object, then
re-decides. It saves the leaf LSA, calls btree_key_find_and_lock_unique
(§5.6) with the prior saved_locked_oid copied into
find_unique_helper.locked_oid at X_LOCK, then walks these branches:
flowchart TB
FAL["find_and_lock_unique"] --> E1{"error / *restart?"}
E1 -- yes --> RET["return / restart"]
E1 -- no --> NF{"result != KEY_FOUND\nor SERVER LSA split?"}
NF -- yes --> NEWK["restart, or insert_new_key\nkey vacuumed away"]
NF -- no --> FO{"found_object and\nINSERT_NEW_OBJECT?"}
FO -- no --> AP["append_object_unique"]
FO -- yes --> MU{"!multi_update or\nnum_visible > 1?"}
MU -- yes --> UV["unique violation"]
MU -- no --> HA{"is_ha_enabled?"}
HA -- yes --> UV
HA -- no --> AP
Figure 5-1 — Branch-complete flow of btree_key_lock_and_append_object_unique
(the !found_object RR-isolation branch also reaches unique violation).
The branch worth quoting is the violation decision:
// btree_key_lock_and_append_object_unique -- src/storage/btree.cif (find_unique_helper.found_object && insert_helper->purpose == BTREE_OP_INSERT_NEW_OBJECT) { if (insert_helper->is_unique_multi_update) // ... btree_get_num_visible_from_leaf_and_ovf with dirty snapshot -> num_visible ... if (!insert_helper->is_unique_multi_update || num_visible > 1) { /* <- real violation */ if (prm_get_bool_value (PRM_ID_UNIQUE_ERROR_KEY_VALUE)) return ER_UNIQUE_VIOLATION_WITHKEY; BTREE_SET_UNIQUE_VIOLATION_ERROR (...); return ER_BTREE_UNIQUE_FAILED; } else if (insert_helper->is_ha_enabled) return ER_REPL_MULTI_UPDATE_UNIQUE_VIOLATION; /* HA strictness */ insert_helper->is_unique_key_added_or_deleted = false; /* <- multi-update allowed */ OID_SET_NULL (&insert_helper->saved_locked_oid); /* keep lock on relocated object */} else insert_helper->is_unique_key_added_or_deleted = true; /* all deleted -> a new key */A live visible object plus a normal insert is a violation —
ER_BTREE_UNIQUE_FAILED, or ER_UNIQUE_VIOLATION_WITHKEY when
PRM_ID_UNIQUE_ERROR_KEY_VALUE wants the key echoed. The multi-update
exception tolerates a transient second version only when num_visible <= 1
under a dirty snapshot; under HA even that becomes
ER_REPL_MULTI_UPDATE_UNIQUE_VIOLATION. The RR branch (a !found_object
whose repeatable-read snapshot still counts a visible object) raises the
same error pair. When cleared, the path saves two objects for undo —
but only for an MVCC insert (rcvindex == RVBT_MVCC_INSERT_OBJECT) whose
key either gained no new visible object (!is_unique_key_added_or_deleted)
or whose first object was deleted by this same transaction (its delete
MVCCID equals our insert MVCCID). Then
btree_rv_save_keyval_for_undo_two_objects packs both versions and
rcvindex switches to RVBT_MVCC_INSERT_OBJECT_UNQ, so undo restores the
displaced object.
Invariant: at most one visible object per key in a unique index (two
transiently during multi-row update, never under HA) — enforced by the
num_visible > 1 and is_ha_enabled branches. Violating it lets two
committed rows share a unique key.
5.6 btree_key_find_and_lock_unique and the lock loop
Section titled “5.6 btree_key_find_and_lock_unique and the lock loop”btree_key_find_and_lock_unique dispatches on BTREE_IS_UNIQUE:
_of_unique inspects only the first object (btree_leaf_get_first_object);
_of_non_unique serves system-catalog indexes declared unique but
physically storing an OID list. The hot path:
// btree_key_find_and_lock_unique_of_unique -- src/storage/btree.cif (search_key->result != BTREE_KEY_FOUND) goto error_or_not_found; /* key gone */while (true) { btree_leaf_get_first_object (btid_int, &record, &unique_oid, &unique_class_oid, &mvcc_info); if (class filter mismatch) goto error_or_not_found; /* partition-scoped lookup */ if (locked_oid set and != unique_oid) /* first object changed */ { lock_unlock_object_donot_move_to_non2pl (...); OID_SET_NULL (&locked_oid); } switch (mvcc_satisfies_delete (thread_p, &mvcc_header)) { case INSERT_IN_PROGRESS: case DELETE_IN_PROGRESS: /* SA: assert_release(false); SERVER: fallthrough */ case CAN_DELETE: /* already locked -> found; else cond-lock; on fail unfix+uncond, *restart/break */ case DELETED: case SELF_DELETED: goto error_or_not_found; /* gone -> caller may insert */ default: assert_release (false); error_code = ER_FAILED; goto error_or_not_found; } }_of_non_unique differs materially: the record holds a list, so it
scans the whole leaf record then every overflow page rather than stopping
at the first object —
// btree_key_find_and_lock_unique_of_non_unique -- src/storage/btree.cwhile (true) { if (start_reading_leaf_record) { /* read leaf record; first object via btree_or_get_object */ } else if (buf.ptr == buf.endptr) { /* <- exhausted current record */ if (VPID_ISNULL (&next_overflow_vpid)) goto error_or_not_found; /* end of chain: not found */ overflow_page = pgbuf_fix (..., &next_overflow_vpid, OLD_PAGE, PGBUF_LATCH_READ, ...); if (prev_overflow_page) pgbuf_unfix_and_init (thread_p, prev_overflow_page); btree_get_next_overflow_vpid (..., &next_overflow_vpid); /* step the chain */ } btree_or_get_object (&buf, ..., &unique_oid, ...); /* next object */ if (class mismatch) continue; /* <- advance, NOT not-found */ switch (mvcc_satisfies_delete (...)) { /* same 5 cases as _of_unique */ } }Unlike _of_unique, a class mismatch does continue (advance to the next
list object) and a deleted object keeps scanning; not-found is reported
only when buf.ptr == buf.endptr and VPID_ISNULL(next_overflow_vpid).
Pages are fixed PGBUF_LATCH_READ, with the prev/current pair unfixed as
the scan steps.
Invariant: the first object of a unique leaf record is the newest
visible version, and it is the one that gets locked. Maintained by §5.5
plus btree_leaf_change_first_object (§5.8); _of_unique depends on it.
A stale first version would let an inserter lock the wrong object and miss
a real conflict.
BTREE_FIND_UNIQUE_HELPER threads through these functions:
| Field | Role | Why it exists |
|---|---|---|
oid | OID of the found object (output) | Which object was visible |
match_class_oid | Only this class; NULL = any | Partition-scoped lookup; mismatch = not-found |
lock_mode | Lock to take; insert = X_LOCK, FK = S_LOCK | Separates read-only find from insert |
snapshot | MVCC filter; NULL = none | FK/find set it; locking insert leaves NULL |
found_object | True iff found and locked | Success flag (§5.5) |
time_track | PERF_UTIME_TRACKER for PSTAT_BT_FIND_UNIQUE* | Per-phase counters |
locked_oid (SERVER_MODE) | Object currently locked here | Survives unfix/refix; detects a changed first object |
locked_class_oid (SERVER_MODE) | Class OID of locked_oid | Needed by lock_unlock_object_* |
locked_oid/locked_class_oid are absent in SA_MODE. Insert (unique) uses
X_LOCK + NULL snapshot (a live object drives the violation decision);
FK/find uses S_LOCK + a snapshot (a visible object exists, no exclusion).
5.7 btree_key_append_object_unique — keeping the newest first
Section titled “5.7 btree_key_append_object_unique — keeping the newest first”The new object must become the first object, so the current first is relocated to the tail:
// btree_key_append_object_unique -- src/storage/btree.cif (btree_record_get_num_oids (...) >= BTREE_MAX_OIDCOUNT_IN_LEAF_RECORD (btid_int)) error_code = btree_key_relocate_last_into_ovf (...); /* full: evict LAST inline object first */BTREE_MVCC_INFO_SET_FIXED_SIZE (&first_object->mvcc_info);btree_record_append_object (..., first_object, NULL, &..rv_redo_data_ptr); /* append OLD first at tail */btree_leaf_change_first_object (..., BTREE_INSERT_OID (insert_helper), ...); /* overwrite slot zero */spage_update (...);// log_append_undoredo_data with insert_helper->rcvindex (MVCC / NON_MVCC / *_UNQ); logical undo.When full, the last inline object is evicted to overflow by
btree_key_relocate_last_into_ovf under its own sysop; with room,
relocation is skipped. The two-step split is deliberate — the logging
system cannot mix a sysop with logical undo, so step 1 uses a sysop with
physical undo/redo and step 2 uses logical undo.
btree_key_relocate_last_into_ovf opens a log_sysop_start, pushes the
last object via btree_key_append_object_into_ovf, removes it from the
leaf with btree_record_remove_last_object, and logs
RVBT_RECORD_MODIFY_UNDOREDO. Its subtle branch is a re-read:
// btree_key_relocate_last_into_ovf -- src/storage/btree.cif (VPID_ISNULL (&leaf_record_info->ovfl) && btree_leaf_is_flaged (leaf_record, BTREE_LEAF_RECORD_OVERFLOW_OIDS)) { btree_read_record (...); btree_record_get_last_object (...); } /* <- first ovf page just added: re-read */The overflow append may itself have created the first overflow page, so
btree_leaf_record_change_overflow_link rewrote the leaf record and the
cached offset_to_last_object is stale — recompute it before
btree_record_remove_last_object or the wrong bytes get removed.
5.8 btree_leaf_change_first_object — overwriting slot zero
Section titled “5.8 btree_leaf_change_first_object — overwriting slot zero”This primitive replaces the in-record first object, growing or shrinking the record as the object’s size class changes:
// btree_leaf_change_first_object -- src/storage/btree.cif (old_rec_flag & BTREE_LEAF_RECORD_OVERFLOW_OIDS) { /* <- overflow -> FORCE fixed size */ old_object_size += 2 * OR_MVCCID_SIZE; new_object_size += 2 * OR_MVCCID_SIZE; BTREE_MVCC_INFO_SET_FIXED_SIZE (mvcc_info); if (BTREE_IS_UNIQUE (...) && !new_has_class_oid) /* unique + overflow forces a class OID too */ { new_rec_flag |= BTREE_LEAF_RECORD_CLASS_OID; new_object_size += OR_OID_SIZE; } }// else: no overflow -> variable size. Then RECORD_MOVE_DATA(recp, new_object_size, old_object_size).With no overflow the first object is variable-size; with overflow
it is forced fixed-size plus (for unique) a class OID, so
btree_leaf_get_first_object and the overflow binary search agree on a
uniform width. RECORD_MOVE_DATA slides the key and inline OIDs by
(new - old), reported through *key_offset.
Invariant: when a leaf record has an overflow chain, its first object
is fixed-width (and, if unique, includes a class OID) — enforced here and
by BTREE_MVCC_INFO_SET_FIXED_SIZE in the appenders. If violated,
btree_find_oid_from_ovfl’s binary search (dividing by
BTREE_OBJECT_FIXED_SIZE) reads mid-object garbage.
5.9 Locating an exact (key, OID) — the find-OID family
Section titled “5.9 Locating an exact (key, OID) — the find-OID family”Both the unique check and the delete path ask “which page holds this exact
(key, OID) under this purpose’s visibility?” btree_find_oid_and_its_page
scans the leaf linearly (btree_find_oid_from_leaf, variable-width,
skipping the packed key after object 0), then binary-searches each overflow
page (btree_find_oid_from_ovfl, fixed-width, OID-sorted):
// btree_find_oid_and_its_page -- src/storage/btree.cbtree_find_oid_from_leaf (..., oid, match_mvccinfo, purpose, offset_to_object, ...);if (*offset_to_object != NOT_FOUND) { *found_page = leaf_page; return NO_ERROR; } /* in leaf */if (VPID_ISNULL (&leaf_rec_info->ovfl)) return NO_ERROR; /* no chain: not found */do { overflow_page = pgbuf_fix (..., PGBUF_LATCH_WRITE, ...); /* walk chain, tracking *prev_page */ btree_find_oid_from_ovfl (..., overflow_page, oid, purpose, match_mvccinfo, offset_to_object, ...);} while (*offset_to_object == NOT_FOUND && !VPID_ISNULL (&overflow_vpid)); /* step via next_overflow_vpid */// found: *found_page = overflow_page; else unfix all + not foundPages are fixed PGBUF_LATCH_WRITE (mutating callers); prev_page is
tracked for relinking after removal; all pages are unfixed on the
not-found and error paths. The MVCC gate in both finders is
btree_find_oid_does_mvcc_info_match, purpose-dependent:
BTREE_OP_PURPOSE | Match rule |
|---|---|
DELETE_VACUUM_INSID | insert MVCCID equals match_mvccinfo->insert_mvccid |
DELETE_VACUUM_OBJECT, DELETE_OBJECT_PHYSICAL_POSTPONED | delete MVCCID equals match_mvccinfo->delete_mvccid (avoids a reused OID) |
DELETE_UNDO_INSERT, DELETE_UNDO_INSERT_UNQ_MULTIUPD | insert MVCCID matches the one being rolled back |
DELETE_UNDO_INSERT_DELID | delete MVCCID matches; insert MVCCID cross-checked only if also set |
DELETE_OBJECT_PHYSICAL, INSERT_MVCC_DELID, INSERT_MARK_DELETED, default | object has no valid delete MVCCID (alive) |
The purpose selects which MVCC field decides a match, so one finder serves insert, delete, and vacuum. The default branch (“no valid delete MVCCID”) drives unique-insert and physical-delete, so a reused OID whose old version is deleted-but-unvacuumed is correctly skipped.
5.10 Chapter summary — key takeaways
Section titled “5.10 Chapter summary — key takeaways”- Two entry points on
BTREE_IS_UNIQUE:btree_key_append_object_non_uniqueappends blindly;btree_key_lock_and_append_object_uniquelocks-then-verifies. - Inline until
BTREE_MAX_OIDCOUNT_IN_LEAF_RECORD, then overflow viabtree_key_append_object_into_ovf— reuse a chain page or prepend a new one (btree_start_overflow_page; flag + link set under a sysop). - Unique = lock first, then decide.
_of_uniqueX-locks the first object;_of_non_uniquescans the whole list. A live visible object meansER_BTREE_UNIQUE_FAILED, except aMULTI-ROW-UPDATEtolerates one transient extra — never under HA, never two. - Newest visible stays first —
btree_key_append_object_uniquerelocates the old first object to the tail (evicting the last inline object to overflow first if full), thenbtree_leaf_change_first_object. - Overflow forces fixed-width objects (both MVCCID slots, class OID for
unique) — the precondition for
btree_find_oid_from_ovfl’s binary search. - One finder, many purposes —
btree_find_oid_does_mvcc_info_matchmakes the same(key, OID)match or miss byBTREE_OP_PURPOSE, so insert/delete/undo/vacuum reusebtree_find_oid_and_its_page. - Restart and re-read are first-class:
*restartbounces to root, the LSA recheck forces a split restart, and the “first overflow page just created” re-read keepsbtree_key_relocate_last_into_ovfhonest.
Chapter 6: Node Split and Separator Promotion
Section titled “Chapter 6: Node Split and Separator Promotion”When the target leaf — or any ancestor on the descent path — has no room, CUBRID grows the tree via pessimistic split-on-descent, making growth safe without ever restarting from the root for capacity reasons. The companion (cubrid-btree.md, §“Descent” and Figures 2–3) covers why CUBRID splits eagerly under a system op; this chapter dissects how, branch by branch. Chapter 3 covered the descent calling the function below; Chapter 4 the leaf insert once room is guaranteed.
6.1 The pessimistic contract: split before you advance
Section titled “6.1 The pessimistic contract: split before you advance”The descent driver btree_search_key_and_apply_functions calls one advance function per level; for inserts it is btree_split_node_and_advance.
INVARIANT — “the child you advance into always has room.” Before returning a child page,
btree_split_node_and_advanceverifies the child can absorb the worst-case new entry, splitting it first if not. A chain split (a split forcing its parent to split, which forces its…) can never happen: when control reaches a node, its parent was guaranteed one free slot on the previous iteration. The leaf, visited last, is thus guaranteed room by induction.
It enforces this by measuring, at every level, the largest entry the operation could add against free space:
// btree_split_node_and_advance -- src/storage/btree.cmax_new_data_size = btree_get_max_new_data_size (..., node_type, max_key_len, insert_helper, false /*known*/);need_split = max_new_data_size > spage_get_free_space_without_saving (thread_p, child_page, NULL);max_key_len is insert_helper->key_len_in_page when need_update_max_key_len, else node_header->max_key_len. known_to_be_found = false: descent does not yet know whether the key exists, so it always reserves for a brand-new entry — the pessimistic estimate.
6.2 btree_get_max_new_data_size — the worst-case estimator
Section titled “6.2 btree_get_max_new_data_size — the worst-case estimator”An under-estimate would defeat the §6.1 invariant. The body is a node-type check then a switch (helper->purpose):
| Branch | Returned size | Why |
|---|---|---|
| Non-leaf node | NON_LEAF_ENTRY_MAX_SIZE(key_len) + SPAGE_SLOT_SIZE | A child split promotes one separator here; slot directory grows by one. |
| Leaf, key found | BTREE_OBJECT_FIXED_SIZE | Append an OID / upgrade first object / add an overflow link (Ch 5). No new slot. |
| Leaf, key not found | LEAF_ENTRY_MAX_SIZE(key_len) + SPAGE_SLOT_SIZE | Brand-new key record plus its slot (Ch 4). |
INSERT_MVCC_DELID / MARK_DELETED | OR_MVCCID_SIZE | Logical delete stamps a delete MVCCID into an existing object (Ch 8). |
| default | assert_release(false), ER_FAILED | Unhandled purpose — a programming error. |
Descent passes known_to_be_found = false for leaves, so the leaf estimate is the full new-key cost — largest of the insert cases.
6.3 btree_node_split_info — the skew accumulator
Section titled “6.3 btree_node_split_info — the skew accumulator”Every node header carries a BTREE_NODE_SPLIT_INFO that biases consecutive splits toward the hot side — the optimization for monotonic (auto-increment) key streams, where a 50/50 split would leave the left half permanently half-empty.
// btree_node_split_info -- src/storage/storage_common.hstruct btree_node_split_info{ float pivot; /* pivot = split_slot_id / num_keys */ int index; /* number of key insert after node split */};| Field | Role | Why it exists |
|---|---|---|
pivot | Running average in [0,1] of where past splits landed. 0.5 balanced; near 1.0 right-skewed (ascending); near 0.0 left-skewed. | Lets btree_split_find_pivot aim the next split at the hot side instead of the geometric middle, so the cold half is not wasted. |
index | Count of splits folded into the average (capped at key count); the CMA divisor. | Damps the average so one outlier insert cannot swing the pivot wildly. |
btree_write_default_split_info seeds a fresh node with pivot = BTREE_SPLIT_DEFAULT_PIVOT (0.5), index = 1 — start balanced.
btree_split_find_pivot turns the average into a target byte size, two branches:
- Balanced —
pivot == 0(uninit) orBTREE_SPLIT_LOWER_BOUND < pivot < BTREE_SPLIT_UPPER_BOUND: aim forCEIL_PTVDIV(total, 2); the[0.20,0.80]dead-band stops mild noise from skewing. - Skewed —
pivot ≤ 0.20or≥ 0.80: aim for(int)(total * pivot), clamped to[0.05,0.95](MIN_PIVOT/MAX_PIVOT) so ≥5% always lands each side.
btree_split_next_pivot folds the new split in: index = MIN(index+1, max_index); if pivot == 0 take new_value directly, else new_pivot = pivot + (new_value - pivot)/index (CMA), clamped to [0,1]. new_value = p_slot_id / key_cnt is where the just-promoted separator landed; ascending inserts feed values near 1.0, dragging pivot upward so future splits keep the dense right side and ship the sparse left half to the sibling.
6.4 btree_find_split_point — the size-balanced split point
Section titled “6.4 btree_find_split_point — the size-balanced split point”btree_split_node/btree_split_root delegate the where to btree_find_split_point, which returns the separator key and writes *mid_slot = last slot kept on the left. It is size-driven (spage_get_space_for_record bytes, not key counts, so fat keys split correctly). Three rules from its source comment: (1) get close to mid_size, (2) leave both sides room for the new entry and fences, (3) both sides keep ≥1 non-fence record. Three budgets back them: left_max_size/right_max_size = BTREE_NODE_MAX_SPLIT_SIZE - new_ent_size - (leaf only) new_fence_size; mid_size = btree_split_find_pivot(tot_rec, …); left_min_size = tot_rec - right_max_size.
The loop walks slots left to right accumulating left_size, charging each real record’s bytes — except at the leaf insert slot, where it charges new_ent_size (absent key) or existing + new_ent_size (present key, append grows in place). Per record:
- while
left_size < left_min_size(right still over budget):continue; if (left_size > MIN(left_max_size, mid_size)) { *mid_slot = i-1; break; }— cut one slot before overflow;if (i == stop_at) { *mid_slot = i; break; }— whole page minus fences stays left.
Two post-loop guards enforce rule #3: *mid_slot == start_with - 1 (left empty) → ++; *mid_slot == stop_at (right empty) → --; each gated by node-type / is_key_added_to_left / found.
Fence-key cost. For leaves the budget pre-subtracts new_fence_size = LEAF_FENCE_MAX_SIZE(max_key_len)+SPAGE_SLOT_SIZE from both sides: the split adds an upper fence to the left leaf and a lower fence to the right (a separator copy marked BTREE_LEAF_RECORD_FENCE letting range scans detect boundaries cheaply — Ch 7). The estimate stops the node overflowing when the fence materializes.
Separator selection. With *mid_slot fixed, it reads mid_key (last left) and next_key (first right). A leaf computes the shortest distinguishing prefix via btree_get_prefix_separator — unless either is an overflow key, where it clones next_key whole (a prefix could exceed the in-page max). A non-leaf clones next_key.
INVARIANT — “the separator strictly separates.” Every left-node key is
< separator ≤every right-node key.btree_get_prefix_separator(mid_key, next_key)returns the smallest value> mid_keyand≤ next_key. Violating it routes searches to the wrong subtree.
6.5 btree_split_node — splitting a non-root node into Q and R
Section titled “6.5 btree_split_node — splitting a non-root node into Q and R”btree_split_node takes parent P, the full node Q, a freshly allocated sibling R, the P-slot pointing at Q, and the descent key (all three pages already write-latched, asserted up front). Q keeps [1..leftcnt] + upper fence, R gets lower fence + [leftcnt+1..key_cnt], P gains a sep → R slot, and (leaf) Q.next → R.
Figure 6-1 — btree_split_node leaf-chain relink
flowchart LR Q["Q: [1..leftcnt]"] R["R: [leftcnt+1..]"] N["old Q.next"] Q -->|"next_vpid"| R R -->|"next_vpid"| N R -->|"prev_vpid"| Q N -->|"prev_vpid (step 8)"| R
Branch-complete walkthrough:
- Find the cut.
key_cnt = btree_node_number_of_keys(Q);≤ 0→goto exit_on_error.btree_find_split_point→sep_key,leftcnt; NULL/DB_NULLsep_key→ log and error out.rightcnt = key_cnt - leftcnt. - Fence record (leaf only). If leaf and
sep_key_len < BTREE_MAX_KEYLEN_INPAGE && ≤ qheader->max_key_len, write a leaf record fromsep_key, setBTREE_LEAF_RECORD_FENCE,flag_fence_insert = true. An overflow-key separator, orPRM_ID_USE_BTREE_FENCE_KEY = false, forcesflag_fence_insert = false. - Undo-log Q, set headers.
log_append_undo_data2(RVBT_COPYPAGE, …)snapshots the entire old Q for byte-for-byte rollback. Saveright_next_vpid = qheader->next_vpid. Sibling links exist only at the leaf level: leaves setqheader->next_vpid = *R_vpid,rheader->prev_vpid = *Q_vpid; non-leaf Qnext_vpidis nulled (VPID_SET_NULL) and R’s link fields keepbtree_init_node_headerdefaults (which redo-logs R’s header). R always inheritsrheader->next_vpid = right_next_vpid. - Move the upper half, fence Q. Optional lower fence at R slot 1, then for
i = 1..rightcntthe moving target stays atleftcnt+1: peek slotleftcnt+1of Q →spage_insert_at(R, j, …)→spage_delete(Q, leftcnt+1). Ifflag_fence_insert, Q’s upper fence goes in atleftcnt+1.btree_compress_node(Q), then redo-log Q (RVBT_COPYPAGE). - Redo-log R.
btree_compress_node(R), then redo-log R (RVBT_COPYPAGE). R has no undo log — newly allocated in the system op, abort just deallocates it. - Promote separator into P.
nleaf_rec.pnt = *R_vpid;key_type = BTREE_OVERFLOW_KEY/key_len = -1when too long, elseBTREE_NORMAL_KEY.p_slot_id++(insert after the slot pointing at Q),spage_insert_at(P, p_slot_id, &rec), undo/redoRVBT_NDRECORD_INS. P’s header is undo-logged,btree_split_next_pivot(&pheader->split_info, (float) p_slot_id / key_cnt, key_cnt)folds the landing slot in,max_key_lenbumped, header redo-logged. - Pick the child.
btree_compare_key(key, sep_key, …):DB_UNK→ assertion failure;< 0→*advance_vpid = *Q_vpid; else*R_vpid. - Relink the leaf neighbor. Only when R is a leaf: the page that used to follow Q gets its
prev_vpidre-pointed at R viabtree_get_next_page(R)+btree_set_vpid_previous_vpid(§6.7). exit_on_errorfreessep_keyand returns; inside the caller’s system op,log_sysop_abortreverses every page change.
INVARIANT — “the doubly-linked leaf chain stays consistent.” After a leaf split all four links hold:
Q.next = R,R.prev = Q,R.next = old-next,old-next.prev = R(step 3 sets the first three, step 8 the fourth). Skipping step 8 leaves a leaf whoseprev_vpidpoints at the now-too-full Q, corrupting backward range scans.
6.6 btree_split_root — the height-growth special case
Section titled “6.6 btree_split_root — the height-growth special case”The root must keep its page identity (it is named in the index catalog), so it cannot spawn one sibling. btree_split_root allocates two children Q and R, moves the halves into them, and rewrites root P to two separators (level L → L+1, slot 1 = neg-inf → Q, slot 2 = sep → R). Contrasting with §6.5:
- Undo-log the whole root (
RVBT_COPYPAGE); readkey_cnt(≤0→ error); snapshotsplit_info, setsplit_info.index = 1. - Find split point & fence as in §6.5.
neg_inf_key = sep_key— the separator is reused as the dummy negative-infinity key for the left pointer. - Grow the level.
pheader->node.node_level++, bumpmax_key_len, reset root split-info to default. Children getnode_level = new_level - 1and inheritmax_key_len/split_info. Same leaf-only sibling-link rule as §6.5 step 3. - Fill R (upper half). Optional lower fence, then move
rightcntrecords (peek/insert/deleteleftcnt+1of P). Redo-log RRVBT_INS_PGRECORDS(R is new). - Fill Q (lower half). Move
leftcntrecords by peeking slot 1 of P and deleting it (P shrinks from the front). If a fence is wanted, append Q’s upper fence; elsei--. Redo-log QRVBT_INS_PGRECORDS. - Rebuild root P. Redo-log a delete-all (
RVBT_DEL_PGRECORDS,rec_cnt = key_cnt), then write slot 1 =neg_inf_key → Q_vpid, slot 2 =sep_key → R_vpid(eachRVBT_NDRECORD_INS;BTREE_OVERFLOW_KEY/key_len = -1when too long). - Pick the child. Same
btree_compare_keythree-way branch:DB_UNK→ error;< 0→ Q; else R. exit_on_errorfreessep_key; the system op aborts, deallocates Q and R, and theRVBT_COPYPAGEundo restores P.
INVARIANT — “the root never changes page identity, and the leftmost separator is negative infinity.”
btree_split_rootrewrites P in place (no new VPID) and always installs a dummyneg_inf_keyin slot 1.btree_compare_keynever compares against slot 1 — searches< sep_keyroute to Q unconditionally — so the leftmost pointer covers all keys below the real separator. Delete-all + two inserts (not a page copy) keeps P’s identity stable across redo.
Logging asymmetry: btree_split_node uses whole-page RVBT_COPYPAGE for Q and R; btree_split_root uses record-set logging (RVBT_INS_PGRECORDS / RVBT_DEL_PGRECORDS / RVBT_NDRECORD_INS). Both share one system-op bracket, so abort is uniform.
6.7 btree_set_vpid_previous_vpid — the back-link rewrite
Section titled “6.7 btree_set_vpid_previous_vpid — the back-link rewrite”The smallest piece carries its own undo/redo so a split rollback restores the neighbor’s prev_vpid:
// btree_set_vpid_previous_vpid -- src/storage/btree.cif (page_p == NULL) return NO_ERROR; /* <- chain tail, no-op */header = btree_get_node_header (thread_p, page_p);if (header == NULL) return ER_FAILED;btree_node_header_undo_log (thread_p, &btid->sys_btid->vfid, page_p);header->prev_vpid = *prev; /* <- repoint backward link, then redo-log */Three branches: NULL page (chain tail) → no-op success; header read failure → ER_FAILED; else back-link rewritten with full undo/redo and pgbuf_set_dirty. Reused by leaf merge (Ch 9), hence standalone.
6.8 btree_split_node_and_advance — the full branch map
Section titled “6.8 btree_split_node_and_advance — the full branch map”The advance function is more than the need_split block. With node_type, need_split, and need_update_max_key_len computed (§6.1), it threads through these branches before returning. A root restart below means: on ER_PAGE_LATCH_PROMOTE_FAIL, set nonleaf_latch_mode = WRITE, *restart = true, unfix child+parent, is_crt_node_write_latched = false, return NO_ERROR (other promote errors → goto error). The nine branch entries, in source order:
- Root dispatch (
insert_helper->is_root): the root path mirrors the non-root flow in a parallel block (its own sysop, promotion, advance branches), but splits viabtree_split_root(two new pages, §6.6) notbtree_split_node. *crt_pagepromotion (need_split && nonleaf_latch_mode == READ && !is_crt_node_write_latched): promote the parent read→write (PGBUF_PROMOTE_ONLY_READER); on fail, root restart.- child promotion (
(need_split || need_update_max_key_len) && nonleaf_latch_mode == READ && !need_update_max_key_len && !is_child_leaf): promote the child (PGBUF_PROMOTE_SHARED_READER); on fail, root restart. need_update_max_key_len(ruleparent->max_key_len >= child->max_key_lenviolated): set childmax_key_len = key_len_in_page,btree_node_header_redo_log+pgbuf_set_dirty, propagateneed_update_max_key_len = true,is_crt_node_write_latched = trueso every descendant also updates.- Split + advance-to-child (
need_split,VPID_EQ(&advance_vpid, &child_vpid)): unfixnew_page1,log_sysop_commit,*advance_to_page = child_page,is_crt_node_write_latched = true, return. - Split + advance-to-new-page (
need_split,VPID_EQ(&advance_vpid, &new_page_vpid1)): unfixchild_page,log_sysop_commit,*advance_to_page = new_page1,is_crt_node_write_latched = true, return. - Split + impossible (
need_split, neither VPID matches):assert_release(false), unfix all three,error_code = ER_FAILED,goto error. - No-split fall-through (
!need_split && !need_update_max_key_len && nonleaf_latch_mode == READ): read-latched and untouched, sois_crt_node_write_latched = false,*advance_to_page = child_page, return. error:label (anygoto error): unfixnew_page1/new_page2first, thenlog_sysop_abortif a system op is open, then unfixchild_page.
The two promotion early returns are the only paths that restart from the root, and they restart on latch contention, never capacity. The split bracket is opened by log_sysop_start (is_system_op_started = true) inside the need_split block, closed by log_sysop_commit per advance branch or by the error: label:
// btree_split_node_and_advance -- src/storage/btree.c (error: label)if (new_page1 != NULL) pgbuf_unfix_and_init (thread_p, new_page1);if (new_page2 != NULL) pgbuf_unfix_and_init (thread_p, new_page2);if (is_system_op_started) log_sysop_abort (thread_p); /* <- unfix new pages FIRST so abort can deallocate them */if (child_page != NULL) pgbuf_unfix_and_init (thread_p, child_page);The ordering is load-bearing: new pages must be unfixed before log_sysop_abort, because abort deallocates them and the fix count must be zero.
6.9 Chapter summary — key takeaways
Section titled “6.9 Chapter summary — key takeaways”- Pessimistic split-on-descent splits any node that might not fit the worst-case entry before advancing into it, so the leaf is always guaranteed room and chain splits are structurally impossible.
btree_get_max_new_data_sizeis the estimator behind the guarantee, branching on node type and insert purpose; descent passesknown_to_be_found = falseto reserve the full new-key cost.btree_node_split_info{pivot, index} is a CMA accumulator;btree_split_find_pivotturns it into a target byte size (balanced in[0.20,0.80], clamped to[0.05,0.95]),btree_split_next_pivotupdates it from each separator’s landing slot.btree_find_split_pointis size-balanced: it reserves space for the new entry and fences on both sides, keeps ≥1 non-fence record per side, and returns a prefix-truncated separator.btree_split_nodeundo-logs Q whole (RVBT_COPYPAGE), moves the upper half to R, writes fences, redo-logs both, promotes the separator into P (RVBT_NDRECORD_INS), and relinks the leaf chain (§6.7).btree_split_rootgrows height in place: two children, a dummy neg-inf separator in slot 1 plus the real one in slot 2,node_levelbump, record-set redo.btree_split_node_and_advancehas eight live exit paths across nine branch entries (theerror:label is the shared abort sink): two latch-promotion root-restarts, max-key-len propagation, two split commits, theassert_release(false)impossible path, and the no-split read-latch fall-through — all in one system-op bracket whoseerror:label unfixes new pages beforelog_sysop_abort.
Chapter 7: Range Scan and LSA Aware Resume
Section titled “Chapter 7: Range Scan and LSA Aware Resume”This chapter answers the read-side question: once data is on disk, how
does a range scan walk leaves and overflow pages key-by-key, and resume
safely after releasing every latch across a client roundtrip? The answer
is one generic driver, btree_range_scan, parameterized by a key-processing
callback (BTREE_RANGE_SCAN_PROCESS_KEY_FUNC). The driver owns positioning,
latching, advancement, and page-LSA-aware resume; the callback owns
per-key work (visible OIDs for SELECT, any FK match for FK checks). All
persistent cursor state lives in one BTREE_SCAN (BTS), so a scan drops all
latches between iterations and resumes where it left off. For latch-coupling
descent see cubrid-btree.md §“Descent — latch-coupling on read”.
7.1 The cursor structs — btree_scan and btree_keyrange
Section titled “7.1 The cursor structs — btree_scan and btree_keyrange”btree_keyrange (src/storage/btree.h) is the “what to scan” descriptor;
btree_scan (BTS) is the large “where am I now” cursor. Fields tagged
/* TO BE REMOVED */ in the header are legacy-only and marked legacy.
btree_keyrange field | Role | Why it exists |
|---|---|---|
range | RANGE enum, each endpoint open/closed/infinite | Captures all bound combos; positioning branches on it alone |
lower_key | Lower-bound DB_VALUE, or NULL | NULL = no lower limit; btree_range_scan_start branches on this |
upper_key | Upper-bound DB_VALUE, or NULL | NULL = no upper limit; range test never trips on top |
num_index_term | Count of leading index columns bound | MIDXKEY can range on a prefix; tells comparator how many columns matter |
The table below is the authority for every btree_scan member:
| Field | Role | Why it exists |
|---|---|---|
btid_int | Resolved tree id + key domain | Avoids re-reading the root header per key (Ch 1) |
read_uncommitted | Legacy dirty-read flag | legacy — dies with btree_prepare_next_search |
P_vpid/P_page | Previous leaf vpid/ptr | legacy — dies with btree_find_next_index_record |
C_vpid | Current leaf VPID | Persists across unfix — anchor resume re-fixes |
O_vpid | Current overflow VPID | Park point when a key’s OIDs overflow one buffer (mid-key resume) |
C_page | Current leaf ptr | NULL when no latch held; set only while walking |
slot_id | Current slot in C_page | Re-validated after a latch re-acquire |
oid_pos | Legacy OID cursor | legacy — superseded by callback OID walk |
cur_key | Key the cursor is parked on | Resume anchor when slot_id is untrustworthy |
clear_cur_key | Ownership flag for cur_key | Tells btree_scan_clear_key whether to free |
cur_common_prefix_page_vpid/_lsa | VPID+LSA of page whose common prefix is cached (debug builds only) | Present only under CHECK_VERIFY_COMMON_PREFIX_PAGE_INFO; asserts the cached prefix still belongs to the current page |
common_prefix_size | Length of page’s common prefix | COMMON_PREFIX_UNKNOWN until computed |
common_prefix_key | Shared prefix bytes for the page | Combined with a compressed cur_key to rebuild the full key |
clear_common_prefix_key | Ownership flag for prefix | Frees the prefix copy when page changes |
is_cur_key_compressed | cur_key is prefix-stripped | true → btree_check_decompress_key re-attaches prefix before unfix |
attid_idxs | attr-id → column-position map | Filters reference base columns by attr-id |
is_cur_key_copied | cur_key owns a heap copy | Copy survives C_page unfix; set in §7.8 |
key_range/key_filter | What to scan / per-key predicate | Range positioning; filter rejects in-range failures |
key_filter_storage | Backing FILTER_INFO | Owns the struct so key_filter is stable across iterations |
use_desc_index | Descending scan flag | Selects prev_vpid + conditional prev-leaf fix path |
restart_scan | Legacy restart counter | legacy |
read_keys/qualified_keys | Query-trace counters | Bumped in §7.7; feed EXPLAIN/trace stats |
key_range_max_value_equal | Upper bound matched exactly | Lets next advance end the scan without reading another key |
cur_leaf_lsa | C_page LSA at last unfix | Decides reuse vs. re-locate on resume |
lock_mode | S_LOCK or X_LOCK | From the scan’s isolation needs |
key_record | Peeked RECDES for the key | Avoids re-reading the slot in OID-walking callbacks |
need_to_check_null | Range can match SQL NULL | Toggles the NULL branch in range qualification |
leaf_rec_info | Decoded LEAF_REC | .ovfl is the head of the key’s OID overflow chain |
node_type | Always BTREE_LEAF_NODE | Asserts cursor never parks on a non-leaf |
offset | Byte offset to first OID | Where OID processing starts in key_record |
key_status | NOT_VERIFIED/VERIFIED/CONSUMED | Per-key state machine (§7.2) |
end_scan | Whole range exhausted | Driver terminates; C_vpid reset, key cleared |
end_one_iteration | OID buffer full | Return partial results, keep C_vpid/cur_leaf_lsa |
is_interrupted | Callback returned mid-key | Breaks inner loop without advancing |
is_key_partially_processed | Key’s OIDs spilled across iterations | Resume re-enters callback at O_vpid, not the leaf |
n_oids_read/..._last_iteration | Cumulative / per-iter counts | Per-iter is btree_keyval_search’s return; debits key_limit_upper |
oid_ptr | Write head into caller’s buffer | Reset to oid_list->oidp per iteration |
match_class_oid | Class filter for hierarchical uniques | Skips OIDs from other classes in a partition hierarchy |
key_limit_lower/_upper | LIMIT/OFFSET push-down | _upper decremented by per-iter count at exit |
index_scan_idp | Owning index scan | Source of OID buffer, snapshot, opt mode (ISS/MRO/covering) |
is_btid_int_valid | btid_int resolved | Guards re-resolution on first use |
is_scan_started | Start vs. resume selector | false until start succeeds; then resume is used |
force_restart_from_root | Desc fast path failed | Forces full re-locate on next resume |
is_fk_remake | Deduplicate-mode FK retry | Retries FK existence with rebuilt key |
time_track | Timing | Feeds PSTAT_BT_* |
bts_other | Per-purpose extra state | FK: points at the BTREE_FIND_FK_OBJECT result |
7.2 The key_status state machine
Section titled “7.2 The key_status state machine”BTS_KEY_STATUS (bts_key_status, src/storage/btree.h) is three-valued:
BTS_KEY_IS_NOT_VERIFIED (slot reached, range/filter not yet checked),
BTS_KEY_IS_VERIFIED (passed range AND filter, ready to process),
BTS_KEY_IS_CONSUMED (fully processed, advance to next slot).
stateDiagram-v2 [*] --> NOT_VERIFIED : positioned on a slot NOT_VERIFIED --> VERIFIED : range ok and filter ok NOT_VERIFIED --> NOT_VERIFIED : fence or filtered, advance slot VERIFIED --> CONSUMED : key_func read all OIDs CONSUMED --> NOT_VERIFIED : advance to next slot VERIFIED --> VERIFIED : resume re-reads same record CONSUMED --> [*] : range exhausted, end_scan
Figure 7-1 — key_status transitions. advance_over_filtered_keys drives
NOT_VERIFIED to VERIFIED; the callback drives VERIFIED to CONSUMED.
Key invariant: the driver only calls key_func when
key_status == BTS_KEY_IS_VERIFIED. Asserted in btree_range_scan right
before the callback. Every path to the callback first routes through
btree_range_scan_advance_over_filtered_keys, which sets VERIFIED only
after range AND filter pass. Violate it and a callback could emit OIDs for
a fence or out-of-range key — rows that do not belong to the result.
7.3 The single-key wrapper — btree_keyval_search and btree_scan_update_range
Section titled “7.3 The single-key wrapper — btree_keyval_search and btree_scan_update_range”A point lookup (WHERE k = 5) is a degenerate range:
// btree_keyval_search — src/storage/btree.crc = btree_prepare_bts (thread_p, bts, btid, isidp, kv_range, filter, ...);rc = btree_range_scan (thread_p, bts, btree_range_scan_select_visible_oids);return bts->n_oids_read_last_iteration; /* <- OID count, iterative */It is “just a GE_LE range search with the same key.” Returning
n_oids_read_last_iteration lets a caller loop while !BTREE_END_OF_SCAN(bts)
to drain a key whose OIDs span buffer-fills. btree_scan_update_range
installs a new range into an existing BTS (multiple-range optimization): it
validates range, nulls SQL-NULL bounds, and swaps bounds for
descending scans:
// btree_scan_update_range — src/storage/btree.cif ((bts->use_desc_index && !BTREE_IS_PART_KEY_DESC (&bts->btid_int)) || (!bts->use_desc_index && BTREE_IS_PART_KEY_DESC (&bts->btid_int))) { /* XOR: scan-desc != column-desc */ range_reverse (bts->key_range.range); /* GE_LE becomes LE_GE etc. */ swap_key = bts->key_range.lower_key; /* <- swap bounds */ bts->key_range.lower_key = bts->key_range.upper_key; bts->key_range.upper_key = swap_key; }The swap fires only when exactly one of “scan descending” / “column stored descending” holds; if both hold the physical walk is already forward.
7.4 Starting a scan — btree_range_scan_start
Section titled “7.4 Starting a scan — btree_range_scan_start”Positions on the first eligible key; called only while
is_scan_started == false. The /* (X) */ markers are the branch map.
// btree_range_scan_start — src/storage/btree.cbts->key_status = BTS_KEY_IS_NOT_VERIFIED;if (bts->key_range.lower_key == NULL) { /* (A) no lower bound: jump to lowest leaf */ btree_find_lower_bound_leaf (thread_p, bts, NULL); if (bts->end_scan) { return NO_ERROR; } /* empty index */ }else { /* (B) lower bound: descend to leaf */ btree_locate_key (..., bts->key_range.lower_key, ..., &found); if (!found) { if (bts->use_desc_index) { bts->slot_id--; } } /* (B1) desc: first-smaller */ else if (bts->key_range.range == GT_LT || bts->key_range.range == GT_LE || bts->key_range.range == GT_INF) { bts->key_status = BTS_KEY_IS_CONSUMED; } /* (B2) strict GT_*: skip exact */ }btree_range_scan_advance_over_filtered_keys (thread_p, bts);if (bts->force_restart_from_root) { /* (C) desc couldn't position: restart */ assert (bts->use_desc_index); bts->key_status = BTS_KEY_IS_NOT_VERIFIED; return NO_ERROR; /* is_scan_started stays false -> driver loops */ }bts->is_scan_started = true; /* (D) success */return NO_ERROR;All four outcomes funnel into advance_over_filtered_keys. Only (C)
leaves is_scan_started false — a descending start that hit the prev-leaf
fix path without progress, which the driver retries.
7.5 The generic driver — btree_range_scan
Section titled “7.5 The generic driver — btree_range_scan”The outer loop runs until end_scan (range done) or end_one_iteration
(buffer full, return to client); the inner loop drains keys, calling the
callback for each VERIFIED key.
flowchart TB
START["reset end_scan/end_one_iteration\nn_oids_read_last_iteration=0\noid_ptr=oid_list.oidp"] --> OUTER{"!end_scan and\n!end_one_iteration?"}
OUTER -->|no| EXIT["save cur_leaf_lsa, unfix pages\ndebit key_limit_upper"]
OUTER -->|yes| POS{"is_scan_started?"}
POS -->|no| STARTF["btree_range_scan_start"]
POS -->|yes| RESUME["btree_range_scan_resume"]
STARTF --> CHK
RESUME --> CHK{"end_scan?"}
CHK -->|yes| EXIT
CHK -->|no| FRR{"force_restart_from_root?"}
FRR -->|yes| UNFIXC["unfix C_page, continue outer"]
UNFIXC --> OUTER
FRR -->|no| INNER["INNER: key_func(bts)"]
INNER --> ICHK{"interrupted /\nend_one_iteration /\nend_scan?"}
ICHK -->|yes| EXIT
ICHK -->|no| ADV["advance_over_filtered_keys"]
ADV --> ACHK{"end_scan?"}
ACHK -->|yes| EXIT
ACHK -->|no| AFRR{"force_restart_from_root?"}
AFRR -->|yes| UNFIXC
AFRR -->|no| INNER
Figure 7-2 — btree_range_scan control flow. Outer loop chooses start vs.
resume; inner loop calls key_func then advances. Both
force_restart_from_root arms unfix C_page and re-enter the outer loop.
// btree_range_scan — src/storage/btree.cwhile (true) { assert (bts->key_status == BTS_KEY_IS_VERIFIED); /* invariant 7.2 */ error_code = key_func (thread_p, bts); /* <- the callback */ if (bts->is_interrupted || bts->end_one_iteration || bts->end_scan) { break; } assert (bts->key_status == BTS_KEY_IS_CONSUMED); btree_range_scan_advance_over_filtered_keys (thread_p, bts); if (bts->force_restart_from_root) { /* unfix C_page, decompress cur_key */ break; } }At the end: label the driver always, in order: captures
cur_leaf_lsa = pgbuf_get_lsa(C_page) before unfixing; decompresses
cur_key if end_one_iteration; unfixes C_page/P_page; on end_scan
nulls C_vpid and clears is_scan_started/the key; debits key_limit_upper
by n_oids_read_last_iteration.
Key invariant: cur_leaf_lsa is captured at the same instant C_page
is unfixed. Unfix without first copying the LSA and a later resume would
compare a stale LSA, conclude “unchanged,” and reuse a slot_id now
pointing at a different key. The end: block guarantees copy-then-unfix;
the no-page branch nulls cur_leaf_lsa so resume must re-locate from root.
7.6 Resuming — btree_range_scan_resume and the LSA decision
Section titled “7.6 Resuming — btree_range_scan_resume and the LSA decision”Re-establish position without holding a latch the whole time. The
decision hinges on cur_leaf_lsa; /* (X) */ is the branch map.
// btree_range_scan_resume — src/storage/btree.cif (!bts->force_restart_from_root) { pgbuf_fix_if_not_deallocated (..., &bts->C_vpid, ..., &bts->C_page); if (bts->C_page != NULL) { if (LSA_EQ (&bts->cur_leaf_lsa, pgbuf_get_lsa (bts->C_page))) { return btree_range_scan_advance_over_filtered_keys (...); } /* (1) UNCHANGED */ if (BTREE_IS_PAGE_VALID_LEAF (thread_p, bts->C_page)) { /* (2) still a leaf but changed */ btree_leaf_is_key_between_min_max (..., &bts->cur_key, &search_key); if (search_key.result == BTREE_KEY_BETWEEN) { btree_search_leaf_page (..., &bts->cur_key, &search_key); } switch (search_key.result) { case BTREE_KEY_FOUND: /* (2a) key still here: reset slot */ bts->slot_id = search_key.slotid; return btree_range_scan_advance_over_filtered_keys (...); case BTREE_KEY_BETWEEN: /* (2b) key deleted, neighbor here */ bts->slot_id = bts->use_desc_index ? search_key.slotid - 1 : search_key.slotid; bts->key_status = BTS_KEY_IS_NOT_VERIFIED; return btree_range_scan_advance_over_filtered_keys (...); case BTREE_KEY_SMALLER: case BTREE_KEY_BIGGER: case BTREE_KEY_NOTFOUND: break; /* (2c) migrated off page: re-locate */ } } pgbuf_unfix_and_init (thread_p, bts->C_page); /* (3) reused as non-leaf */ } /* (4) C_page deallocated */ }bts->force_restart_from_root = false;btree_locate_key (..., &bts->cur_key, ..., &found);if (found) { return btree_range_scan_advance_over_filtered_keys (...); }if (btree_node_number_of_keys (thread_p, bts->C_page) < 1) { bts->end_scan = true; return NO_ERROR; } /* (5) index emptied under us */if (bts->use_desc_index) { bts->slot_id--; }bts->key_status = BTS_KEY_IS_NOT_VERIFIED;return btree_range_scan_advance_over_filtered_keys (...);(1) fast path — leaf byte-for-byte unchanged, slot_id/key_status
hold. (2a)/(2b) re-derive the slot by searching for cur_key on a
changed-but-still-leaf page. (2c)/(3)/(4)/(5) re-descend from root via
btree_locate_key, then advance, end, or position on the insertion point.
Key invariant: resume re-anchors on cur_key, never on slot_id
alone. slot_id is trusted only in branch (1), where the LSA proved no
change; elsewhere the slot is re-derived by searching for cur_key. A stale
slot would skip or duplicate keys whenever a concurrent split/merge shifted
slot numbers.
7.7 Advancing over fence and filtered keys — btree_range_scan_advance_over_filtered_keys
Section titled “7.7 Advancing over fence and filtered keys — btree_range_scan_advance_over_filtered_keys”Turns “positioned somewhere” into “positioned on the next VERIFIED key, or
end_scan.” Three phases.
Resumed-VERIFIED early return — if already VERIFIED (page reused, re-fixed without re-checking), re-peek and return; otherwise set direction and step a CONSUMED slot:
// btree_range_scan_advance_over_filtered_keys — src/storage/btree.cif (bts->key_status == BTS_KEY_IS_VERIFIED) { spage_get_record (..., bts->slot_id, &bts->key_record, PEEK); btree_range_scan_read_record (thread_p, bts); /* page may have moved bytes */ return NO_ERROR; /* keep current key */ }if (bts->key_range_max_value_equal) { bts->end_scan = true; return NO_ERROR; } /* exact upper bound returned */inc_slot = bts->use_desc_index ? -1 : 1;if (bts->key_status == BTS_KEY_IS_CONSUMED) { bts->slot_id += inc_slot; }next_vpid = bts->use_desc_index ? node_header->prev_vpid : node_header->next_vpid;Page-crossing loop — when slot_id runs off the page, fix the next
leaf; VPID_ISNULL ends the scan. Ascending latch-couples directly,
descending defers to §7.8:
while (bts->slot_id <= 0 || bts->slot_id > key_count || key_count == 0) { if (VPID_ISNULL (&next_vpid)) { bts->end_scan = true; return NO_ERROR; } /* consumed */ if (bts->use_desc_index) { btree_range_scan_descending_fix_prev_leaf (..., &key_count, &node_header, &next_vpid); if (bts->force_restart_from_root) { return NO_ERROR; } } else { next_node_page = pgbuf_fix (..., &next_vpid, ...); pgbuf_unfix (thread_p, bts->C_page); /* latch-couple to next */ bts->C_page = next_node_page; VPID_COPY (&bts->C_vpid, &next_vpid); bts->slot_id = 1; next_vpid = node_header->next_vpid; } }Per-key qualification — peek; skip fences; else read, count, apply range + filter:
spage_get_record (..., bts->slot_id, &bts->key_record, PEEK);if (!btree_leaf_is_flaged (&bts->key_record, BTREE_LEAF_RECORD_FENCE)) { btree_range_scan_read_record (thread_p, bts); bts->read_keys++; btree_apply_key_range_and_filter (..., &is_range_satisfied, &is_filter_satisfied); if (!is_range_satisfied) { bts->end_scan = true; return NO_ERROR; } /* past range */ if (is_filter_satisfied) { bts->qualified_keys++; bts->key_status = BTS_KEY_IS_VERIFIED; /* <- only success exit */ return NO_ERROR; } /* filter failed: fall through */ }bts->slot_id += inc_slot; /* fence/filter-fail: try next slot */Fence keys are page-edge sentinels (see cubrid-btree.md §“Node layout”),
skipped silently. A failed range test ends the scan; a failed filter
advances and loops. Descending page-crossing may set
force_restart_from_root via §7.8.
7.8 Reading the record and the descending prev-leaf fix
Section titled “7.8 Reading the record and the descending prev-leaf fix”btree_range_scan_read_record decodes the peeked slot into cur_key:
variable-length text/bit/midxkey use PEEK_KEY_VALUE (decompression may
still allocate, tracked by is_cur_key_copied); fixed types use
COPY_KEY_VALUE:
// btree_range_scan_read_record — src/storage/btree.cswitch (TP_DOMAIN_TYPE (bts->btid_int.key_type)) { case DB_TYPE_MIDXKEY: case DB_TYPE_VARCHAR: case DB_TYPE_CHAR: case DB_TYPE_BIT: case DB_TYPE_VARBIT: { int ret = btree_read_record_in_leafpage (thread_p, bts->C_page, PEEK_KEY_VALUE, bts); bts->is_cur_key_copied = bts->cur_key.need_clear; /* compressed -> copied */ return ret; } default: break; }bts->is_cur_key_copied = true;return btree_read_record_in_leafpage (thread_p, bts->C_page, COPY_KEY_VALUE, bts);A copied/decompressed key survives the C_page unfix — which is why the
driver runs btree_check_decompress_key before unfixing on
end_one_iteration: resume must search the page for a complete cur_key.
btree_range_scan_descending_fix_prev_leaf exists because descending scans
walk against the forward sibling links and would deadlock with forward
scans under unconditional latches. It tries a conditional latch on the
previous leaf, dropping the current latch and re-acquiring in safe order
only on failure:
// btree_range_scan_descending_fix_prev_leaf — src/storage/btree.cprev_leaf = pgbuf_fix (..., &prev_leaf_vpid, ..., PGBUF_CONDITIONAL_LATCH);if (prev_leaf != NULL) { /* (A) conditional latch won: common case */ pgbuf_unfix_and_init (thread_p, bts->C_page); bts->C_page = prev_leaf; bts->slot_id = *key_count = btree_node_number_of_keys (...); /* last key of prev */ VPID_COPY (next_vpid, &(*node_header_ptr)->prev_vpid); return NO_ERROR; }/* (B) conditional failed: drop current, re-fix prev unconditionally */btree_check_decompress_key (bts); /* cur_key must survive the unfix */pgbuf_unfix_and_init (thread_p, bts->C_page);pgbuf_fix_if_not_deallocated (..., &prev_leaf_vpid, ..., &prev_leaf);if (prev_leaf == NULL) { bts->force_restart_from_root = true; return NO_ERROR; } /* (B1) freed */if (!BTREE_IS_PAGE_VALID_LEAF (thread_p, prev_leaf)) { bts->force_restart_from_root = true; ...; return NO_ERROR; } /* (B2) reused non-leaf */*node_header_ptr = btree_get_node_header (thread_p, prev_leaf);if (!VPID_EQ (&(*node_header_ptr)->next_vpid, &bts->C_vpid)) { bts->force_restart_from_root = true; ...; return NO_ERROR; } /* (B3) relinked */(A) wins without waiting, positioning on the previous page’s last key.
(B) drops the current latch (after decompressing cur_key) and re-fixes
unconditionally; three latchless hazards each force a root restart — (B1)
freed, (B2) reused as non-leaf, (B3) leaves no longer linked.
Key invariant: descending advancement never holds two leaf latches in
the forward link direction. It grabs the previous page conditionally or
releases the current page before waiting — the rule that keeps a descending
scan from deadlocking against ascending scans and writers latching in
next_vpid order. When it cannot make progress safely it sets
force_restart_from_root and re-descends from root: slow but safe.
7.9 The SELECT callback — btree_range_scan_select_visible_oids
Section titled “7.9 The SELECT callback — btree_range_scan_select_visible_oids”The BTREE_RANGE_SCAN_PROCESS_KEY_FUNC for SELECTs: for the current
VERIFIED key, walk its OID list (leaf inline OIDs, then the overflow chain)
and copy the MVCC-visible ones into the caller’s buffer. Because a key’s
OID list can exceed the buffer, it carries partial-key resume via
O_vpid and is_key_partially_processed.
flowchart TB
ENTER["key_func entry (VERIFIED key)"] --> ISS{"ISS distinct-key op?"}
ISS -->|yes| EOUT["set key, n=1, end_scan"]
ISS -->|no| PART{"is_key_partially_processed?"}
PART -->|yes| RESOVF["re-fix O_vpid\nget next overflow vpid"]
PART -->|no| SOFT{"soft cap exceeded\nby leaf+1ovf?"}
SOFT -->|yes| ONEIT["end_one_iteration"]
SOFT -->|no| LEAF["process leaf OIDs"]
LEAF --> OVF["overflow loop"]
RESOVF --> OVF
OVF --> HARD{"hard cap exceeded\nby this ovf page?"}
HARD -->|yes| SAVE["O_vpid=last_visible\npartial=true, end_one_iteration"]
HARD -->|no| PROC["process page OIDs"]
PROC --> MORE{"more overflow pages?"}
MORE -->|yes| OVF
MORE -->|no| DONE["O_vpid=NULL\nkey_status=CONSUMED"]
Figure 7-3 — btree_range_scan_select_visible_oids. The soft cap stops
before a key when the buffer is nearly full (end_one_iteration); the
hard cap suspends inside a key’s overflow chain, saving O_vpid for
partial-key resume.
Leaf and overflow OIDs are processed by btree_record_process_objects with
btree_select_visible_object_for_range_scan, which builds an MVCC header,
checks the snapshot, applies match_class_oid and key-limit filters, and
copies visible OIDs via BTS_SAVE_OID_IN_BUFFER.
7.10 The FK callback — btree_range_scan_find_fk_any_object
Section titled “7.10 The FK callback — btree_range_scan_find_fk_any_object”An FK check only needs to know whether any referenced row exists, so the
FK callback stops at the first qualifying object, using bts->bts_other
(a BTREE_FIND_FK_OBJECT) as the result holder:
// btree_range_scan_find_fk_any_object — src/storage/btree.cfk_arg.bts = bts; fk_arg.ovfl_page = NULL;btree_record_process_objects (..., BTREE_LEAF_NODE, ..., &stop, btree_fk_object_does_exist, &fk_arg); /* leaf */if (stop == true) { return NO_ERROR; } /* found -> done, no overflow */VPID_COPY (&ovf_vpid, &bts->leaf_rec_info.ovfl);while (!VPID_ISNULL (&ovf_vpid)) { /* hand-over-hand chain walk */ fk_arg.ovfl_page = pgbuf_fix (..., &ovf_vpid, ...); btree_record_process_objects (..., BTREE_OVERFLOW_NODE, ..., btree_fk_object_does_exist, &fk_arg); if (stop) { break; } /* found -> done */ btree_get_next_overflow_vpid (thread_p, fk_arg.ovfl_page, &ovf_vpid); }if (bts->end_scan) { return NO_ERROR; }if (!bts->is_interrupted) { /* fully scanned, none found */ assert (OID_ISNULL (&((BTREE_FIND_FK_OBJECT *) bts->bts_other)->found_oid)); if (bts->is_fk_remake) { /* deduplicate-mode retry with rebuilt key */ } }A leaf match sets stop; the callback returns without touching the chain.
Otherwise it walks the chain (same hand-over-hand fix) and stops on the
first match. Fully scanned, none found, not interrupted → found_oid NULL;
under deduplicate-key mode (is_fk_remake) it retries with the rebuilt key.
The FK callback never sets end_one_iteration — an FK check fits one key’s
work and needs no partial-key resume.
7.11 Chapter summary — key takeaways
Section titled “7.11 Chapter summary — key takeaways”- One generic driver, two callbacks.
btree_range_scanowns positioning, latching, advancement, and resume;btree_range_scan_select_visible_oidsdoes per-key SELECT work andbtree_range_scan_find_fk_any_objectdoes FK existence.btree_keyval_searchis just a single-key range over the SELECT callback. cur_leaf_lsais the resume contract. Captured at the exact instantC_pageis unfixed; on resume,LSA_EQagainst the re-fixed page is the fast path, and any mismatch forces re-findingcur_keyor a root re-locate.key_statusis a strict three-state machine. VERIFIED is asserted before every callback and CONSUMED after; onlyadvance_over_filtered_keyspromotes NOT_VERIFIED to VERIFIED, after range AND filter pass.- Resume re-anchors on
cur_key, notslot_id. Slot numbers are trusted only when the LSA proves the page unchanged; elsewhere the slot is re-derived by searching for the key, so concurrent splits/merges never make the scan skip or duplicate rows. - Descending scans avoid deadlock by conditional prev-leaf latching.
btree_range_scan_descending_fix_prev_leaftries a conditional latch and, failing, releases its current latch before re-acquiring in safe order; a changed leaf chain forcesforce_restart_from_root. - Two capacity limits split a large key across iterations. The soft
limit stops before a key when the buffer is nearly full
(
end_one_iteration); the hard limit suspends inside a key’s overflow chain, savingO_vpidandis_key_partially_processedso the next iteration resumes mid-key — streaming millions of OIDs without holding a latch across roundtrips. - Fence keys are invisible to scans.
advance_over_filtered_keysskips anyBTREE_LEAF_RECORD_FENCEslot, so the page-edge sentinels from splits (Ch 6) never leak into a result set.
Chapter 8: Deleting an Object Logical and Physical
Section titled “Chapter 8: Deleting an Object Logical and Physical”How does CUBRID take a single OID out of a key? A logical delete only
stamps a delete MVCCID so older readers still see the object; a physical
delete splices the object bytes out, and when the last OID leaves it removes
the whole key. One orchestrator, btree_delete_internal, drives both plus
the undo (re-insert) and vacuum paths, dispatching on BTREE_OP_PURPOSE.
For the record layout (key‖OID, unique first-object rule, overflow chain)
see cubrid-btree.md § “CUBRID’s Approach”.
8.1 One orchestrator, seven purposes
Section titled “8.1 One orchestrator, seven purposes”btree_delete_internal builds a BTREE_DELETE_HELPER, switch (purpose)
picks the per-key worker, and hands it to
btree_search_key_and_apply_functions (the shared descent machine from
insert, Chapter 3).
// btree_delete_internal -- src/storage/btree.cswitch (purpose) { case BTREE_OP_DELETE_OBJECT_PHYSICAL_POSTPONED: case BTREE_OP_DELETE_UNDO_INSERT: LSA_COPY (&delete_helper.reference_lsa, ref_lsa); /* <- CLR anchor */ [[fallthrough]]; case BTREE_OP_DELETE_OBJECT_PHYSICAL: case BTREE_OP_DELETE_VACUUM_OBJECT: key_func = btree_key_delete_remove_object; break; /* <- physical path */ case BTREE_OP_DELETE_UNDO_INSERT_UNQ_MULTIUPD: LSA_COPY (&delete_helper.reference_lsa, ref_lsa); key_func = btree_key_remove_object_and_keep_visible_first; break; case BTREE_OP_DELETE_VACUUM_INSID: key_func = btree_key_remove_insert_mvccid; break; /* <- vacuum insid */ case BTREE_OP_DELETE_UNDO_INSERT_DELID: LSA_COPY (&delete_helper.reference_lsa, ref_lsa); key_func = btree_key_remove_delete_mvccid; break; /* <- logical undo */ default: assert_release (false); return ER_FAILED; }The seven cases collapse to four leaf workers:
key_func | Purposes | Effect |
|---|---|---|
btree_key_delete_remove_object | PHYSICAL, PHYSICAL_POSTPONED, UNDO_INSERT, VACUUM_OBJECT | splice OID out, remove key if last; PHYSICAL logs full undo, rest are CLR/vacuum |
btree_key_remove_object_and_keep_visible_first | UNDO_INSERT_UNQ_MULTIUPD | splice + restore visible first object |
btree_key_remove_delete_mvccid | UNDO_INSERT_DELID | logical: clear delete MVCCID |
btree_key_remove_insert_mvccid | VACUUM_INSID | logical: clear insert MVCCID |
The post-descent tail: free printed_key; on error log and return; else
bump PSTAT_BT_NUM_DELETES, publish *unique, and run the subtle revert
branch:
// btree_delete_internal -- src/storage/btree.cif (delete_helper.check_key_deleted && !delete_helper.is_key_deleted) { /* <- key still has a visible object: undo the optimistic decrement */ delete_helper.unique_stats_info->insert_key_and_row (); /* <- revert key */ delete_helper.unique_stats_info->delete_row (); /* <- keep row delta */ }flowchart TB
A["btree_delete_internal\nswitch(purpose)"] --> B["btree_search_key_and_apply_functions"]
B --> C["btree_fix_root_for_delete"]
C --> D["btree_merge_node_and_advance\ndescend, merge under-full -- Ch 9"]
D --> E{"key_func"}
E -->|physical| F["btree_key_delete_remove_object"]
E -->|undo delid| G["btree_key_remove_delete_mvccid"]
E -->|vacuum insid| H["btree_key_remove_insert_mvccid"]
E -->|unq multiupd| I["btree_key_remove_object_and_keep_visible_first"]
Figure 8-1 — Orchestration. The purpose chooses the leaf worker; the descent machinery is shared with insert.
8.2 Reaching the leaf
Section titled “8.2 Reaching the leaf”btree_fix_root_for_delete runs once per root fix. First search loads
B-tree info (btree_fix_root_with_info); a restart re-fixes and returns
early. It unpacks the key, sets the domain, computes is_null.
Several purposes stop here: both vacuum purposes return NO_ERROR before
stat work; undo/postpone purposes return after patching a NULL class OID to
topclass_oid (omitted from the undo image when it equals topclass). Only
PHYSICAL falls through to unique stats, where MULTI_ROW_UPDATE arms the
check_key_deleted/is_key_deleted pair (§8.1). A NULL key sets *stop.
btree_merge_node_and_advance is the advance callback (merge logic is
Ch 9). The leaf-reaching branch: at node_level == 1 it calls
btree_search_leaf_page, then promotes a root read latch via
pgbuf_promote_read_latch. On ER_PAGE_LATCH_PROMOTE_FAIL it sets
*restart and flips nonleaf_latch_mode to PGBUF_LATCH_WRITE so the retry
descends write-latched; else sets *is_leaf, leaving it write-latched for
key_func.
Invariant — the leaf is write-latched before any mutation. Every
key_func asserts pgbuf_get_latch_mode (*leaf_page) >= PGBUF_LATCH_WRITE;
the promote-or-restart branch guarantees it. A read-latched worker would
corrupt the page against a concurrent writer.
8.3 Physical delete dispatch — btree_key_delete_remove_object
Section titled “8.3 Physical delete dispatch — btree_key_delete_remove_object”Finds the object, prepares logging, removes it, recounts:
- Key found?
result == BTREE_KEY_FOUND→ copy record,btree_read_record,btree_find_oid_and_its_pagelocates the OID (possibly overflow) intofound_page/prev_found_page/offset_to_object; else the offset isNOT_FOUND. - Object not found: for
VACUUM_OBJECTbenign (already vacuumed) — warn andgoto exit; elseassert_release(false)+ER_BTREE_UNKNOWN_KEY. - Found: a
SERVER_MODEassert checks the lock is held (vacuum/recovery/ own-insert-into-non-unique are lock-free); a second enforces the unique first-object rule (§8.7). - Undo logging: only
PHYSICALsaves a key-value image viabtree_rv_save_keyval_for_undointorv_keyval_data(rollback re-inserts the OID); vacuum/undo-of-insert are compensation, so= NULL. - Remove:
node_type = (*leaf_page == found_page) ? LEAF : OVERFLOW, thenbtree_key_remove_object(§8.4); unfix other pages. - Recount: for
MULTI_ROW_UPDATE, re-read the leaf,btree_get_num_visible_from_leaf_and_ovf(dirty snapshot); ifnum_visible_oids > 0setis_key_deleted = false(§8.1 fixes the stat). exit: idempotent unfix of both pages, freerv_keyval_data.
8.4 btree_key_remove_object — leaf vs. overflow fork
Section titled “8.4 btree_key_remove_object — leaf vs. overflow fork”A two-line dispatcher (Figure 8-2): leaf → btree_leaf_remove_object,
overflow → btree_overflow_remove_object. A leaf record must never be left
empty (unique first-object rule); an overflow page may shrink to nothing.
flowchart TB
K["btree_key_remove_object\nnode_type?"] --> L{"BTREE_LEAF_NODE?"}
L -->|yes| LO["btree_leaf_remove_object"]
L -->|no| OO["btree_overflow_remove_object"]
LO --> LO0{"offset_to_object == 0?"}
LO0 -->|no| RR["btree_record_remove_object\nsplice non-first"]
LO0 -->|yes, only object, no ovfl| DK["btree_delete_key_from_leaf\nremove whole key"]
LO0 -->|yes, only object, has ovfl| RP["btree_replace_first_oid_with_ovfl_oid"]
LO0 -->|yes, more leaf objects| RL["btree_leaf_record_replace_first_with_last"]
OO --> OO1{"overflow has 1 object?"}
OO1 -->|yes| DEA["dealloc page + relink"]
OO1 -->|no| RR2["btree_record_remove_object"]
Figure 8-2 — Every physical-removal branch.
8.5 btree_leaf_remove_object — the four leaf branches
Section titled “8.5 btree_leaf_remove_object — the four leaf branches”This enforces the unique first-object invariant; removing the key clears
check_key_deleted. The four branches:
// btree_leaf_remove_object -- src/storage/btree.cif (offset_to_object == 0) { btree_record_get_last_object (..., &last_oid, ..., &offset_to_last_object); if (offset_to_last_object == 0) /* <- only object */ { if (VPID_ISNULL (&leaf_rec_info->ovfl)) btree_delete_key_from_leaf (...); /* <- remove the key (§8.8) */ else btree_replace_first_oid_with_ovfl_oid (...); /* <- pull from overflow (§8.6) */ } else btree_leaf_record_replace_first_with_last (...); /* <- first := last */ }else btree_record_remove_object (...); /* <- splice non-first */Invariant — a unique leaf record’s first object is the visible one, and the record is never empty while OIDs remain. The first slot has a distinct encoding (may carry the class OID and fixed-size MVCC info), so removing it always means replacement, never a raw splice. Violating it leaves a scan reading a malformed first object or a key with a live overflow link but no OID.
8.6 Pulling an OID up — btree_replace_first_oid_with_ovfl_oid
Section titled “8.6 Pulling an OID up — btree_replace_first_oid_with_ovfl_oid”When the leaf’s only object is removed but overflow OIDs remain, an overflow
object becomes the new leaf first object via a system-op sandwich:
btree_leaf_change_first_object swaps the overflow’s last object into the
leaf first slot,
btree_overflow_record_replace_object moves the object-to-delete into the
vacated overflow slot, commit, then remove it via btree_overflow_remove_object
(§8.7). exit_on_error aborts the system op; pure-undo purposes must not fail
here.
8.7 Overflow removal — btree_overflow_remove_object and friends
Section titled “8.7 Overflow removal — btree_overflow_remove_object and friends”Reads the overflow record (slot 1) and forks on whether this object is alone:
// btree_overflow_remove_object -- src/storage/btree.cif (overflow_record.length == BTREE_OBJECT_FIXED_SIZE (btid_int)) { assert (offset_to_object == 0); /* <- only first removable */ btree_get_next_overflow_vpid (thread_p, *overflow_page, &next_overflow_vpid); pgbuf_unfix_and_init (thread_p, *overflow_page); if (!delete_helper->is_system_op_started) { log_sysop_start (...); ... } file_dealloc (..., &overflow_vpid, FILE_BTREE); if (prev_page == leaf_page) btree_modify_leaf_ovfl_vpid (...); /* <- relink from leaf */ else btree_modify_overflow_link (...); /* <- relink from prev ovfl */ }else btree_record_remove_object (..., BTREE_OVERFLOW_NODE, offset_to_object, &ovf_addr);The error: label ends a self-started system op and asserts a non-no-undo
purpose.
btree_record_remove_object / _internal is the byte-level splice shared by
both paths: _internal computes the encoded size (OR_OID_SIZE + a second
OID for unique + MVCC bytes per flags), packs removed bytes as undo,
RECORD_MOVE_DATA closes the gap, packs redo; the wrapper spage_updates and
btree_rv_log_delete_objects.
Invariant — only the first object of an overflow page may be the
alone/last object removed by deallocation. _internal asserts
node_type == BTREE_OVERFLOW_NODE || offset_to_object > 0 — it refuses to
splice the leaf’s first object; that goes through replacement (§8.5/§8.6).
8.8 Removing the whole key — btree_delete_key_from_leaf
Section titled “8.8 Removing the whole key — btree_delete_key_from_leaf”When the last OID leaves, the key record is deleted. If
leafrec_pnt->key_len < 0 the key value lives in an overflow key page →
start a system op and btree_delete_overflow_key first; if a system op runs,
copy the record for undo before deleting. Then:
// btree_delete_key_from_leaf -- src/storage/btree.cif (spage_delete (thread_p, leaf_pg, search_key->slotid) != search_key->slotid) goto exit_on_error;key_cnt = btree_node_number_of_keys (thread_p, leaf_pg);if (key_cnt == 0) header->max_key_len = 0; /* <- no keys left */LOG_RV_RECORD_SET_MODIFY_MODE (&delete_helper->leaf_addr, LOG_RV_RECORD_DELETE);btree_rv_log_delete_object (thread_p, *delete_helper, delete_helper->leaf_addr, leaf_record.length, 0, leaf_record.data, NULL); /* <- undo = record */exit_on_error aborts a started system op. Only the page-local key_cnt
updates here; index-wide num_keys/num_oids deltas flow through the
unique-stats path (delete_key_and_row/delete_row, §8.2/§8.1).
8.9 Logical delete — btree_key_remove_delete_mvccid
Section titled “8.9 Logical delete — btree_key_remove_delete_mvccid”The logical half (BTREE_OP_DELETE_UNDO_INSERT_DELID) rolls back an
MVCC delete: the object stays, only its delete MVCCID clears. A miss is a
hard ER_BTREE_UNKNOWN_KEY (undo must find it). Forks on uniqueness:
- unique (
_unique→btree_remove_delete_mvccid_unique_internal): the visible object must be first, so undo may move it back — if already first, justbtree_record_remove_delid; else swap with the current first (fixed-size), under a system op with undoredo in an overflow page. Logged as compensate (log_append_compensate_with_undo_nxlsa, anchored toreference_lsa), or undoredo when a system op is open. - non-unique (
_non_unique): no ordering —btree_record_remove_delidin place,spage_update, thenlog_append_compensate_with_undo_nxlsa(RVBT_RECORD_MODIFY_COMPENSATE, anchored toreference_lsa).
btree_record_remove_delid branches on has_fixed_size: a fixed-size
object (overflow, or first-with-overflow) keeps its width and nulls the
MVCCID (btree_set_mvccid to MVCCID_NULL); a variable-size object removes
the MVCCID bytes and clears BTREE_OID_HAS_MVCC_DELID (btree_remove_mvccid).
8.10 Logical vacuum — btree_key_remove_insert_mvccid
Section titled “8.10 Logical vacuum — btree_key_remove_insert_mvccid”BTREE_OP_DELETE_VACUUM_INSID is the mirror: vacuum clears the insert
MVCCID once the object is visible to all — no undo at all. Branches:
object not found -> benign, warn and return NO_ERROR; found -> assert the
INSID is not all-visible, then btree_record_remove_insid strips the bytes,
spage_update, and log_append_redo_data with redo-only mode
RVBT_RECORD_MODIFY_NO_UNDO.
Invariant — vacuum never produces undo. Both vacuum workers
(_insert_mvccid here, VACUUM_OBJECT via the physical path) emit only
redo. If vacuum logged undo, a crash during recovery could “un-vacuum” and
re-expose a dead version, breaking visibility.
8.11 Chapter summary — key takeaways
Section titled “8.11 Chapter summary — key takeaways”- One orchestrator, seven purposes.
btree_delete_internalswitch (purpose)selects one of four leaf workers; the descent (btree_fix_root_for_delete→btree_merge_node_and_advance) is shared with insert and reaches the leaf write-latched. - Logical vs. physical are different paths. Logical
(
_remove_delete_mvccidundo,_remove_insert_mvccidvacuum) clears an MVCCID in place; physical (_delete_remove_object) splices OID bytes and may remove the key. - The unique first-object rule drives the leaf branches.
btree_leaf_remove_objectnever raw-splices the first object: it removes the key (btree_delete_key_from_leaf), pulls one up from overflow (btree_replace_first_oid_with_ovfl_oid), or swaps in the last object; overflow removal forks on alone-or-not (dealloc + relink, else splice). - Logging encodes intent. Physical delete carries a key-value undo image
for re-insert; vacuum is redo-only; undo-of-delete writes compensate
records anchored to
reference_lsa. Thecheck_key_deleted/is_key_deletedpair letsMULTI_ROW_UPDATErecount visible objects and reverse an over-eager stat decrement.
Chapter 9: Merge and Under Fill Rebalancing
Section titled “Chapter 9: Merge and Under Fill Rebalancing”This chapter is the mirror image of Chapter 6. A merge is driven down the descent before a delete: the descent walks with btree_merge_node_and_advance, and at every non-leaf hop asks “can the child I am about to enter be folded into a sibling?” If yes, it collapses the two children into the left one, deletes the between-key separator from the parent, and recycles the right page. A degenerate two-key root with height > 2 instead triggers btree_merge_root, which pulls both children up into the root frame and shrinks the tree by one level.
For merge theory (under-fill threshold, separator fold-down, height shrink) see cubrid-btree.md, “Deletion and rebalancing”. This chapter traces only the CUBRID mechanics. The split logging this mirrors is in Chapter 6; merge re-uses the same RVBT_COPYPAGE / RVBT_INS_PGRECORDS redo verbs and the RVBT_MARK_DEALLOC_PAGE undo verb in reverse.
9.1 The merge-on-descend driver: btree_merge_node_and_advance
Section titled “9.1 The merge-on-descend driver: btree_merge_node_and_advance”btree_merge_node_and_advance is the BTREE_ADVANCE_WITH_KEY_FUNCTION for delete, dispatched per-node by btree_search_key_and_apply_functions for btree_delete_internal. Its job is dual: advance one level toward the leaf, and opportunistically merge the child it enters. It fans into three regions, each ending in return: a leaf (position, no merge); a root with two keys and level > 2 (delegate to btree_merge_root); an ordinary non-leaf (pick the child following key, try to merge it with its right neighbor via btree_merge_node). Figure 9-1 is the skeleton; every labelled edge is a real branch.
flowchart TD
A["merge_node_and_advance"] --> B{"node_level == 1 ?"}
B -->|leaf| C{"is_root and latch not WRITE ?"}
C -->|yes| D["promote SHARED_READER"]
D -->|FAIL| E["restart, latch:=WRITE, return"]
D -->|ok| F["is_leaf=true, return"]
C -->|no| F
B -->|non-leaf| G{"is_root and level>2 and keys==2 ?"}
G -->|yes| H["fix L+R, compute need/force_root_merge"]
H --> I{"need_root_merge ?"}
I -->|no| J["search child, advance, fall to tail"]
I -->|yes| LM{"latch == READ ?"}
LM -->|already WRITE| M["sysop, merge_root, dealloc Q+R, commit"]
LM -->|READ| K{"promote all 3 ?"}
K -->|FAIL and force| L["restart, latch:=WRITE, return"]
K -->|FAIL not force| J
K -->|ok| M
M --> N["advance_to_page=root, crt_page=NULL"]
G -->|no| O["search child, fix child_page"]
J --> O
O --> P{"slotid<key_count and free>PAGESIZE/2 ?"}
P -->|no| Z["advance_to_page=child, return"]
P -->|yes| Q["fix right neighbor"]
Q --> R{"right fixed ?"}
R -->|not in buffer| Z
R -->|yes| S["btree_node_mergeable"]
S --> T{"status != NO ?"}
T -->|no| Z
T -->|yes| U{"latch READ ? promote all 3"}
U -->|already WRITE| W
U -->|FAIL and FORCE| V["restart, latch:=WRITE, return"]
U -->|FAIL and TRY| Z
U -->|ok| W["sysop, merge_node, dealloc right, commit"]
Figure 9-1. Control flow of btree_merge_node_and_advance, including the already-WRITE skip-promotion edges.
Leaf branch. At node_level == 1 it positions search_key with btree_search_leaf_page, then guarantees WRITE. If we descended READ and this leaf is also the root (single-page tree), it promotes; ER_PAGE_LATCH_PROMOTE_FAIL sets *restart and bumps nonleaf_latch_mode to WRITE so the descent restarts exclusively. Otherwise the leaf already holds WRITE and returns *is_leaf = true.
// btree_merge_node_and_advance -- src/storage/btree.cif (node_header->node_level == 1) { error_code = btree_search_leaf_page (thread_p, btid_int, *crt_page, key, search_key); if (delete_helper->is_root && delete_helper->nonleaf_latch_mode != PGBUF_LATCH_WRITE) { error_code = pgbuf_promote_read_latch (thread_p, crt_page, PGBUF_PROMOTE_SHARED_READER); if (error_code == ER_PAGE_LATCH_PROMOTE_FAIL) { *restart = true; delete_helper->nonleaf_latch_mode = PGBUF_LATCH_WRITE; return NO_ERROR; } } *is_leaf = true; return NO_ERROR; }Root-collapse precheck: is_root && node_level > 2 && btree_node_number_of_keys(...) == 2. The node_level > 2 guard means the tree never collapses to a single page. Both children are fixed, used bytes computed as DB_PAGESIZE - spage_get_free_space(...), and two thresholds derived with the root’s max_key_len as a defensive allowance for the folded-in separator:
// btree_merge_node_and_advance -- src/storage/btree.cneed_root_merge = (left_used + right_used + CAN_MERGE_WHEN_EMPTY + root_max_key_length) < DB_PAGESIZE;force_root_merge = (left_used + right_used + FORCE_MERGE_WHEN_EMPTY + root_max_key_length) < DB_PAGESIZE;CAN_MERGE_WHEN_EMPTY is MAX(DB_PAGESIZE*0.33, ...), FORCE_MERGE_WHEN_EMPTY MAX(*0.66, ...): need = merged page at most ~67% full, force at most ~34%. Promotion is attempted only when nonleaf_latch_mode == PGBUF_LATCH_READ; if the descent is already exclusive (restart path) promotion is skipped and the merge runs directly.
Branch matrix for the root-merge outcome (entered only after need_root_merge):
| Latch / promotion result | force_root_merge | Action |
|---|---|---|
nonleaf_latch_mode already WRITE | — | skip promotion, fall into merge (else: “Pages already latched exclusively”) |
| promote from READ -> all 3 latched | — | log_sysop_start, btree_merge_root, dealloc both children, log_sysop_commit, *advance_to_page = *crt_page, *crt_page = NULL |
promote -> PROMOTE_FAIL | true | *restart = true, latch:=WRITE, unfix children, return |
promote -> PROMOTE_FAIL | false | fall through, skip merge, advance to one child |
| other error | — | goto error (abort sysop if started) |
Invariant — root merge re-enters the root. A successful
btree_merge_rootsets*advance_to_page = *crt_pageand*crt_page = NULLrather than descending, re-running the callback on the same root so a second collapse fires if the shorter root again has two mergeable children. Drop it and a tree that should shrink two levels shrinks one, leaving a transiently under-filled root.
Ordinary non-leaf tail. btree_search_nonleaf_page chooses search_key->slotid and the child VPID. child_latch is WRITE with PGBUF_PROMOTE_ONLY_READER if the child is a leaf (node_level == 2), else inherits nonleaf_latch_mode with PGBUF_PROMOTE_SHARED_READER — the single-reader promote at level 2 mirrors btree_split_node_and_advance (Chapter 6). The right-neighbor precheck is the gate the reader question turns on:
// btree_merge_node_and_advance -- src/storage/btree.cif (search_key->slotid < key_count && spage_get_free_space (thread_p, child_page) > DB_PAGESIZE / 2)Both conditions required: slotid < key_count means a right neighbor exists; more than half free means plausibly under-filled. Under SERVER_MODE IO stress, neighbor_fetch_mode downgrades to OLD_PAGE_IF_IN_BUFFER so vacuum does not stall on cold siblings — a missing neighbor leaves right_page NULL, er_clear masks the benign miss, and the code falls through to a plain advance. Here too, an already-WRITE descent skips the promotion block and merges directly; promotion is attempted only on the READ path. There is no left-neighbor branch: only the right sibling is tried, relying on the descent visiting every node so each pair is checked from its left member.
9.2 The mergeability classifier: btree_node_mergeable
Section titled “9.2 The mergeability classifier: btree_node_mergeable”btree_node_mergeable returns three-valued BTREE_MERGE_STATUS: BTREE_MERGE_NO (skip), BTREE_MERGE_TRY (merge if latches promote cheaply), BTREE_MERGE_FORCE (mandatory; restart exclusively if promotion fails). Figure 9-2 enumerates every return.
flowchart TD
A["node_mergeable(L,R)"] --> B{"L_cnt == 0 ?"}
B -->|yes| F1["FORCE: left empty"]
B -->|no| C{"L non-fence == 0 ?"}
C -->|yes| D["R_used = uncompressed(R)"]
D --> D1{"R + FORCE_EMPTY < PAGESIZE"}
D1 -->|yes| F2["FORCE"]
D1 -->|no| D2{"R + CAN_EMPTY < PAGESIZE"}
D2 -->|yes| T1["TRY"]
D2 -->|no| N1["NO"]
C -->|no| E{"R_cnt == 0 ?"}
E -->|yes| F3["FORCE: right empty"]
E -->|no| G{"R non-fence == 0 ?"}
G -->|yes| H["symmetric on L_used"]
G -->|no| I{"L==1 and R==1 key ?"}
I -->|yes| F4["FORCE: both single-key"]
I -->|no| J["used bytes"]
J --> K{"L+R+CAN_EMPTY < PAGESIZE ?"}
K -->|no| N2["NO"]
K -->|yes| Lr["recompute uncompressed for LEAF"]
Lr --> Mm{"uncompressed still fits ?"}
Mm -->|no| N3["NO"]
Mm -->|yes| O{"L+R+FORCE_EMPTY < PAGESIZE ?"}
O -->|yes| F5["FORCE"]
O -->|no| T2["TRY"]
Figure 9-2. Return classification of btree_node_mergeable.
It counts keys with btree_node_number_of_keys, then a non-fence count by subtracting lower fence (slot 1) and upper fence (slot L_cnt) when btree_is_fence_key reports them. Three cases:
- One side empty / fence-only.
L_cnt == 0->FORCE. If L has only fences (L_non_fence_cnt == 0), R’s uncompressed size decides (btree_node_size_uncompressed, fences re-derived after merge): fits withFORCE_MERGE_WHEN_EMPTYslack ->FORCE; only withCAN_MERGE_WHEN_EMPTY->TRY; elseNO. R symmetric. - Both single-key.
L_non_fence_cnt == 1 && R_non_fence_cnt == 1->FORCE. - Size test. With used bytes, if
L_used + R_used + CAN_MERGE_WHEN_EMPTY < DB_PAGESIZEit is a candidate. Leaf nodes re-check with uncompressed sizes (merge discards the shared fence prefix, so post-merge bytes can exceed the naive sum):
// btree_node_mergeable -- src/storage/btree.cif (l_node_type == BTREE_LEAF_NODE) { L_used = btree_node_size_uncompressed (thread_p, btid, L_page); if (L_used < 0) { return BTREE_MERGE_NO; } if (L_used + R_used + CAN_MERGE_WHEN_EMPTY >= DB_PAGESIZE) { return BTREE_MERGE_NO; } R_used = btree_node_size_uncompressed (thread_p, btid, R_page); if (L_used + R_used + CAN_MERGE_WHEN_EMPTY >= DB_PAGESIZE) { return BTREE_MERGE_NO; } }if (L_used + R_used + FORCE_MERGE_WHEN_EMPTY < DB_PAGESIZE) { return BTREE_MERGE_FORCE; }return BTREE_MERGE_TRY;The TRY/FORCE asymmetry is the 0.33/0.66 split. The size-fails case reaches a final return BTREE_MERGE_NO.
9.3 The general merge: btree_merge_node
Section titled “9.3 The general merge: btree_merge_node”btree_merge_node(P, left_pg, right_pg, p_slot_id, child_vpid, status) folds right_pg into left_pg, deletes the separator at parent slot p_slot_id (the slot pointing at the right page), and recycles the right page, returning the surviving VPID in *child_vpid (always left_vpid). The 12 STEP comments map to the recovery story; Figure 9-3 is the flow.
flowchart TD A["btree_merge_node"] --> B["STEP 0-1: classify fences\nkeep left lower + right upper"] B --> C["STEP 2-3: gather rec[] in merge_buf\nrecompress if merged_prefix changed"] C --> D["STEP 4: undo-log left RVBT_COPYPAGE"] D --> E["STEP 5-7: remove upper fence,\nspage_insert rec[], append fence"] E --> F["STEP 8: RVBT_NDRECORD_DEL +\nspage_delete parent slot, child_vpid=left"] F --> G["STEP 9-10: left next_vpid past right,\nmax_key_len, redo-log left"] G --> H["STEP 11: right node_level=-1\nunder RVBT_MARK_DEALLOC_PAGE"] H --> I["STEP 12: btree_get_next_page +\nset_vpid_previous_vpid -> left"] E -->|spage failure| X["assert_release, goto exit_on_error"] F -->|delete != slot| X
Figure 9-3. STEP flow of btree_merge_node with its error exits.
STEP 0–3. Leaf pages carry a lower fence (slot 1) and upper fence (slot L_cnt) when flagged BTREE_LEAF_RECORD_FENCE. The merged page keeps the left lower fence and right upper fence; the left’s upper fence and right’s lower fence (the vanished boundary) are discarded. The right upper fence is copied out early (merged_upper_fence_record) since the right page will be recycled. Records are gathered into a scratch rec[] backed by merge_buf (grown via db_private_alloc if left_used + right_used + MAX_MERGE_ALIGN_WASTE overflows the stack buffer), re-compressing each with btree_recompress_record when the merged prefix differs from the source.
Invariant — prefix monotonicity. The merged prefix can never be longer than either source prefix. For MIDXKEY indexes with both fences,
merged_prefix = pr_midxkey_common_prefix(left_fence_key, right_fence_key); otherwise it stays0(no-recompress fast path). The code only ever re-compresses (expands) records; compressing against a prefix longer thanmerged_prefixwould corrupt the key bytes.
STEP 4 — undo image of left. Before mutating left, its full image is logged for undo:
// btree_merge_node -- src/storage/btree.c (STEP 4)log_append_undo_data2 (thread_p, RVBT_COPYPAGE, &btid->sys_btid->vfid, left_pg, -1, DB_PAGESIZE, left_pg);This physical undo half pairs with the STEP 10 redo so a rolled-back delete reconstructs the pre-merge left page byte-for-byte (9.5).
STEP 5–7 — rewrite left. Remove the left upper fence (if present); if the prefix changed, delete and re-insert all left non-fence records, else keep them; spage_insert the gathered rec[]; append the merged upper fence. Each spage_* failure is assert_release(false) + goto exit_on_error.
STEP 8 — fold the separator out of the parent. The structural heart. The parent record at p_slot_id is read; an overflow key (nleaf_pnt.key_len < 0) has its pages freed via btree_delete_overflow_key; then the separator is logged and deleted:
// btree_merge_node -- src/storage/btree.c (STEP 8)btree_rv_write_log_record (recset_data, &recset_length, &peek_rec, BTREE_NON_LEAF_NODE);log_append_undoredo_data2 (thread_p, RVBT_NDRECORD_DEL, &btid->sys_btid->vfid, P, p_slot_id, recset_length, sizeof (p_slot_id), recset_data, &p_slot_id);if (spage_delete (thread_p, P, p_slot_id) != p_slot_id) { /* ... goto exit_on_error ... */ }*child_vpid = *left_vpid; /* <- surviving child is always the left page */Deleting the right page’s slot leaves the left page’s slot still pointing at left_pg, now spanning both old ranges. RVBT_NDRECORD_DEL carries the old record (undo: re-insert) and the slot id (redo: delete).
STEP 9–11 — left header, redo, mark right dead. Left next_vpid is repointed past the right page (= right_header->next_vpid), max_key_len maxed, split-info reset, redo image logged via log_append_redo_data2(RVBT_COPYPAGE, left_pg, ...) before pgbuf_set_dirty. The right page is not freed here; its node_level is set to -1 under RVBT_MARK_DEALLOC_PAGE so a concurrent re-fix detects a dead page (BTREE_IS_PAGE_VALID_LEAF); file_dealloc is in the caller.
STEP 12 — relink the following node’s prev_vpid. The node after the right page still points its prev_vpid at the recycled right page. btree_get_next_page fetches it (NULL at end of chain) and btree_set_vpid_previous_vpid rewrites the back-link to left. This is the only branch touching a fourth page, so a merge can momentarily hold P, left, right, and next. The exit_on_error tail frees merge_buf_ptr if heap-allocated; the sysop abort (9.5) then undoes everything.
9.4 The root collapse: btree_merge_root
Section titled “9.4 The root collapse: btree_merge_root”btree_merge_root(P, Q, R) is simpler than btree_merge_node because it does not move P’s separator into a sibling — it pours both children’s records into P itself. Q and R are untouched beyond their dealloc mark, and the root’s page identity never changes, so the b-tree header pointer stays valid.
Q’s upper fence (if btree_leaf_is_flaged(..., BTREE_LEAF_RECORD_FENCE)) is excluded by decrementing Q_end; R’s lower fence by advancing R_start (root level > 2 implies non-leaf children, so fences are usually absent, but the code defends anyway). The two root separators are removed with btree_delete_meta_record — slot 2 first, then slot 1, so numbering does not shift. Q’s records are saved with btree_rv_util_save_page_records and spage_insert_at into P at slots 1..Q_end, logged once via RVBT_INS_PGRECORDS; the per-iteration error path logs only the records that did land:
// btree_merge_root -- src/storage/btree.cfor (i = 1; i <= Q_end; i++) { if (spage_get_record (...) != S_SUCCESS || spage_insert_at (...) != SP_SUCCESS) { if (i > 1) { recset_header.rec_cnt = i - 1; /* undo-log the i-1 records already inserted */ } goto exit_on_error; } }R’s records follow at slots left_cnt + 1 .., logged the same way. After both blocks land, the root header is undo-logged, prev_vpid/next_vpid nulled, split-info defaulted, and node_level decremented:
// btree_merge_root -- src/storage/btree.cbtree_node_header_undo_log (thread_p, &btid->sys_btid->vfid, P);VPID_SET_NULL (&root_header->node.prev_vpid);VPID_SET_NULL (&root_header->node.next_vpid);btree_write_default_split_info (&(root_header->node.split_info));root_header->node.node_level--;assert_release (root_header->node.node_level > 1);btree_node_header_redo_log (thread_p, &btid->sys_btid->vfid, P);Invariant — the root is the only level that shrinks the tree.
node_level--on the root frame is the single place height decreases.assert_release(node_level > 1)keeps the post-collapse root non-leaf — with thenode_level > 2precheck, the tree never collapses a level-3 root to a leaf in one step, nor to a single page. The#ifndef NDEBUGblock also asserts the root’smax_key_lenequalsMAX(q,r max_key_len). Decrementingnode_levelwithout first copying the children’s records would make descent index a separator-less level and read garbage slots.
Q and R are each marked dead (node_level = -1, undo-logged via RVBT_MARK_DEALLOC_PAGE) but not deallocated — the caller does file_dealloc after btree_merge_root returns.
9.5 Back-link repair and the system-op / logical-undo bracket
Section titled “9.5 Back-link repair and the system-op / logical-undo bracket”btree_set_vpid_previous_vpid keeps the doubly-linked leaf chain consistent after a merge removed the right page — the exact dual of the next-link rewrite a split performs.
// btree_set_vpid_previous_vpid -- src/storage/btree.cheader = btree_get_node_header (thread_p, page_p);if (header == NULL) { return ER_FAILED; }btree_node_header_undo_log (thread_p, &btid->sys_btid->vfid, page_p);header->prev_vpid = *prev; /* <- repoint back-link to surviving left page */btree_node_header_redo_log (thread_p, &btid->sys_btid->vfid, page_p);pgbuf_set_dirty (thread_p, page_p, DONT_FREE);An early if (page_p == NULL) return NO_ERROR; makes it a no-op when the merged left page was last in the chain. The same primitive serves split and merge.
The bracket. Each merge runs inside log_sysop_start / log_sysop_commit, enclosing all physical changes — RVBT_COPYPAGE on left, RVBT_INS_PGRECORDS on root, RVBT_NDRECORD_DEL on the parent, RVBT_MARK_DEALLOC_PAGE on the recycled page, and file_dealloc:
// btree_merge_node_and_advance -- src/storage/btree.clog_sysop_start (thread_p);is_system_op_started = true;/* ... btree_merge_node / btree_merge_root, then file_dealloc of the recycled page(s) ... */log_sysop_commit (thread_p);is_system_op_started = false;Invariant — a structural merge is atomic and self-undoing within the delete. On any
goto errorbetween start and commit, theerror:label runslog_sysop_abort, replaying the undo records in reverse: the recycled page un-deallocated, the parent separator re-inserted, the left page restored from itsRVBT_COPYPAGEimage — re-splitting to the pre-merge shape so the delete retries. The sysop commits before the leaf-level object removal, so the merge is a structural (nested top) action: even if the enclosing logical delete rolls back, the committed merge is never logically undone, leaving the merged shape. Leavefile_deallocoutside the sysop and a crash between commit and dealloc leaks the right page; commit the dealloc without the record moves and the parent points at a freed page.
Cross-reference Chapter 6: split logs RVBT_COPYPAGE redo on the new right page and RVBT_NDRECORD_INS on the parent; merge logs the symmetric RVBT_COPYPAGE undo+redo on the surviving left page and RVBT_NDRECORD_DEL on the parent — the undo of one is structurally the other.
9.6 Chapter summary — key takeaways
Section titled “9.6 Chapter summary — key takeaways”- Merge is driven on descent, right-sibling only.
btree_merge_node_and_advancechecks each child against its right neighbor as it walks down; there is no left-merge branch, and the precheck requires both a right neighbor (slotid < key_count) and an under-filled child (free > PAGESIZE/2). - Mergeability is three-valued.
btree_node_mergeablereturnsNO/TRY/FORCEkeyed onCAN_MERGE_WHEN_EMPTY(~33%) andFORCE_MERGE_WHEN_EMPTY(~66%); empty/fence-only and double-single-key pairs are alwaysFORCE, and leaf pairs are re-checked against uncompressed size.TRYis skipped on a failed promotion whileFORCErestarts the descent exclusively. - Promotion is skipped when the descent is already exclusive. Both merge blocks promote latches only when
nonleaf_latch_mode == PGBUF_LATCH_READ; on the restart (already-WRITE) path the merge runs directly. - General merge folds the separator out of the parent.
btree_merge_nodecopies right into left, deletes the parent slot pointing at the right page (RVBT_NDRECORD_DEL), repointsleft_header->next_vpid, marks the right page dead (node_level = -1), and relinks the following node’sprev_vpidviabtree_set_vpid_previous_vpid. - Root collapse is the only height shrink.
btree_merge_rootpours both children’s records into the unchanged root and decrementsnode_level; thenode_level > 2precheck andassert_release(node_level > 1)keep the tree off a single page and the root non-leaf. - The recycled page is marked, not freed, inside the merge.
node_level = -1(undo-logged withRVBT_MARK_DEALLOC_PAGE) lets concurrent re-fixers detect a dead page;file_deallochappens in the driver, still inside the system op. - The merge is self-undoing.
log_sysop_start/commitbrackets each merge so a mid-merge error aborts back to the exact pre-merge layout (re-split), while a committed merge survives even if the enclosing delete rolls back — mirroring Chapter 6’s split logging.
Chapter 10: Special and Edge Paths
Section titled “Chapter 10: Special and Edge Paths”What builds or maintains the tree outside the single-key insert/delete
lifecycle of Chapters 3-9: the two index-load paths, the leaf-record auxiliaries
(fence keys, FK probes), unique-statistics aggregation, and deduplicate-key mode.
For why a bulk build beats N inserts and what a fence key is for, see
cubrid-btree.md (“Bulk Loading”, “Prefix Compression and Fence Keys”).
10.1 Offline bulk load: xbtree_load_index
Section titled “10.1 Offline bulk load: xbtree_load_index”xbtree_load_index is the CREATE INDEX path used when no concurrent writer
sees the table. It builds the tree invisibly in one top system operation, then
commits atomically. Because nobody reaches the file until commit, latching is
bypassed — pages are fixed PGBUF_LATCH_WRITE only for the buffer manager,
never to coordinate transactions. Figure 10-1 traces every branch; the load is
all-or-nothing (each goto error aborts the sysop, returns NULL). The zero-key
branch recreates an empty index through the regular xbtree_add_index.
flowchart TD
A["xbtree_load_index"] --> B["init SORT_ARGS + LOAD_ARGS\nset deduplicate_key_idx"]
B --> C["btree_index_sort -> btree_construct_leafs callback"]
C --> D{"any keys loaded?\nleaf.pgptr != NULL"}
D -->|yes| E["btree_save_last_leafrec"]
E --> F{"has_fk?"}
F -->|yes| G["btree_load_check_fk"]
F -->|no| H["btree_build_nleafs"]
G --> H
D -->|no| I["log_sysop_abort\nxbtree_add_index empty"]
H --> J{"is_sysop_started?"}
I --> J
J -->|yes| K["attach to outer\nif unique: undo RVBT_REMOVE_UNIQUE_STATS"]
J -->|no| L["nothing to log"]
K --> M["return btid"]
L --> M
Figure 10-1. xbtree_load_index control flow with empty-index and FK branches.
10.2 Packing leaves left-to-right: btree_construct_leafs and btree_first_oid
Section titled “10.2 Packing leaves left-to-right: btree_construct_leafs and btree_first_oid”btree_construct_leafs is the sort callback; it folds each sorted (key, oid)
into the current leaf. The comments label all five branches:
// btree_construct_leafs -- src/storage/btree_load.c next = *(char **) recdes->data; /* <- save forward link before mutating recdes */ if (VPID_ISNULL (&(load_args->leaf.vpid))) /* <- A: very first record */ bt_load_get_first_leaf_page_and_init_args (...); else { int c = btree_compare_key (&sparam.this_key, &load_args->current_key, ...); if (c == DB_GT) bt_load_make_new_record_on_leaf_page (...); /* <- B: new key */ else if (c == DB_EQ) bt_load_add_same_key_to_record (...); /* <- C: append OID */ else { assert_release (false); goto error; } /* <- D: out-of-order = corruption */ } if (!next) break; /* <- E: chain exhausted */INVARIANT (sorted-input monotonicity).
btree_compare_key(this_key, current_key)is neverDB_LT; the sort guarantees non-decreasing keys, so theelsearm isassert_release(false). A buggy comparator would otherwise produce an unsorted leaf chain that scans (Ch 7) and searches (Ch 3) mis-navigate; the assert turns latent corruption into immediate failure.
bt_load_make_new_record_on_leaf_page applies the fill-factor: when
(cur_maxspace - leaf_nleaf_recdes.length) < LOAD_FIXED_EMPTY_FOR_LEAF and
spage_number_of_records > 1 it calls btree_proceed_leaf to start a fresh leaf
(chaining prev/next) before spage_insert, then btree_first_oid.
LOAD_FIXED_EMPTY_FOR_LEAF reserves slack for later in-place inserts; the > 1
guard prevents an infinite loop when one oversized key (already overflow-backed)
exceeds the threshold. btree_first_oid branches on key size (key_len < BTREE_MAX_KEYLEN_INPAGE → BTREE_NORMAL_KEY, else BTREE_OVERFLOW_KEY, creating
the overflow-key file via btree_create_overflow_key_file if ovfid is null) and
on the first object’s delete flag: MVCC_IS_HEADER_DELID_VALID → curr_non_del_obj_count = 0 (only object already MVCC-deleted, key not live), else = 1 and ++n_keys.
A key whose only object is already deleted does not count toward the distinct-key
statistic — the bulk-load analogue of the runtime bookkeeping in §10.9.
10.3 Bottom-up internal levels and btree_load_new_page
Section titled “10.3 Bottom-up internal levels and btree_load_new_page”btree_build_nleafs builds internal levels one at a time with two work lists
(push_list/pop_list) that ping-pong (Figure 10-2). Every page goes through
btree_load_new_page, whose node_level argument discriminates:
// btree_load_new_page -- src/storage/btree_load.c assert (log_check_system_op_is_started (thread_p)); /* <- caller must hold a sysop */ log_sysop_start (thread_p); /* <- nested sysop just for this alloc */ file_alloc (thread_p, &btid->vfid, btree_initialize_new_page, NULL, vpid_new, page_new); if (header) /* <- leaf/non-leaf (node_level >= 1) */ { header->node_level = node_level; header->common_prefix = 0; btree_init_node_header (...); } else /* <- overflow page (node_level == -1) */ { VPID_SET_NULL (&ovf_header_info.next_vpid); btree_init_overflow_header (...); }end: /* <- error: log_sysop_abort; success: log_sysop_commit (committed independently) */ if (error_code != NO_ERROR) log_sysop_abort (thread_p); else log_sysop_commit (thread_p);INVARIANT (per-page allocation durability). Each allocation is wrapped in its own nested system operation that commits independently. Source comment: “we need to commit page allocations. if loading index is aborted, the entire file is destroyed.” Because the whole file is dropped on outer abort, the error path is sloppy about freeing individual pages — it relies on
file_destroyat the top level. Theassertenforces that no caller allocates a load page outside the protective top operation.
btree_proceed_leaf (§10.2) calls this with node_level == 1 per new leaf and
threads prev_vpid/next_vpid so Chapter 7’s range scan chain is built right.
flowchart TD
A["btree_build_nleafs"] --> B["pop children level k"]
B --> C["btree_connect_page -> push parent level k+1"]
C --> D{"pop_list length > 0?"}
D -->|yes| B
D -->|no| E["swap push_list / pop_list\nlevel := k+1"]
E --> F{"more than one page\nat this level?"}
F -->|yes| B
F -->|no| G["write root header\nSET_DECOMPRESS_IDX_HEADER"]
Figure 10-2. Bottom-up internal-level construction in btree_build_nleafs.
10.4 Online (concurrent) index build: xbtree_load_online_index
Section titled “10.4 Online (concurrent) index build: xbtree_load_online_index”The online path lets DML run while the index is built, differing from §10.1 three
ways: an MVCC builder_snapshot gives a consistent table version; the schema lock
is demoted per class from SCH_M_LOCK to IX_LOCK (lock_demote_class_lock)
so writers are not blocked, then re-promoted at the end; and batched inserts go
to a worker pool via online_index_builder. It sets tdes->has_deadlock_priority = true so the builder wins deadlocks against DML. The re-promotion loop is the
most safety-critical edge path: the lock must return to SCH_M_LOCK before
commit/rollback, else the catalog mutation races concurrent readers.
// xbtree_load_online_index -- src/storage/btree_load.c xlogtb_reset_wait_msecs (thread_p, LK_INFINITE_WAIT); /* <- never give up */ while (true) { lock_ret = lock_object (..., SCH_M_LOCK, LK_UNCOND_LOCK); if (lock_ret == LK_GRANTED) break; else if (lock_ret == LK_NOTGRANTED_DUE_ERROR && er_errid () == ER_INTERRUPTED) { er_clear (); logtb_set_tran_index_interrupt (...); continue; } /* <- swallow interrupt, retry */ else if (lock_ret == LK_NOTGRANTED_DUE_TIMEOUT && css_is_shutdowning_server ()) { er_clear (); continue; } /* <- shutdown: still must promote */ assert (0); /* <- any other outcome unacceptable */ }Only after the lock is back does it honor the builder’s return code (if (ret != NO_ERROR) goto error;). For unique indexes it re-validates uniqueness across all
classes via btree_online_index_check_unique_constraint — a concurrent inserter
might have created a collision the snapshot builder could not see. The snapshot is
invalidated on both exits.
10.5 The insert-list dispatcher: online_index_builder
Section titled “10.5 The insert-list dispatcher: online_index_builder”online_index_builder is the producer. Its heap_next loop branches on the scan
result: S_END breaks (heap exhausted), S_ERROR sets the error and breaks, any
other non-S_SUCCESS trips assert (false). With a filter_pred (partial index)
a V_ERROR aborts and a non-V_TRUE continues to skip the row. Otherwise it
generates the key (heap_attrinfo_generate_key; NULL ⇒ ER_FAILED), lazily
allocates an index_builder_loader_task, and calls add_key (§10.6); a
BATCH_FULL return pushes the task to ib_workpool and bumps tasks_started. A
set load_context.m_has_error aborts with ER_IB_ERROR_ABORT. After the loop it
pushes the final partial task, spins until m_tasks_executed == tasks_started,
and for a unique index calls logtb_tran_update_btid_unique_stats with
(num_keys, num_oids, num_nulls) (§10.9).
Each task’s execute sorts its batch (prepare_list) then loops calling
btree_online_index_list_dispatcher once per key with the purpose
BTREE_OP_ONLINE_INDEX_IB_INSERT; on error it sets m_load_context.m_has_error
(atomic exchange) and breaks. The BTREE_OP_ONLINE_INDEX_* enum (in btree.h) is
the routing key the regular insert/delete machinery checks. Its seven values split
into the builder’s _IB_INSERT/_IB_DELETE; the concurrent-DML _TRAN_INSERT,
_TRAN_INSERT_DF (object pre-flagged DELETE so vacuum reclaims it without a second
descent — used when an UPDATE inserts a row it is about to delete), _TRAN_DELETE;
and the rollback _UNDO_TRAN_INSERT/_UNDO_TRAN_DELETE.
10.6 add_key: NULL handling and the batch boundary
Section titled “10.6 add_key: NULL handling and the batch boundary”index_builder_loader_task::add_key filters NULLs from the physical tree but
still counts them for stats:
// index_builder_loader_task::add_key -- src/storage/btree_load.c if (DB_IS_NULL (key) || btree_multicol_key_is_null (const_cast<DB_VALUE *>(key))) { if (BTREE_IS_UNIQUE (m_unique_pk)) ++m_insert_list.m_ignored_nulls_cnt; /* <- count, don't store */ return BATCH_CONTINUE; } m_memsize += m_insert_list.add_key (key, oid); return (m_memsize > prm_get_bigint_value (PRM_ID_IB_TASK_MEMSIZE)) ? BATCH_FULL : BATCH_CONTINUE;INVARIANT (NULLs are tracked, never stored). A NULL (or all-NULL multi-column) key is never placed in the tree, but for a unique index it increments
m_ignored_nulls_cnt, whichexecutefolds into bothm_num_oidsandm_num_nulls. This preservesrows == keys + nulls(§10.9): NULLs add a row and a null but no key, as the SQL standard requires (multiple NULLs do not violate uniqueness). Storing NULLs, or failing to count them, would make the count query and the uniqueness check diverge.
10.7 Fence keys: BTREE_LEAF_RECORD_FENCE
Section titled “10.7 Fence keys: BTREE_LEAF_RECORD_FENCE”A fence key is a sentinel leaf record (flagged BTREE_LEAF_RECORD_FENCE)
recording a leaf’s boundary without a real object. CUBRID uses fences as the
prefix-compression anchor: every key’s common prefix is computed against the
leaf’s fence, so records store only their suffix. Readers must skip them. The
decompressor in btree_read_record rebuilds the full key only for a non-fence
record on a page with positive prefix:
// btree_read_record (decompress branch) -- src/storage/btree.c if (node_type == BTREE_LEAF_NODE && !btree_leaf_is_flaged (rec, BTREE_LEAF_RECORD_OVERFLOW_KEY) && !btree_leaf_is_flaged (rec, BTREE_LEAF_RECORD_FENCE)) { /* <- never decompress a fence */ int n_prefix = btree_node_get_common_prefix (thread_p, btid, pgptr); if (n_prefix > 0) { spage_get_record (thread_p, pgptr, 1, &peek_rec, PEEK); /* <- slot 1 = fence */ assert (btree_leaf_is_flaged (&peek_rec, BTREE_LEAF_RECORD_FENCE)); pr_midxkey_add_prefix (&result, &lf_key, key, n_prefix); /* <- fence prefix + record suffix */ } }INVARIANT (slot 1 is the prefix anchor). When a leaf has a positive
common_prefix, slot 1 holds the fence key and the decompressorasserts it. A non-fence record at slot 1 would makepr_midxkey_add_prefixgraft the wrong prefix and produce corrupt keys.
Scan skipping is the mirror image. In btree_search_leaf_page’s binary search, an
exact DB_EQ match against a fence is never a hit:
// btree_search_leaf_page -- src/storage/btree.c if (c == DB_EQ) { if (btree_leaf_is_flaged (&rec, BTREE_LEAF_RECORD_FENCE)) { assert (middle == 1 || middle == key_cnt); if (middle == 1) c = DB_GT; /* <- left fence: fall through to next key */ else if (middle == key_cnt) { search_key->result = BTREE_KEY_BIGGER; search_key->slotid = key_cnt; return NO_ERROR; } } else { search_key->result = BTREE_KEY_FOUND; search_key->slotid = middle; return NO_ERROR; } }This differs from the first-key fast check btree_leaf_is_key_between_min_max,
which on a fence at slot 1 sets result = BTREE_KEY_BETWEEN, slotid = -1 to
force a full search. Either way a range scan (Ch 7) never returns a fence object.
10.8 Foreign-key probes: btree_find_foreign_key and btree_check_foreign_key
Section titled “10.8 Foreign-key probes: btree_find_foreign_key and btree_check_foreign_key”Both FK helpers reuse the range scan; they differ in direction.
btree_check_foreign_key runs on the referencing side — verifying a child
key exists in the parent’s PK index. The annotations label every branch:
// btree_check_foreign_key -- src/storage/btree.c if (has_null == true) return NO_ERROR; /* <- 1: NULL FK column trivially satisfied */ classrepr = heap_classrepr_get (...); if (classrepr == NULL) goto exit_on_error; /* <- 2: no repr -> cleanup + normalize errid */ if (classrepr->has_partition_info > 0) { /* <- 3: partitioned parent -> prune to local btid */ partition_init_pruning_context (&pcontext); partition_load_pruning_context (...); } BTID_COPY (&local_btid, pk_btid); if (classrepr->has_partition_info > 0 && pcontext.partitions != NULL) partition_prune_unique_btid (&pcontext, keyval, &part_oid, &class_hfid, &local_btid); ret_search = xbtree_find_unique (thread_p, &local_btid, S_SELECT_WITH_LOCK, keyval, ...); if (ret_search == BTREE_KEY_NOTFOUND) /* <- 4a: no parent row -> FK violation */ { er_set (...ER_FK_INVALID...); ret = ER_FK_INVALID; goto exit_on_error; } else if (ret_search == BTREE_ERROR_OCCURRED) { ASSERT_ERROR_AND_SET (ret); goto exit_on_error; } /* <- 4b */ assert (ret_search == BTREE_KEY_FOUND); /* <- 4c: parent exists -> ok */ // ... success and exit_on_error share cleanup: clear pcontext + heap_classrepr_free_and_init ...Success and exit_on_error share the cleanup; only exit_on_error forces a
non-NO_ERROR return via (ret == NO_ERROR && er_errid() == NO_ERROR) ? ER_FAILED : ret.
btree_find_foreign_key runs on the referenced side — before deleting/updating
a parent PK it must find any child row still referencing the key, locking it so
the check is stable. It calls btree_remake_foreign_key_with_PK (yielding
is_newly); when !is_newly it pr_share_values the key into both range bounds,
sets range = GE_LE, find_fk_object.lock_mode = S_LOCK, threads
btree_scan.is_fk_remake = is_newly, and drives btree_range_scan with the
btree_range_scan_find_fk_any_object visitor. On error or no match
(OID_ISNULL (&find_fk_object.found_oid)) it releases any lock taken
(lock_unlock_object_donot_move_to_non2pl if locked_object set). The is_newly
flag plumbs deduplicate-key mode: btree_remake_foreign_key_with_PK may rebuild
the probe key so range bounds match the on-disk dedup-augmented key; when it does
the original key is not shared and the rebuilt copy is cleared on exit (§10.10).
10.9 Unique statistics: btree_unique_stats and multi_index_unique_stats
Section titled “10.9 Unique statistics: btree_unique_stats and multi_index_unique_stats”A unique index keeps three counters so COUNT(*), COUNT(DISTINCT key), and the
uniqueness check answer without scanning. btree_unique_stats is the per-index
triple (stat_type m_rows, m_keys, m_nulls; in btree_unique.hpp).
| Field | Role | Why it exists |
|---|---|---|
m_rows | total objects (OIDs), including NULL-keyed ones | answers COUNT(*); left side of the uniqueness identity |
m_keys | number of non-NULL objects (one per row with a real key) | the “distinct enough” count; participates in is_unique |
m_nulls | number of objects whose indexed key is NULL | NULLs are exempt from uniqueness, so tallied separately |
The mutators move two counters together; no setter touches one alone:
// btree_unique_stats mutators -- src/storage/btree_unique.cpp void insert_key_and_row () { ++m_keys; ++m_rows; } void insert_null_and_row () { ++m_nulls; ++m_rows; } void add_row () { ++m_rows; } /* <- dup OID of existing key */ void delete_key_and_row () { --m_keys; --m_rows; } void delete_null_and_row () { --m_nulls; --m_rows; } void delete_row () { --m_rows; } bool is_unique () const { return m_rows == m_keys + m_nulls; }INVARIANT (
m_rows == m_keys + m_nulls). Enforced structurally: every mutator changingm_keysorm_nullschangesm_rowsby the same amount, andadd_row/delete_row(a dup OID of an existing key) move onlym_rows. A unique index is consistent exactly whenis_unique()is true; `m_rows > m_keys
- m_nulls
means some key has two live OIDs — a violation. The+=/-=` operators move all three counters componentwise, so aggregating partial stats from worker threads (§10.5) cannot break the identity.
multi_index_unique_stats aggregates several indexes touched by one statement
(multi-column, partitioned, or the online builder’s per-class loop). Its
btid_comparator orders by (root_pageid, vfid.volid).
| Field | Role | Why it exists |
|---|---|---|
m_stats_map | std::map<BTID, btree_unique_stats, btid_comparator> | one stats triple per index touched; the map shape means an index appears at most once |
Mutators delegate to operator+=: add_index_stats(index, us) does
m_stats_map[index] += us (asserting !BTID_IS_NULL) and operator+=(other)
folds every entry the same way, inheriting the per-index identity.
construct/destruct use placement new/delete because the object can live
inside a transaction descriptor whose memory is managed elsewhere.
Feeding GLOBAL_UNIQUE_STATS_TABLE. Per-transaction stats are local until
commit, then flushed into the server-global GLOBAL_UNIQUE_STATS_TABLE (an
LF_HASH_TABLE keyed by BTID). Two recovery hooks replay the logged flush.
btree_rv_redo_global_unique_stats_commit unpacks (btid, num_nulls, num_oids, num_keys) and calls logtb_rv_update_global_unique_stats_by_abs — replaying the
absolute totals (idempotent). btree_rv_undo_global_unique_stats_commit
unpacks the same fields but calls logtb_update_global_unique_stats_by_delta with
negated counts, reversing the aborted transaction’s contribution; during
LOG_RECOVERY_UNDO_PHASE it first checks disk_is_page_sector_reserved, and if
the index file is gone returns NO_ERROR — honoring the rule that broken
references are tolerated. This is the RVBT_REMOVE_UNIQUE_STATS lifecycle §10.1
hooks for the bulk-load undo.
10.10 Deduplicate-key mode: BTID_INT.deduplicate_key_idx
Section titled “10.10 Deduplicate-key mode: BTID_INT.deduplicate_key_idx”SUPPORT_DEDUPLICATE_KEY_MODE lets a non-unique index store duplicate keys
efficiently by appending a hidden discriminator column. The position is computed
once at load (dk_get_deduplicate_key_position) into btid_int.deduplicate_key_idx
and persisted into the root header at finalization via SET_DECOMPRESS_IDX_HEADER (root_header, load_args->btid->deduplicate_key_idx), read back on every open via
GET_DECOMPRESS_IDX_HEADER. Three places diverge from the regular path: (1)
range scan / prefix length sets env->same_prefix_len = ...deduplicate_key_idx
so the dedup column bounds user-key prefix from discriminator suffix; (2)
foreign-key remake (§10.8) rebuilds the probe key to include/strip the dedup
column via the is_newly/is_fk_remake flags; (3) the same header field doubles
as the compression-index marker (SET/GET_DECOMPRESS_IDX_HEADER), reusing the
slot rather than spending a new field. For a plain index deduplicate_key_idx is
the sentinel dk_get_deduplicate_key_position returns when no dedup column applies,
and all three branches collapse to the ordinary behavior of Chapters 3, 7, 8.
10.11 Chapter summary — key takeaways
Section titled “10.11 Chapter summary — key takeaways”- Offline bulk load builds invisibly, then commits atomically.
xbtree_load_indexsorts the heap, packs leaves left-to-right via thebtree_construct_leafscallback, builds internal levels bottom-up, skips runtime latching; any error or empty result aborts the whole file. - Sorted input is a load-time invariant.
btree_construct_leafstrustsbtree_compare_keyto never returnDB_LT; theassert_releaseconverts a comparator bug into immediate failure, not a silently unsorted chain. - Every load page allocation is its own committed sysop.
btree_load_new_pagebracketsfile_allocin a nested system operation that commits independently, relying on whole-file destruction for cleanup;node_leveldiscriminates leaf/non-leaf from overflow. - Online build trades a sort for a snapshot, lock demotion, and a worker
pool.
xbtree_load_online_indexdemotesSCH_M_LOCKtoIX_LOCK, drivesonline_index_builderto dispatch batchedBTREE_OP_ONLINE_INDEX_IB_INSERTtasks, then must re-promote the schema lock in an uninterruptible loop and re-check uniqueness for concurrent collisions. - NULL keys are tracked, never stored.
add_keyskips NULLs from the tree but counts them for unique stats, preservingrows == keys + nulls. - Fence keys anchor prefix compression and are skipped by readers. A
BTREE_LEAF_RECORD_FENCEat slot 1 supplies the common prefix; the decompressor never decompresses a fence, andbtree_search_leaf_pageon an equal-but-fence match either falls through (c = DB_GTat slot 1) or returnsBTREE_KEY_BIGGER(slotkey_cnt) instead ofBTREE_KEY_FOUND. - Unique stats are a structurally-enforced triple feeding a global hash.
btree_unique_statsmovesm_keys/m_nullsin lockstep withm_rows;multi_index_unique_statsaggregates perBTID; commit flushes intoGLOBAL_UNIQUE_STATS_TABLEwith redo writing absolute totals and undo subtracting a delta (tolerating an already-dropped index).deduplicate_key_idx, stamped into the root header at build time, bends the range-scan, FK-remake, and compression paths into dedup-key mode.
Position hints as of this revision
Section titled “Position hints as of this revision”The following are line numbers as observed on 2026-06-08; symbols are the canonical anchor and line numbers are hints that decay.
| Symbol | File | Line |
|---|---|---|
OR_OID_SIZE | src/base/object_representation_constants.h | 67 |
OR_OID_PAGEID | src/base/object_representation_constants.h | 68 |
OR_OID_SLOTID | src/base/object_representation_constants.h | 69 |
OR_OID_VOLID | src/base/object_representation_constants.h | 70 |
BTREE_SPLIT_LOWER_BOUND | src/storage/btree.c | 79 |
BTREE_SPLIT_MIN_PIVOT | src/storage/btree.c | 82 |
BTREE_SPLIT_DEFAULT_PIVOT | src/storage/btree.c | 85 |
BTREE_NODE_MAX_SPLIT_SIZE | src/storage/btree.c | 88 |
CAN_MERGE_WHEN_EMPTY | src/storage/btree.c | 101 |
FORCE_MERGE_WHEN_EMPTY | src/storage/btree.c | 104 |
BTREE_LEAF_RECORD_FENCE | src/storage/btree.c | 113 |
BTREE_LEAF_RECORD_OVERFLOW_OIDS | src/storage/btree.c | 115 |
BTREE_LEAF_RECORD_OVERFLOW_KEY | src/storage/btree.c | 117 |
BTREE_LEAF_RECORD_CLASS_OID | src/storage/btree.c | 119 |
BTREE_LEAF_RECORD_MASK | src/storage/btree.c | 121 |
BTREE_OID_HAS_MVCC_INSID | src/storage/btree.c | 124 |
BTREE_OID_HAS_MVCC_DELID | src/storage/btree.c | 126 |
BTREE_RECORD_OR_BUF_INIT | src/storage/btree.c | 260 |
BTREE_GET_MVCC_INFO_SIZE_FROM_FLAGS | src/storage/btree.c | 278 |
BTREE_MERGE_NO | src/storage/btree.c | 311 |
BTREE_MERGE_TRY | src/storage/btree.c | 312 |
BTREE_MERGE_FORCE | src/storage/btree.c | 313 |
btree_search_key_helper | src/storage/btree.c | 358 |
BTREE_FIND_UNIQUE_HELPER | src/storage/btree.c | 387 |
BTS_IS_SOFT_CAPACITY_ENOUGH | src/storage/btree.c | 568 |
BTS_IS_HARD_CAPACITY_ENOUGH | src/storage/btree.c | 582 |
BTS_SAVE_OID_IN_BUFFER | src/storage/btree.c | 592 |
btree_insert_helper | src/storage/btree.c | 662 |
BTREE_INSERT_HELPER | src/storage/btree.c | 662 |
btree_delete_helper | src/storage/btree.c | 766 |
BTREE_DELETE_HELPER | src/storage/btree.c | 766 |
btree_or_get_object | src/storage/btree.c | 1458 |
btree_fix_root_with_info | src/storage/btree.c | 1850 |
btree_create_overflow_key_file | src/storage/btree.c | 1975 |
btree_leaf_record_change_overflow_link | src/storage/btree.c | 2318 |
btree_leaf_get_first_object | src/storage/btree.c | 2451 |
btree_record_get_num_oids | src/storage/btree.c | 2757 |
btree_leaf_change_first_object | src/storage/btree.c | 2822 |
btree_leaf_set_flag | src/storage/btree.c | 3461 |
btree_record_object_set_mvcc_flags | src/storage/btree.c | 3480 |
btree_append_oid | src/storage/btree.c | 3637 |
btree_record_append_object | src/storage/btree.c | 3798 |
btree_start_overflow_page | src/storage/btree.c | 3973 |
btree_get_disk_size_of_key | src/storage/btree.c | 4065 |
btree_write_record | src/storage/btree.c | 4100 |
btree_read_record | src/storage/btree.c | 4257 |
btree_read_record_without_decompression | src/storage/btree.c | 4515 |
btree_search_nonleaf_page | src/storage/btree.c | 5189 |
btree_leaf_is_key_between_min_max | src/storage/btree.c | 5369 |
btree_search_leaf_page | src/storage/btree.c | 5537 |
btree_find_foreign_key | src/storage/btree.c | 6361 |
btree_delete_key_from_leaf | src/storage/btree.c | 9653 |
btree_replace_first_oid_with_ovfl_oid | src/storage/btree.c | 9781 |
btree_modify_leaf_ovfl_vpid | src/storage/btree.c | 9978 |
btree_modify_overflow_link | src/storage/btree.c | 10054 |
btree_delete_meta_record | src/storage/btree.c | 10148 |
btree_write_default_split_info | src/storage/btree.c | 10247 |
btree_merge_root | src/storage/btree.c | 10284 |
btree_merge_node | src/storage/btree.c | 10557 |
btree_node_size_uncompressed | src/storage/btree.c | 11097 |
btree_node_mergeable | src/storage/btree.c | 11172 |
btree_key_append_object_as_new_overflow | src/storage/btree.c | 11330 |
btree_find_oid_and_its_page | src/storage/btree.c | 11646 |
btree_find_oid_does_mvcc_info_match | src/storage/btree.c | 11794 |
btree_find_oid_from_leaf | src/storage/btree.c | 11946 |
btree_find_oid_from_ovfl | src/storage/btree.c | 12037 |
btree_find_split_point | src/storage/btree.c | 12419 |
btree_split_find_pivot | src/storage/btree.c | 12849 |
btree_split_next_pivot | src/storage/btree.c | 12874 |
btree_recompress_record | src/storage/btree.c | 13118 |
btree_split_node | src/storage/btree.c | 13324 |
btree_split_root | src/storage/btree.c | 14184 |
btree_locate_key | src/storage/btree.c | 14882 |
btree_keyval_search | src/storage/btree.c | 15458 |
btree_scan_update_range | src/storage/btree.c | 16055 |
btree_rv_util_save_page_records | src/storage/btree.c | 17242 |
btree_get_next_page | src/storage/btree.c | 19384 |
btree_set_vpid_previous_vpid | src/storage/btree.c | 19435 |
btree_compare_key | src/storage/btree.c | 19460 |
btree_check_foreign_key | src/storage/btree.c | 22655 |
btree_rv_undo_global_unique_stats_commit | src/storage/btree.c | 23050 |
btree_rv_redo_global_unique_stats_commit | src/storage/btree.c | 23120 |
btree_search_key_and_apply_functions | src/storage/btree.c | 23186 |
btree_get_root_with_key | src/storage/btree.c | 23390 |
btree_advance_and_find_key | src/storage/btree.c | 23455 |
btree_key_find_and_lock_unique | src/storage/btree.c | 23645 |
btree_key_find_and_lock_unique_of_unique | src/storage/btree.c | 23673 |
btree_key_find_and_lock_unique_of_non_unique | src/storage/btree.c | 23903 |
btree_range_scan_start | src/storage/btree.c | 24926 |
btree_range_scan_resume | src/storage/btree.c | 25024 |
btree_range_scan_read_record | src/storage/btree.c | 25185 |
btree_range_scan_advance_over_filtered_keys | src/storage/btree.c | 25230 |
btree_range_scan_descending_fix_prev_leaf | src/storage/btree.c | 25451 |
btree_range_scan | src/storage/btree.c | 25794 |
btree_range_scan_select_visible_oids | src/storage/btree.c | 26008 |
btree_select_visible_object_for_range_scan | src/storage/btree.c | 26330 |
btree_range_scan_find_fk_any_object | src/storage/btree.c | 26545 |
btree_insert_internal | src/storage/btree.c | 26976 |
btree_fix_root_for_insert | src/storage/btree.c | 27154 |
btree_get_max_new_data_size | src/storage/btree.c | 27429 |
btree_split_node_and_advance | src/storage/btree.c | 27495 |
btree_key_insert_new_object | src/storage/btree.c | 28120 |
btree_key_insert_new_key | src/storage/btree.c | 28302 |
btree_key_lock_and_append_object_unique | src/storage/btree.c | 28572 |
btree_key_append_object_non_unique | src/storage/btree.c | 28927 |
btree_key_append_object_unique | src/storage/btree.c | 29042 |
btree_key_relocate_last_into_ovf | src/storage/btree.c | 29156 |
btree_key_append_object_into_ovf | src/storage/btree.c | 29309 |
btree_delete_internal | src/storage/btree.c | 30525 |
btree_fix_root_for_delete | src/storage/btree.c | 30687 |
btree_merge_node_and_advance | src/storage/btree.c | 30895 |
btree_key_delete_remove_object | src/storage/btree.c | 31493 |
btree_key_remove_object_and_keep_visible_first | src/storage/btree.c | 31753 |
btree_leaf_record_replace_first_with_last | src/storage/btree.c | 32046 |
btree_record_remove_object | src/storage/btree.c | 32137 |
btree_record_remove_object_internal | src/storage/btree.c | 32230 |
btree_key_remove_object | src/storage/btree.c | 32298 |
btree_overflow_remove_object | src/storage/btree.c | 32344 |
btree_leaf_remove_object | src/storage/btree.c | 32517 |
btree_key_remove_insert_mvccid | src/storage/btree.c | 32637 |
btree_key_remove_delete_mvccid | src/storage/btree.c | 32818 |
btree_key_remove_delete_mvccid_unique | src/storage/btree.c | 32986 |
btree_remove_delete_mvccid_unique_internal | src/storage/btree.c | 33113 |
btree_key_remove_delete_mvccid_non_unique | src/storage/btree.c | 33257 |
btree_record_remove_insid | src/storage/btree.c | 33419 |
btree_record_remove_delid | src/storage/btree.c | 33479 |
btree_online_index_list_dispatcher | src/storage/btree.c | 34209 |
btree_rv_log_delete_object | src/storage/btree.c | 35823 |
btree_rv_log_insert_object | src/storage/btree.c | 35885 |
BTREE_NEED_UNIQUE_CHECK | src/storage/btree.h | 63 |
BTREE_NODE_TYPE | src/storage/btree.h | 81 |
non_leaf_rec | src/storage/btree.h | 103 |
leaf_rec | src/storage/btree.h | 111 |
btid_int | src/storage/btree.h | 119 |
BTID_INT.deduplicate_key_idx | src/storage/btree.h | 133 |
btree_keyrange | src/storage/btree.h | 139 |
bts_key_status | src/storage/btree.h | 151 |
btree_scan | src/storage/btree.h | 198 |
BTREE_INIT_SCAN | src/storage/btree.h | 297 |
BTREE_END_OF_SCAN | src/storage/btree.h | 368 |
btree_op_purpose | src/storage/btree.h | 535 |
BTREE_OP_DELETE_OBJECT_PHYSICAL | src/storage/btree.h | 549 |
BTREE_OP_ONLINE_INDEX_IB_INSERT | src/storage/btree.h | 565 |
btree_mvcc_info | src/storage/btree.h | 581 |
btree_object_info | src/storage/btree.h | 594 |
BTREE_RANGE_SCAN_PROCESS_KEY_FUNC | src/storage/btree.h | 710 |
btree_node_header_undo_log | src/storage/btree_load.c | 410 |
btree_node_header_redo_log | src/storage/btree_load.c | 432 |
xbtree_load_index | src/storage/btree_load.c | 856 |
dk_get_deduplicate_key_position | src/storage/btree_load.c | 925 |
btree_save_last_leafrec | src/storage/btree_load.c | 1275 |
SET_DECOMPRESS_IDX_HEADER | src/storage/btree_load.c | 1929 |
btree_load_new_page | src/storage/btree_load.c | 2030 |
btree_proceed_leaf | src/storage/btree_load.c | 2122 |
btree_first_oid | src/storage/btree_load.c | 2199 |
bt_load_make_new_record_on_leaf_page | src/storage/btree_load.c | 2582 |
LOAD_FIXED_EMPTY_FOR_LEAF | src/storage/btree_load.c | 2591 |
btree_construct_leafs | src/storage/btree_load.c | 3011 |
xbtree_load_online_index | src/storage/btree_load.c | 4992 |
online_index_builder | src/storage/btree_load.c | 5287 |
index_builder_loader_task::add_key | src/storage/btree_load.c | 5526 |
index_builder_loader_task::execute | src/storage/btree_load.c | 5562 |
LEAF_FENCE_MAX_SIZE | src/storage/btree_load.h | 89 |
NON_LEAF_ENTRY_MAX_SIZE | src/storage/btree_load.h | 96 |
BTREE_OBJECT_FIXED_SIZE | src/storage/btree_load.h | 130 |
BTREE_MAX_KEYLEN_INPAGE | src/storage/btree_load.h | 147 |
BTREE_MAX_OIDCOUNT_IN_LEAF_RECORD | src/storage/btree_load.h | 155 |
BTREE_GET_KEY_LEN_IN_PAGE | src/storage/btree_load.h | 167 |
btree_node_header | src/storage/btree_load.h | 206 |
btree_root_header | src/storage/btree_load.h | 218 |
btree_unique_stats::insert_key_and_row | src/storage/btree_unique.cpp | 72 |
btree_unique_stats::is_unique | src/storage/btree_unique.cpp | 118 |
multi_index_unique_stats::add_index_stats | src/storage/btree_unique.cpp | 168 |
btree_unique_stats | src/storage/btree_unique.hpp | 34 |
multi_index_unique_stats | src/storage/btree_unique.hpp | 69 |
spage_find_empty_slot_at | src/storage/slotted_page.c | 1674 |
spage_insert_at | src/storage/slotted_page.c | 1902 |
spage_header | src/storage/slotted_page.h | 64 |
spage_slot | src/storage/slotted_page.h | 88 |
NON_LEAF_RECORD_SIZE | src/storage/storage_common.h | 134 |
LEAF_RECORD_SIZE | src/storage/storage_common.h | 136 |
btree_node_split_info | src/storage/storage_common.h | 140 |
BTREE_SEARCH | src/storage/storage_common.h | 399 |
BTREE_CONSTRAINT_UNIQUE | src/storage/storage_common.h | 616 |
BTREE_CONSTRAINT_PRIMARY_KEY | src/storage/storage_common.h | 617 |
Sources
Section titled “Sources”cubrid-btree.md— the high-level companion (design intent, theory).- Raw analysis:
raw/code-analysis/cubrid/storage/index/. - Code:
src/storage/btree.{c,h},btree_load.{c,h},btree_unique.{cpp,hpp}; slotted-page layout insrc/storage/slotted_page.h. - Methodology:
knowledge/methodology/code-analysis-detail-doc.md.