Skip to content

CUBRID B+Tree — Code-Level Deep Dive

Where this document fits: The high-level analysis cubrid-btree.md covers design intent and theoretical background. This document traces every branch and field at the code level. Each chapter is self-contained, but reading in order follows the full lifecycle of a single index key inside the kernel.

Contents:

ChTitleStatus
1Data Structure Map
2Record Encoding Decoding and Module Initialization
3Descent and Latch Coupling
4Inserting a Brand New Key
5Appending an OID and Enforcing Uniqueness
6Node Split and Separator Promotion
7Range Scan and LSA Aware Resume
8Deleting an Object Logical and Physical
9Merge and Under Fill Rebalancing
10Special and Edge Paths

Question: what in-memory and on-disk structures describe a B+Tree node, record, and key/OID cell, and how do their fields relate? Theory lives in cubrid-btree.md (### Slotted-page nodes, ### Non-leaf vs. leaf record split, ### OID-list suffix in non-unique leaves); this chapter pins down the fields. Layering: slotted page -> node header (slot HEADER = 0) -> per-key records (NON_LEAF_REC/LEAF_REC fixed prefix + variable payload).

1.1 The slotted-page substrate: spage_header and spage_slot

Section titled “1.1 The slotted-page substrate: spage_header and spage_slot”
// spage_header -- src/storage/slotted_page.h
struct spage_header
{
PGNSLOTS num_slots;
PGNSLOTS num_records;
INT16 anchor_type; /* ANCHORED / ANCHORED_DONT_REUSE_SLOTS / UNANCHORED_* */
unsigned short alignment;
int total_free;
int cont_free;
int offset_to_free_area; /* records grow up to here */
int reserved1;
int flags; /* always SPAGE_HEADER_FLAG_NONE today */
unsigned int is_saving:1; /* space saving for recovery (undo) */
unsigned int need_update_best_hint:1;
unsigned int reserved_bits:30;
};
FieldRoleWhy it exists
num_slotsallocated slot entries (live + reusable)slot array length; PGSLOTID index space
num_recordsslots currently holding a recordseparates live from freed/reusable slots
anchor_typeslot-stability policy; enum ANCHORED, ANCHORED_DONT_REUSE_SLOTS, UNANCHORED_ANY_SEQUENCE, UNANCHORED_KEEP_SEQUENCEB+Tree uses UNANCHORED_KEEP_SEQUENCE; slot id = the key’s sorted position, later slots shift on insert/delete (Ch 3)
alignmentrecord start alignmentBTREE_MAX_ALIGN so MVCCID/OID reads align
total_freeall free bytes, may be fragmentedpre-insert feasibility check
cont_freecontiguous run at offset_to_free_areafits in place only if <= cont_free, else compact
offset_to_free_arearecord/free-space frontiernew bytes written here, then it advances
reserved1, flags, reserved_bitspadding / future use8-byte alignment, forward-compatible
is_savingrecovery space reservation activerollback can restore freed space
need_update_best_hintheap stats hintunused by B+Tree
// spage_slot -- src/storage/slotted_page.h
struct spage_slot
{
unsigned int offset_to_record:14;
unsigned int record_length:14;
unsigned int record_type:4; /* REC_HOME, REC_NEWHOME, ... */
};
FieldRoleWhy it exists
offset_to_recordwhere record bytes startindirection lets records move on compaction without changing slot ids
record_lengthrecord byte lengthbounds the RECDES handed to B+Tree code
record_typeREC_* discriminatorB+Tree records are REC_HOME; field exists for heap reuse

Invariant — slot/record accounting. num_records <= num_slots and offset_to_free_area + cont_free must not overrun the front-growing slot array. Enforced by spage_insert*; a violation returns SP_DOESNT_FIT -> split (Ch 6).

Invariant — slot order is key order. B+Tree pages are initialized with UNANCHORED_KEEP_SEQUENCE (the spage_initialize calls in btree.c): a slot id is the key’s ordinal position in the node, and spage_insert_at/spage_delete shift the later slots (spage_shift_slot_up/_down) to preserve that order — this is what lets descent binary-search directly over slot ids (Ch 3). The flip side: a PGSLOTID is positional, valid only while the page stays latched; a scan that unfixes the page must re-validate via the page LSA before reusing one (Ch 7). Compaction moves record bytes, never slot entries.

1.2 The B+Tree node header and root header

Section titled “1.2 The B+Tree node header and root header”

Slot HEADER (= 0) is the node header. Non-root nodes carry btree_node_header; the root carries btree_root_header whose first field is an embedded btree_node_header, so both share node-level accessors.

// btree_node_header -- src/storage/btree_load.h
struct btree_node_header
{
BTREE_NODE_SPLIT_INFO split_info;
VPID prev_vpid; /* leaf prev pointer */
VPID next_vpid; /* leaf next pointer */
short node_level; /* Leaf == 1, Non_leaf > 1 */
short max_key_len;
int common_prefix;
};
FieldRoleWhy it exists
split_info{float pivot; int index;}biases the next split point (Ch 6) toward observed access, not always 50/50
prev_vpidleft sibling page id (leaf)leaf doubly-linked chain range scans walk (Ch 7)
next_vpidright sibling page id (leaf)B-link “next”; a scan on a just-split page follows it rightward
node_levelnode height, leaves == 1leaf/non-leaf discriminator used everywhere
max_key_lenlargest key length in subtreesizing / split estimation without descending
common_prefixshared-prefix length on this nodeprefix compression (COMMON_PREFIX_UNKNOWN = -1 when uncomputed)
// btree_node_split_info -- src/storage/storage_common.h
struct btree_node_split_info { float pivot; int index; }; /* pivot = split_slot_id/num_keys; index = insert slot after split */
// btree_root_header -- src/storage/btree_load.h
struct btree_root_header
{
BTREE_NODE_HEADER node; /* root is also a node */
INT64 num_oids;
INT64 num_nulls;
INT64 num_keys;
OID topclass_oid; /* topclass oid, or NULL OID for non-unique */
int unique_pk;
struct { int rev_level:16; int deduplicate_key_idx:16; } _32;
VFID ovfid; /* overflow (key) file */
MVCCID creator_mvccid;
char packed_key_domain[1]; /* key type for the index — ALWAYS LAST */
};
// BTREE_CONSTRAINT_TYPE -- src/storage/storage_common.h
typedef enum { BTREE_CONSTRAINT_UNIQUE = 0x01, BTREE_CONSTRAINT_PRIMARY_KEY = 0x02 } BTREE_CONSTRAINT_TYPE;
FieldRoleWhy it exists
nodeembedded btree_node_headerroot is a real node; btree_get_node_header works on it too
num_oidstotal objects indexedunique stats / estimates; maintained by delta logging
num_nullsNULL keys (not stored)unique constraints ignore NULLs; counted apart
num_keysdistinct keysnum_oids == num_keys for a healthy unique index
topclass_oidowning class OID, or NULLunique single-class vs. non-unique multi-class
unique_pkbitmask BTREE_CONSTRAINT_UNIQUE (0x01), BTREE_CONSTRAINT_PRIMARY_KEY (0x02)uniqueness enforcement + fixed object sizing (Ch 5)
_32.rev_levelon-disk revisionformat gate (BTREE_CURRENT_REV_LEVEL)
_32.deduplicate_key_idxdedup column index, stored idx+1SUPPORT_DEDUPLICATE_KEY_MODE; +1 lets 0 mean unset
ovfidoverflow-key file idover-long keys (Ch 10)
creator_mvccidindex-creating transactionindex visibility during online load
packed_key_domainserialized key TP_DOMAINthe key type; must be last (variable-length)

Invariant — packed_key_domain is trailing. As a flexible array member, nothing may follow it. Enforced by btree_init_root_header (packs last) / btree_glean_root_header_info (unpacks). A field after it overwrites domain bytes and corrupts key comparison.

Invariant — unique counters. num_oids == num_keys must hold after every committed mutation on a unique index (NULLs in num_nulls). Enforced by btree_change_root_header_delta, which adjusts all three atomically with the structural change (Ch 5).

1.3 Per-key record fixed prefixes: NON_LEAF_REC and LEAF_REC

Section titled “1.3 Per-key record fixed prefixes: NON_LEAF_REC and LEAF_REC”
// non_leaf_rec -- src/storage/btree.h
struct non_leaf_rec { VPID pnt; short key_len; }; /* pnt = child page pointer */
// leaf_rec -- src/storage/btree.h
struct leaf_rec { VPID ovfl; short key_len; }; /* ovfl = overflow-OID page */
StructFieldRoleWhy it exists
non_leaf_recpntchild VPID for keys >= separatorthe descent (Ch 3) follows it down
non_leaf_reckey_lenseparator key length (-1 if overflowed)key bytes after pnt; on-disk prefix NON_LEAF_RECORD_SIZE = DISK_VPID_ALIGNED_SIZE
leaf_recovflfirst overflow-OID page, NULL_VPID if noneOID spillover beyond record capacity (Ch 5)
leaf_reckey_leninline key length (-1 if overflow key)bounds key bytes before the OID list; LEAF_RECORD_SIZE = 0 — no fixed prefix, the first OID is the start and leaf_rec is only the parse target

Left to right: first OID (inline) -> class OID (only if BTREE_LEAF_RECORD_CLASS_OID) -> insert MVCCID (only if first OID’s volid has BTREE_OID_HAS_MVCC_INSID) -> delete MVCCID (only if BTREE_OID_HAS_MVCC_DELID) -> key bytes (leaf_rec.key_len long, or a VPID if overflow key) -> trailing OID list, each OID possibly carrying its own MVCC fields -> aligned overflow VPID at the tail (only if BTREE_LEAF_RECORD_OVERFLOW_OIDS). The discriminating bits live in two reused sub-fields of the first OID: record flags in slotid, per-OID MVCC flags in volid.

// leaf record flags -- src/storage/btree.c (in first OID's slotid)
#define BTREE_LEAF_RECORD_FENCE ((short) 0x1000)
#define BTREE_LEAF_RECORD_OVERFLOW_OIDS ((short) 0x2000)
#define BTREE_LEAF_RECORD_OVERFLOW_KEY ((short) 0x4000)
#define BTREE_LEAF_RECORD_CLASS_OID ((short) 0x8000)
#define BTREE_LEAF_RECORD_MASK ((short) 0xF000)
FlagMeaning when setConsumer
BTREE_LEAF_RECORD_FENCErecord is a fence (split boundary), not an objectscans skip (Ch 7); splits create (Ch 6)
BTREE_LEAF_RECORD_OVERFLOW_OIDSleaf_rec.ovfl set; trailing aligned VPID existsBTREE_RECORD_OR_BUF_INIT shrinks parsed length by DB_ALIGN(DISK_VPID_SIZE, BTREE_MAX_ALIGN) so the tail VPID is not read as an OID
BTREE_LEAF_RECORD_OVERFLOW_KEYkey replaced by a VPID into ovfidkey read/compare fetches the overflow page (Ch 10)
BTREE_LEAF_RECORD_CLASS_OIDclass OID stored after first OIDobjects in one record span classes (non-unique multi-class)
// per-OID MVCC flags -- src/storage/btree.c (in each OID's volid)
#define BTREE_OID_HAS_MVCC_INSID ((short) 0x4000)
#define BTREE_OID_HAS_MVCC_DELID ((short) 0x8000)
#define BTREE_OID_MVCC_FLAGS_MASK ((short) 0xC000)

BTREE_GET_MVCC_INFO_SIZE_FROM_FLAGS derives the byte count: both set -> 2 * OR_MVCCID_SIZE, one -> OR_MVCCID_SIZE, neither -> 0.

Invariant — flag namespace separation. Record flags are slotid bits 0xF000; MVCC flags are volid bits 0xC000. OVERFLOW_KEY and HAS_MVCC_INSID are both 0x4000 in different sub-fields. Enforced by BTREE_OID_GET_RECORD_FLAGS (slotid) vs. BTREE_OID_GET_MVCC_FLAGS (volid); mixing them misparses.

Invariant — MVCC fields decode positionally. No length prefix: presence is known only from the OID’s volid flags, so exactly those bytes must be consumed before the next field. Enforced by BTREE_RECORD_OR_BUF_INIT + flag macros; ad-hoc arithmetic desyncs on the first variable-MVCC object.

flowchart TD
  A["start: first OID"] --> B{"slotid has<br/>CLASS_OID?"}
  B -->|yes| C["read class OID"]
  B -->|no| D
  C --> D{"volid has<br/>HAS_MVCC_INSID?"}
  D -->|yes| E["read insert_mvccid"]
  D -->|no| F
  E --> F{"volid has<br/>HAS_MVCC_DELID?"}
  F -->|yes| G["read delete_mvccid"]
  F -->|no| H
  G --> H{"slotid has<br/>OVERFLOW_KEY?"}
  H -->|yes| I["key field = VPID into ovfid"]
  H -->|no| J["read key_len key bytes"]
  I --> K["trailing OID list"]
  J --> K
  K --> L{"slotid has<br/>OVERFLOW_OIDS?"}
  L -->|yes| M["leaf_rec.ovfl = aligned tail VPID"]
  L -->|no| N["done"]
  M --> N

Figure 1-1: branch-complete parse of a single leaf record.

1.5 BTID_INT: the in-memory index descriptor

Section titled “1.5 BTID_INT: the in-memory index descriptor”

Gleaned from the root header by btree_glean_root_header_info, threaded through nearly every B+Tree function.

// btid_int -- src/storage/btree.h
struct btid_int
{
BTID *sys_btid;
int unique_pk;
int part_key_desc; /* last partial-key domain is descending */
TP_DOMAIN *key_type;
TP_DOMAIN *nonleaf_key_type; /* differs from key_type under prefix keys */
VFID ovfid;
char *copy_buf; /* key copy buffer, from INDX_SCAN_ID */
int copy_buf_len;
int rev_level;
int deduplicate_key_idx;
OID topclass_oid;
};
FieldRoleWhy it exists
sys_btidpersistent BTID (file + root page id)on-disk index identity
unique_pkuniqueness/PK bitmaskunique checks (Ch 5) + fixed object sizing
part_key_desclast partial-key column is DESCreverses comparison sign (BTREE_IS_PART_KEY_DESC)
key_typeleaf key TP_DOMAINfull-key comparison / serialization
nonleaf_key_typenon-leaf domain, differs only with prefix keysfixed-char separators stored as varying counterpart; non-leaf compare uses this
ovfidoverflow-key fileover-long keys; mirrors root ovfid
copy_buf / copy_buf_lenkey-materialization scratchavoids per-key malloc in scans; borrowed from scan id
rev_levelon-disk format revisionguards format-dependent paths
deduplicate_key_idxdedup column indexdedup-mode key handling
topclass_oidowning class OIDclass checks; NULL for non-unique multi-class

Invariant — key_type vs. nonleaf_key_type. Same domain for non-prefix indexes; for fixed-char prefix indexes nonleaf_key_type is the varying counterpart from btree_generate_prefix_domain, and separators must compare with it, never key_type. The wrong domain mis-orders descent (Ch 6).

// BTREE_NODE_TYPE -- src/storage/btree.h
typedef enum { BTREE_LEAF_NODE = 0, BTREE_NON_LEAF_NODE, BTREE_OVERFLOW_NODE } BTREE_NODE_TYPE;
ValuePage kindRecord layout
BTREE_LEAF_NODEleaf (node_level == 1)leaf-record layout of 1.4
BTREE_NON_LEAF_NODEinternal (node_level > 1)NON_LEAF_REC + separator key
BTREE_OVERFLOW_NODEoverflow-OID pagebtree_overflow_header + packed fixed-size OIDs

BTREE_OP_PURPOSE is the dispatch key for the one shared engine (btree_insert_internal/btree_delete_internal):

// btree_op_purpose -- src/storage/btree.h
enum btree_op_purpose
{
BTREE_OP_NO_OP,
BTREE_OP_INSERT_NEW_OBJECT,
BTREE_OP_INSERT_MVCC_DELID,
BTREE_OP_INSERT_MARK_DELETED,
BTREE_OP_INSERT_UNDO_PHYSICAL_DELETE,
BTREE_OP_DELETE_OBJECT_PHYSICAL,
BTREE_OP_DELETE_OBJECT_PHYSICAL_POSTPONED,
BTREE_OP_DELETE_UNDO_INSERT,
BTREE_OP_DELETE_UNDO_INSERT_UNQ_MULTIUPD,
BTREE_OP_DELETE_UNDO_INSERT_DELID,
BTREE_OP_DELETE_VACUUM_OBJECT,
BTREE_OP_DELETE_VACUUM_INSID,
BTREE_OP_NOTIFY_VACUUM,
BTREE_OP_ONLINE_INDEX_IB_INSERT,
BTREE_OP_ONLINE_INDEX_IB_DELETE,
BTREE_OP_ONLINE_INDEX_TRAN_INSERT,
BTREE_OP_ONLINE_INDEX_TRAN_INSERT_DF,
BTREE_OP_ONLINE_INDEX_UNDO_TRAN_INSERT,
BTREE_OP_ONLINE_INDEX_TRAN_DELETE,
BTREE_OP_ONLINE_INDEX_UNDO_TRAN_DELETE
};

Five families: insert, physical/logical delete, vacuum, notify-vacuum, online-index load. INSERT_MVCC_DELID (add delete MVCCID on MVCC delete) vs. INSERT_MARK_DELETED (non-MVCC unique, vacuum-exempt) matters for Ch 8; DELETE_VACUUM_OBJECT (full remove) vs. DELETE_VACUUM_INSID (insert-MVCCID-only remove) for vacuum.

1.7 BTREE_MVCC_INFO and BTREE_OBJECT_INFO: the in-memory cell

Section titled “1.7 BTREE_MVCC_INFO and BTREE_OBJECT_INFO: the in-memory cell”

Parsing the 1.4 layout yields one BTREE_OBJECT_INFO per object, whose MVCC sub-record is BTREE_MVCC_INFO:

// btree_mvcc_info -- src/storage/btree.h
struct btree_mvcc_info { short flags; MVCCID insert_mvccid; MVCCID delete_mvccid; };
#define BTREE_MVCC_INFO_INITIALIZER { 0, MVCCID_ALL_VISIBLE, MVCCID_NULL }
// btree_object_info -- src/storage/btree.h
struct btree_object_info { OID oid; OID class_oid; BTREE_MVCC_INFO mvcc_info; };
StructFieldRoleWhy it exists
btree_mvcc_infoflagswhich MVCCIDs presentin-memory mirror of per-OID volid flags
btree_mvcc_infoinsert_mvccidcreating tx id (or MVCCID_ALL_VISIBLE)read visibility; vacuum clears once all-visible
btree_mvcc_infodelete_mvcciddeleting tx id (or MVCCID_NULL)logical-delete marker; physical removal awaits vacuum
btree_object_infooidindexed instancethe lookup answer
btree_object_infoclass_oidowning classmulti-class non-unique indexes + lock targeting
btree_object_infomvcc_infoobject MVCC statescan visibility (Ch 7)

Invariant — flags mirrors stored bits. BTREE_MVCC_INFO.flags must equal the per-OID flags packed into the record. Enforced by the sanctioned mutators BTREE_MVCC_INFO_SET_FIXED_SIZE / _CLEAR_FIXED_SIZE; disagreement makes the computed size wrong and reads the next object from the wrong offset. The fixed-size rule (BTREE_OBJECT_FIXED_SIZE, btree_load.h) forces flags full — OID [+ class OID if unique] + both MVCCIDs — for an overflow-page object, a non-first leaf object, or the first object of a record that has overflow pages, so overflow pages address as fixed-stride arrays.

Containment: BTID_INT is gleaned from btree_root_header (embeds btree_node_header, embeds btree_node_split_info); node records are non_leaf_rec (pnt descends to a child) or leaf_rec (ovfl chains to a btree_overflow_header page of fixed-stride OIDs); each leaf record decodes to a list of btree_object_info, each holding a btree_mvcc_info.

  1. Three layers. Each page is a slotted page (spage_header + spage_slot[]) whose slot 0 (HEADER) holds btree_node_header (non-root) or btree_root_header (root, embedding a node header); slots 1..N hold per-key records.
  2. LEAF_RECORD_SIZE is 0. A leaf record has no fixed on-disk prefix — the first OID is the start; leaf_rec/non_leaf_rec are parse targets, and only non_leaf_rec has a real on-disk size (NON_LEAF_RECORD_SIZE).
  3. Flags live in reused OID sub-fields. Record flags (BTREE_LEAF_RECORD_*, mask 0xF000) in the first OID’s slotid; MVCC flags (BTREE_OID_HAS_MVCC_*, mask 0xC000) in each OID’s volid. 0x4000 differs per field — mask the correct sub-field.
  4. Leaf records decode positionally. Optional class OID, insert/delete MVCCIDs, key (inline or overflow VPID), trailing OID list, optional aligned overflow VPID — present by flag, no length prefixes; the only safe walk is BTREE_RECORD_OR_BUF_INIT plus flag macros.
  5. BTID_INT is the threaded descriptor. Gleaned from the root header, it carries unique_pk, separate key_type/nonleaf_key_type (diverge only under prefix keys), ovfid, and dedup metadata into nearly every operation.
  6. BTREE_OP_PURPOSE is the dispatch key. One insert/delete engine serves five families (insert, delete, vacuum, notify, online-load); the purpose distinguishes MVCC delete from non-MVCC mark-deleted and full-remove from insid-only-remove.
  7. Counters and flags stay coherent. Root-header num_oids/num_keys/num_nulls are kept consistent by delta logging, and BTREE_MVCC_INFO.flags must mirror the stored OID flags — so record sizes and unique arithmetic never desynchronize across a crash or concurrent reader.

Chapter 2: Record Encoding Decoding and Module Initialization

Section titled “Chapter 2: Record Encoding Decoding and Module Initialization”

Chapter 1 mapped the in-memory structs (BTID_INT, LEAF_REC, NON_LEAF_REC, BTREE_MVCC_INFO, BTREE_OBJECT_INFO). This chapter traces how a logical (key, OID, class_oid, MVCC info) tuple becomes the raw bytes of a slotted-page record and back, then covers the two per-operation scratch structs (btree_insert_helper, btree_delete_helper) every later chapter threads down the call stack, and the one-time overflow-key file bootstrap. For the why of overflow keys, the OID-list suffix, and unique-vs-non-unique layout see the companion’s Node layout and Unique-key handling; this chapter traces the bytes, not the theory.

A leaf record’s parsing-control bits are not a separate field — they ride in the high nibble of the first object’s slot-id half-word (OR_OID_SLOTID):

// BTREE_LEAF_RECORD_* — src/storage/btree.c
#define BTREE_LEAF_RECORD_FENCE ((short) 0x1000) /* fence key, not a real object */
#define BTREE_LEAF_RECORD_OVERFLOW_OIDS ((short) 0x2000) /* OID list spills to overflow page */
#define BTREE_LEAF_RECORD_OVERFLOW_KEY ((short) 0x4000) /* key spilled to overflow-key file */
#define BTREE_LEAF_RECORD_CLASS_OID ((short) 0x8000) /* subclass OID present after instance OID */
#define BTREE_LEAF_RECORD_MASK ((short) 0xF000)

A second, independent pair rides in the first object’s volid half-word (OR_OID_VOLID) and governs MVCC fields per object: BTREE_OID_HAS_MVCC_INSID (0x4000), BTREE_OID_HAS_MVCC_DELID (0x8000). An OID is 8 bytes (pageid(4) @0 | slotid(2) @4 | volid(2) @6); a disk address uses only the low bits of slotid/volid, so each high nibble is free. This is the central encoding trick: the first object’s OID doubles as the record header.

flowchart LR
  subgraph First["First object (also the record header)"]
    O1["instance OID 8B<br/>slotid hi-nibble = record flags<br/>volid hi-nibble = MVCC flags"]
    C1["class OID 8B<br/>if CLASS_OID flag"]
    I1["insert MVCCID 8B<br/>if HAS_MVCC_INSID"]
    D1["delete MVCCID 8B<br/>if HAS_MVCC_DELID"]
  end
  KEY["key bytes<br/>(or 6B overflow VPID if OVERFLOW_KEY)"]
  REST["object 2..n<br/>OID + optional MVCCIDs"]
  OVF["overflow-OIDs VPID 6B<br/>if OVERFLOW_OIDS"]
  O1 --> C1 --> I1 --> D1 --> KEY --> REST --> OVF

Figure 2-1. Leaf-record byte grammar. The first object precedes the key; all additional objects follow it after an align32 pad. The trailing overflow-OIDs VPID is excluded from object iteration by BTREE_RECORD_OR_BUF_INIT (§2.5). A non-leaf record has no objects: it is the NON_LEAF_REC fixed portion (child VPID + key_len) followed by the key, where a negative key_len is the non-leaf equivalent of OVERFLOW_KEY.

2.2 Measuring a key: btree_get_disk_size_of_key

Section titled “2.2 Measuring a key: btree_get_disk_size_of_key”

The caller needs the key’s packed length to size the buffer and decide whether it overflows. btree_get_disk_size_of_key is a thin wrapper:

// btree_get_disk_size_of_key — src/storage/btree.c
int btree_get_disk_size_of_key (DB_VALUE * key) {
if (key == NULL || DB_IS_NULL (key)) {
assert (key != NULL && !DB_IS_NULL (key)); /* <- NULL keys never reach here in release */
return 0;
}
return pr_index_writeval_disk_size (key); /* <- delegates to the type's packer */
}

Its only branch is the NULL guard: a NULL key is the absence of a record (counted in the root header’s num_nulls), never a zero-length cell, so reaching it is a bug. The result feeds or_init sizing in btree_write_record and the overflow threshold in §2.9.

Invariant — key size is computed once and reused. The insert path stashes the length in btree_insert_helper::key_len_in_page after BTREE_GET_KEY_LEN_IN_PAGE (which collapses keys >= BTREE_MAX_KEYLEN_INPAGE to DISK_VPID_SIZE). If this measurement and the actual index_writeval disagreed, the slot would be mis-sized and adjacent records corrupted. The code enforces agreement by routing both through the same pr_type packer.

2.3 Setting record flags: btree_leaf_set_flag

Section titled “2.3 Setting record flags: btree_leaf_set_flag”

Every record-flag write goes through one chokepoint:

// btree_leaf_set_flag — src/storage/btree.c
static void btree_leaf_set_flag (RECDES * recp, short record_flag) {
short slot_id;
assert ((short) (record_flag & ~BTREE_LEAF_RECORD_MASK) == 0); /* <- only the F000 nibble allowed */
slot_id = OR_GET_SHORT (recp->data + OR_OID_SLOTID); /* <- read first object's slotid+flags */
OR_PUT_SHORT (recp->data + OR_OID_SLOTID, slot_id | record_flag); /* <- OR in, preserve slotid */
}

It reads the word at OR_OID_SLOTID, ORs in the flag, writes it back. The assert catches a stray bit (or an MVCC 0x4000/0x8000 meant for the volid word) that would corrupt the first OID’s slot-id and silently redirect the object pointer. The write is non-destructive on the low bits because a real slot-id never uses the top nibble. MVCC flags use the sibling btree_record_object_set_mvcc_flags, the identical trick on OR_OID_VOLID.

2.4 Serializing a record: btree_write_record

Section titled “2.4 Serializing a record: btree_write_record”

btree_write_record builds the record for both node types into a caller-supplied RECDES, writing through an OR_BUF cursor and setting flags as it goes. Figure 2-2 traces every branch.

flowchart TD
  A["or_init buf over rec->data"] --> B{node_type == LEAF?}
  B -- "non-leaf" --> NL["write fixed NON_LEAF_REC"]
  B -- "leaf" --> L1["or_put_oid instance OID"]
  L1 --> L2{UNIQUE and class != topclass?}
  L2 -- yes --> L3["or_put_oid class OID<br/>set CLASS_OID flag"]
  L2 -- no --> L4
  L3 --> L4{mvcc_info != NULL?}
  L4 -- yes --> L7["or_put_mvccid insid if HAS_INSID<br/>or_put_mvccid delid if HAS_DELID<br/>set MVCC flags in volid"]
  L4 -- no --> KT
  L7 --> KT
  NL --> KT{NORMAL key?}
  KT -- normal --> K1["pr_type->index_writeval key"]
  KT -- overflow --> K3["leaf: set OVERFLOW_KEY flag<br/>btree_store_overflow_key, write 6B VPID"]
  K1 --> AL["or_put_align32"]
  K3 --> AL
  AL --> FIN["rec->length = ptr-buffer<br/>rec->type = REC_HOME"]

Figure 2-2. Control flow of btree_write_record, all branches.

Every or_put_* is followed by if (error_code != NO_ERROR) { assert_release (false); return error_code; } — an OR_BUF overrun is a sizing bug, not a runtime condition. Leaf-path branches:

  • First objector_put_oid writes the instance OID, which is also the record header (§2.1).
  • Class OIDor_put_oid + btree_leaf_set_flag (rec, BTREE_LEAF_RECORD_CLASS_OID) only when unique and the object’s class differs from btid->topclass_oid (a PK on the base table stores none, saving 8 bytes). Flag and bytes move in lockstep.
  • MVCC — guarded by mvcc_info != NULL; HAS_INSID/HAS_DELID are independent or_put_mvccid calls. btree_record_object_set_mvcc_flags (rec->data, mvcc_info->flags) commits the volid nibble after the bytes, so a partial record never claims fields it lacks. mvcc_info == NULL writes neither bytes nor flags (non-MVCC bulk paths).

The non-leaf branch emits btree_write_fixed_portion_of_non_leaf_record_to_orbuf and ignores oid/class_oid/mvcc_info. The shared key-write then runs: NORMAL calls pr_type->index_writeval; OVERFLOW sets OVERFLOW_KEY (leaf only — non-leaf signals overflow via negative key_len), btree_store_overflow_key chains the key, and the 6-byte VPID is written in its place. Finally or_put_align32 pads to 4 bytes (the parser relies on this — §2.6), rec->length is the cursor delta, rec->type = REC_HOME.

Invariant — flags written during serialization must exactly predict the fields the parser skips. btree_write_record is the only producer, btree_read_record_without_decompression the only consumer; they share no state but the flag bits. If the writer set OVERFLOW_KEY without storing the key (or vice versa), the reader’s or_advance math walks into the key region. Hence each flag is set immediately adjacent to its or_put, never batched.

2.5 BTREE_RECORD_OR_BUF_INIT — the iteration boundary

Section titled “2.5 BTREE_RECORD_OR_BUF_INIT — the iteration boundary”

Object loops never read the whole record blindly. BTREE_RECORD_OR_BUF_INIT sets an OR_BUF whose endptr excludes the trailing overflow-OIDs VPID:

// BTREE_RECORD_OR_BUF_INIT — src/storage/btree.c
#define BTREE_RECORD_OR_BUF_INIT(buf, btree_rec) \
do { int size = (btree_rec)->length; \
if (btree_leaf_is_flaged (btree_rec, BTREE_LEAF_RECORD_OVERFLOW_OIDS)) \
{ size -= DB_ALIGN (DISK_VPID_SIZE, BTREE_MAX_ALIGN); } /* <- hide ovfl pointer from loops */ \
or_init (&buf, (btree_rec)->data, size); } while (false)

This lets §2.6 and §2.7 loop while (buf.ptr < buf.endptr) without mistaking the 6-byte overflow VPID for an object.

2.6 Parsing a record: btree_read_record and the without_decompression core

Section titled “2.6 Parsing a record: btree_read_record and the without_decompression core”

btree_read_record is a two-step shell: it calls btree_read_record_without_decompression to parse fields and locate the key, then — for in-page leaf keys that are not fences — reattaches the node’s common prefix. The core walks the bytes, skipping exactly the fields the writer claimed to write, gated on the same conditions §2.4 used:

// btree_read_record_without_decompression — src/storage/btree.c (leaf field-skip, condensed)
rc = or_advance (&buf, OR_OID_SIZE); /* skip instance oid */
if (BTREE_IS_UNIQUE (btid->unique_pk)
&& btree_leaf_is_flaged (rec, BTREE_LEAF_RECORD_CLASS_OID))
{ rc = or_advance (&buf, OR_OID_SIZE); } /* skip class oid (UNIQUE && flag) */
if (btree_record_object_is_flagged (rec->data, BTREE_OID_HAS_MVCC_INSID))
{ rc = or_advance (&buf, OR_MVCCID_SIZE); } /* skip insert mvccid */
if (btree_record_object_is_flagged (rec->data, BTREE_OID_HAS_MVCC_DELID))
{ rc = or_advance (&buf, OR_MVCCID_SIZE); } /* skip delete mvccid */
if (btree_leaf_is_flaged (rec, BTREE_LEAF_RECORD_OVERFLOW_KEY))
{ key_type = BTREE_OVERFLOW_KEY; }
if (btree_leaf_is_flaged (rec, BTREE_LEAF_RECORD_OVERFLOW_OIDS))
{ btree_leaf_get_vpid_for_overflow_oids (rec, &leaf_rec->ovfl); }
else { VPID_SET_NULL (&leaf_rec->ovfl); }

Record flags come from the slot-id nibble (btree_leaf_is_flaged), per-object MVCC flags from the volid nibble (btree_record_object_is_flagged); the class-OID skip needs BTREE_IS_UNIQUE and the flag, matching §2.4. Each or_advance has its own if (rc != NO_ERROR) return rc; (omitted). The non-leaf branch reads the fixed NON_LEAF_REC portion and treats key_len < 0 as BTREE_OVERFLOW_KEY. Key-read branches (shared):

  • NORMAL key. clear_key is true only when key != NULL && copy_key == COPY_KEY_VALUE; copying a MIDXKEY/char/bit string routes through btid->copy_buf to avoid a malloc. index_readval advances the cursor and leaf_rec->key_len is set from the delta. Passing key == NULL makes index_readval skip the key without materializing it — the cheap path scans use to reach the OID list.
  • OVERFLOW key. Reads the 6-byte VPID; if key != NULL it follows the chain via btree_load_overflow_key and forces *clear_key = true, else *clear_key = false. A failed or_get_short asserts and nulls the key.

The tail computes after_key_offset identically for every branch: a key->need_clear override forces *clear_key = true for self-clearing types, then buf.ptr = PTR_ALIGN (buf.ptr, OR_INT_SIZE) skips the writer’s align32 and *offset = CAST_BUFLEN (buf.ptr - buf.buffer) records the byte after the key. That *offset is the module’s most-reused number — where the OID list begins, passed everywhere as after_key_offset. btree_read_record then decompresses only for a leaf, non-overflow, non-fence key when btree_node_get_common_prefix returns n_prefix > 0: it reads slot 1 (the fence) and pr_midxkey_add_prefix splices prefix + suffix into *key (n_prefix < 0 asserts). See the companion’s node-layout section for prefix mechanics.

2.7 Counting objects: btree_leaf_get_first_object and btree_record_get_num_oids

Section titled “2.7 Counting objects: btree_leaf_get_first_object and btree_record_get_num_oids”

btree_leaf_get_first_object returns the header-doubling object: it runs BTREE_RECORD_OR_BUF_INIT (record_buffer, recp) then btree_or_get_object (..., BTREE_LEAF_NODE, dummy_offset = 0, oidp, class_oid, mvcc_info), asserting NO_ERROR (the first object must always parse). btree_record_get_num_oids computes the total; its branches encode the storage regimes:

// btree_record_get_num_oids — src/storage/btree.c (condensed; the three branches)
if (node_type == BTREE_LEAF_NODE) {
rec_oid_cnt = 1; /* the object before the key */
BTREE_RECORD_OR_BUF_INIT (buf, rec);
or_seek (&buf, after_key_offset); /* jump past the key */
if (BTREE_IS_UNIQUE (btid_int->unique_pk)) { /* unique: fixed-size objects */
fixed_object_size = BTREE_OBJECT_FIXED_SIZE (btid_int);
assert ((CAST_BUFLEN (buf.endptr - buf.ptr) % fixed_object_size) == 0);
return rec_oid_cnt + CAST_BUFLEN (buf.endptr - buf.ptr) / fixed_object_size;
}
while (buf.ptr < buf.endptr) { /* non-unique: variable size, walk it */
mvcc_flag = btree_record_object_get_mvcc_flags (buf.ptr);
buf.ptr += OR_OID_SIZE + BTREE_GET_MVCC_INFO_SIZE_FROM_FLAGS (mvcc_flag);
rec_oid_cnt++;
}
return rec_oid_cnt;
}
return CEIL_PTVDIV (rec->length, BTREE_OBJECT_FIXED_SIZE (btid_int)); /* overflow node: all fixed */

Three regimes: unique leaf — non-first objects are fixed (BTREE_OBJECT_FIXED_SIZE = 2*OID + 2*MVCCID); one division of the post-key span gives the count, the modulo assert is the corruption tripwire. Non-unique leaf — variable objects (no class OID; MVCCIDs optional), so the loop advances by OR_OID_SIZE + BTREE_GET_MVCC_INFO_SIZE_FROM_FLAGS(...), the +1 first object added up front. Overflow node — no key, all fixed: rec->length / fixed_size via CEIL_PTVDIV.

Invariant — after_key_offset must point exactly at the byte after the key. All three leaf strategies depend on or_seek (&buf, after_key_offset) landing on the first post-key object; that offset is §2.6’s *offset verbatim. The unique-leaf modulo assert catches an off-by-one offset before it silently miscounts.

Every mutation flows through one of two heap-free scratch structs, constructed once per top-level call and threaded down the recursion, grouping what would otherwise be a dozen parameters.

FieldRoleWhy it exists
obj_infoBTREE_OBJECT_INFO being insertedthe payload (OID + class OID + MVCC info)
purposeBTREE_OP_PURPOSE discriminatorselects new-insert / MVCC-delid / online-index / undo behavior
op_typesingle vs multi insert/modifydrives unique_stats_info collection
unique_stats_infobtree_unique_stats * accumulatordefers unique-count maintenance to a batch flush
key_len_in_pagepacked key length (post BTREE_GET_KEY_LEN_IN_PAGE)sizes splits without re-measuring
nonleaf_latch_modePGBUF_LATCH_MODE for descentREAD optimistic, WRITE when a split is anticipated (Ch 3)
is_first_trytrue on first root fixloads B-tree info from root once
need_update_max_key_lenmax-key-len bump to propagate downchildren must follow a grown max
is_crt_node_write_latchedcurrent node already write-latchedskips latch promotion
is_rootcurrent node is the rootroot has special split/merge handling
is_unique_key_added_or_deletedkey count actually changeddistinguishes key-created vs OID-appended for stats
is_unique_multi_updatemulti-row update of unique indextolerates >1 visible object mid-statement
is_ha_enabledHA replication activeforbids any transient unique violation
log_operationser_log tracing toggledebug instrumentation
is_nullthe key is SQL NULLNULLs are counted, not stored
printed_key / printed_key_sha1readable key + its SHA1diagnostics for large keys
insert_listbtree_insert_list *batched multi-key insert driver
leaf_addrLOG_DATA_ADDR of target leafredo-logging anchor
rcvindexLOG_RCVINDEX idselects the redo handler
rv_keyval_data / _lengthpacked key+value undo imagelogical undo payload
rv_redo_data / _ptrredo buffer + write cursorphysical redo image, built incrementally
compensate_undo_nxlsanext-undo LSACLR generation during undo
is_system_op_starteda nested system op is opentracks pending log_sysop_* commit/abort
time_trackPERF_UTIME_TRACKERperformance statistics
saved_locked_oid / _class_oid (SERVER_MODE)object locked during unique checkcarries the lock across latch-drop-and-retry

Constructor descent-defaults worth memorizing: purpose (BTREE_OP_NO_OP), nonleaf_latch_mode (PGBUF_LATCH_READ) (start optimistic), is_first_try (true) (load btree info once), is_crt_node_write_latched (false), is_root (false), is_unique_key_added_or_deleted (true) (assume the key count changes until proven otherwise), is_system_op_started (false).

FieldRoleWhy it exists
object_infoBTREE_OBJECT_INFO to removethe deletion target
second_object_infoa second object infoundo of a unique-index insert needs the displaced object
purposeBTREE_OP_PURPOSE discriminatorsame multiplexing role as insert helper
nonleaf_latch_modedescent latch modeREAD-optimistic default
op_typeoperation type (default SINGLE_ROW_DELETE)gates unique_stats_info collection
unique_stats_infobtree_unique_stats *batch unique-count maintenance
match_mvccinfoBTREE_MVCC_INFO to matchfinds the specific object version in a key’s OID list
buffered_keyOR_BUF * of the keysearch by pre-packed key, avoiding re-encode
printed_key / printed_key_sha1readable key + SHA1diagnostics for large keys
log_operationstracing toggledebug logging
is_rootcurrent node is rootroot-specific merge handling
is_first_searchfirst traversaldistinguishes initial descent from post-restart retry
check_key_deletedpossible >1 visible object (MULTI_ROW_UPDATE)decides whether to verify the key is gone
is_key_deletedkey actually became emptycorrects collected stats if the key survived
leaf_addrLOG_DATA_ADDR of leafredo anchor
rv_keyval_data / _lengthundo imagelogical undo of the delete
rv_redo_data / _ptrredo buffer + cursorphysical redo image
reference_lsareference LSArecovery/compensation reference point
is_system_op_startednested system op opensame commit/abort bookkeeping
time_trackPERF_UTIME_TRACKERperformance stats

The two structs mirror each other on the recovery block (leaf_addr, rv_keyval_data*, rv_redo_data*, is_system_op_started); btree_insert_helper_to_delete_helper and its inverse copy these across when one operation switches modes (e.g. an MVCC update that deletes then inserts).

Invariant — is_system_op_started must be balanced on every exit path. If the code log_sysop_starts (as §2.9 does) and sets this flag, every return/goto error must log_sysop_commit or log_sysop_abort. A leaked system op holds the log-append fixed and stalls recovery; the flag lets error paths ask “did I open one?” before unwinding.

2.9 First-time overflow-key file bootstrap

Section titled “2.9 First-time overflow-key file bootstrap”

The overflow-key file is created lazily, the first time a key exceeds BTREE_MAX_KEYLEN_INPAGE (= DB_PAGESIZE / 8). The insert root prep guards creation with a double-checked pattern: it tests key_len >= BTREE_MAX_KEYLEN_INPAGE && VFID_ISNULL (&btid_int->ovfid) under a read latch; if true it pgbuf_promote_read_latch’s the root to WRITE (re-fixing on ER_PAGE_LATCH_PROMOTE_FAIL), sets insert_helper->is_crt_node_write_latched = true, then re-tests VFID_ISNULL because a concurrent inserter may have created the file in the latch window. Only then does it log_sysop_start and call btree_create_overflow_key_file (aborting the sysop and goto error on failure). The system operation makes the file persist even if the surrounding insert rolls back — it is permanent metadata, not transactional data.

// btree_create_overflow_key_file — src/storage/btree.c (condensed)
VFID_SET_NULL (&btid->ovfid);
des.btree_key_overflow.btid = *btid->sys_btid; /* structure copy */
des.btree_key_overflow.class_oid = btid->topclass_oid;
assert (!OID_ISNULL (&des.btree_key_overflow.class_oid));
error_code = file_create_with_npages (thread_p, FILE_BTREE_OVERFLOW_KEY, 3, &des, &btid->ovfid);
if (error_code != NO_ERROR) { return error_code; }
error_code = heap_get_class_tde_algorithm (thread_p, &btid->topclass_oid, &tde_algo);
if (error_code != NO_ERROR) { VFID_SET_NULL (&btid->ovfid); return error_code; } /* <- roll back VFID */
error_code = file_apply_tde_algorithm (thread_p, &btid->ovfid, tde_algo);
if (error_code != NO_ERROR) { VFID_SET_NULL (&btid->ovfid); return error_code; } /* <- roll back VFID */
return error_code;

Branch-complete: it nulls ovfid up front, creates a 3-page FILE_BTREE_OVERFLOW_KEY file tagged with the index’s BTID and top-class OID, then applies the class’s TDE (transparent data encryption) algorithm so encrypted tables encrypt their overflow keys. Every error path after the file is created re-nulls ovfid — a half-initialized file never leaves a dangling VFID in BTID_INT, keeping the caller’s VFID_ISNULL re-check honest on retry. The successful VFID is later persisted into the root header. The class_oid assert guards against an orphan overflow file with no owning class.

  1. The first object’s OID is the record header — record flags in its slotid nibble (btree_leaf_set_flag), MVCC flags in its volid nibble (btree_record_object_set_mvcc_flags); the low bits stay a valid disk address.
  2. btree_write_record and btree_read_record_without_decompression are a producer/consumer pair joined only by those flag bits; the writer sets each flag adjacent to its field and the reader skips exactly those fields, so flags are never batched and every or_put/or_advance has its own error exit.
  3. *offset / after_key_offset is the load-bearing number — computed once after PTR_ALIGN, then or_seek’d to by btree_record_get_num_oids and later iterators; BTREE_RECORD_OR_BUF_INIT hides the trailing overflow VPID from those loops.
  4. Object counting has three regimes: unique-leaf (one division, modulo-asserted), non-unique-leaf (per-object walk via BTREE_GET_MVCC_INFO_SIZE_FROM_FLAGS), overflow (rec->length / fixed_size); the +1 first object is added explicitly in the leaf cases.
  5. btree_insert_helper / btree_delete_helper are heap-free constructor-initialized scratch structs; their defaults (PGBUF_LATCH_READ, is_first_try = true) encode Ch 3’s optimistic descent and their mirrored recovery block feeds the logging chapters.
  6. The overflow-key file is created lazily and exactly once, behind a double-checked VFID_ISNULL inside a system operation so it survives rollback; every error path in btree_create_overflow_key_file re-nulls ovfid to keep the re-check honest.

This chapter answers: given a key, which leaf owns it, and how do we get there without corrupting the tree under concurrency? The descent is shared by every insert, delete, scan-seek, and unique probe. For latch-coupling theory and the “restart from root” claim see the companion’s ### Latch-coupling on descent; this chapter corrects it: writes split pessimistically on descent (Ch 6) and only restart when a latch promotion fails.

3.1 The descent driver and its three callbacks

Section titled “3.1 The descent driver and its three callbacks”

Every traversal flows through btree_search_key_and_apply_functions — a template where callers plug in three callbacks (fix-root / advance / process-leaf):

// btree_search_key_and_apply_functions -- src/storage/btree.c
start_btree_traversal:
restart = false; is_leaf = false;
root_function (..., &crt_page, &is_leaf, search_key, &stop, &restart, root_args); /* 1. fix root, load BTID_INT 1st try */
if (stop) goto end; if (restart) goto start_btree_traversal; /* stop: NULL/vacuum. restart: promote failed */
while (!is_leaf) /* 2. ADVANCE until leaf reached */
{
advance_function (..., &crt_page, &advance_page, &is_leaf, search_key, &stop, &restart, advance_args);
if (stop) goto end;
if (restart) { if (advance_page) pgbuf_unfix_and_init (thread_p, advance_page); goto start_btree_traversal; }
if (!is_leaf) /* latch-couple: release parent only AFTER child fixed */
{ if (crt_page) pgbuf_unfix (thread_p, crt_page); crt_page = advance_page; advance_page = NULL; }
}
if (key_function) { key_function (...); if (restart) goto start_btree_traversal; } /* 3. operate on leaf */

The callback triples (root / advance / key): read = btree_get_root_with_key / btree_advance_and_find_key / none; insert (Ch 4-6) and delete (Ch 8-9) swap in btree_fix_root_for_insert / _for_delete plus the btree_split_node_and_advance / btree_merge_node_and_advance advances and btree_key_insert_new_object / btree_key_delete_* key functions.

INVARIANT (no page leak across restart): on restart the engine unfixes advance_page before jumping to start_btree_traversal, and the loop top unfixes crt_page; every advance callback also unfixes both crt_page and child_page before setting restart = true. A forgotten unfix pins the buffer-fix counter forever.

Figure 3-1 — the descent skeleton.

stateDiagram-v2
    [*] --> FixRoot : start_btree_traversal
    FixRoot --> End : stop true
    FixRoot --> FixRoot : restart true
    FixRoot --> Advance : root fixed, not leaf
    FixRoot --> ProcessLeaf : root is also leaf
    Advance --> End : stop true
    Advance --> FixRoot : restart true, unfix child
    Advance --> Advance : child fixed, release parent, descend
    Advance --> ProcessLeaf : reached leaf
    ProcessLeaf --> FixRoot : restart true
    ProcessLeaf --> End : done
    End --> [*]

3.2 BTREE_SEARCH_KEY_HELPER — the result carrier

Section titled “3.2 BTREE_SEARCH_KEY_HELPER — the result carrier”

Every page-level search writes its verdict into one tiny struct, passed by pointer down the descent and back:

// struct btree_search_key_helper -- src/storage/btree.c
struct btree_search_key_helper
{
enum fence_key_presence { NO_FENCE_KEY = 0, HAS_FENCE_KEY };
BTREE_SEARCH result; /* Result of key search. */
PGSLOTID slotid; /* Slot ID of found key or slot ID of the biggest key smaller then key (if not found). */
fence_key_presence has_fence_key;
};
FieldRoleWhy it exists
resultA BTREE_SEARCH value (matrix below); the primary verdict.Callers branch on found vs absent.
slotidOn BTREE_KEY_FOUND, the slot holding the key; when absent, the biggest key smaller than the search key; NULL_SLOTID while uninitialized or on DB_UNK.One struct serves both “found, here it is” and “absent, here it belongs”.
has_fence_keyHAS_FENCE_KEY if the last record examined carried BTREE_LEAF_RECORD_FENCE.The landed-on slot is a synthetic boundary, not data — skip in scans (§3.6).

Role matrix for the values btree_search_*_page can set (subset of the 7-member BTREE_SEARCH enum in storage_common.h; BTREE_ERROR_OCCURRED and BTREE_ACTIVE_KEY_FOUND belong to OID/version search in Ch 5/8 and are never emitted here):

ValueSet when / slotid / reaction
BTREE_KEY_FOUNDexact non-fence match; slotid = the match; read/modify
BTREE_KEY_NOTFOUNDDB_UNK compare; slotid = NULL_SLOTID; propagate error
BTREE_KEY_SMALLERkey < all, slot 1; slotid = 1; go to prev sibling
BTREE_KEY_BIGGERkey > all, off right end; slotid = key_cnt + 1; go to next sibling
BTREE_KEY_BETWEENabsent, inside [min,max]; slotid = next-bigger slot; owning leaf, insert here

BTREE_SEARCH_KEY_HELPER_INITIALIZER seeds it to { BTREE_KEY_NOTFOUND, NULL_SLOTID, NO_FENCE_KEY }. btree_advance_and_find_key dispatches by node type (Figure 3-2): the leaf search fills all three fields; the non-leaf search fills only slotid, never the helper.

Figure 3-2 — who fills the helper.

flowchart LR
  A["btree_advance_and_find_key"] -->|leaf| LP["btree_search_leaf_page\nfills result+slotid+has_fence_key"]
  A -->|non-leaf| NP["btree_search_nonleaf_page\nfills only slotid"]
  LP --> H["BTREE_SEARCH_KEY_HELPER"]

All three root callbacks call btree_fix_root_with_info, which pgbuf_fixes the root at latch_mode, reads the header (btree_get_root_header), and — only when btid_int_p != NULL — gleans metadata via btree_glean_root_header_info. BTID_INT (key type, unique flag, overflow-key VFID, topclass OID, …) loads only on the first fix; on restart the callbacks pass btid_int_p = NULL to reuse it:

// btree_fix_root_for_insert -- src/storage/btree.c
if (insert_helper->is_first_try)
*root_page = btree_fix_root_with_info (..., insert_helper->nonleaf_latch_mode, NULL, &root_header, btid_int);
else
{ /* restart: re-fix, reset latch flags, skip all first-try-only work */
*root_page = btree_fix_root_with_info (..., insert_helper->nonleaf_latch_mode, NULL, NULL, NULL);
insert_helper->is_crt_node_write_latched = false; insert_helper->need_update_max_key_len = false;
return NO_ERROR;
}
insert_helper->is_first_try = false; /* <- never do the below twice */

First-try-only branches in btree_fix_root_for_insert (guarded so they never repeat on restart):

  1. purpose == UNDO_PHYSICAL_DELETE / ONLINE_INDEX_UNDO_TRAN_DELETE → fill class OID for unique, compute key_len_in_page, return.
  2. unique index → build btree_unique_stats incr (delete_* vs insert_* by purpose); accumulate into unique_stats_info (multi-row) or logtb_tran_update_unique_stats; error → goto error.
  3. is_null*stop = true, unfix root, return; INSERT_MVCC_DELID / INSERT_MARK_DELETED → return.
  4. else (real insert) → if key_len >= BTREE_MAX_KEYLEN_INPAGE and no overflow-key file exists, promote root to WRITE and create it (§3.4); set key_len_in_page, return.

btree_fix_root_for_delete mirrors this — first search loads BTID_INT, restart re-fixes and returns. Its first-fix if on delete_helper->purpose (prose summary, not a C excerpt): the two BTREE_OP_DELETE_VACUUM_* purposes return NO_ERROR immediately; the five undo purposes (_UNDO_INSERT*, _OBJECT_PHYSICAL_POSTPONED, _ONLINE_INDEX_UNDO_TRAN_INSERT) COPY_OID btid_int->topclass_oid into a NULL unique BTREE_DELETE_CLASS_OID, then return NO_ERROR; else (asserted btree_is_delete_object_purpose) update unique stats via incr.delete_* and set *stop = true if is_null.

Both fix the root with nonleaf_latch_mode, which defaults to PGBUF_LATCH_READ — writes start optimistic, escalating to WRITE only where a node must change.

3.4 Latch-coupling and where READ becomes WRITE

Section titled “3.4 Latch-coupling and where READ becomes WRITE”
  • Read path (btree_advance_and_find_key): root and each child fixed PGBUF_LATCH_READ; the driver releases the parent only after the child is fixed. No promotion.
  • Write paths (split/merge advance): also start READ; a latch is promoted to WRITE only where its node must be modified.

btree_split_node_and_advance (Ch 6) fixes the child WRITE eagerly when it is a leaf’s parent or max-key-len must grow, else keeps nonleaf_latch_mode. When a split is needed but the node is still READ-latched, it promotes in place and, on failure, restarts WRITE-mode:

// btree_split_node_and_advance -- src/storage/btree.c
if (need_split && insert_helper->nonleaf_latch_mode == PGBUF_LATCH_READ && !insert_helper->is_crt_node_write_latched)
{
error_code = pgbuf_promote_read_latch (thread_p, crt_page, PGBUF_PROMOTE_ONLY_READER); /* always ONLY_READER */
if (error_code == ER_PAGE_LATCH_PROMOTE_FAIL)
{ /* could not promote -> unfix both pages, switch to WRITE-mode, restart from root */
insert_helper->nonleaf_latch_mode = PGBUF_LATCH_WRITE; *restart = true;
pgbuf_unfix_and_init (thread_p, child_page); pgbuf_unfix_and_init (thread_p, *crt_page);
insert_helper->is_crt_node_write_latched = false; return NO_ERROR;
}
insert_helper->is_crt_node_write_latched = true; /* promoted in place: keep descending */
}

A later sibling promote targets the child node via pgbuf_promote_read_latch (..., &child_page, PGBUF_PROMOTE_SHARED_READER); both promotes fail-and-restart identically.

INVARIANT (no latch upgrade deadlock): the current-page promote uses PGBUF_PROMOTE_ONLY_READER and the child-page promote PGBUF_PROMOTE_SHARED_READER — both fail rather than wait when another reader holds the page, after which the thread releases all latches and restarts WRITE-mode. Blocking on a WRITE upgrade while holding a READ latch is the textbook latch-upgrade deadlock; fail-and-restart makes it impossible.

This corrects the companion’s “restart from root”: a write does not restart per split — it splits pessimistically on descent (Ch 6); a restart happens only on a promotion failure, after which the WRITE-mode descent cannot loop.

3.5 btree_search_nonleaf_page — binary search and child selection

Section titled “3.5 btree_search_nonleaf_page — binary search and child selection”

Slot 1 is a dummy negative-infinity key whose only payload is a child pointer:

// btree_search_nonleaf_page -- src/storage/btree.c
key_cnt = btree_node_number_of_keys (thread_p, page_ptr);
if (key_cnt <= 0) return ER_FAILED; /* underflow guard */
if (key_cnt == 1) /* only the dummy neg-inf key -> follow its pointer */
{ *slot_id = 1; *child_vpid = non_leaf_rec.pnt; return NO_ERROR; }
left = 2; right = key_cnt; /* <- start at 2: skip dummy slot 1, then narrow as §3.6 */
while (left <= right)
{
middle = CEIL_PTVDIV ((left + right), 2);
c = btree_compare_key (key, &temp_key, btid->key_type, 1, 1, &start_col);
if (c == DB_UNK) return ER_FAILED; /* incomparable -> error */
if (c == 0) { *slot_id = middle; *child_vpid = non_leaf_rec.pnt; return NO_ERROR; } /* exact separator hit */
else if (c < 0) right = middle - 1; else left = middle + 1; /* (+ remember start_col) */
}

Post-loop (left > right): c < 0 → child is left of middle (*slot_id = middle - 1, re-read its pnt); c >= 0 → child is middle (*child_vpid already holds non_leaf_rec.pnt). page_bounds (the update_boundary_* calls) is NULL for the read descent; start_col is a midxkey skip cursor. The non-leaf search emits only slot_id/child_vpid, never the helper (btree_advance_and_find_key lends search_key->slotid as scratch).

3.6 btree_search_leaf_page — found, not-found, and fence-key skipping

Section titled “3.6 btree_search_leaf_page — found, not-found, and fence-key skipping”

At the leaf the post-conditions are richer; fence keys complicate equality:

// btree_search_leaf_page -- src/storage/btree.c
left = 1; right = key_cnt; /* search_key zeroed to NOTFOUND/NULL_SLOTID/NO_FENCE first */
while (left <= right) /* middle = CEIL_PTVDIV((left+right),2); c = btree_compare_key(...) */
{
if (c == DB_UNK) { search_key->result = BTREE_KEY_NOTFOUND; ... return error; }
if (c == DB_EQ)
{
if (btree_leaf_is_flaged (&rec, BTREE_LEAF_RECORD_FENCE)) /* matched a FENCE record */
{
search_key->has_fence_key = HAS_FENCE_KEY;
if (middle == 1) c = DB_GT; /* lower fence -> "go right" */
else if (middle == key_cnt) /* upper fence -> key is past end */
{ search_key->result = BTREE_KEY_BIGGER; search_key->slotid = key_cnt; return NO_ERROR; }
}
else { search_key->result = BTREE_KEY_FOUND; search_key->slotid = middle; return NO_ERROR; }
}
if (c < 0) right = middle - 1; else left = middle + 1; /* (narrow + remember start_col, as §3.5) */
}

A fence key is a synthetic copy of a neighbor’s boundary that bounds prefix compression (companion’s ### Node layout); it must never be reported as a real match — hence the in-loop branches turn a lower-fence hit into “go right” (c = DB_GT) and an upper-fence hit into BTREE_KEY_BIGGER.

Post-loop (key absent) it flags has_fence_key if the last record was a fence, then sets the verdict per the §3.2 matrix: c < 0SMALLER if middle == 1 else BETWEEN, slotid = middle; c >= 0BIGGER if middle == key_cnt else BETWEEN, slotid = middle + 1. Per the header comment the key can equal the upper fence only on a scan resuming after an unfix/refix (Ch 7); an #ifndef NDEBUG block cross-checks via btree_leaf_is_key_between_min_max.

3.7 btree_locate_key — the single-key descent entry

Section titled “3.7 btree_locate_key — the single-key descent entry”

This one-shot entry calls the engine with the default root function and btree_advance_and_find_key, passing reuse_btid_int = true so BTID_INT is not re-gleaned. Two branches: an error early-return (*leaf_page_out = NULL); and a success path setting *found_p = (search_key.result == BTREE_KEY_FOUND) and *slot_id = search_key.slotid, returning the leaf still latched (READ) — the building block for unique probes (Ch 5) and delete (Ch 8). *found_p is true only for BTREE_KEY_FOUND; the others give found = false with slot_id at the insertion point.

3.8 btree_compare_key — the comparison primitive

Section titled “3.8 btree_compare_key — the comparison primitive”

Every binary search bottoms out here (returning DB_LT/DB_EQ/DB_GT/DB_UNK):

// btree_compare_key -- src/storage/btree.c
if (DB_IS_NULL (key1))
{ if (DB_IS_NULL (key2)) { assert (false); return DB_UNK; } return DB_LT; } /* NULL sorts low; both-NULL impossible */
if (DB_IS_NULL (key2)) return DB_GT;
if (dom_type == DB_TYPE_MIDXKEY)
{
c = pr_midxkey_compare (..., start_colp, &dummy_diff_column, dom_is_desc, NULL);
if (dom_is_desc[0]) c = ((c == DB_GT) ? DB_LT : (c == DB_LT) ? DB_GT : c); /* DESC flips leading col */
}

Beyond the NULL and DESC handling shown, incomparable types or a collation mismatch yield DB_UNK — a hard error for every search. start_colp is the midxkey column-skip cursor (unused for single-column keys).

  1. One engine, three callbacksbtree_search_key_and_apply_functions runs fix-root → advance → process-leaf; operations differ only in callbacks (read: btree_get_root_with_key + btree_advance_and_find_key; write: split/merge advance).
  2. BTREE_SEARCH_KEY_HELPER is the universal verdictresult, slotid (match or insertion point), has_fence_key.
  3. BTID_INT loads exactly once on the first root fix; restarts re-fix with btid_int_p = NULL, first-try-only blocks guarded by is_first_try.
  4. Descent is latch-coupled, optimistically READ — child fixed before parent released, escalating only where a node must change.
  5. Restart is the exception — the write path splits pessimistically on descent (Ch 6); a restart happens only on ER_PAGE_LATCH_PROMOTE_FAIL, after which it runs WRITE-mode, preventing the latch-upgrade deadlock.
  6. Non-leaf vs. leaf search differ in outputbtree_search_nonleaf_page skips dummy slot 1 and emits (slot_id, child_vpid); btree_search_leaf_page emits a full helper and special-cases fence keys.
  7. btree_compare_key is deterministic — NULL sorts low, DESC flips the leading column, incomparable types return DB_UNK (a hard error).

This chapter answers one question: when the descent reaches a leaf and the key is not yet present, how is a fresh (key, first-OID) record built and spliced into the leaf, and how are the header counters and recovery records updated? We trace btree_insert_internal, how btree_split_node_and_advance hands control to the leaf, the dispatch in btree_key_insert_new_object, and then btree_key_insert_new_key in full. OID-append to an existing key is Chapter 5; the split branch is Chapter 6. For record layout, latch-coupling descent, and the unique-statistics model in the abstract, see the companion cubrid-btree.md (“Record encoding”, “Latch coupling”); we do not re-derive them. Structs were dissected in Chapter 1; here we touch only the helper fields this path mutates.

4.1 The orchestration entry point: btree_insert_internal

Section titled “4.1 The orchestration entry point: btree_insert_internal”

All three public insert entry points funnel into btree_insert_internal, a thin setup-and-dispatch layer that never touches a page. It builds one stack-local BTREE_INSERT_HELPER, picks the per-key function by purpose, then delegates the descent to btree_search_key_and_apply_functions. The purpose-to-function dispatch is the first branch point:

// btree_insert_internal -- src/storage/btree.c
switch (purpose) {
case BTREE_OP_INSERT_UNDO_PHYSICAL_DELETE:
LSA_COPY (&insert_helper.compensate_undo_nxlsa, undo_nxlsa); [[fallthrough]];
case BTREE_OP_INSERT_NEW_OBJECT:
key_insert_func = btree_key_insert_new_object; break; /* <- our path */
case BTREE_OP_INSERT_MVCC_DELID:
case BTREE_OP_INSERT_MARK_DELETED:
key_insert_func = btree_key_find_and_insert_delete_mvccid; break; /* <- logical delete */
default: assert (false); return ER_FAILED;
}

For a brand-new key the purpose is BTREE_OP_INSERT_NEW_OBJECT. The NULL-key flag is computed before the switch and is decisive — a NULL key is never stored: insert_helper.is_null = key == NULL || DB_IS_NULL (key) || btree_multicol_key_is_null (key);. The work is one call threading three callbacks (root-fix, descend/split, leaf-process) through the generic walker btree_search_key_and_apply_functions. After it returns, three branches of post-processing run: (1) saved-lock release (SERVER_MODE) — any predecessor OID stashed in insert_helper.saved_locked_oid by unique-key locking is unlocked now, success or not; (2) error short-circuit, else bump PSTAT_BT_NUM_INSERTS and set *unique; (3) multi-update stats correction — for MULTI_ROW_UPDATE on a unique index with HA disabled, if the key was not added/deleted (!is_unique_key_added_or_deleted) the speculative descent delta is reverted (revert delete_key_and_row(), re-apply add_row()). That is the only place the speculative key-count delta is undone.

flowchart TD
  A["btree_insert public entry"] --> B["btree_insert_internal\nbuild BTREE_INSERT_HELPER"]
  B --> C{"purpose?"}
  C -->|INSERT_NEW_OBJECT / UNDO_PHYSICAL_DELETE| D["key_insert_func =\nbtree_key_insert_new_object"]
  C -->|MVCC_DELID / MARK_DELETED| E["key_insert_func =\nbtree_key_find_and_insert_delete_mvccid"]
  D --> F["btree_search_key_and_apply_functions"]
  E --> F
  F --> G["btree_fix_root_for_insert\nunique stats, NULL gate"]
  G --> H["btree_split_node_and_advance\ndescent + split"]
  H --> I["leaf reached -> key_insert_func"]
  I --> J["post: release lock, stats correction"]

Figure 4-1. btree_insert_internal orchestration and callback wiring.

4.2 The unique-check gate and the NULL-key early exit

Section titled “4.2 The unique-check gate and the NULL-key early exit”

btree_fix_root_for_insert runs once per descent (guarded by is_first_try). Beyond fixing the root it does unique-statistics accounting and computes key_len_in_page (the packed key length the Chapter 6 split math needs). Two gates here shape the new-key path.

The unique-stats gate keys on BTREE_IS_UNIQUE (btid_int->unique_pk) alone. A nearby macro looks like it should be this gate but is not:

// BTREE_NEED_UNIQUE_CHECK -- src/storage/btree.h (defined but never referenced)
#define BTREE_NEED_UNIQUE_CHECK(thread_p, op) \
(logtb_is_current_active (thread_p) \
&& (op == SINGLE_ROW_INSERT || op == MULTI_ROW_INSERT || op == SINGLE_ROW_UPDATE))

Its intended semantics — an active transaction (so rollback/recovery skip live checking) plus an insert-shaped op type — record the design intent, but no caller in src/ references it; it is dead code. The live stats bump is gated only on BTREE_IS_UNIQUE, with the increment chosen by NULL-ness:

// btree_fix_root_for_insert -- src/storage/btree.c
if (BTREE_IS_UNIQUE (btid_int->unique_pk))
{
btree_unique_stats incr; /* ... MVCC-delete branch omitted ... */
if (insert_helper->is_null)
incr.insert_null_and_row (); /* <- NULL: a null + a row, never a key */
else
incr.insert_key_and_row (); /* <- speculative: assumes a new key WILL be added */
if (BTREE_IS_MULTI_ROW_OP (insert_helper->op_type))
(*insert_helper->unique_stats_info) += incr; /* multi-row sink */
else if (!btree_is_online_index_loading (insert_helper->purpose))
logtb_tran_update_unique_stats (thread_p, *btid, incr, true); /* single-row tran stats */
}

Invariant — speculative key count is corrected, never left wrong. Violation: every duplicate-key insert reports a phantom key, corrupting optimizer cardinality. insert_key_and_row() is applied before the descent knows the key exists; if the leaf already holds it (Chapter 5) the 4.1 multi-row correction reverts it, and single-row rollback reverses the logtb_tran_update_unique_stats delta.

The NULL-key early exit — a NULL key never reaches a leaf:

// btree_fix_root_for_insert -- src/storage/btree.c
if (insert_helper->is_null)
{
*stop = true; /* <- halt walker at root */
pgbuf_unfix_and_init (thread_p, *root_page);
return NO_ERROR;
}

Stats were already counted, so the NULL is accounted but not stored. *stop makes the walker return without the leaf callback, so btree_key_insert_new_key is reached only for non-NULL keys — its own assert (key != NULL && ...) encodes that. Finally the function sets insert_helper->key_len_in_page = BTREE_GET_KEY_LEN_IN_PAGE (btree_get_disk_size_of_key (key)), possibly creating an overflow-key file when key_len >= BTREE_MAX_KEYLEN_INPAGE.

4.3 Reaching the leaf: btree_split_node_and_advance

Section titled “4.3 Reaching the leaf: btree_split_node_and_advance”

The recursive driver is dissected fully in Chapter 6. For this chapter only its leaf-reaching tail matters: after ensuring room for a child split (Chapter 6) it tests the node type:

// btree_split_node_and_advance -- src/storage/btree.c
if (node_type == BTREE_LEAF_NODE)
{
assert (pgbuf_get_latch_mode (*crt_page) == PGBUF_LATCH_WRITE); /* <- leaves always WRITE-latched */
/* No other child. Notify called this is a leaf node and return the slot of new key. */
error_code = btree_search_leaf_page (thread_p, btid_int, *crt_page, key, search_key);
if (error_code != NO_ERROR)
{ ASSERT_ERROR (); goto error; }
*is_leaf = true; /* <- walker now calls key_insert_func instead of recursing */
return NO_ERROR;
}

When *is_leaf is set true, the walker stops recursing and invokes the key function on *crt_page. The crucial outputs of btree_search_leaf_page are search_key->slotid (the slot where the key belongs, by binary search) and search_key->result (BTREE_KEY_FOUND if present, else BTREE_KEY_NOTFOUND / BTREE_KEY_BETWEEN) — the pivot the next function branches on. The leaf is always WRITE-latched on arrival, so the leaf path never promotes.

4.4 Leaf dispatch: btree_key_insert_new_object

Section titled “4.4 Leaf dispatch: btree_key_insert_new_object”

This leaf callback sets up recovery scaffolding then splits between new key (this chapter) and append (Chapter 5). First it stamps insert_helper->leaf_addr (offset = search_key->slotid, pgptr = *leaf_page, vfid) and prepares the undo keyval blob — the logical undo image (key + class OID + OID + MVCC info):

// btree_key_insert_new_object -- src/storage/btree.c
if (insert_helper->purpose == BTREE_OP_INSERT_NEW_OBJECT /* || ONLINE_INDEX_TRAN_INSERT ... */) {
insert_helper->rcvindex =
BTREE_MVCC_INFO_IS_INSID_NOT_ALL_VISIBLE (BTREE_INSERT_MVCC_INFO (insert_helper))
? RVBT_MVCC_INSERT_OBJECT : RVBT_NON_MVCC_INSERT_OBJECT; /* <- MVCC vs non-MVCC */
error_code = btree_rv_save_keyval_for_undo (btid_int, key, ..., &insert_helper->rv_keyval_data, ...);
}
else /* BTREE_OP_INSERT_UNDO_PHYSICAL_DELETE */
insert_helper->rcvindex = RVBT_RECORD_MODIFY_COMPENSATE; /* <- compensation, no undo keyval */

Then the central branch — result != BTREE_KEY_FOUND is the new-key path of this chapter; the else falls through to the append paths of Chapter 5:

// btree_key_insert_new_object -- src/storage/btree.c
if (search_key->result != BTREE_KEY_FOUND)
{ /* Key doesn't exist. Insert new key. */
error_code = btree_key_insert_new_key (thread_p, btid_int, key, *leaf_page, insert_helper, search_key);
/* ... free undo blob if reallocated, track time ... */ return NO_ERROR;
}
/* Key was found. Append new object to existing key. */
if (BTREE_IS_UNIQUE (btid_int->unique_pk) && insert_helper->purpose == BTREE_OP_INSERT_NEW_OBJECT)
{
error_code = btree_key_lock_and_append_object_unique (..., restart, ...); /* <- Chapter 5 */
if (error_code != NO_ERROR) { ASSERT_ERROR (); goto error; }
if (*restart == true) { goto exit; } /* <- re-descend: leaf changed under the lock wait */
}
else
{ /* btree_key_append_object_non_unique (...) -- Chapter 5 */ }

The found-key unique path may set *restart and goto exit to re-walk from the root: if btree_key_lock_and_append_object_unique waited on a lock and the leaf was reorganized meanwhile, it asks the walker to descend again — a fourth way to leave the dispatch, with mechanics deferred to Chapter 5. The append branches return through a shared exit:/error: epilogue that frees a heap-reallocated undo buffer and records the perf timer; the new-key branch does the equivalent cleanup inline and returns directly without reaching exit:. Only the result != BTREE_KEY_FOUND branch concerns us.

4.5 Building and splicing the cell: btree_key_insert_new_key

Section titled “4.5 Building and splicing the cell: btree_key_insert_new_key”

The heart of the chapter: build a leaf record holding the key and its single first OID, insert at search_key->slotid, update header counters, emit one undoredo log record.

4.5.1 — Mark added. insert_helper->is_unique_key_added_or_deleted = true; confirms the speculative +key from 4.2 was right (reached only when the key was not found), so no later correction fires.

4.5.2 — Overflow-key vs in-page key (three sub-branches).

// btree_key_insert_new_key -- src/storage/btree.c
key_len = btree_get_disk_size_of_key (key);
if (key_len >= BTREE_MAX_KEYLEN_INPAGE)
{
key_type = BTREE_OVERFLOW_KEY;
log_sysop_start (thread_p); /* <- (a) overflow key needs a nested system op */
insert_helper->is_system_op_started = true;
}
else
{
int diff_column = btree_node_get_common_prefix (thread_p, btid_int, leaf_page);
if (diff_column > 0) /* <- (b) prefix compression */
{ new_key = &local_key; pr_clone_value (key, new_key); pr_midxkey_remove_prefix (new_key, diff_column); }
else if (diff_column < 0)
{ ASSERT_ERROR (); return diff_column; } /* <- (c) error: bad prefix, early return before any sysop */
}

(a) oversized -> overflow key under a system op so the overflow-page allocation commits/aborts atomically; (b) common prefix > 0 -> clone into local_key and strip prefix columns; (c) diff_column < 0 -> propagate error (returns before any sysop, so no abort needed).

4.5.3 — Encode via btree_write_record into a REC_HOME descriptor:

// btree_key_insert_new_key -- src/storage/btree.c
record.type = REC_HOME; record.data = PTR_ALIGN (data_buffer, MAX_ALIGNMENT); record.area_size = DB_PAGESIZE;
error_code = btree_write_record (thread_p, btid_int, NULL /* no non-leaf node_rec */, new_key, BTREE_LEAF_NODE,
key_type, key_len, false, BTREE_INSERT_CLASS_OID (insert_helper),
BTREE_INSERT_OID (insert_helper), BTREE_INSERT_MVCC_INFO (insert_helper), &record);
if (new_key == &local_key) pr_clear_value (&local_key); /* <- free cloned key on BOTH paths, before err check */
if (error_code != NO_ERROR) { ASSERT_ERROR (); goto error; }

For a leaf this writes the first OID, optionally a subclass OID (unique index + class differs from topclass, sets BTREE_LEAF_RECORD_CLASS_OID), MVCC info, then the packed key body (or a VPID to the overflow page). local_key is cleared before the error check, so the clone never leaks.

4.5.4 — Splice into the slot: spage_insert_at.

// btree_key_insert_new_key -- src/storage/btree.c
node_header = btree_get_node_header (thread_p, leaf_page); /* ... NULL-check -> goto error ... */
FI_TEST (thread_p, FI_TEST_BTREE_MANAGER_RANDOM_EXIT, 0);
/* Nothing should fail after spage_insert_at! */
if (spage_insert_at (thread_p, leaf_page, search_key->slotid, &record) != SP_SUCCESS)
{ assert_release (false); error_code = ER_FAILED; goto error; }

Invariant — spage_insert_at is the point of no return. Violation: a half-applied insert with no clean rollback, since redo is emitted right after. The Chapter 6 split pass already reserved btree_get_max_new_data_size bytes, so SP_DOESNT_FIT cannot occur; the only remaining failure is a logic bug (assert_release).

Inside, spage_insert_at validates the slot id, then spage_find_empty_slot_at branches: at num_slots it appends via spage_add_new_slot; between existing keys it calls spage_take_slot_in_use, which slides the slot-directory entries from slot_id up by one to free it. Either branch bumps the page-header counters and spage_insert_data copies the bytes (Chapter 1). Inserting at search_key->slotid preserves sorted order.

4.5.5 — Update the node header: max_key_len and split_info.

// btree_key_insert_new_key -- src/storage/btree.c
key_cnt = btree_node_number_of_keys (thread_p, leaf_page);
key_len = BTREE_GET_KEY_LEN_IN_PAGE (key_len);
if (key_len > node_header->max_key_len)
{ update_max_key_length = true; node_header->max_key_len = key_len; } /* <- redo must know too */
btree_split_next_pivot (&node_header->split_info, (float) search_key->slotid / key_cnt, key_cnt);

Key count is not written here — it is derived from the slot count, which spage_insert_at already incremented. What this function maintains explicitly:

  • max_key_len — grown only if the new key is longer. The local boolean update_max_key_length tells the redo encoding (4.5.6) to pack the new key length so recovery rebuilds the header identically. This is leaf growth; the ancestor chain was bumped during descent (Chapter 6).
  • split_info.pivotbtree_split_next_pivot folds slotid / key_cnt into a running average so a future split favours where inserts land (Chapter 6).

4.5.6 — Emit the recovery record into a local buffer (the helper’s smaller redo buffer may not hold a full record plus debug info):

// btree_key_insert_new_key -- src/storage/btree.c
LOG_RV_RECORD_SET_MODIFY_MODE (&insert_helper->leaf_addr, LOG_RV_RECORD_INSERT); /* <- redo = "insert a slot" */
if (update_max_key_length)
{
rv_redo_data_ptr = or_pack_int (rv_redo_data_ptr, key_len); /* <- replay the header growth */
BTREE_RV_SET_UPDATE_MAX_KEY_LEN (&insert_helper->leaf_addr);
}
memcpy (rv_redo_data_ptr, record.data, record.length); rv_redo_data_ptr += record.length;
btree_rv_log_insert_object (thread_p, *insert_helper, insert_helper->leaf_addr, 0, rv_redo_data_length, NULL, rv_redo_data);

Redo data is an optional packed key_len (only when the header grew, gated by the BTREE_RV_SET_UPDATE_MAX_KEY_LEN address flag) followed by the raw record; the matching undo is the logical keyval blob from 4.4. btree_rv_log_insert_object chooses the log shape by purpose:

// btree_rv_log_insert_object -- src/storage/btree.c
if (insert_helper.is_system_op_started)
log_append_undoredo_data (thread_p, RVBT_RECORD_MODIFY_UNDOREDO, &addr, undo_length, redo_length, undo_data, redo_data);
else switch (insert_helper.purpose)
{
case BTREE_OP_INSERT_NEW_OBJECT: /* undo logical, redo physical */
log_append_undoredo_data (thread_p, insert_helper.rcvindex, &addr, insert_helper.rv_keyval_data_length,
redo_length, insert_helper.rv_keyval_data, redo_data);
break; /* ... online-index and compensate cases ... */
}

Invariant — undo is logical, redo is physical, for the normal new-key case. Violation: undo corrupts a reshaped page or leaks the overflow allocation. Rollback re-derives and physically removes the object rather than restoring raw bytes a concurrent thread may have reshaped. The exception: an overflow-key sysop (is_system_op_started) logs physical RVBT_RECORD_MODIFY_UNDOREDO bracketed by the sysop.

4.5.7 — Close the system op and finish.

// btree_key_insert_new_key -- src/storage/btree.c
if (insert_helper->is_system_op_started)
btree_insert_sysop_end (thread_p, insert_helper);
pgbuf_set_dirty (thread_p, leaf_page, DONT_FREE); /* <- keep latch, flag for flush */
return NO_ERROR;
error:
if (insert_helper->is_system_op_started)
{ log_sysop_abort (thread_p); insert_helper->is_system_op_started = false; } /* <- roll back overflow alloc */
return error_code;

The error: label is the single cleanup point; its only real work is aborting a started sysop (overflow-key case). The point-of-no-return invariant guarantees no slotted-page change reachable from error: needs undoing.

flowchart TD
  A["btree_key_insert_new_key"] --> B["is_unique_key_added_or_deleted = true"]
  B --> C{"key_len >= MAX_KEYLEN_INPAGE?"}
  C -->|yes| D["key_type = OVERFLOW\nlog_sysop_start"]
  C -->|no| E{"common prefix?"}
  E -->|diff_column > 0| F["clone + strip prefix into local_key"]
  E -->|diff_column < 0| G["return error"]
  E -->|0| H["use key as-is"]
  D --> I["btree_write_record"]
  F --> I
  H --> I
  I -->|error| Z["goto error"]
  I -->|ok| J["spage_insert_at at slotid"]
  J -->|not SP_SUCCESS| Z
  J -->|SP_SUCCESS| K["update max_key_len if grown\nbtree_split_next_pivot"]
  K --> L["pack redo: optional key_len + record"]
  L --> M["btree_rv_log_insert_object\nundo logical / redo physical"]
  M --> N{"sysop started?"}
  N -->|yes| O["btree_insert_sysop_end"]
  N -->|no| P["pgbuf_set_dirty; return NO_ERROR"]
  O --> P
  Z --> Q{"sysop started?"}
  Q -->|yes| R["log_sysop_abort"]
  Q -->|no| S["return error_code"]
  R --> S

Figure 4-2. btree_key_insert_new_key branch-complete control flow.

  1. btree_insert_internal is pure setup: build the helper, select btree_key_insert_new_object, compute is_null, delegate to the walker; its only post-work is saved-lock release and the no-op-multi-update correction.
  2. Unique stats are bumped optimistically in btree_fix_root_for_insert (insert_key_and_row) before the leaf knows if the key exists; truth is restored by the 4.1 correction or undo logs. NULL keys are counted (insert_null_and_row) but stop at root via *stop = true.
  3. btree_split_node_and_advance reaches the leaf by setting *is_leaf = true after btree_search_leaf_page; the leaf is always WRITE-latched, and search_key->result is the branch pivot.
  4. btree_key_insert_new_object has four exits: new-key insert, unique append (which may goto exit on *restart to re-descend, Chapter 5), non-unique append, and the error path.
  5. btree_key_insert_new_key encodes the cell via btree_write_record through three sub-branches: oversized -> overflow key under a sysop, prefix-compressible -> clone-and-strip, plain -> as-is.
  6. spage_insert_at is the point of no return (append vs spage_take_slot_in_use); the Chapter 6 split pass guarantees room, so SP_DOESNT_FIT would be a bug. Header maintenance is max_key_len (grown only if longer, echoed into redo via update_max_key_length) and split_info.pivot; key count is implicit in the slot count.
  7. Recovery for the normal new-key case is undo logical, redo physical; an overflow-key sysop instead forces a physical RVBT_RECORD_MODIFY_UNDOREDO pair bracketed by the sysop.

Chapter 5: Appending an OID and Enforcing Uniqueness

Section titled “Chapter 5: Appending an OID and Enforcing Uniqueness”

Chapter 4 covered the first object for a key; this chapter handles the case where the key already exists, so a second OID must be threaded onto it. For a non-unique index that is a pure append; for a unique index it is a locking protocol that may end in ER_BTREE_UNIQUE_FAILED. Three sub-problems interleave: where the new OID goes (inline until a cap, then an overflow chain off LEAF_REC::ovfl); whether a unique append is legal (lock the current visible object, re-check under it); and what “visible” means (purpose-dependent). For the slotted-page layout, the key || OID encoding, and overflow files, see cubrid-btree.md §“Node layout” and §“Unique-key handling”.

The leaf-key dispatcher routes on BTREE_IS_UNIQUE (btid_int->unique_pk): the non-unique path appends with no lock and no visibility check; the unique path locks the existing visible object, decides whether the insert violates uniqueness, then appends. Both end at btree_key_append_object_into_ovf when the inline list is full.

5.2 btree_key_append_object_non_unique — the bare append

Section titled “5.2 btree_key_append_object_non_unique — the bare append”
// btree_key_append_object_non_unique -- src/storage/btree.c
n_objects = btree_record_get_num_oids (..., leaf_record, offset_after_key, BTREE_LEAF_NODE);
if (n_objects < BTREE_MAX_OIDCOUNT_IN_LEAF_RECORD (btid_int)) /* <- branch A: room inline */
{
btree_record_append_object (..., leaf_record, BTREE_LEAF_NODE, btree_obj, NULL, &..rv_redo_data_ptr);
spage_update (...); return NO_ERROR; /* writes at tail; logical-undo + physical-redo log */
}
BTREE_MVCC_INFO_SET_FIXED_SIZE (&btree_obj->mvcc_info); /* <- branch B: overflow objects are fixed size */
error_code = btree_key_append_object_into_ovf (..., leaf_record, leaf_info, insert_helper, btree_obj);

The cap BTREE_MAX_OIDCOUNT_IN_LEAF_RECORD (btree_load.h) is BTREE_MAX_OIDLEN_INPAGE / BTREE_OBJECT_FIXED_SIZE. Branch A writes at the record tail (descent already guaranteed leaf room, so no split). Branch B forces fixed size — every overflow object carries both MVCCID slots so the page is binary-searchable (§5.9).

Invariant: an inline OID list never exceeds BTREE_MAX_OIDCOUNT_IN_LEAF_RECORD — enforced by the n_objects < n_objects_limit gate (and a symmetric assert (==) in §5.7). If violated, the record grows unbounded and breaks the split-point arithmetic that assumes a bounded record.

5.3 btree_append_oid and btree_record_append_object

Section titled “5.3 btree_append_oid and btree_record_append_object”

btree_append_oid is the mechanical writer (also used by the loader): a bare 8-byte OID (OR_PUT_OID at rec->data + rec->length, then rec->length += OR_OID_SIZE), no MVCC/class/recovery data. The richer btree_record_append_object packs the full object; when the leaf record already carries BTREE_LEAF_RECORD_OVERFLOW_OIDS it first steps append_at back by DISK_VPID_ALIGNED_SIZE, saves the trailing VPID, packs the object, then re-appends the VPID:

// btree_record_append_object -- src/storage/btree.c
if (node_type == BTREE_LEAF_NODE && record->length > 0
&& btree_leaf_is_flaged (record, BTREE_LEAF_RECORD_OVERFLOW_OIDS))
{ append_at -= DISK_VPID_ALIGNED_SIZE; OR_GET_VPID (append_at, &ovf_vpid); } /* <- preserve ovfl link */

So a new inline object is spliced in front of the trailing LEAF_REC::ovfl link, keeping the link at a fixed position (the tail).

btree_key_append_object_into_ovf reuses a partially-full chain page or starts a new one:

// btree_key_append_object_into_ovf -- src/storage/btree.c
btree_find_free_overflow_oids_page (..., &leaf_record_info->ovfl, &overflow_page);
if (overflow_page == NULL) /* <- no free page in chain */
error_code = btree_key_append_object_as_new_overflow (...); /* allocate + prepend new page */
else /* <- a chain page has room */
{ btree_key_append_object_to_overflow (..., overflow_page, append_object, insert_helper);
pgbuf_unfix_and_init (thread_p, overflow_page); }

The new-page branch calls btree_start_overflow_page to allocate and seed the page, then btree_leaf_record_change_overflow_link to splice it onto the leaf link and set BTREE_LEAF_RECORD_OVERFLOW_OIDS:

// btree_start_overflow_page -- src/storage/btree.c
*new_page_ptr = btree_get_new_page (thread_p, btid_int, new_vpid, near_vpid);
VPID_COPY (&ovf_header_info.next_vpid, first_overflow_vpid); /* <- prepend: new page -> old head */
btree_init_overflow_header (...); spage_insert_at (..., *new_page_ptr, 1, &rec); /* one object, slot 1 */
// ... redo-only log (RVBT_RECORD_MODIFY_NO_UNDO); undo handled logically by caller's sysop ...

It prepends (leaf -> new -> old head -> ...): the new page’s next_vpid points to the current head before the leaf link is rewritten.

Invariant: the overflow link and the BTREE_LEAF_RECORD_OVERFLOW_OIDS flag are set together, under a system operation. btree_key_append_object_as_new_overflow brackets allocation + link rewrite in log_sysop_start / btree_insert_sysop_end, so a crash between them is undone logically. If flag and link disagreed, btree_record_append_object would read a garbage VPID at the tail.

5.5 The unique protocol — btree_key_lock_and_append_object_unique

Section titled “5.5 The unique protocol — btree_key_lock_and_append_object_unique”

A unique index cannot just append: under MVCC a “duplicate” key may have its visible version already logically deleted, in which case a new version is legal. The protocol locks the current first object, then re-decides. It saves the leaf LSA, calls btree_key_find_and_lock_unique (§5.6) with the prior saved_locked_oid copied into find_unique_helper.locked_oid at X_LOCK, then walks these branches:

flowchart TB
  FAL["find_and_lock_unique"] --> E1{"error / *restart?"}
  E1 -- yes --> RET["return / restart"]
  E1 -- no --> NF{"result != KEY_FOUND\nor SERVER LSA split?"}
  NF -- yes --> NEWK["restart, or insert_new_key\nkey vacuumed away"]
  NF -- no --> FO{"found_object and\nINSERT_NEW_OBJECT?"}
  FO -- no --> AP["append_object_unique"]
  FO -- yes --> MU{"!multi_update or\nnum_visible > 1?"}
  MU -- yes --> UV["unique violation"]
  MU -- no --> HA{"is_ha_enabled?"}
  HA -- yes --> UV
  HA -- no --> AP

Figure 5-1 — Branch-complete flow of btree_key_lock_and_append_object_unique (the !found_object RR-isolation branch also reaches unique violation).

The branch worth quoting is the violation decision:

// btree_key_lock_and_append_object_unique -- src/storage/btree.c
if (find_unique_helper.found_object && insert_helper->purpose == BTREE_OP_INSERT_NEW_OBJECT) {
if (insert_helper->is_unique_multi_update)
// ... btree_get_num_visible_from_leaf_and_ovf with dirty snapshot -> num_visible ...
if (!insert_helper->is_unique_multi_update || num_visible > 1) { /* <- real violation */
if (prm_get_bool_value (PRM_ID_UNIQUE_ERROR_KEY_VALUE)) return ER_UNIQUE_VIOLATION_WITHKEY;
BTREE_SET_UNIQUE_VIOLATION_ERROR (...); return ER_BTREE_UNIQUE_FAILED; }
else if (insert_helper->is_ha_enabled) return ER_REPL_MULTI_UPDATE_UNIQUE_VIOLATION; /* HA strictness */
insert_helper->is_unique_key_added_or_deleted = false; /* <- multi-update allowed */
OID_SET_NULL (&insert_helper->saved_locked_oid); /* keep lock on relocated object */
} else
insert_helper->is_unique_key_added_or_deleted = true; /* all deleted -> a new key */

A live visible object plus a normal insert is a violationER_BTREE_UNIQUE_FAILED, or ER_UNIQUE_VIOLATION_WITHKEY when PRM_ID_UNIQUE_ERROR_KEY_VALUE wants the key echoed. The multi-update exception tolerates a transient second version only when num_visible <= 1 under a dirty snapshot; under HA even that becomes ER_REPL_MULTI_UPDATE_UNIQUE_VIOLATION. The RR branch (a !found_object whose repeatable-read snapshot still counts a visible object) raises the same error pair. When cleared, the path saves two objects for undo — but only for an MVCC insert (rcvindex == RVBT_MVCC_INSERT_OBJECT) whose key either gained no new visible object (!is_unique_key_added_or_deleted) or whose first object was deleted by this same transaction (its delete MVCCID equals our insert MVCCID). Then btree_rv_save_keyval_for_undo_two_objects packs both versions and rcvindex switches to RVBT_MVCC_INSERT_OBJECT_UNQ, so undo restores the displaced object.

Invariant: at most one visible object per key in a unique index (two transiently during multi-row update, never under HA) — enforced by the num_visible > 1 and is_ha_enabled branches. Violating it lets two committed rows share a unique key.

5.6 btree_key_find_and_lock_unique and the lock loop

Section titled “5.6 btree_key_find_and_lock_unique and the lock loop”

btree_key_find_and_lock_unique dispatches on BTREE_IS_UNIQUE: _of_unique inspects only the first object (btree_leaf_get_first_object); _of_non_unique serves system-catalog indexes declared unique but physically storing an OID list. The hot path:

// btree_key_find_and_lock_unique_of_unique -- src/storage/btree.c
if (search_key->result != BTREE_KEY_FOUND) goto error_or_not_found; /* key gone */
while (true) {
btree_leaf_get_first_object (btid_int, &record, &unique_oid, &unique_class_oid, &mvcc_info);
if (class filter mismatch) goto error_or_not_found; /* partition-scoped lookup */
if (locked_oid set and != unique_oid) /* first object changed */
{ lock_unlock_object_donot_move_to_non2pl (...); OID_SET_NULL (&locked_oid); }
switch (mvcc_satisfies_delete (thread_p, &mvcc_header)) {
case INSERT_IN_PROGRESS: case DELETE_IN_PROGRESS: /* SA: assert_release(false); SERVER: fallthrough */
case CAN_DELETE: /* already locked -> found; else cond-lock; on fail unfix+uncond, *restart/break */
case DELETED: case SELF_DELETED: goto error_or_not_found; /* gone -> caller may insert */
default: assert_release (false); error_code = ER_FAILED; goto error_or_not_found;
} }

_of_non_unique differs materially: the record holds a list, so it scans the whole leaf record then every overflow page rather than stopping at the first object —

// btree_key_find_and_lock_unique_of_non_unique -- src/storage/btree.c
while (true) {
if (start_reading_leaf_record) { /* read leaf record; first object via btree_or_get_object */ }
else if (buf.ptr == buf.endptr) { /* <- exhausted current record */
if (VPID_ISNULL (&next_overflow_vpid)) goto error_or_not_found; /* end of chain: not found */
overflow_page = pgbuf_fix (..., &next_overflow_vpid, OLD_PAGE, PGBUF_LATCH_READ, ...);
if (prev_overflow_page) pgbuf_unfix_and_init (thread_p, prev_overflow_page);
btree_get_next_overflow_vpid (..., &next_overflow_vpid); /* step the chain */
}
btree_or_get_object (&buf, ..., &unique_oid, ...); /* next object */
if (class mismatch) continue; /* <- advance, NOT not-found */
switch (mvcc_satisfies_delete (...)) { /* same 5 cases as _of_unique */ } }

Unlike _of_unique, a class mismatch does continue (advance to the next list object) and a deleted object keeps scanning; not-found is reported only when buf.ptr == buf.endptr and VPID_ISNULL(next_overflow_vpid). Pages are fixed PGBUF_LATCH_READ, with the prev/current pair unfixed as the scan steps.

Invariant: the first object of a unique leaf record is the newest visible version, and it is the one that gets locked. Maintained by §5.5 plus btree_leaf_change_first_object (§5.8); _of_unique depends on it. A stale first version would let an inserter lock the wrong object and miss a real conflict.

BTREE_FIND_UNIQUE_HELPER threads through these functions:

FieldRoleWhy it exists
oidOID of the found object (output)Which object was visible
match_class_oidOnly this class; NULL = anyPartition-scoped lookup; mismatch = not-found
lock_modeLock to take; insert = X_LOCK, FK = S_LOCKSeparates read-only find from insert
snapshotMVCC filter; NULL = noneFK/find set it; locking insert leaves NULL
found_objectTrue iff found and lockedSuccess flag (§5.5)
time_trackPERF_UTIME_TRACKER for PSTAT_BT_FIND_UNIQUE*Per-phase counters
locked_oid (SERVER_MODE)Object currently locked hereSurvives unfix/refix; detects a changed first object
locked_class_oid (SERVER_MODE)Class OID of locked_oidNeeded by lock_unlock_object_*

locked_oid/locked_class_oid are absent in SA_MODE. Insert (unique) uses X_LOCK + NULL snapshot (a live object drives the violation decision); FK/find uses S_LOCK + a snapshot (a visible object exists, no exclusion).

5.7 btree_key_append_object_unique — keeping the newest first

Section titled “5.7 btree_key_append_object_unique — keeping the newest first”

The new object must become the first object, so the current first is relocated to the tail:

// btree_key_append_object_unique -- src/storage/btree.c
if (btree_record_get_num_oids (...) >= BTREE_MAX_OIDCOUNT_IN_LEAF_RECORD (btid_int))
error_code = btree_key_relocate_last_into_ovf (...); /* full: evict LAST inline object first */
BTREE_MVCC_INFO_SET_FIXED_SIZE (&first_object->mvcc_info);
btree_record_append_object (..., first_object, NULL, &..rv_redo_data_ptr); /* append OLD first at tail */
btree_leaf_change_first_object (..., BTREE_INSERT_OID (insert_helper), ...); /* overwrite slot zero */
spage_update (...);
// log_append_undoredo_data with insert_helper->rcvindex (MVCC / NON_MVCC / *_UNQ); logical undo.

When full, the last inline object is evicted to overflow by btree_key_relocate_last_into_ovf under its own sysop; with room, relocation is skipped. The two-step split is deliberate — the logging system cannot mix a sysop with logical undo, so step 1 uses a sysop with physical undo/redo and step 2 uses logical undo.

btree_key_relocate_last_into_ovf opens a log_sysop_start, pushes the last object via btree_key_append_object_into_ovf, removes it from the leaf with btree_record_remove_last_object, and logs RVBT_RECORD_MODIFY_UNDOREDO. Its subtle branch is a re-read:

// btree_key_relocate_last_into_ovf -- src/storage/btree.c
if (VPID_ISNULL (&leaf_record_info->ovfl) && btree_leaf_is_flaged (leaf_record, BTREE_LEAF_RECORD_OVERFLOW_OIDS))
{ btree_read_record (...); btree_record_get_last_object (...); } /* <- first ovf page just added: re-read */

The overflow append may itself have created the first overflow page, so btree_leaf_record_change_overflow_link rewrote the leaf record and the cached offset_to_last_object is stale — recompute it before btree_record_remove_last_object or the wrong bytes get removed.

5.8 btree_leaf_change_first_object — overwriting slot zero

Section titled “5.8 btree_leaf_change_first_object — overwriting slot zero”

This primitive replaces the in-record first object, growing or shrinking the record as the object’s size class changes:

// btree_leaf_change_first_object -- src/storage/btree.c
if (old_rec_flag & BTREE_LEAF_RECORD_OVERFLOW_OIDS) { /* <- overflow -> FORCE fixed size */
old_object_size += 2 * OR_MVCCID_SIZE; new_object_size += 2 * OR_MVCCID_SIZE;
BTREE_MVCC_INFO_SET_FIXED_SIZE (mvcc_info);
if (BTREE_IS_UNIQUE (...) && !new_has_class_oid) /* unique + overflow forces a class OID too */
{ new_rec_flag |= BTREE_LEAF_RECORD_CLASS_OID; new_object_size += OR_OID_SIZE; } }
// else: no overflow -> variable size. Then RECORD_MOVE_DATA(recp, new_object_size, old_object_size).

With no overflow the first object is variable-size; with overflow it is forced fixed-size plus (for unique) a class OID, so btree_leaf_get_first_object and the overflow binary search agree on a uniform width. RECORD_MOVE_DATA slides the key and inline OIDs by (new - old), reported through *key_offset.

Invariant: when a leaf record has an overflow chain, its first object is fixed-width (and, if unique, includes a class OID) — enforced here and by BTREE_MVCC_INFO_SET_FIXED_SIZE in the appenders. If violated, btree_find_oid_from_ovfl’s binary search (dividing by BTREE_OBJECT_FIXED_SIZE) reads mid-object garbage.

5.9 Locating an exact (key, OID) — the find-OID family

Section titled “5.9 Locating an exact (key, OID) — the find-OID family”

Both the unique check and the delete path ask “which page holds this exact (key, OID) under this purpose’s visibility?” btree_find_oid_and_its_page scans the leaf linearly (btree_find_oid_from_leaf, variable-width, skipping the packed key after object 0), then binary-searches each overflow page (btree_find_oid_from_ovfl, fixed-width, OID-sorted):

// btree_find_oid_and_its_page -- src/storage/btree.c
btree_find_oid_from_leaf (..., oid, match_mvccinfo, purpose, offset_to_object, ...);
if (*offset_to_object != NOT_FOUND) { *found_page = leaf_page; return NO_ERROR; } /* in leaf */
if (VPID_ISNULL (&leaf_rec_info->ovfl)) return NO_ERROR; /* no chain: not found */
do { overflow_page = pgbuf_fix (..., PGBUF_LATCH_WRITE, ...); /* walk chain, tracking *prev_page */
btree_find_oid_from_ovfl (..., overflow_page, oid, purpose, match_mvccinfo, offset_to_object, ...);
} while (*offset_to_object == NOT_FOUND && !VPID_ISNULL (&overflow_vpid)); /* step via next_overflow_vpid */
// found: *found_page = overflow_page; else unfix all + not found

Pages are fixed PGBUF_LATCH_WRITE (mutating callers); prev_page is tracked for relinking after removal; all pages are unfixed on the not-found and error paths. The MVCC gate in both finders is btree_find_oid_does_mvcc_info_match, purpose-dependent:

BTREE_OP_PURPOSEMatch rule
DELETE_VACUUM_INSIDinsert MVCCID equals match_mvccinfo->insert_mvccid
DELETE_VACUUM_OBJECT, DELETE_OBJECT_PHYSICAL_POSTPONEDdelete MVCCID equals match_mvccinfo->delete_mvccid (avoids a reused OID)
DELETE_UNDO_INSERT, DELETE_UNDO_INSERT_UNQ_MULTIUPDinsert MVCCID matches the one being rolled back
DELETE_UNDO_INSERT_DELIDdelete MVCCID matches; insert MVCCID cross-checked only if also set
DELETE_OBJECT_PHYSICAL, INSERT_MVCC_DELID, INSERT_MARK_DELETED, defaultobject has no valid delete MVCCID (alive)

The purpose selects which MVCC field decides a match, so one finder serves insert, delete, and vacuum. The default branch (“no valid delete MVCCID”) drives unique-insert and physical-delete, so a reused OID whose old version is deleted-but-unvacuumed is correctly skipped.

  1. Two entry points on BTREE_IS_UNIQUE: btree_key_append_object_non_unique appends blindly; btree_key_lock_and_append_object_unique locks-then-verifies.
  2. Inline until BTREE_MAX_OIDCOUNT_IN_LEAF_RECORD, then overflow via btree_key_append_object_into_ovf — reuse a chain page or prepend a new one (btree_start_overflow_page; flag + link set under a sysop).
  3. Unique = lock first, then decide. _of_unique X-locks the first object; _of_non_unique scans the whole list. A live visible object means ER_BTREE_UNIQUE_FAILED, except a MULTI-ROW-UPDATE tolerates one transient extra — never under HA, never two.
  4. Newest visible stays firstbtree_key_append_object_unique relocates the old first object to the tail (evicting the last inline object to overflow first if full), then btree_leaf_change_first_object.
  5. Overflow forces fixed-width objects (both MVCCID slots, class OID for unique) — the precondition for btree_find_oid_from_ovfl’s binary search.
  6. One finder, many purposesbtree_find_oid_does_mvcc_info_match makes the same (key, OID) match or miss by BTREE_OP_PURPOSE, so insert/delete/undo/vacuum reuse btree_find_oid_and_its_page.
  7. Restart and re-read are first-class: *restart bounces to root, the LSA recheck forces a split restart, and the “first overflow page just created” re-read keeps btree_key_relocate_last_into_ovf honest.

Chapter 6: Node Split and Separator Promotion

Section titled “Chapter 6: Node Split and Separator Promotion”

When the target leaf — or any ancestor on the descent path — has no room, CUBRID grows the tree via pessimistic split-on-descent, making growth safe without ever restarting from the root for capacity reasons. The companion (cubrid-btree.md, §“Descent” and Figures 2–3) covers why CUBRID splits eagerly under a system op; this chapter dissects how, branch by branch. Chapter 3 covered the descent calling the function below; Chapter 4 the leaf insert once room is guaranteed.

6.1 The pessimistic contract: split before you advance

Section titled “6.1 The pessimistic contract: split before you advance”

The descent driver btree_search_key_and_apply_functions calls one advance function per level; for inserts it is btree_split_node_and_advance.

INVARIANT — “the child you advance into always has room.” Before returning a child page, btree_split_node_and_advance verifies the child can absorb the worst-case new entry, splitting it first if not. A chain split (a split forcing its parent to split, which forces its…) can never happen: when control reaches a node, its parent was guaranteed one free slot on the previous iteration. The leaf, visited last, is thus guaranteed room by induction.

It enforces this by measuring, at every level, the largest entry the operation could add against free space:

// btree_split_node_and_advance -- src/storage/btree.c
max_new_data_size = btree_get_max_new_data_size (..., node_type, max_key_len, insert_helper, false /*known*/);
need_split = max_new_data_size > spage_get_free_space_without_saving (thread_p, child_page, NULL);

max_key_len is insert_helper->key_len_in_page when need_update_max_key_len, else node_header->max_key_len. known_to_be_found = false: descent does not yet know whether the key exists, so it always reserves for a brand-new entry — the pessimistic estimate.

6.2 btree_get_max_new_data_size — the worst-case estimator

Section titled “6.2 btree_get_max_new_data_size — the worst-case estimator”

An under-estimate would defeat the §6.1 invariant. The body is a node-type check then a switch (helper->purpose):

BranchReturned sizeWhy
Non-leaf nodeNON_LEAF_ENTRY_MAX_SIZE(key_len) + SPAGE_SLOT_SIZEA child split promotes one separator here; slot directory grows by one.
Leaf, key foundBTREE_OBJECT_FIXED_SIZEAppend an OID / upgrade first object / add an overflow link (Ch 5). No new slot.
Leaf, key not foundLEAF_ENTRY_MAX_SIZE(key_len) + SPAGE_SLOT_SIZEBrand-new key record plus its slot (Ch 4).
INSERT_MVCC_DELID / MARK_DELETEDOR_MVCCID_SIZELogical delete stamps a delete MVCCID into an existing object (Ch 8).
defaultassert_release(false), ER_FAILEDUnhandled purpose — a programming error.

Descent passes known_to_be_found = false for leaves, so the leaf estimate is the full new-key cost — largest of the insert cases.

6.3 btree_node_split_info — the skew accumulator

Section titled “6.3 btree_node_split_info — the skew accumulator”

Every node header carries a BTREE_NODE_SPLIT_INFO that biases consecutive splits toward the hot side — the optimization for monotonic (auto-increment) key streams, where a 50/50 split would leave the left half permanently half-empty.

// btree_node_split_info -- src/storage/storage_common.h
struct btree_node_split_info
{
float pivot; /* pivot = split_slot_id / num_keys */
int index; /* number of key insert after node split */
};
FieldRoleWhy it exists
pivotRunning average in [0,1] of where past splits landed. 0.5 balanced; near 1.0 right-skewed (ascending); near 0.0 left-skewed.Lets btree_split_find_pivot aim the next split at the hot side instead of the geometric middle, so the cold half is not wasted.
indexCount of splits folded into the average (capped at key count); the CMA divisor.Damps the average so one outlier insert cannot swing the pivot wildly.

btree_write_default_split_info seeds a fresh node with pivot = BTREE_SPLIT_DEFAULT_PIVOT (0.5), index = 1 — start balanced.

btree_split_find_pivot turns the average into a target byte size, two branches:

  • Balancedpivot == 0 (uninit) or BTREE_SPLIT_LOWER_BOUND < pivot < BTREE_SPLIT_UPPER_BOUND: aim for CEIL_PTVDIV(total, 2); the [0.20,0.80] dead-band stops mild noise from skewing.
  • Skewedpivot ≤ 0.20 or ≥ 0.80: aim for (int)(total * pivot), clamped to [0.05,0.95] (MIN_PIVOT/MAX_PIVOT) so ≥5% always lands each side.

btree_split_next_pivot folds the new split in: index = MIN(index+1, max_index); if pivot == 0 take new_value directly, else new_pivot = pivot + (new_value - pivot)/index (CMA), clamped to [0,1]. new_value = p_slot_id / key_cnt is where the just-promoted separator landed; ascending inserts feed values near 1.0, dragging pivot upward so future splits keep the dense right side and ship the sparse left half to the sibling.

6.4 btree_find_split_point — the size-balanced split point

Section titled “6.4 btree_find_split_point — the size-balanced split point”

btree_split_node/btree_split_root delegate the where to btree_find_split_point, which returns the separator key and writes *mid_slot = last slot kept on the left. It is size-driven (spage_get_space_for_record bytes, not key counts, so fat keys split correctly). Three rules from its source comment: (1) get close to mid_size, (2) leave both sides room for the new entry and fences, (3) both sides keep ≥1 non-fence record. Three budgets back them: left_max_size/right_max_size = BTREE_NODE_MAX_SPLIT_SIZE - new_ent_size - (leaf only) new_fence_size; mid_size = btree_split_find_pivot(tot_rec, …); left_min_size = tot_rec - right_max_size.

The loop walks slots left to right accumulating left_size, charging each real record’s bytes — except at the leaf insert slot, where it charges new_ent_size (absent key) or existing + new_ent_size (present key, append grows in place). Per record:

  • while left_size < left_min_size (right still over budget): continue;
  • if (left_size > MIN(left_max_size, mid_size)) { *mid_slot = i-1; break; } — cut one slot before overflow;
  • if (i == stop_at) { *mid_slot = i; break; } — whole page minus fences stays left.

Two post-loop guards enforce rule #3: *mid_slot == start_with - 1 (left empty) → ++; *mid_slot == stop_at (right empty) → --; each gated by node-type / is_key_added_to_left / found.

Fence-key cost. For leaves the budget pre-subtracts new_fence_size = LEAF_FENCE_MAX_SIZE(max_key_len)+SPAGE_SLOT_SIZE from both sides: the split adds an upper fence to the left leaf and a lower fence to the right (a separator copy marked BTREE_LEAF_RECORD_FENCE letting range scans detect boundaries cheaply — Ch 7). The estimate stops the node overflowing when the fence materializes.

Separator selection. With *mid_slot fixed, it reads mid_key (last left) and next_key (first right). A leaf computes the shortest distinguishing prefix via btree_get_prefix_separatorunless either is an overflow key, where it clones next_key whole (a prefix could exceed the in-page max). A non-leaf clones next_key.

INVARIANT — “the separator strictly separates.” Every left-node key is < separator ≤ every right-node key. btree_get_prefix_separator(mid_key, next_key) returns the smallest value > mid_key and ≤ next_key. Violating it routes searches to the wrong subtree.

6.5 btree_split_node — splitting a non-root node into Q and R

Section titled “6.5 btree_split_node — splitting a non-root node into Q and R”

btree_split_node takes parent P, the full node Q, a freshly allocated sibling R, the P-slot pointing at Q, and the descent key (all three pages already write-latched, asserted up front). Q keeps [1..leftcnt] + upper fence, R gets lower fence + [leftcnt+1..key_cnt], P gains a sep → R slot, and (leaf) Q.next → R.

Figure 6-1 — btree_split_node leaf-chain relink

flowchart LR
  Q["Q: [1..leftcnt]"]
  R["R: [leftcnt+1..]"]
  N["old Q.next"]
  Q -->|"next_vpid"| R
  R -->|"next_vpid"| N
  R -->|"prev_vpid"| Q
  N -->|"prev_vpid (step 8)"| R

Branch-complete walkthrough:

  1. Find the cut. key_cnt = btree_node_number_of_keys(Q); ≤ 0goto exit_on_error. btree_find_split_pointsep_key, leftcnt; NULL/DB_NULL sep_key → log and error out. rightcnt = key_cnt - leftcnt.
  2. Fence record (leaf only). If leaf and sep_key_len < BTREE_MAX_KEYLEN_INPAGE && ≤ qheader->max_key_len, write a leaf record from sep_key, set BTREE_LEAF_RECORD_FENCE, flag_fence_insert = true. An overflow-key separator, or PRM_ID_USE_BTREE_FENCE_KEY = false, forces flag_fence_insert = false.
  3. Undo-log Q, set headers. log_append_undo_data2(RVBT_COPYPAGE, …) snapshots the entire old Q for byte-for-byte rollback. Save right_next_vpid = qheader->next_vpid. Sibling links exist only at the leaf level: leaves set qheader->next_vpid = *R_vpid, rheader->prev_vpid = *Q_vpid; non-leaf Q next_vpid is nulled (VPID_SET_NULL) and R’s link fields keep btree_init_node_header defaults (which redo-logs R’s header). R always inherits rheader->next_vpid = right_next_vpid.
  4. Move the upper half, fence Q. Optional lower fence at R slot 1, then for i = 1..rightcnt the moving target stays at leftcnt+1: peek slot leftcnt+1 of Q → spage_insert_at(R, j, …)spage_delete(Q, leftcnt+1). If flag_fence_insert, Q’s upper fence goes in at leftcnt+1. btree_compress_node(Q), then redo-log Q (RVBT_COPYPAGE).
  5. Redo-log R. btree_compress_node(R), then redo-log R (RVBT_COPYPAGE). R has no undo log — newly allocated in the system op, abort just deallocates it.
  6. Promote separator into P. nleaf_rec.pnt = *R_vpid; key_type = BTREE_OVERFLOW_KEY/key_len = -1 when too long, else BTREE_NORMAL_KEY. p_slot_id++ (insert after the slot pointing at Q), spage_insert_at(P, p_slot_id, &rec), undo/redo RVBT_NDRECORD_INS. P’s header is undo-logged, btree_split_next_pivot(&pheader->split_info, (float) p_slot_id / key_cnt, key_cnt) folds the landing slot in, max_key_len bumped, header redo-logged.
  7. Pick the child. btree_compare_key(key, sep_key, …): DB_UNK → assertion failure; < 0*advance_vpid = *Q_vpid; else *R_vpid.
  8. Relink the leaf neighbor. Only when R is a leaf: the page that used to follow Q gets its prev_vpid re-pointed at R via btree_get_next_page(R) + btree_set_vpid_previous_vpid (§6.7).
  9. exit_on_error frees sep_key and returns; inside the caller’s system op, log_sysop_abort reverses every page change.

INVARIANT — “the doubly-linked leaf chain stays consistent.” After a leaf split all four links hold: Q.next = R, R.prev = Q, R.next = old-next, old-next.prev = R (step 3 sets the first three, step 8 the fourth). Skipping step 8 leaves a leaf whose prev_vpid points at the now-too-full Q, corrupting backward range scans.

6.6 btree_split_root — the height-growth special case

Section titled “6.6 btree_split_root — the height-growth special case”

The root must keep its page identity (it is named in the index catalog), so it cannot spawn one sibling. btree_split_root allocates two children Q and R, moves the halves into them, and rewrites root P to two separators (level L → L+1, slot 1 = neg-inf → Q, slot 2 = sep → R). Contrasting with §6.5:

  1. Undo-log the whole root (RVBT_COPYPAGE); read key_cnt (≤0 → error); snapshot split_info, set split_info.index = 1.
  2. Find split point & fence as in §6.5. neg_inf_key = sep_key — the separator is reused as the dummy negative-infinity key for the left pointer.
  3. Grow the level. pheader->node.node_level++, bump max_key_len, reset root split-info to default. Children get node_level = new_level - 1 and inherit max_key_len/split_info. Same leaf-only sibling-link rule as §6.5 step 3.
  4. Fill R (upper half). Optional lower fence, then move rightcnt records (peek/insert/delete leftcnt+1 of P). Redo-log R RVBT_INS_PGRECORDS (R is new).
  5. Fill Q (lower half). Move leftcnt records by peeking slot 1 of P and deleting it (P shrinks from the front). If a fence is wanted, append Q’s upper fence; else i--. Redo-log Q RVBT_INS_PGRECORDS.
  6. Rebuild root P. Redo-log a delete-all (RVBT_DEL_PGRECORDS, rec_cnt = key_cnt), then write slot 1 = neg_inf_key → Q_vpid, slot 2 = sep_key → R_vpid (each RVBT_NDRECORD_INS; BTREE_OVERFLOW_KEY/key_len = -1 when too long).
  7. Pick the child. Same btree_compare_key three-way branch: DB_UNK → error; < 0 → Q; else R.
  8. exit_on_error frees sep_key; the system op aborts, deallocates Q and R, and the RVBT_COPYPAGE undo restores P.

INVARIANT — “the root never changes page identity, and the leftmost separator is negative infinity.” btree_split_root rewrites P in place (no new VPID) and always installs a dummy neg_inf_key in slot 1. btree_compare_key never compares against slot 1 — searches < sep_key route to Q unconditionally — so the leftmost pointer covers all keys below the real separator. Delete-all + two inserts (not a page copy) keeps P’s identity stable across redo.

Logging asymmetry: btree_split_node uses whole-page RVBT_COPYPAGE for Q and R; btree_split_root uses record-set logging (RVBT_INS_PGRECORDS / RVBT_DEL_PGRECORDS / RVBT_NDRECORD_INS). Both share one system-op bracket, so abort is uniform.

Section titled “6.7 btree_set_vpid_previous_vpid — the back-link rewrite”

The smallest piece carries its own undo/redo so a split rollback restores the neighbor’s prev_vpid:

// btree_set_vpid_previous_vpid -- src/storage/btree.c
if (page_p == NULL) return NO_ERROR; /* <- chain tail, no-op */
header = btree_get_node_header (thread_p, page_p);
if (header == NULL) return ER_FAILED;
btree_node_header_undo_log (thread_p, &btid->sys_btid->vfid, page_p);
header->prev_vpid = *prev; /* <- repoint backward link, then redo-log */

Three branches: NULL page (chain tail) → no-op success; header read failure → ER_FAILED; else back-link rewritten with full undo/redo and pgbuf_set_dirty. Reused by leaf merge (Ch 9), hence standalone.

6.8 btree_split_node_and_advance — the full branch map

Section titled “6.8 btree_split_node_and_advance — the full branch map”

The advance function is more than the need_split block. With node_type, need_split, and need_update_max_key_len computed (§6.1), it threads through these branches before returning. A root restart below means: on ER_PAGE_LATCH_PROMOTE_FAIL, set nonleaf_latch_mode = WRITE, *restart = true, unfix child+parent, is_crt_node_write_latched = false, return NO_ERROR (other promote errors → goto error). The nine branch entries, in source order:

  1. Root dispatch (insert_helper->is_root): the root path mirrors the non-root flow in a parallel block (its own sysop, promotion, advance branches), but splits via btree_split_root (two new pages, §6.6) not btree_split_node.
  2. *crt_page promotion (need_split && nonleaf_latch_mode == READ && !is_crt_node_write_latched): promote the parent read→write (PGBUF_PROMOTE_ONLY_READER); on fail, root restart.
  3. child promotion ((need_split || need_update_max_key_len) && nonleaf_latch_mode == READ && !need_update_max_key_len && !is_child_leaf): promote the child (PGBUF_PROMOTE_SHARED_READER); on fail, root restart.
  4. need_update_max_key_len (rule parent->max_key_len >= child->max_key_len violated): set child max_key_len = key_len_in_page, btree_node_header_redo_log + pgbuf_set_dirty, propagate need_update_max_key_len = true, is_crt_node_write_latched = true so every descendant also updates.
  5. Split + advance-to-child (need_split, VPID_EQ(&advance_vpid, &child_vpid)): unfix new_page1, log_sysop_commit, *advance_to_page = child_page, is_crt_node_write_latched = true, return.
  6. Split + advance-to-new-page (need_split, VPID_EQ(&advance_vpid, &new_page_vpid1)): unfix child_page, log_sysop_commit, *advance_to_page = new_page1, is_crt_node_write_latched = true, return.
  7. Split + impossible (need_split, neither VPID matches): assert_release(false), unfix all three, error_code = ER_FAILED, goto error.
  8. No-split fall-through (!need_split && !need_update_max_key_len && nonleaf_latch_mode == READ): read-latched and untouched, so is_crt_node_write_latched = false, *advance_to_page = child_page, return.
  9. error: label (any goto error): unfix new_page1/new_page2 first, then log_sysop_abort if a system op is open, then unfix child_page.

The two promotion early returns are the only paths that restart from the root, and they restart on latch contention, never capacity. The split bracket is opened by log_sysop_start (is_system_op_started = true) inside the need_split block, closed by log_sysop_commit per advance branch or by the error: label:

// btree_split_node_and_advance -- src/storage/btree.c (error: label)
if (new_page1 != NULL) pgbuf_unfix_and_init (thread_p, new_page1);
if (new_page2 != NULL) pgbuf_unfix_and_init (thread_p, new_page2);
if (is_system_op_started)
log_sysop_abort (thread_p); /* <- unfix new pages FIRST so abort can deallocate them */
if (child_page != NULL) pgbuf_unfix_and_init (thread_p, child_page);

The ordering is load-bearing: new pages must be unfixed before log_sysop_abort, because abort deallocates them and the fix count must be zero.

  1. Pessimistic split-on-descent splits any node that might not fit the worst-case entry before advancing into it, so the leaf is always guaranteed room and chain splits are structurally impossible.
  2. btree_get_max_new_data_size is the estimator behind the guarantee, branching on node type and insert purpose; descent passes known_to_be_found = false to reserve the full new-key cost.
  3. btree_node_split_info {pivot, index} is a CMA accumulator; btree_split_find_pivot turns it into a target byte size (balanced in [0.20,0.80], clamped to [0.05,0.95]), btree_split_next_pivot updates it from each separator’s landing slot.
  4. btree_find_split_point is size-balanced: it reserves space for the new entry and fences on both sides, keeps ≥1 non-fence record per side, and returns a prefix-truncated separator.
  5. btree_split_node undo-logs Q whole (RVBT_COPYPAGE), moves the upper half to R, writes fences, redo-logs both, promotes the separator into P (RVBT_NDRECORD_INS), and relinks the leaf chain (§6.7).
  6. btree_split_root grows height in place: two children, a dummy neg-inf separator in slot 1 plus the real one in slot 2, node_level bump, record-set redo.
  7. btree_split_node_and_advance has eight live exit paths across nine branch entries (the error: label is the shared abort sink): two latch-promotion root-restarts, max-key-len propagation, two split commits, the assert_release(false) impossible path, and the no-split read-latch fall-through — all in one system-op bracket whose error: label unfixes new pages before log_sysop_abort.

Chapter 7: Range Scan and LSA Aware Resume

Section titled “Chapter 7: Range Scan and LSA Aware Resume”

This chapter answers the read-side question: once data is on disk, how does a range scan walk leaves and overflow pages key-by-key, and resume safely after releasing every latch across a client roundtrip? The answer is one generic driver, btree_range_scan, parameterized by a key-processing callback (BTREE_RANGE_SCAN_PROCESS_KEY_FUNC). The driver owns positioning, latching, advancement, and page-LSA-aware resume; the callback owns per-key work (visible OIDs for SELECT, any FK match for FK checks). All persistent cursor state lives in one BTREE_SCAN (BTS), so a scan drops all latches between iterations and resumes where it left off. For latch-coupling descent see cubrid-btree.md §“Descent — latch-coupling on read”.

7.1 The cursor structs — btree_scan and btree_keyrange

Section titled “7.1 The cursor structs — btree_scan and btree_keyrange”

btree_keyrange (src/storage/btree.h) is the “what to scan” descriptor; btree_scan (BTS) is the large “where am I now” cursor. Fields tagged /* TO BE REMOVED */ in the header are legacy-only and marked legacy.

btree_keyrange fieldRoleWhy it exists
rangeRANGE enum, each endpoint open/closed/infiniteCaptures all bound combos; positioning branches on it alone
lower_keyLower-bound DB_VALUE, or NULLNULL = no lower limit; btree_range_scan_start branches on this
upper_keyUpper-bound DB_VALUE, or NULLNULL = no upper limit; range test never trips on top
num_index_termCount of leading index columns boundMIDXKEY can range on a prefix; tells comparator how many columns matter

The table below is the authority for every btree_scan member:

FieldRoleWhy it exists
btid_intResolved tree id + key domainAvoids re-reading the root header per key (Ch 1)
read_uncommittedLegacy dirty-read flaglegacy — dies with btree_prepare_next_search
P_vpid/P_pagePrevious leaf vpid/ptrlegacy — dies with btree_find_next_index_record
C_vpidCurrent leaf VPIDPersists across unfix — anchor resume re-fixes
O_vpidCurrent overflow VPIDPark point when a key’s OIDs overflow one buffer (mid-key resume)
C_pageCurrent leaf ptrNULL when no latch held; set only while walking
slot_idCurrent slot in C_pageRe-validated after a latch re-acquire
oid_posLegacy OID cursorlegacy — superseded by callback OID walk
cur_keyKey the cursor is parked onResume anchor when slot_id is untrustworthy
clear_cur_keyOwnership flag for cur_keyTells btree_scan_clear_key whether to free
cur_common_prefix_page_vpid/_lsaVPID+LSA of page whose common prefix is cached (debug builds only)Present only under CHECK_VERIFY_COMMON_PREFIX_PAGE_INFO; asserts the cached prefix still belongs to the current page
common_prefix_sizeLength of page’s common prefixCOMMON_PREFIX_UNKNOWN until computed
common_prefix_keyShared prefix bytes for the pageCombined with a compressed cur_key to rebuild the full key
clear_common_prefix_keyOwnership flag for prefixFrees the prefix copy when page changes
is_cur_key_compressedcur_key is prefix-strippedtruebtree_check_decompress_key re-attaches prefix before unfix
attid_idxsattr-id → column-position mapFilters reference base columns by attr-id
is_cur_key_copiedcur_key owns a heap copyCopy survives C_page unfix; set in §7.8
key_range/key_filterWhat to scan / per-key predicateRange positioning; filter rejects in-range failures
key_filter_storageBacking FILTER_INFOOwns the struct so key_filter is stable across iterations
use_desc_indexDescending scan flagSelects prev_vpid + conditional prev-leaf fix path
restart_scanLegacy restart counterlegacy
read_keys/qualified_keysQuery-trace countersBumped in §7.7; feed EXPLAIN/trace stats
key_range_max_value_equalUpper bound matched exactlyLets next advance end the scan without reading another key
cur_leaf_lsaC_page LSA at last unfixDecides reuse vs. re-locate on resume
lock_modeS_LOCK or X_LOCKFrom the scan’s isolation needs
key_recordPeeked RECDES for the keyAvoids re-reading the slot in OID-walking callbacks
need_to_check_nullRange can match SQL NULLToggles the NULL branch in range qualification
leaf_rec_infoDecoded LEAF_REC.ovfl is the head of the key’s OID overflow chain
node_typeAlways BTREE_LEAF_NODEAsserts cursor never parks on a non-leaf
offsetByte offset to first OIDWhere OID processing starts in key_record
key_statusNOT_VERIFIED/VERIFIED/CONSUMEDPer-key state machine (§7.2)
end_scanWhole range exhaustedDriver terminates; C_vpid reset, key cleared
end_one_iterationOID buffer fullReturn partial results, keep C_vpid/cur_leaf_lsa
is_interruptedCallback returned mid-keyBreaks inner loop without advancing
is_key_partially_processedKey’s OIDs spilled across iterationsResume re-enters callback at O_vpid, not the leaf
n_oids_read/..._last_iterationCumulative / per-iter countsPer-iter is btree_keyval_search’s return; debits key_limit_upper
oid_ptrWrite head into caller’s bufferReset to oid_list->oidp per iteration
match_class_oidClass filter for hierarchical uniquesSkips OIDs from other classes in a partition hierarchy
key_limit_lower/_upperLIMIT/OFFSET push-down_upper decremented by per-iter count at exit
index_scan_idpOwning index scanSource of OID buffer, snapshot, opt mode (ISS/MRO/covering)
is_btid_int_validbtid_int resolvedGuards re-resolution on first use
is_scan_startedStart vs. resume selectorfalse until start succeeds; then resume is used
force_restart_from_rootDesc fast path failedForces full re-locate on next resume
is_fk_remakeDeduplicate-mode FK retryRetries FK existence with rebuilt key
time_trackTimingFeeds PSTAT_BT_*
bts_otherPer-purpose extra stateFK: points at the BTREE_FIND_FK_OBJECT result

BTS_KEY_STATUS (bts_key_status, src/storage/btree.h) is three-valued: BTS_KEY_IS_NOT_VERIFIED (slot reached, range/filter not yet checked), BTS_KEY_IS_VERIFIED (passed range AND filter, ready to process), BTS_KEY_IS_CONSUMED (fully processed, advance to next slot).

stateDiagram-v2
  [*] --> NOT_VERIFIED : positioned on a slot
  NOT_VERIFIED --> VERIFIED : range ok and filter ok
  NOT_VERIFIED --> NOT_VERIFIED : fence or filtered, advance slot
  VERIFIED --> CONSUMED : key_func read all OIDs
  CONSUMED --> NOT_VERIFIED : advance to next slot
  VERIFIED --> VERIFIED : resume re-reads same record
  CONSUMED --> [*] : range exhausted, end_scan

Figure 7-1 — key_status transitions. advance_over_filtered_keys drives NOT_VERIFIED to VERIFIED; the callback drives VERIFIED to CONSUMED.

Key invariant: the driver only calls key_func when key_status == BTS_KEY_IS_VERIFIED. Asserted in btree_range_scan right before the callback. Every path to the callback first routes through btree_range_scan_advance_over_filtered_keys, which sets VERIFIED only after range AND filter pass. Violate it and a callback could emit OIDs for a fence or out-of-range key — rows that do not belong to the result.

7.3 The single-key wrapper — btree_keyval_search and btree_scan_update_range

Section titled “7.3 The single-key wrapper — btree_keyval_search and btree_scan_update_range”

A point lookup (WHERE k = 5) is a degenerate range:

// btree_keyval_search — src/storage/btree.c
rc = btree_prepare_bts (thread_p, bts, btid, isidp, kv_range, filter, ...);
rc = btree_range_scan (thread_p, bts, btree_range_scan_select_visible_oids);
return bts->n_oids_read_last_iteration; /* <- OID count, iterative */

It is “just a GE_LE range search with the same key.” Returning n_oids_read_last_iteration lets a caller loop while !BTREE_END_OF_SCAN(bts) to drain a key whose OIDs span buffer-fills. btree_scan_update_range installs a new range into an existing BTS (multiple-range optimization): it validates range, nulls SQL-NULL bounds, and swaps bounds for descending scans:

// btree_scan_update_range — src/storage/btree.c
if ((bts->use_desc_index && !BTREE_IS_PART_KEY_DESC (&bts->btid_int))
|| (!bts->use_desc_index && BTREE_IS_PART_KEY_DESC (&bts->btid_int)))
{ /* XOR: scan-desc != column-desc */
range_reverse (bts->key_range.range); /* GE_LE becomes LE_GE etc. */
swap_key = bts->key_range.lower_key; /* <- swap bounds */
bts->key_range.lower_key = bts->key_range.upper_key;
bts->key_range.upper_key = swap_key;
}

The swap fires only when exactly one of “scan descending” / “column stored descending” holds; if both hold the physical walk is already forward.

7.4 Starting a scan — btree_range_scan_start

Section titled “7.4 Starting a scan — btree_range_scan_start”

Positions on the first eligible key; called only while is_scan_started == false. The /* (X) */ markers are the branch map.

// btree_range_scan_start — src/storage/btree.c
bts->key_status = BTS_KEY_IS_NOT_VERIFIED;
if (bts->key_range.lower_key == NULL)
{ /* (A) no lower bound: jump to lowest leaf */
btree_find_lower_bound_leaf (thread_p, bts, NULL);
if (bts->end_scan) { return NO_ERROR; } /* empty index */
}
else
{ /* (B) lower bound: descend to leaf */
btree_locate_key (..., bts->key_range.lower_key, ..., &found);
if (!found)
{ if (bts->use_desc_index) { bts->slot_id--; } } /* (B1) desc: first-smaller */
else if (bts->key_range.range == GT_LT || bts->key_range.range == GT_LE
|| bts->key_range.range == GT_INF)
{ bts->key_status = BTS_KEY_IS_CONSUMED; } /* (B2) strict GT_*: skip exact */
}
btree_range_scan_advance_over_filtered_keys (thread_p, bts);
if (bts->force_restart_from_root)
{ /* (C) desc couldn't position: restart */
assert (bts->use_desc_index);
bts->key_status = BTS_KEY_IS_NOT_VERIFIED;
return NO_ERROR; /* is_scan_started stays false -> driver loops */
}
bts->is_scan_started = true; /* (D) success */
return NO_ERROR;

All four outcomes funnel into advance_over_filtered_keys. Only (C) leaves is_scan_started false — a descending start that hit the prev-leaf fix path without progress, which the driver retries.

7.5 The generic driver — btree_range_scan

Section titled “7.5 The generic driver — btree_range_scan”

The outer loop runs until end_scan (range done) or end_one_iteration (buffer full, return to client); the inner loop drains keys, calling the callback for each VERIFIED key.

flowchart TB
  START["reset end_scan/end_one_iteration\nn_oids_read_last_iteration=0\noid_ptr=oid_list.oidp"] --> OUTER{"!end_scan and\n!end_one_iteration?"}
  OUTER -->|no| EXIT["save cur_leaf_lsa, unfix pages\ndebit key_limit_upper"]
  OUTER -->|yes| POS{"is_scan_started?"}
  POS -->|no| STARTF["btree_range_scan_start"]
  POS -->|yes| RESUME["btree_range_scan_resume"]
  STARTF --> CHK
  RESUME --> CHK{"end_scan?"}
  CHK -->|yes| EXIT
  CHK -->|no| FRR{"force_restart_from_root?"}
  FRR -->|yes| UNFIXC["unfix C_page, continue outer"]
  UNFIXC --> OUTER
  FRR -->|no| INNER["INNER: key_func(bts)"]
  INNER --> ICHK{"interrupted /\nend_one_iteration /\nend_scan?"}
  ICHK -->|yes| EXIT
  ICHK -->|no| ADV["advance_over_filtered_keys"]
  ADV --> ACHK{"end_scan?"}
  ACHK -->|yes| EXIT
  ACHK -->|no| AFRR{"force_restart_from_root?"}
  AFRR -->|yes| UNFIXC
  AFRR -->|no| INNER

Figure 7-2 — btree_range_scan control flow. Outer loop chooses start vs. resume; inner loop calls key_func then advances. Both force_restart_from_root arms unfix C_page and re-enter the outer loop.

// btree_range_scan — src/storage/btree.c
while (true)
{
assert (bts->key_status == BTS_KEY_IS_VERIFIED); /* invariant 7.2 */
error_code = key_func (thread_p, bts); /* <- the callback */
if (bts->is_interrupted || bts->end_one_iteration || bts->end_scan)
{ break; }
assert (bts->key_status == BTS_KEY_IS_CONSUMED);
btree_range_scan_advance_over_filtered_keys (thread_p, bts);
if (bts->force_restart_from_root)
{ /* unfix C_page, decompress cur_key */ break; }
}

At the end: label the driver always, in order: captures cur_leaf_lsa = pgbuf_get_lsa(C_page) before unfixing; decompresses cur_key if end_one_iteration; unfixes C_page/P_page; on end_scan nulls C_vpid and clears is_scan_started/the key; debits key_limit_upper by n_oids_read_last_iteration.

Key invariant: cur_leaf_lsa is captured at the same instant C_page is unfixed. Unfix without first copying the LSA and a later resume would compare a stale LSA, conclude “unchanged,” and reuse a slot_id now pointing at a different key. The end: block guarantees copy-then-unfix; the no-page branch nulls cur_leaf_lsa so resume must re-locate from root.

7.6 Resuming — btree_range_scan_resume and the LSA decision

Section titled “7.6 Resuming — btree_range_scan_resume and the LSA decision”

Re-establish position without holding a latch the whole time. The decision hinges on cur_leaf_lsa; /* (X) */ is the branch map.

// btree_range_scan_resume — src/storage/btree.c
if (!bts->force_restart_from_root)
{
pgbuf_fix_if_not_deallocated (..., &bts->C_vpid, ..., &bts->C_page);
if (bts->C_page != NULL)
{
if (LSA_EQ (&bts->cur_leaf_lsa, pgbuf_get_lsa (bts->C_page)))
{ return btree_range_scan_advance_over_filtered_keys (...); } /* (1) UNCHANGED */
if (BTREE_IS_PAGE_VALID_LEAF (thread_p, bts->C_page))
{ /* (2) still a leaf but changed */
btree_leaf_is_key_between_min_max (..., &bts->cur_key, &search_key);
if (search_key.result == BTREE_KEY_BETWEEN)
{ btree_search_leaf_page (..., &bts->cur_key, &search_key); }
switch (search_key.result)
{
case BTREE_KEY_FOUND: /* (2a) key still here: reset slot */
bts->slot_id = search_key.slotid;
return btree_range_scan_advance_over_filtered_keys (...);
case BTREE_KEY_BETWEEN: /* (2b) key deleted, neighbor here */
bts->slot_id = bts->use_desc_index ? search_key.slotid - 1 : search_key.slotid;
bts->key_status = BTS_KEY_IS_NOT_VERIFIED;
return btree_range_scan_advance_over_filtered_keys (...);
case BTREE_KEY_SMALLER: case BTREE_KEY_BIGGER: case BTREE_KEY_NOTFOUND:
break; /* (2c) migrated off page: re-locate */
}
}
pgbuf_unfix_and_init (thread_p, bts->C_page); /* (3) reused as non-leaf */
}
/* (4) C_page deallocated */
}
bts->force_restart_from_root = false;
btree_locate_key (..., &bts->cur_key, ..., &found);
if (found) { return btree_range_scan_advance_over_filtered_keys (...); }
if (btree_node_number_of_keys (thread_p, bts->C_page) < 1)
{ bts->end_scan = true; return NO_ERROR; } /* (5) index emptied under us */
if (bts->use_desc_index) { bts->slot_id--; }
bts->key_status = BTS_KEY_IS_NOT_VERIFIED;
return btree_range_scan_advance_over_filtered_keys (...);

(1) fast path — leaf byte-for-byte unchanged, slot_id/key_status hold. (2a)/(2b) re-derive the slot by searching for cur_key on a changed-but-still-leaf page. (2c)/(3)/(4)/(5) re-descend from root via btree_locate_key, then advance, end, or position on the insertion point.

Key invariant: resume re-anchors on cur_key, never on slot_id alone. slot_id is trusted only in branch (1), where the LSA proved no change; elsewhere the slot is re-derived by searching for cur_key. A stale slot would skip or duplicate keys whenever a concurrent split/merge shifted slot numbers.

7.7 Advancing over fence and filtered keys — btree_range_scan_advance_over_filtered_keys

Section titled “7.7 Advancing over fence and filtered keys — btree_range_scan_advance_over_filtered_keys”

Turns “positioned somewhere” into “positioned on the next VERIFIED key, or end_scan.” Three phases.

Resumed-VERIFIED early return — if already VERIFIED (page reused, re-fixed without re-checking), re-peek and return; otherwise set direction and step a CONSUMED slot:

// btree_range_scan_advance_over_filtered_keys — src/storage/btree.c
if (bts->key_status == BTS_KEY_IS_VERIFIED)
{
spage_get_record (..., bts->slot_id, &bts->key_record, PEEK);
btree_range_scan_read_record (thread_p, bts); /* page may have moved bytes */
return NO_ERROR; /* keep current key */
}
if (bts->key_range_max_value_equal)
{ bts->end_scan = true; return NO_ERROR; } /* exact upper bound returned */
inc_slot = bts->use_desc_index ? -1 : 1;
if (bts->key_status == BTS_KEY_IS_CONSUMED) { bts->slot_id += inc_slot; }
next_vpid = bts->use_desc_index ? node_header->prev_vpid : node_header->next_vpid;

Page-crossing loop — when slot_id runs off the page, fix the next leaf; VPID_ISNULL ends the scan. Ascending latch-couples directly, descending defers to §7.8:

while (bts->slot_id <= 0 || bts->slot_id > key_count || key_count == 0)
{
if (VPID_ISNULL (&next_vpid)) { bts->end_scan = true; return NO_ERROR; } /* consumed */
if (bts->use_desc_index)
{
btree_range_scan_descending_fix_prev_leaf (..., &key_count, &node_header, &next_vpid);
if (bts->force_restart_from_root) { return NO_ERROR; }
}
else
{
next_node_page = pgbuf_fix (..., &next_vpid, ...);
pgbuf_unfix (thread_p, bts->C_page); /* latch-couple to next */
bts->C_page = next_node_page; VPID_COPY (&bts->C_vpid, &next_vpid);
bts->slot_id = 1; next_vpid = node_header->next_vpid;
}
}

Per-key qualification — peek; skip fences; else read, count, apply range + filter:

spage_get_record (..., bts->slot_id, &bts->key_record, PEEK);
if (!btree_leaf_is_flaged (&bts->key_record, BTREE_LEAF_RECORD_FENCE))
{
btree_range_scan_read_record (thread_p, bts);
bts->read_keys++;
btree_apply_key_range_and_filter (..., &is_range_satisfied, &is_filter_satisfied);
if (!is_range_satisfied) { bts->end_scan = true; return NO_ERROR; } /* past range */
if (is_filter_satisfied)
{
bts->qualified_keys++;
bts->key_status = BTS_KEY_IS_VERIFIED; /* <- only success exit */
return NO_ERROR;
}
/* filter failed: fall through */
}
bts->slot_id += inc_slot; /* fence/filter-fail: try next slot */

Fence keys are page-edge sentinels (see cubrid-btree.md §“Node layout”), skipped silently. A failed range test ends the scan; a failed filter advances and loops. Descending page-crossing may set force_restart_from_root via §7.8.

7.8 Reading the record and the descending prev-leaf fix

Section titled “7.8 Reading the record and the descending prev-leaf fix”

btree_range_scan_read_record decodes the peeked slot into cur_key: variable-length text/bit/midxkey use PEEK_KEY_VALUE (decompression may still allocate, tracked by is_cur_key_copied); fixed types use COPY_KEY_VALUE:

// btree_range_scan_read_record — src/storage/btree.c
switch (TP_DOMAIN_TYPE (bts->btid_int.key_type))
{
case DB_TYPE_MIDXKEY: case DB_TYPE_VARCHAR: case DB_TYPE_CHAR:
case DB_TYPE_BIT: case DB_TYPE_VARBIT:
{
int ret = btree_read_record_in_leafpage (thread_p, bts->C_page, PEEK_KEY_VALUE, bts);
bts->is_cur_key_copied = bts->cur_key.need_clear; /* compressed -> copied */
return ret;
}
default: break;
}
bts->is_cur_key_copied = true;
return btree_read_record_in_leafpage (thread_p, bts->C_page, COPY_KEY_VALUE, bts);

A copied/decompressed key survives the C_page unfix — which is why the driver runs btree_check_decompress_key before unfixing on end_one_iteration: resume must search the page for a complete cur_key.

btree_range_scan_descending_fix_prev_leaf exists because descending scans walk against the forward sibling links and would deadlock with forward scans under unconditional latches. It tries a conditional latch on the previous leaf, dropping the current latch and re-acquiring in safe order only on failure:

// btree_range_scan_descending_fix_prev_leaf — src/storage/btree.c
prev_leaf = pgbuf_fix (..., &prev_leaf_vpid, ..., PGBUF_CONDITIONAL_LATCH);
if (prev_leaf != NULL)
{ /* (A) conditional latch won: common case */
pgbuf_unfix_and_init (thread_p, bts->C_page);
bts->C_page = prev_leaf;
bts->slot_id = *key_count = btree_node_number_of_keys (...); /* last key of prev */
VPID_COPY (next_vpid, &(*node_header_ptr)->prev_vpid);
return NO_ERROR;
}
/* (B) conditional failed: drop current, re-fix prev unconditionally */
btree_check_decompress_key (bts); /* cur_key must survive the unfix */
pgbuf_unfix_and_init (thread_p, bts->C_page);
pgbuf_fix_if_not_deallocated (..., &prev_leaf_vpid, ..., &prev_leaf);
if (prev_leaf == NULL)
{ bts->force_restart_from_root = true; return NO_ERROR; } /* (B1) freed */
if (!BTREE_IS_PAGE_VALID_LEAF (thread_p, prev_leaf))
{ bts->force_restart_from_root = true; ...; return NO_ERROR; } /* (B2) reused non-leaf */
*node_header_ptr = btree_get_node_header (thread_p, prev_leaf);
if (!VPID_EQ (&(*node_header_ptr)->next_vpid, &bts->C_vpid))
{ bts->force_restart_from_root = true; ...; return NO_ERROR; } /* (B3) relinked */

(A) wins without waiting, positioning on the previous page’s last key. (B) drops the current latch (after decompressing cur_key) and re-fixes unconditionally; three latchless hazards each force a root restart — (B1) freed, (B2) reused as non-leaf, (B3) leaves no longer linked.

Key invariant: descending advancement never holds two leaf latches in the forward link direction. It grabs the previous page conditionally or releases the current page before waiting — the rule that keeps a descending scan from deadlocking against ascending scans and writers latching in next_vpid order. When it cannot make progress safely it sets force_restart_from_root and re-descends from root: slow but safe.

7.9 The SELECT callback — btree_range_scan_select_visible_oids

Section titled “7.9 The SELECT callback — btree_range_scan_select_visible_oids”

The BTREE_RANGE_SCAN_PROCESS_KEY_FUNC for SELECTs: for the current VERIFIED key, walk its OID list (leaf inline OIDs, then the overflow chain) and copy the MVCC-visible ones into the caller’s buffer. Because a key’s OID list can exceed the buffer, it carries partial-key resume via O_vpid and is_key_partially_processed.

flowchart TB
  ENTER["key_func entry (VERIFIED key)"] --> ISS{"ISS distinct-key op?"}
  ISS -->|yes| EOUT["set key, n=1, end_scan"]
  ISS -->|no| PART{"is_key_partially_processed?"}
  PART -->|yes| RESOVF["re-fix O_vpid\nget next overflow vpid"]
  PART -->|no| SOFT{"soft cap exceeded\nby leaf+1ovf?"}
  SOFT -->|yes| ONEIT["end_one_iteration"]
  SOFT -->|no| LEAF["process leaf OIDs"]
  LEAF --> OVF["overflow loop"]
  RESOVF --> OVF
  OVF --> HARD{"hard cap exceeded\nby this ovf page?"}
  HARD -->|yes| SAVE["O_vpid=last_visible\npartial=true, end_one_iteration"]
  HARD -->|no| PROC["process page OIDs"]
  PROC --> MORE{"more overflow pages?"}
  MORE -->|yes| OVF
  MORE -->|no| DONE["O_vpid=NULL\nkey_status=CONSUMED"]

Figure 7-3 — btree_range_scan_select_visible_oids. The soft cap stops before a key when the buffer is nearly full (end_one_iteration); the hard cap suspends inside a key’s overflow chain, saving O_vpid for partial-key resume.

Leaf and overflow OIDs are processed by btree_record_process_objects with btree_select_visible_object_for_range_scan, which builds an MVCC header, checks the snapshot, applies match_class_oid and key-limit filters, and copies visible OIDs via BTS_SAVE_OID_IN_BUFFER.

7.10 The FK callback — btree_range_scan_find_fk_any_object

Section titled “7.10 The FK callback — btree_range_scan_find_fk_any_object”

An FK check only needs to know whether any referenced row exists, so the FK callback stops at the first qualifying object, using bts->bts_other (a BTREE_FIND_FK_OBJECT) as the result holder:

// btree_range_scan_find_fk_any_object — src/storage/btree.c
fk_arg.bts = bts; fk_arg.ovfl_page = NULL;
btree_record_process_objects (..., BTREE_LEAF_NODE, ..., &stop,
btree_fk_object_does_exist, &fk_arg); /* leaf */
if (stop == true) { return NO_ERROR; } /* found -> done, no overflow */
VPID_COPY (&ovf_vpid, &bts->leaf_rec_info.ovfl);
while (!VPID_ISNULL (&ovf_vpid))
{ /* hand-over-hand chain walk */
fk_arg.ovfl_page = pgbuf_fix (..., &ovf_vpid, ...);
btree_record_process_objects (..., BTREE_OVERFLOW_NODE, ..., btree_fk_object_does_exist, &fk_arg);
if (stop) { break; } /* found -> done */
btree_get_next_overflow_vpid (thread_p, fk_arg.ovfl_page, &ovf_vpid);
}
if (bts->end_scan) { return NO_ERROR; }
if (!bts->is_interrupted)
{ /* fully scanned, none found */
assert (OID_ISNULL (&((BTREE_FIND_FK_OBJECT *) bts->bts_other)->found_oid));
if (bts->is_fk_remake) { /* deduplicate-mode retry with rebuilt key */ }
}

A leaf match sets stop; the callback returns without touching the chain. Otherwise it walks the chain (same hand-over-hand fix) and stops on the first match. Fully scanned, none found, not interrupted → found_oid NULL; under deduplicate-key mode (is_fk_remake) it retries with the rebuilt key. The FK callback never sets end_one_iteration — an FK check fits one key’s work and needs no partial-key resume.

  1. One generic driver, two callbacks. btree_range_scan owns positioning, latching, advancement, and resume; btree_range_scan_select_visible_oids does per-key SELECT work and btree_range_scan_find_fk_any_object does FK existence. btree_keyval_search is just a single-key range over the SELECT callback.
  2. cur_leaf_lsa is the resume contract. Captured at the exact instant C_page is unfixed; on resume, LSA_EQ against the re-fixed page is the fast path, and any mismatch forces re-finding cur_key or a root re-locate.
  3. key_status is a strict three-state machine. VERIFIED is asserted before every callback and CONSUMED after; only advance_over_filtered_keys promotes NOT_VERIFIED to VERIFIED, after range AND filter pass.
  4. Resume re-anchors on cur_key, not slot_id. Slot numbers are trusted only when the LSA proves the page unchanged; elsewhere the slot is re-derived by searching for the key, so concurrent splits/merges never make the scan skip or duplicate rows.
  5. Descending scans avoid deadlock by conditional prev-leaf latching. btree_range_scan_descending_fix_prev_leaf tries a conditional latch and, failing, releases its current latch before re-acquiring in safe order; a changed leaf chain forces force_restart_from_root.
  6. Two capacity limits split a large key across iterations. The soft limit stops before a key when the buffer is nearly full (end_one_iteration); the hard limit suspends inside a key’s overflow chain, saving O_vpid and is_key_partially_processed so the next iteration resumes mid-key — streaming millions of OIDs without holding a latch across roundtrips.
  7. Fence keys are invisible to scans. advance_over_filtered_keys skips any BTREE_LEAF_RECORD_FENCE slot, so the page-edge sentinels from splits (Ch 6) never leak into a result set.

Chapter 8: Deleting an Object Logical and Physical

Section titled “Chapter 8: Deleting an Object Logical and Physical”

How does CUBRID take a single OID out of a key? A logical delete only stamps a delete MVCCID so older readers still see the object; a physical delete splices the object bytes out, and when the last OID leaves it removes the whole key. One orchestrator, btree_delete_internal, drives both plus the undo (re-insert) and vacuum paths, dispatching on BTREE_OP_PURPOSE. For the record layout (key‖OID, unique first-object rule, overflow chain) see cubrid-btree.md § “CUBRID’s Approach”.

btree_delete_internal builds a BTREE_DELETE_HELPER, switch (purpose) picks the per-key worker, and hands it to btree_search_key_and_apply_functions (the shared descent machine from insert, Chapter 3).

// btree_delete_internal -- src/storage/btree.c
switch (purpose)
{
case BTREE_OP_DELETE_OBJECT_PHYSICAL_POSTPONED:
case BTREE_OP_DELETE_UNDO_INSERT:
LSA_COPY (&delete_helper.reference_lsa, ref_lsa); /* <- CLR anchor */
[[fallthrough]];
case BTREE_OP_DELETE_OBJECT_PHYSICAL:
case BTREE_OP_DELETE_VACUUM_OBJECT:
key_func = btree_key_delete_remove_object; break; /* <- physical path */
case BTREE_OP_DELETE_UNDO_INSERT_UNQ_MULTIUPD:
LSA_COPY (&delete_helper.reference_lsa, ref_lsa);
key_func = btree_key_remove_object_and_keep_visible_first; break;
case BTREE_OP_DELETE_VACUUM_INSID:
key_func = btree_key_remove_insert_mvccid; break; /* <- vacuum insid */
case BTREE_OP_DELETE_UNDO_INSERT_DELID:
LSA_COPY (&delete_helper.reference_lsa, ref_lsa);
key_func = btree_key_remove_delete_mvccid; break; /* <- logical undo */
default:
assert_release (false); return ER_FAILED;
}

The seven cases collapse to four leaf workers:

key_funcPurposesEffect
btree_key_delete_remove_objectPHYSICAL, PHYSICAL_POSTPONED, UNDO_INSERT, VACUUM_OBJECTsplice OID out, remove key if last; PHYSICAL logs full undo, rest are CLR/vacuum
btree_key_remove_object_and_keep_visible_firstUNDO_INSERT_UNQ_MULTIUPDsplice + restore visible first object
btree_key_remove_delete_mvccidUNDO_INSERT_DELIDlogical: clear delete MVCCID
btree_key_remove_insert_mvccidVACUUM_INSIDlogical: clear insert MVCCID

The post-descent tail: free printed_key; on error log and return; else bump PSTAT_BT_NUM_DELETES, publish *unique, and run the subtle revert branch:

// btree_delete_internal -- src/storage/btree.c
if (delete_helper.check_key_deleted && !delete_helper.is_key_deleted)
{ /* <- key still has a visible object: undo the optimistic decrement */
delete_helper.unique_stats_info->insert_key_and_row (); /* <- revert key */
delete_helper.unique_stats_info->delete_row (); /* <- keep row delta */
}
flowchart TB
  A["btree_delete_internal\nswitch(purpose)"] --> B["btree_search_key_and_apply_functions"]
  B --> C["btree_fix_root_for_delete"]
  C --> D["btree_merge_node_and_advance\ndescend, merge under-full -- Ch 9"]
  D --> E{"key_func"}
  E -->|physical| F["btree_key_delete_remove_object"]
  E -->|undo delid| G["btree_key_remove_delete_mvccid"]
  E -->|vacuum insid| H["btree_key_remove_insert_mvccid"]
  E -->|unq multiupd| I["btree_key_remove_object_and_keep_visible_first"]

Figure 8-1 — Orchestration. The purpose chooses the leaf worker; the descent machinery is shared with insert.

btree_fix_root_for_delete runs once per root fix. First search loads B-tree info (btree_fix_root_with_info); a restart re-fixes and returns early. It unpacks the key, sets the domain, computes is_null. Several purposes stop here: both vacuum purposes return NO_ERROR before stat work; undo/postpone purposes return after patching a NULL class OID to topclass_oid (omitted from the undo image when it equals topclass). Only PHYSICAL falls through to unique stats, where MULTI_ROW_UPDATE arms the check_key_deleted/is_key_deleted pair (§8.1). A NULL key sets *stop.

btree_merge_node_and_advance is the advance callback (merge logic is Ch 9). The leaf-reaching branch: at node_level == 1 it calls btree_search_leaf_page, then promotes a root read latch via pgbuf_promote_read_latch. On ER_PAGE_LATCH_PROMOTE_FAIL it sets *restart and flips nonleaf_latch_mode to PGBUF_LATCH_WRITE so the retry descends write-latched; else sets *is_leaf, leaving it write-latched for key_func.

Invariant — the leaf is write-latched before any mutation. Every key_func asserts pgbuf_get_latch_mode (*leaf_page) >= PGBUF_LATCH_WRITE; the promote-or-restart branch guarantees it. A read-latched worker would corrupt the page against a concurrent writer.

8.3 Physical delete dispatch — btree_key_delete_remove_object

Section titled “8.3 Physical delete dispatch — btree_key_delete_remove_object”

Finds the object, prepares logging, removes it, recounts:

  1. Key found? result == BTREE_KEY_FOUND → copy record, btree_read_record, btree_find_oid_and_its_page locates the OID (possibly overflow) into found_page/prev_found_page/offset_to_object; else the offset is NOT_FOUND.
  2. Object not found: for VACUUM_OBJECT benign (already vacuumed) — warn and goto exit; else assert_release(false) + ER_BTREE_UNKNOWN_KEY.
  3. Found: a SERVER_MODE assert checks the lock is held (vacuum/recovery/ own-insert-into-non-unique are lock-free); a second enforces the unique first-object rule (§8.7).
  4. Undo logging: only PHYSICAL saves a key-value image via btree_rv_save_keyval_for_undo into rv_keyval_data (rollback re-inserts the OID); vacuum/undo-of-insert are compensation, so = NULL.
  5. Remove: node_type = (*leaf_page == found_page) ? LEAF : OVERFLOW, then btree_key_remove_object (§8.4); unfix other pages.
  6. Recount: for MULTI_ROW_UPDATE, re-read the leaf, btree_get_num_visible_from_leaf_and_ovf (dirty snapshot); if num_visible_oids > 0 set is_key_deleted = false (§8.1 fixes the stat).
  7. exit: idempotent unfix of both pages, free rv_keyval_data.

8.4 btree_key_remove_object — leaf vs. overflow fork

Section titled “8.4 btree_key_remove_object — leaf vs. overflow fork”

A two-line dispatcher (Figure 8-2): leaf → btree_leaf_remove_object, overflow → btree_overflow_remove_object. A leaf record must never be left empty (unique first-object rule); an overflow page may shrink to nothing.

flowchart TB
  K["btree_key_remove_object\nnode_type?"] --> L{"BTREE_LEAF_NODE?"}
  L -->|yes| LO["btree_leaf_remove_object"]
  L -->|no| OO["btree_overflow_remove_object"]
  LO --> LO0{"offset_to_object == 0?"}
  LO0 -->|no| RR["btree_record_remove_object\nsplice non-first"]
  LO0 -->|yes, only object, no ovfl| DK["btree_delete_key_from_leaf\nremove whole key"]
  LO0 -->|yes, only object, has ovfl| RP["btree_replace_first_oid_with_ovfl_oid"]
  LO0 -->|yes, more leaf objects| RL["btree_leaf_record_replace_first_with_last"]
  OO --> OO1{"overflow has 1 object?"}
  OO1 -->|yes| DEA["dealloc page + relink"]
  OO1 -->|no| RR2["btree_record_remove_object"]

Figure 8-2 — Every physical-removal branch.

8.5 btree_leaf_remove_object — the four leaf branches

Section titled “8.5 btree_leaf_remove_object — the four leaf branches”

This enforces the unique first-object invariant; removing the key clears check_key_deleted. The four branches:

// btree_leaf_remove_object -- src/storage/btree.c
if (offset_to_object == 0)
{
btree_record_get_last_object (..., &last_oid, ..., &offset_to_last_object);
if (offset_to_last_object == 0) /* <- only object */
{
if (VPID_ISNULL (&leaf_rec_info->ovfl))
btree_delete_key_from_leaf (...); /* <- remove the key (§8.8) */
else
btree_replace_first_oid_with_ovfl_oid (...); /* <- pull from overflow (§8.6) */
}
else
btree_leaf_record_replace_first_with_last (...); /* <- first := last */
}
else
btree_record_remove_object (...); /* <- splice non-first */

Invariant — a unique leaf record’s first object is the visible one, and the record is never empty while OIDs remain. The first slot has a distinct encoding (may carry the class OID and fixed-size MVCC info), so removing it always means replacement, never a raw splice. Violating it leaves a scan reading a malformed first object or a key with a live overflow link but no OID.

8.6 Pulling an OID up — btree_replace_first_oid_with_ovfl_oid

Section titled “8.6 Pulling an OID up — btree_replace_first_oid_with_ovfl_oid”

When the leaf’s only object is removed but overflow OIDs remain, an overflow object becomes the new leaf first object via a system-op sandwich: btree_leaf_change_first_object swaps the overflow’s last object into the leaf first slot, btree_overflow_record_replace_object moves the object-to-delete into the vacated overflow slot, commit, then remove it via btree_overflow_remove_object (§8.7). exit_on_error aborts the system op; pure-undo purposes must not fail here.

8.7 Overflow removal — btree_overflow_remove_object and friends

Section titled “8.7 Overflow removal — btree_overflow_remove_object and friends”

Reads the overflow record (slot 1) and forks on whether this object is alone:

// btree_overflow_remove_object -- src/storage/btree.c
if (overflow_record.length == BTREE_OBJECT_FIXED_SIZE (btid_int))
{
assert (offset_to_object == 0); /* <- only first removable */
btree_get_next_overflow_vpid (thread_p, *overflow_page, &next_overflow_vpid);
pgbuf_unfix_and_init (thread_p, *overflow_page);
if (!delete_helper->is_system_op_started) { log_sysop_start (...); ... }
file_dealloc (..., &overflow_vpid, FILE_BTREE);
if (prev_page == leaf_page)
btree_modify_leaf_ovfl_vpid (...); /* <- relink from leaf */
else
btree_modify_overflow_link (...); /* <- relink from prev ovfl */
}
else
btree_record_remove_object (..., BTREE_OVERFLOW_NODE, offset_to_object, &ovf_addr);

The error: label ends a self-started system op and asserts a non-no-undo purpose.

btree_record_remove_object / _internal is the byte-level splice shared by both paths: _internal computes the encoded size (OR_OID_SIZE + a second OID for unique + MVCC bytes per flags), packs removed bytes as undo, RECORD_MOVE_DATA closes the gap, packs redo; the wrapper spage_updates and btree_rv_log_delete_objects.

Invariant — only the first object of an overflow page may be the alone/last object removed by deallocation. _internal asserts node_type == BTREE_OVERFLOW_NODE || offset_to_object > 0 — it refuses to splice the leaf’s first object; that goes through replacement (§8.5/§8.6).

8.8 Removing the whole key — btree_delete_key_from_leaf

Section titled “8.8 Removing the whole key — btree_delete_key_from_leaf”

When the last OID leaves, the key record is deleted. If leafrec_pnt->key_len < 0 the key value lives in an overflow key page → start a system op and btree_delete_overflow_key first; if a system op runs, copy the record for undo before deleting. Then:

// btree_delete_key_from_leaf -- src/storage/btree.c
if (spage_delete (thread_p, leaf_pg, search_key->slotid) != search_key->slotid) goto exit_on_error;
key_cnt = btree_node_number_of_keys (thread_p, leaf_pg);
if (key_cnt == 0) header->max_key_len = 0; /* <- no keys left */
LOG_RV_RECORD_SET_MODIFY_MODE (&delete_helper->leaf_addr, LOG_RV_RECORD_DELETE);
btree_rv_log_delete_object (thread_p, *delete_helper, delete_helper->leaf_addr,
leaf_record.length, 0, leaf_record.data, NULL); /* <- undo = record */

exit_on_error aborts a started system op. Only the page-local key_cnt updates here; index-wide num_keys/num_oids deltas flow through the unique-stats path (delete_key_and_row/delete_row, §8.2/§8.1).

8.9 Logical delete — btree_key_remove_delete_mvccid

Section titled “8.9 Logical delete — btree_key_remove_delete_mvccid”

The logical half (BTREE_OP_DELETE_UNDO_INSERT_DELID) rolls back an MVCC delete: the object stays, only its delete MVCCID clears. A miss is a hard ER_BTREE_UNKNOWN_KEY (undo must find it). Forks on uniqueness:

  • unique (_uniquebtree_remove_delete_mvccid_unique_internal): the visible object must be first, so undo may move it back — if already first, just btree_record_remove_delid; else swap with the current first (fixed-size), under a system op with undoredo in an overflow page. Logged as compensate (log_append_compensate_with_undo_nxlsa, anchored to reference_lsa), or undoredo when a system op is open.
  • non-unique (_non_unique): no ordering — btree_record_remove_delid in place, spage_update, then log_append_compensate_with_undo_nxlsa (RVBT_RECORD_MODIFY_COMPENSATE, anchored to reference_lsa).

btree_record_remove_delid branches on has_fixed_size: a fixed-size object (overflow, or first-with-overflow) keeps its width and nulls the MVCCID (btree_set_mvccid to MVCCID_NULL); a variable-size object removes the MVCCID bytes and clears BTREE_OID_HAS_MVCC_DELID (btree_remove_mvccid).

8.10 Logical vacuum — btree_key_remove_insert_mvccid

Section titled “8.10 Logical vacuum — btree_key_remove_insert_mvccid”

BTREE_OP_DELETE_VACUUM_INSID is the mirror: vacuum clears the insert MVCCID once the object is visible to all — no undo at all. Branches: object not found -> benign, warn and return NO_ERROR; found -> assert the INSID is not all-visible, then btree_record_remove_insid strips the bytes, spage_update, and log_append_redo_data with redo-only mode RVBT_RECORD_MODIFY_NO_UNDO.

Invariant — vacuum never produces undo. Both vacuum workers (_insert_mvccid here, VACUUM_OBJECT via the physical path) emit only redo. If vacuum logged undo, a crash during recovery could “un-vacuum” and re-expose a dead version, breaking visibility.

  1. One orchestrator, seven purposes. btree_delete_internal switch (purpose) selects one of four leaf workers; the descent (btree_fix_root_for_deletebtree_merge_node_and_advance) is shared with insert and reaches the leaf write-latched.
  2. Logical vs. physical are different paths. Logical (_remove_delete_mvccid undo, _remove_insert_mvccid vacuum) clears an MVCCID in place; physical (_delete_remove_object) splices OID bytes and may remove the key.
  3. The unique first-object rule drives the leaf branches. btree_leaf_remove_object never raw-splices the first object: it removes the key (btree_delete_key_from_leaf), pulls one up from overflow (btree_replace_first_oid_with_ovfl_oid), or swaps in the last object; overflow removal forks on alone-or-not (dealloc + relink, else splice).
  4. Logging encodes intent. Physical delete carries a key-value undo image for re-insert; vacuum is redo-only; undo-of-delete writes compensate records anchored to reference_lsa. The check_key_deleted/is_key_deleted pair lets MULTI_ROW_UPDATE recount visible objects and reverse an over-eager stat decrement.

Chapter 9: Merge and Under Fill Rebalancing

Section titled “Chapter 9: Merge and Under Fill Rebalancing”

This chapter is the mirror image of Chapter 6. A merge is driven down the descent before a delete: the descent walks with btree_merge_node_and_advance, and at every non-leaf hop asks “can the child I am about to enter be folded into a sibling?” If yes, it collapses the two children into the left one, deletes the between-key separator from the parent, and recycles the right page. A degenerate two-key root with height > 2 instead triggers btree_merge_root, which pulls both children up into the root frame and shrinks the tree by one level.

For merge theory (under-fill threshold, separator fold-down, height shrink) see cubrid-btree.md, “Deletion and rebalancing”. This chapter traces only the CUBRID mechanics. The split logging this mirrors is in Chapter 6; merge re-uses the same RVBT_COPYPAGE / RVBT_INS_PGRECORDS redo verbs and the RVBT_MARK_DEALLOC_PAGE undo verb in reverse.

9.1 The merge-on-descend driver: btree_merge_node_and_advance

Section titled “9.1 The merge-on-descend driver: btree_merge_node_and_advance”

btree_merge_node_and_advance is the BTREE_ADVANCE_WITH_KEY_FUNCTION for delete, dispatched per-node by btree_search_key_and_apply_functions for btree_delete_internal. Its job is dual: advance one level toward the leaf, and opportunistically merge the child it enters. It fans into three regions, each ending in return: a leaf (position, no merge); a root with two keys and level > 2 (delegate to btree_merge_root); an ordinary non-leaf (pick the child following key, try to merge it with its right neighbor via btree_merge_node). Figure 9-1 is the skeleton; every labelled edge is a real branch.

flowchart TD
  A["merge_node_and_advance"] --> B{"node_level == 1 ?"}
  B -->|leaf| C{"is_root and latch not WRITE ?"}
  C -->|yes| D["promote SHARED_READER"]
  D -->|FAIL| E["restart, latch:=WRITE, return"]
  D -->|ok| F["is_leaf=true, return"]
  C -->|no| F
  B -->|non-leaf| G{"is_root and level>2 and keys==2 ?"}
  G -->|yes| H["fix L+R, compute need/force_root_merge"]
  H --> I{"need_root_merge ?"}
  I -->|no| J["search child, advance, fall to tail"]
  I -->|yes| LM{"latch == READ ?"}
  LM -->|already WRITE| M["sysop, merge_root, dealloc Q+R, commit"]
  LM -->|READ| K{"promote all 3 ?"}
  K -->|FAIL and force| L["restart, latch:=WRITE, return"]
  K -->|FAIL not force| J
  K -->|ok| M
  M --> N["advance_to_page=root, crt_page=NULL"]
  G -->|no| O["search child, fix child_page"]
  J --> O
  O --> P{"slotid<key_count and free>PAGESIZE/2 ?"}
  P -->|no| Z["advance_to_page=child, return"]
  P -->|yes| Q["fix right neighbor"]
  Q --> R{"right fixed ?"}
  R -->|not in buffer| Z
  R -->|yes| S["btree_node_mergeable"]
  S --> T{"status != NO ?"}
  T -->|no| Z
  T -->|yes| U{"latch READ ? promote all 3"}
  U -->|already WRITE| W
  U -->|FAIL and FORCE| V["restart, latch:=WRITE, return"]
  U -->|FAIL and TRY| Z
  U -->|ok| W["sysop, merge_node, dealloc right, commit"]

Figure 9-1. Control flow of btree_merge_node_and_advance, including the already-WRITE skip-promotion edges.

Leaf branch. At node_level == 1 it positions search_key with btree_search_leaf_page, then guarantees WRITE. If we descended READ and this leaf is also the root (single-page tree), it promotes; ER_PAGE_LATCH_PROMOTE_FAIL sets *restart and bumps nonleaf_latch_mode to WRITE so the descent restarts exclusively. Otherwise the leaf already holds WRITE and returns *is_leaf = true.

// btree_merge_node_and_advance -- src/storage/btree.c
if (node_header->node_level == 1)
{
error_code = btree_search_leaf_page (thread_p, btid_int, *crt_page, key, search_key);
if (delete_helper->is_root && delete_helper->nonleaf_latch_mode != PGBUF_LATCH_WRITE)
{
error_code = pgbuf_promote_read_latch (thread_p, crt_page, PGBUF_PROMOTE_SHARED_READER);
if (error_code == ER_PAGE_LATCH_PROMOTE_FAIL)
{ *restart = true; delete_helper->nonleaf_latch_mode = PGBUF_LATCH_WRITE; return NO_ERROR; }
}
*is_leaf = true;
return NO_ERROR;
}

Root-collapse precheck: is_root && node_level > 2 && btree_node_number_of_keys(...) == 2. The node_level > 2 guard means the tree never collapses to a single page. Both children are fixed, used bytes computed as DB_PAGESIZE - spage_get_free_space(...), and two thresholds derived with the root’s max_key_len as a defensive allowance for the folded-in separator:

// btree_merge_node_and_advance -- src/storage/btree.c
need_root_merge = (left_used + right_used + CAN_MERGE_WHEN_EMPTY + root_max_key_length) < DB_PAGESIZE;
force_root_merge = (left_used + right_used + FORCE_MERGE_WHEN_EMPTY + root_max_key_length) < DB_PAGESIZE;

CAN_MERGE_WHEN_EMPTY is MAX(DB_PAGESIZE*0.33, ...), FORCE_MERGE_WHEN_EMPTY MAX(*0.66, ...): need = merged page at most ~67% full, force at most ~34%. Promotion is attempted only when nonleaf_latch_mode == PGBUF_LATCH_READ; if the descent is already exclusive (restart path) promotion is skipped and the merge runs directly.

Branch matrix for the root-merge outcome (entered only after need_root_merge):

Latch / promotion resultforce_root_mergeAction
nonleaf_latch_mode already WRITEskip promotion, fall into merge (else: “Pages already latched exclusively”)
promote from READ -> all 3 latchedlog_sysop_start, btree_merge_root, dealloc both children, log_sysop_commit, *advance_to_page = *crt_page, *crt_page = NULL
promote -> PROMOTE_FAILtrue*restart = true, latch:=WRITE, unfix children, return
promote -> PROMOTE_FAILfalsefall through, skip merge, advance to one child
other errorgoto error (abort sysop if started)

Invariant — root merge re-enters the root. A successful btree_merge_root sets *advance_to_page = *crt_page and *crt_page = NULL rather than descending, re-running the callback on the same root so a second collapse fires if the shorter root again has two mergeable children. Drop it and a tree that should shrink two levels shrinks one, leaving a transiently under-filled root.

Ordinary non-leaf tail. btree_search_nonleaf_page chooses search_key->slotid and the child VPID. child_latch is WRITE with PGBUF_PROMOTE_ONLY_READER if the child is a leaf (node_level == 2), else inherits nonleaf_latch_mode with PGBUF_PROMOTE_SHARED_READER — the single-reader promote at level 2 mirrors btree_split_node_and_advance (Chapter 6). The right-neighbor precheck is the gate the reader question turns on:

// btree_merge_node_and_advance -- src/storage/btree.c
if (search_key->slotid < key_count && spage_get_free_space (thread_p, child_page) > DB_PAGESIZE / 2)

Both conditions required: slotid < key_count means a right neighbor exists; more than half free means plausibly under-filled. Under SERVER_MODE IO stress, neighbor_fetch_mode downgrades to OLD_PAGE_IF_IN_BUFFER so vacuum does not stall on cold siblings — a missing neighbor leaves right_page NULL, er_clear masks the benign miss, and the code falls through to a plain advance. Here too, an already-WRITE descent skips the promotion block and merges directly; promotion is attempted only on the READ path. There is no left-neighbor branch: only the right sibling is tried, relying on the descent visiting every node so each pair is checked from its left member.

9.2 The mergeability classifier: btree_node_mergeable

Section titled “9.2 The mergeability classifier: btree_node_mergeable”

btree_node_mergeable returns three-valued BTREE_MERGE_STATUS: BTREE_MERGE_NO (skip), BTREE_MERGE_TRY (merge if latches promote cheaply), BTREE_MERGE_FORCE (mandatory; restart exclusively if promotion fails). Figure 9-2 enumerates every return.

flowchart TD
  A["node_mergeable(L,R)"] --> B{"L_cnt == 0 ?"}
  B -->|yes| F1["FORCE: left empty"]
  B -->|no| C{"L non-fence == 0 ?"}
  C -->|yes| D["R_used = uncompressed(R)"]
  D --> D1{"R + FORCE_EMPTY < PAGESIZE"}
  D1 -->|yes| F2["FORCE"]
  D1 -->|no| D2{"R + CAN_EMPTY < PAGESIZE"}
  D2 -->|yes| T1["TRY"]
  D2 -->|no| N1["NO"]
  C -->|no| E{"R_cnt == 0 ?"}
  E -->|yes| F3["FORCE: right empty"]
  E -->|no| G{"R non-fence == 0 ?"}
  G -->|yes| H["symmetric on L_used"]
  G -->|no| I{"L==1 and R==1 key ?"}
  I -->|yes| F4["FORCE: both single-key"]
  I -->|no| J["used bytes"]
  J --> K{"L+R+CAN_EMPTY < PAGESIZE ?"}
  K -->|no| N2["NO"]
  K -->|yes| Lr["recompute uncompressed for LEAF"]
  Lr --> Mm{"uncompressed still fits ?"}
  Mm -->|no| N3["NO"]
  Mm -->|yes| O{"L+R+FORCE_EMPTY < PAGESIZE ?"}
  O -->|yes| F5["FORCE"]
  O -->|no| T2["TRY"]

Figure 9-2. Return classification of btree_node_mergeable.

It counts keys with btree_node_number_of_keys, then a non-fence count by subtracting lower fence (slot 1) and upper fence (slot L_cnt) when btree_is_fence_key reports them. Three cases:

  • One side empty / fence-only. L_cnt == 0 -> FORCE. If L has only fences (L_non_fence_cnt == 0), R’s uncompressed size decides (btree_node_size_uncompressed, fences re-derived after merge): fits with FORCE_MERGE_WHEN_EMPTY slack -> FORCE; only with CAN_MERGE_WHEN_EMPTY -> TRY; else NO. R symmetric.
  • Both single-key. L_non_fence_cnt == 1 && R_non_fence_cnt == 1 -> FORCE.
  • Size test. With used bytes, if L_used + R_used + CAN_MERGE_WHEN_EMPTY < DB_PAGESIZE it is a candidate. Leaf nodes re-check with uncompressed sizes (merge discards the shared fence prefix, so post-merge bytes can exceed the naive sum):
// btree_node_mergeable -- src/storage/btree.c
if (l_node_type == BTREE_LEAF_NODE)
{
L_used = btree_node_size_uncompressed (thread_p, btid, L_page);
if (L_used < 0) { return BTREE_MERGE_NO; }
if (L_used + R_used + CAN_MERGE_WHEN_EMPTY >= DB_PAGESIZE) { return BTREE_MERGE_NO; }
R_used = btree_node_size_uncompressed (thread_p, btid, R_page);
if (L_used + R_used + CAN_MERGE_WHEN_EMPTY >= DB_PAGESIZE) { return BTREE_MERGE_NO; }
}
if (L_used + R_used + FORCE_MERGE_WHEN_EMPTY < DB_PAGESIZE) { return BTREE_MERGE_FORCE; }
return BTREE_MERGE_TRY;

The TRY/FORCE asymmetry is the 0.33/0.66 split. The size-fails case reaches a final return BTREE_MERGE_NO.

btree_merge_node(P, left_pg, right_pg, p_slot_id, child_vpid, status) folds right_pg into left_pg, deletes the separator at parent slot p_slot_id (the slot pointing at the right page), and recycles the right page, returning the surviving VPID in *child_vpid (always left_vpid). The 12 STEP comments map to the recovery story; Figure 9-3 is the flow.

flowchart TD
  A["btree_merge_node"] --> B["STEP 0-1: classify fences\nkeep left lower + right upper"]
  B --> C["STEP 2-3: gather rec[] in merge_buf\nrecompress if merged_prefix changed"]
  C --> D["STEP 4: undo-log left RVBT_COPYPAGE"]
  D --> E["STEP 5-7: remove upper fence,\nspage_insert rec[], append fence"]
  E --> F["STEP 8: RVBT_NDRECORD_DEL +\nspage_delete parent slot, child_vpid=left"]
  F --> G["STEP 9-10: left next_vpid past right,\nmax_key_len, redo-log left"]
  G --> H["STEP 11: right node_level=-1\nunder RVBT_MARK_DEALLOC_PAGE"]
  H --> I["STEP 12: btree_get_next_page +\nset_vpid_previous_vpid -> left"]
  E -->|spage failure| X["assert_release, goto exit_on_error"]
  F -->|delete != slot| X

Figure 9-3. STEP flow of btree_merge_node with its error exits.

STEP 0–3. Leaf pages carry a lower fence (slot 1) and upper fence (slot L_cnt) when flagged BTREE_LEAF_RECORD_FENCE. The merged page keeps the left lower fence and right upper fence; the left’s upper fence and right’s lower fence (the vanished boundary) are discarded. The right upper fence is copied out early (merged_upper_fence_record) since the right page will be recycled. Records are gathered into a scratch rec[] backed by merge_buf (grown via db_private_alloc if left_used + right_used + MAX_MERGE_ALIGN_WASTE overflows the stack buffer), re-compressing each with btree_recompress_record when the merged prefix differs from the source.

Invariant — prefix monotonicity. The merged prefix can never be longer than either source prefix. For MIDXKEY indexes with both fences, merged_prefix = pr_midxkey_common_prefix(left_fence_key, right_fence_key); otherwise it stays 0 (no-recompress fast path). The code only ever re-compresses (expands) records; compressing against a prefix longer than merged_prefix would corrupt the key bytes.

STEP 4 — undo image of left. Before mutating left, its full image is logged for undo:

// btree_merge_node -- src/storage/btree.c (STEP 4)
log_append_undo_data2 (thread_p, RVBT_COPYPAGE, &btid->sys_btid->vfid, left_pg, -1, DB_PAGESIZE, left_pg);

This physical undo half pairs with the STEP 10 redo so a rolled-back delete reconstructs the pre-merge left page byte-for-byte (9.5).

STEP 5–7 — rewrite left. Remove the left upper fence (if present); if the prefix changed, delete and re-insert all left non-fence records, else keep them; spage_insert the gathered rec[]; append the merged upper fence. Each spage_* failure is assert_release(false) + goto exit_on_error.

STEP 8 — fold the separator out of the parent. The structural heart. The parent record at p_slot_id is read; an overflow key (nleaf_pnt.key_len < 0) has its pages freed via btree_delete_overflow_key; then the separator is logged and deleted:

// btree_merge_node -- src/storage/btree.c (STEP 8)
btree_rv_write_log_record (recset_data, &recset_length, &peek_rec, BTREE_NON_LEAF_NODE);
log_append_undoredo_data2 (thread_p, RVBT_NDRECORD_DEL, &btid->sys_btid->vfid, P, p_slot_id,
recset_length, sizeof (p_slot_id), recset_data, &p_slot_id);
if (spage_delete (thread_p, P, p_slot_id) != p_slot_id) { /* ... goto exit_on_error ... */ }
*child_vpid = *left_vpid; /* <- surviving child is always the left page */

Deleting the right page’s slot leaves the left page’s slot still pointing at left_pg, now spanning both old ranges. RVBT_NDRECORD_DEL carries the old record (undo: re-insert) and the slot id (redo: delete).

STEP 9–11 — left header, redo, mark right dead. Left next_vpid is repointed past the right page (= right_header->next_vpid), max_key_len maxed, split-info reset, redo image logged via log_append_redo_data2(RVBT_COPYPAGE, left_pg, ...) before pgbuf_set_dirty. The right page is not freed here; its node_level is set to -1 under RVBT_MARK_DEALLOC_PAGE so a concurrent re-fix detects a dead page (BTREE_IS_PAGE_VALID_LEAF); file_dealloc is in the caller.

STEP 12 — relink the following node’s prev_vpid. The node after the right page still points its prev_vpid at the recycled right page. btree_get_next_page fetches it (NULL at end of chain) and btree_set_vpid_previous_vpid rewrites the back-link to left. This is the only branch touching a fourth page, so a merge can momentarily hold P, left, right, and next. The exit_on_error tail frees merge_buf_ptr if heap-allocated; the sysop abort (9.5) then undoes everything.

btree_merge_root(P, Q, R) is simpler than btree_merge_node because it does not move P’s separator into a sibling — it pours both children’s records into P itself. Q and R are untouched beyond their dealloc mark, and the root’s page identity never changes, so the b-tree header pointer stays valid.

Q’s upper fence (if btree_leaf_is_flaged(..., BTREE_LEAF_RECORD_FENCE)) is excluded by decrementing Q_end; R’s lower fence by advancing R_start (root level > 2 implies non-leaf children, so fences are usually absent, but the code defends anyway). The two root separators are removed with btree_delete_meta_record — slot 2 first, then slot 1, so numbering does not shift. Q’s records are saved with btree_rv_util_save_page_records and spage_insert_at into P at slots 1..Q_end, logged once via RVBT_INS_PGRECORDS; the per-iteration error path logs only the records that did land:

// btree_merge_root -- src/storage/btree.c
for (i = 1; i <= Q_end; i++)
{
if (spage_get_record (...) != S_SUCCESS || spage_insert_at (...) != SP_SUCCESS)
{
if (i > 1) { recset_header.rec_cnt = i - 1; /* undo-log the i-1 records already inserted */ }
goto exit_on_error;
}
}

R’s records follow at slots left_cnt + 1 .., logged the same way. After both blocks land, the root header is undo-logged, prev_vpid/next_vpid nulled, split-info defaulted, and node_level decremented:

// btree_merge_root -- src/storage/btree.c
btree_node_header_undo_log (thread_p, &btid->sys_btid->vfid, P);
VPID_SET_NULL (&root_header->node.prev_vpid);
VPID_SET_NULL (&root_header->node.next_vpid);
btree_write_default_split_info (&(root_header->node.split_info));
root_header->node.node_level--;
assert_release (root_header->node.node_level > 1);
btree_node_header_redo_log (thread_p, &btid->sys_btid->vfid, P);

Invariant — the root is the only level that shrinks the tree. node_level-- on the root frame is the single place height decreases. assert_release(node_level > 1) keeps the post-collapse root non-leaf — with the node_level > 2 precheck, the tree never collapses a level-3 root to a leaf in one step, nor to a single page. The #ifndef NDEBUG block also asserts the root’s max_key_len equals MAX(q,r max_key_len). Decrementing node_level without first copying the children’s records would make descent index a separator-less level and read garbage slots.

Q and R are each marked dead (node_level = -1, undo-logged via RVBT_MARK_DEALLOC_PAGE) but not deallocated — the caller does file_dealloc after btree_merge_root returns.

Section titled “9.5 Back-link repair and the system-op / logical-undo bracket”

btree_set_vpid_previous_vpid keeps the doubly-linked leaf chain consistent after a merge removed the right page — the exact dual of the next-link rewrite a split performs.

// btree_set_vpid_previous_vpid -- src/storage/btree.c
header = btree_get_node_header (thread_p, page_p);
if (header == NULL) { return ER_FAILED; }
btree_node_header_undo_log (thread_p, &btid->sys_btid->vfid, page_p);
header->prev_vpid = *prev; /* <- repoint back-link to surviving left page */
btree_node_header_redo_log (thread_p, &btid->sys_btid->vfid, page_p);
pgbuf_set_dirty (thread_p, page_p, DONT_FREE);

An early if (page_p == NULL) return NO_ERROR; makes it a no-op when the merged left page was last in the chain. The same primitive serves split and merge.

The bracket. Each merge runs inside log_sysop_start / log_sysop_commit, enclosing all physical changes — RVBT_COPYPAGE on left, RVBT_INS_PGRECORDS on root, RVBT_NDRECORD_DEL on the parent, RVBT_MARK_DEALLOC_PAGE on the recycled page, and file_dealloc:

// btree_merge_node_and_advance -- src/storage/btree.c
log_sysop_start (thread_p);
is_system_op_started = true;
/* ... btree_merge_node / btree_merge_root, then file_dealloc of the recycled page(s) ... */
log_sysop_commit (thread_p);
is_system_op_started = false;

Invariant — a structural merge is atomic and self-undoing within the delete. On any goto error between start and commit, the error: label runs log_sysop_abort, replaying the undo records in reverse: the recycled page un-deallocated, the parent separator re-inserted, the left page restored from its RVBT_COPYPAGE image — re-splitting to the pre-merge shape so the delete retries. The sysop commits before the leaf-level object removal, so the merge is a structural (nested top) action: even if the enclosing logical delete rolls back, the committed merge is never logically undone, leaving the merged shape. Leave file_dealloc outside the sysop and a crash between commit and dealloc leaks the right page; commit the dealloc without the record moves and the parent points at a freed page.

Cross-reference Chapter 6: split logs RVBT_COPYPAGE redo on the new right page and RVBT_NDRECORD_INS on the parent; merge logs the symmetric RVBT_COPYPAGE undo+redo on the surviving left page and RVBT_NDRECORD_DEL on the parent — the undo of one is structurally the other.

  1. Merge is driven on descent, right-sibling only. btree_merge_node_and_advance checks each child against its right neighbor as it walks down; there is no left-merge branch, and the precheck requires both a right neighbor (slotid < key_count) and an under-filled child (free > PAGESIZE/2).
  2. Mergeability is three-valued. btree_node_mergeable returns NO/TRY/FORCE keyed on CAN_MERGE_WHEN_EMPTY (~33%) and FORCE_MERGE_WHEN_EMPTY (~66%); empty/fence-only and double-single-key pairs are always FORCE, and leaf pairs are re-checked against uncompressed size. TRY is skipped on a failed promotion while FORCE restarts the descent exclusively.
  3. Promotion is skipped when the descent is already exclusive. Both merge blocks promote latches only when nonleaf_latch_mode == PGBUF_LATCH_READ; on the restart (already-WRITE) path the merge runs directly.
  4. General merge folds the separator out of the parent. btree_merge_node copies right into left, deletes the parent slot pointing at the right page (RVBT_NDRECORD_DEL), repoints left_header->next_vpid, marks the right page dead (node_level = -1), and relinks the following node’s prev_vpid via btree_set_vpid_previous_vpid.
  5. Root collapse is the only height shrink. btree_merge_root pours both children’s records into the unchanged root and decrements node_level; the node_level > 2 precheck and assert_release(node_level > 1) keep the tree off a single page and the root non-leaf.
  6. The recycled page is marked, not freed, inside the merge. node_level = -1 (undo-logged with RVBT_MARK_DEALLOC_PAGE) lets concurrent re-fixers detect a dead page; file_dealloc happens in the driver, still inside the system op.
  7. The merge is self-undoing. log_sysop_start/commit brackets each merge so a mid-merge error aborts back to the exact pre-merge layout (re-split), while a committed merge survives even if the enclosing delete rolls back — mirroring Chapter 6’s split logging.

What builds or maintains the tree outside the single-key insert/delete lifecycle of Chapters 3-9: the two index-load paths, the leaf-record auxiliaries (fence keys, FK probes), unique-statistics aggregation, and deduplicate-key mode. For why a bulk build beats N inserts and what a fence key is for, see cubrid-btree.md (“Bulk Loading”, “Prefix Compression and Fence Keys”).

xbtree_load_index is the CREATE INDEX path used when no concurrent writer sees the table. It builds the tree invisibly in one top system operation, then commits atomically. Because nobody reaches the file until commit, latching is bypassed — pages are fixed PGBUF_LATCH_WRITE only for the buffer manager, never to coordinate transactions. Figure 10-1 traces every branch; the load is all-or-nothing (each goto error aborts the sysop, returns NULL). The zero-key branch recreates an empty index through the regular xbtree_add_index.

flowchart TD
  A["xbtree_load_index"] --> B["init SORT_ARGS + LOAD_ARGS\nset deduplicate_key_idx"]
  B --> C["btree_index_sort -> btree_construct_leafs callback"]
  C --> D{"any keys loaded?\nleaf.pgptr != NULL"}
  D -->|yes| E["btree_save_last_leafrec"]
  E --> F{"has_fk?"}
  F -->|yes| G["btree_load_check_fk"]
  F -->|no| H["btree_build_nleafs"]
  G --> H
  D -->|no| I["log_sysop_abort\nxbtree_add_index empty"]
  H --> J{"is_sysop_started?"}
  I --> J
  J -->|yes| K["attach to outer\nif unique: undo RVBT_REMOVE_UNIQUE_STATS"]
  J -->|no| L["nothing to log"]
  K --> M["return btid"]
  L --> M

Figure 10-1. xbtree_load_index control flow with empty-index and FK branches.

10.2 Packing leaves left-to-right: btree_construct_leafs and btree_first_oid

Section titled “10.2 Packing leaves left-to-right: btree_construct_leafs and btree_first_oid”

btree_construct_leafs is the sort callback; it folds each sorted (key, oid) into the current leaf. The comments label all five branches:

// btree_construct_leafs -- src/storage/btree_load.c
next = *(char **) recdes->data; /* <- save forward link before mutating recdes */
if (VPID_ISNULL (&(load_args->leaf.vpid))) /* <- A: very first record */
bt_load_get_first_leaf_page_and_init_args (...);
else {
int c = btree_compare_key (&sparam.this_key, &load_args->current_key, ...);
if (c == DB_GT) bt_load_make_new_record_on_leaf_page (...); /* <- B: new key */
else if (c == DB_EQ) bt_load_add_same_key_to_record (...); /* <- C: append OID */
else { assert_release (false); goto error; } /* <- D: out-of-order = corruption */
}
if (!next) break; /* <- E: chain exhausted */

INVARIANT (sorted-input monotonicity). btree_compare_key(this_key, current_key) is never DB_LT; the sort guarantees non-decreasing keys, so the else arm is assert_release(false). A buggy comparator would otherwise produce an unsorted leaf chain that scans (Ch 7) and searches (Ch 3) mis-navigate; the assert turns latent corruption into immediate failure.

bt_load_make_new_record_on_leaf_page applies the fill-factor: when (cur_maxspace - leaf_nleaf_recdes.length) < LOAD_FIXED_EMPTY_FOR_LEAF and spage_number_of_records > 1 it calls btree_proceed_leaf to start a fresh leaf (chaining prev/next) before spage_insert, then btree_first_oid. LOAD_FIXED_EMPTY_FOR_LEAF reserves slack for later in-place inserts; the > 1 guard prevents an infinite loop when one oversized key (already overflow-backed) exceeds the threshold. btree_first_oid branches on key size (key_len < BTREE_MAX_KEYLEN_INPAGEBTREE_NORMAL_KEY, else BTREE_OVERFLOW_KEY, creating the overflow-key file via btree_create_overflow_key_file if ovfid is null) and on the first object’s delete flag: MVCC_IS_HEADER_DELID_VALIDcurr_non_del_obj_count = 0 (only object already MVCC-deleted, key not live), else = 1 and ++n_keys. A key whose only object is already deleted does not count toward the distinct-key statistic — the bulk-load analogue of the runtime bookkeeping in §10.9.

10.3 Bottom-up internal levels and btree_load_new_page

Section titled “10.3 Bottom-up internal levels and btree_load_new_page”

btree_build_nleafs builds internal levels one at a time with two work lists (push_list/pop_list) that ping-pong (Figure 10-2). Every page goes through btree_load_new_page, whose node_level argument discriminates:

// btree_load_new_page -- src/storage/btree_load.c
assert (log_check_system_op_is_started (thread_p)); /* <- caller must hold a sysop */
log_sysop_start (thread_p); /* <- nested sysop just for this alloc */
file_alloc (thread_p, &btid->vfid, btree_initialize_new_page, NULL, vpid_new, page_new);
if (header) /* <- leaf/non-leaf (node_level >= 1) */
{ header->node_level = node_level; header->common_prefix = 0; btree_init_node_header (...); }
else /* <- overflow page (node_level == -1) */
{ VPID_SET_NULL (&ovf_header_info.next_vpid); btree_init_overflow_header (...); }
end: /* <- error: log_sysop_abort; success: log_sysop_commit (committed independently) */
if (error_code != NO_ERROR) log_sysop_abort (thread_p); else log_sysop_commit (thread_p);

INVARIANT (per-page allocation durability). Each allocation is wrapped in its own nested system operation that commits independently. Source comment: “we need to commit page allocations. if loading index is aborted, the entire file is destroyed.” Because the whole file is dropped on outer abort, the error path is sloppy about freeing individual pages — it relies on file_destroy at the top level. The assert enforces that no caller allocates a load page outside the protective top operation.

btree_proceed_leaf (§10.2) calls this with node_level == 1 per new leaf and threads prev_vpid/next_vpid so Chapter 7’s range scan chain is built right.

flowchart TD
  A["btree_build_nleafs"] --> B["pop children level k"]
  B --> C["btree_connect_page -> push parent level k+1"]
  C --> D{"pop_list length > 0?"}
  D -->|yes| B
  D -->|no| E["swap push_list / pop_list\nlevel := k+1"]
  E --> F{"more than one page\nat this level?"}
  F -->|yes| B
  F -->|no| G["write root header\nSET_DECOMPRESS_IDX_HEADER"]

Figure 10-2. Bottom-up internal-level construction in btree_build_nleafs.

10.4 Online (concurrent) index build: xbtree_load_online_index

Section titled “10.4 Online (concurrent) index build: xbtree_load_online_index”

The online path lets DML run while the index is built, differing from §10.1 three ways: an MVCC builder_snapshot gives a consistent table version; the schema lock is demoted per class from SCH_M_LOCK to IX_LOCK (lock_demote_class_lock) so writers are not blocked, then re-promoted at the end; and batched inserts go to a worker pool via online_index_builder. It sets tdes->has_deadlock_priority = true so the builder wins deadlocks against DML. The re-promotion loop is the most safety-critical edge path: the lock must return to SCH_M_LOCK before commit/rollback, else the catalog mutation races concurrent readers.

// xbtree_load_online_index -- src/storage/btree_load.c
xlogtb_reset_wait_msecs (thread_p, LK_INFINITE_WAIT); /* <- never give up */
while (true) {
lock_ret = lock_object (..., SCH_M_LOCK, LK_UNCOND_LOCK);
if (lock_ret == LK_GRANTED) break;
else if (lock_ret == LK_NOTGRANTED_DUE_ERROR && er_errid () == ER_INTERRUPTED)
{ er_clear (); logtb_set_tran_index_interrupt (...); continue; } /* <- swallow interrupt, retry */
else if (lock_ret == LK_NOTGRANTED_DUE_TIMEOUT && css_is_shutdowning_server ())
{ er_clear (); continue; } /* <- shutdown: still must promote */
assert (0); /* <- any other outcome unacceptable */
}

Only after the lock is back does it honor the builder’s return code (if (ret != NO_ERROR) goto error;). For unique indexes it re-validates uniqueness across all classes via btree_online_index_check_unique_constraint — a concurrent inserter might have created a collision the snapshot builder could not see. The snapshot is invalidated on both exits.

10.5 The insert-list dispatcher: online_index_builder

Section titled “10.5 The insert-list dispatcher: online_index_builder”

online_index_builder is the producer. Its heap_next loop branches on the scan result: S_END breaks (heap exhausted), S_ERROR sets the error and breaks, any other non-S_SUCCESS trips assert (false). With a filter_pred (partial index) a V_ERROR aborts and a non-V_TRUE continues to skip the row. Otherwise it generates the key (heap_attrinfo_generate_key; NULLER_FAILED), lazily allocates an index_builder_loader_task, and calls add_key (§10.6); a BATCH_FULL return pushes the task to ib_workpool and bumps tasks_started. A set load_context.m_has_error aborts with ER_IB_ERROR_ABORT. After the loop it pushes the final partial task, spins until m_tasks_executed == tasks_started, and for a unique index calls logtb_tran_update_btid_unique_stats with (num_keys, num_oids, num_nulls) (§10.9).

Each task’s execute sorts its batch (prepare_list) then loops calling btree_online_index_list_dispatcher once per key with the purpose BTREE_OP_ONLINE_INDEX_IB_INSERT; on error it sets m_load_context.m_has_error (atomic exchange) and breaks. The BTREE_OP_ONLINE_INDEX_* enum (in btree.h) is the routing key the regular insert/delete machinery checks. Its seven values split into the builder’s _IB_INSERT/_IB_DELETE; the concurrent-DML _TRAN_INSERT, _TRAN_INSERT_DF (object pre-flagged DELETE so vacuum reclaims it without a second descent — used when an UPDATE inserts a row it is about to delete), _TRAN_DELETE; and the rollback _UNDO_TRAN_INSERT/_UNDO_TRAN_DELETE.

10.6 add_key: NULL handling and the batch boundary

Section titled “10.6 add_key: NULL handling and the batch boundary”

index_builder_loader_task::add_key filters NULLs from the physical tree but still counts them for stats:

// index_builder_loader_task::add_key -- src/storage/btree_load.c
if (DB_IS_NULL (key) || btree_multicol_key_is_null (const_cast<DB_VALUE *>(key)))
{ if (BTREE_IS_UNIQUE (m_unique_pk)) ++m_insert_list.m_ignored_nulls_cnt; /* <- count, don't store */
return BATCH_CONTINUE; }
m_memsize += m_insert_list.add_key (key, oid);
return (m_memsize > prm_get_bigint_value (PRM_ID_IB_TASK_MEMSIZE)) ? BATCH_FULL : BATCH_CONTINUE;

INVARIANT (NULLs are tracked, never stored). A NULL (or all-NULL multi-column) key is never placed in the tree, but for a unique index it increments m_ignored_nulls_cnt, which execute folds into both m_num_oids and m_num_nulls. This preserves rows == keys + nulls (§10.9): NULLs add a row and a null but no key, as the SQL standard requires (multiple NULLs do not violate uniqueness). Storing NULLs, or failing to count them, would make the count query and the uniqueness check diverge.

A fence key is a sentinel leaf record (flagged BTREE_LEAF_RECORD_FENCE) recording a leaf’s boundary without a real object. CUBRID uses fences as the prefix-compression anchor: every key’s common prefix is computed against the leaf’s fence, so records store only their suffix. Readers must skip them. The decompressor in btree_read_record rebuilds the full key only for a non-fence record on a page with positive prefix:

// btree_read_record (decompress branch) -- src/storage/btree.c
if (node_type == BTREE_LEAF_NODE && !btree_leaf_is_flaged (rec, BTREE_LEAF_RECORD_OVERFLOW_KEY)
&& !btree_leaf_is_flaged (rec, BTREE_LEAF_RECORD_FENCE)) { /* <- never decompress a fence */
int n_prefix = btree_node_get_common_prefix (thread_p, btid, pgptr);
if (n_prefix > 0) {
spage_get_record (thread_p, pgptr, 1, &peek_rec, PEEK); /* <- slot 1 = fence */
assert (btree_leaf_is_flaged (&peek_rec, BTREE_LEAF_RECORD_FENCE));
pr_midxkey_add_prefix (&result, &lf_key, key, n_prefix); /* <- fence prefix + record suffix */
}
}

INVARIANT (slot 1 is the prefix anchor). When a leaf has a positive common_prefix, slot 1 holds the fence key and the decompressor asserts it. A non-fence record at slot 1 would make pr_midxkey_add_prefix graft the wrong prefix and produce corrupt keys.

Scan skipping is the mirror image. In btree_search_leaf_page’s binary search, an exact DB_EQ match against a fence is never a hit:

// btree_search_leaf_page -- src/storage/btree.c
if (c == DB_EQ) {
if (btree_leaf_is_flaged (&rec, BTREE_LEAF_RECORD_FENCE)) {
assert (middle == 1 || middle == key_cnt);
if (middle == 1) c = DB_GT; /* <- left fence: fall through to next key */
else if (middle == key_cnt)
{ search_key->result = BTREE_KEY_BIGGER; search_key->slotid = key_cnt; return NO_ERROR; }
}
else { search_key->result = BTREE_KEY_FOUND; search_key->slotid = middle; return NO_ERROR; }
}

This differs from the first-key fast check btree_leaf_is_key_between_min_max, which on a fence at slot 1 sets result = BTREE_KEY_BETWEEN, slotid = -1 to force a full search. Either way a range scan (Ch 7) never returns a fence object.

10.8 Foreign-key probes: btree_find_foreign_key and btree_check_foreign_key

Section titled “10.8 Foreign-key probes: btree_find_foreign_key and btree_check_foreign_key”

Both FK helpers reuse the range scan; they differ in direction. btree_check_foreign_key runs on the referencing side — verifying a child key exists in the parent’s PK index. The annotations label every branch:

// btree_check_foreign_key -- src/storage/btree.c
if (has_null == true) return NO_ERROR; /* <- 1: NULL FK column trivially satisfied */
classrepr = heap_classrepr_get (...);
if (classrepr == NULL) goto exit_on_error; /* <- 2: no repr -> cleanup + normalize errid */
if (classrepr->has_partition_info > 0) { /* <- 3: partitioned parent -> prune to local btid */
partition_init_pruning_context (&pcontext); partition_load_pruning_context (...); }
BTID_COPY (&local_btid, pk_btid);
if (classrepr->has_partition_info > 0 && pcontext.partitions != NULL)
partition_prune_unique_btid (&pcontext, keyval, &part_oid, &class_hfid, &local_btid);
ret_search = xbtree_find_unique (thread_p, &local_btid, S_SELECT_WITH_LOCK, keyval, ...);
if (ret_search == BTREE_KEY_NOTFOUND) /* <- 4a: no parent row -> FK violation */
{ er_set (...ER_FK_INVALID...); ret = ER_FK_INVALID; goto exit_on_error; }
else if (ret_search == BTREE_ERROR_OCCURRED) { ASSERT_ERROR_AND_SET (ret); goto exit_on_error; } /* <- 4b */
assert (ret_search == BTREE_KEY_FOUND); /* <- 4c: parent exists -> ok */
// ... success and exit_on_error share cleanup: clear pcontext + heap_classrepr_free_and_init ...

Success and exit_on_error share the cleanup; only exit_on_error forces a non-NO_ERROR return via (ret == NO_ERROR && er_errid() == NO_ERROR) ? ER_FAILED : ret.

btree_find_foreign_key runs on the referenced side — before deleting/updating a parent PK it must find any child row still referencing the key, locking it so the check is stable. It calls btree_remake_foreign_key_with_PK (yielding is_newly); when !is_newly it pr_share_values the key into both range bounds, sets range = GE_LE, find_fk_object.lock_mode = S_LOCK, threads btree_scan.is_fk_remake = is_newly, and drives btree_range_scan with the btree_range_scan_find_fk_any_object visitor. On error or no match (OID_ISNULL (&find_fk_object.found_oid)) it releases any lock taken (lock_unlock_object_donot_move_to_non2pl if locked_object set). The is_newly flag plumbs deduplicate-key mode: btree_remake_foreign_key_with_PK may rebuild the probe key so range bounds match the on-disk dedup-augmented key; when it does the original key is not shared and the rebuilt copy is cleared on exit (§10.10).

10.9 Unique statistics: btree_unique_stats and multi_index_unique_stats

Section titled “10.9 Unique statistics: btree_unique_stats and multi_index_unique_stats”

A unique index keeps three counters so COUNT(*), COUNT(DISTINCT key), and the uniqueness check answer without scanning. btree_unique_stats is the per-index triple (stat_type m_rows, m_keys, m_nulls; in btree_unique.hpp).

FieldRoleWhy it exists
m_rowstotal objects (OIDs), including NULL-keyed onesanswers COUNT(*); left side of the uniqueness identity
m_keysnumber of non-NULL objects (one per row with a real key)the “distinct enough” count; participates in is_unique
m_nullsnumber of objects whose indexed key is NULLNULLs are exempt from uniqueness, so tallied separately

The mutators move two counters together; no setter touches one alone:

// btree_unique_stats mutators -- src/storage/btree_unique.cpp
void insert_key_and_row () { ++m_keys; ++m_rows; }
void insert_null_and_row () { ++m_nulls; ++m_rows; }
void add_row () { ++m_rows; } /* <- dup OID of existing key */
void delete_key_and_row () { --m_keys; --m_rows; }
void delete_null_and_row () { --m_nulls; --m_rows; }
void delete_row () { --m_rows; }
bool is_unique () const { return m_rows == m_keys + m_nulls; }

INVARIANT (m_rows == m_keys + m_nulls). Enforced structurally: every mutator changing m_keys or m_nulls changes m_rows by the same amount, and add_row/delete_row (a dup OID of an existing key) move only m_rows. A unique index is consistent exactly when is_unique() is true; `m_rows > m_keys

  • m_nullsmeans some key has two live OIDs — a violation. The+=/-=` operators move all three counters componentwise, so aggregating partial stats from worker threads (§10.5) cannot break the identity.

multi_index_unique_stats aggregates several indexes touched by one statement (multi-column, partitioned, or the online builder’s per-class loop). Its btid_comparator orders by (root_pageid, vfid.volid).

FieldRoleWhy it exists
m_stats_mapstd::map<BTID, btree_unique_stats, btid_comparator>one stats triple per index touched; the map shape means an index appears at most once

Mutators delegate to operator+=: add_index_stats(index, us) does m_stats_map[index] += us (asserting !BTID_IS_NULL) and operator+=(other) folds every entry the same way, inheriting the per-index identity. construct/destruct use placement new/delete because the object can live inside a transaction descriptor whose memory is managed elsewhere.

Feeding GLOBAL_UNIQUE_STATS_TABLE. Per-transaction stats are local until commit, then flushed into the server-global GLOBAL_UNIQUE_STATS_TABLE (an LF_HASH_TABLE keyed by BTID). Two recovery hooks replay the logged flush. btree_rv_redo_global_unique_stats_commit unpacks (btid, num_nulls, num_oids, num_keys) and calls logtb_rv_update_global_unique_stats_by_abs — replaying the absolute totals (idempotent). btree_rv_undo_global_unique_stats_commit unpacks the same fields but calls logtb_update_global_unique_stats_by_delta with negated counts, reversing the aborted transaction’s contribution; during LOG_RECOVERY_UNDO_PHASE it first checks disk_is_page_sector_reserved, and if the index file is gone returns NO_ERROR — honoring the rule that broken references are tolerated. This is the RVBT_REMOVE_UNIQUE_STATS lifecycle §10.1 hooks for the bulk-load undo.

10.10 Deduplicate-key mode: BTID_INT.deduplicate_key_idx

Section titled “10.10 Deduplicate-key mode: BTID_INT.deduplicate_key_idx”

SUPPORT_DEDUPLICATE_KEY_MODE lets a non-unique index store duplicate keys efficiently by appending a hidden discriminator column. The position is computed once at load (dk_get_deduplicate_key_position) into btid_int.deduplicate_key_idx and persisted into the root header at finalization via SET_DECOMPRESS_IDX_HEADER (root_header, load_args->btid->deduplicate_key_idx), read back on every open via GET_DECOMPRESS_IDX_HEADER. Three places diverge from the regular path: (1) range scan / prefix length sets env->same_prefix_len = ...deduplicate_key_idx so the dedup column bounds user-key prefix from discriminator suffix; (2) foreign-key remake (§10.8) rebuilds the probe key to include/strip the dedup column via the is_newly/is_fk_remake flags; (3) the same header field doubles as the compression-index marker (SET/GET_DECOMPRESS_IDX_HEADER), reusing the slot rather than spending a new field. For a plain index deduplicate_key_idx is the sentinel dk_get_deduplicate_key_position returns when no dedup column applies, and all three branches collapse to the ordinary behavior of Chapters 3, 7, 8.

  1. Offline bulk load builds invisibly, then commits atomically. xbtree_load_index sorts the heap, packs leaves left-to-right via the btree_construct_leafs callback, builds internal levels bottom-up, skips runtime latching; any error or empty result aborts the whole file.
  2. Sorted input is a load-time invariant. btree_construct_leafs trusts btree_compare_key to never return DB_LT; the assert_release converts a comparator bug into immediate failure, not a silently unsorted chain.
  3. Every load page allocation is its own committed sysop. btree_load_new_page brackets file_alloc in a nested system operation that commits independently, relying on whole-file destruction for cleanup; node_level discriminates leaf/non-leaf from overflow.
  4. Online build trades a sort for a snapshot, lock demotion, and a worker pool. xbtree_load_online_index demotes SCH_M_LOCK to IX_LOCK, drives online_index_builder to dispatch batched BTREE_OP_ONLINE_INDEX_IB_INSERT tasks, then must re-promote the schema lock in an uninterruptible loop and re-check uniqueness for concurrent collisions.
  5. NULL keys are tracked, never stored. add_key skips NULLs from the tree but counts them for unique stats, preserving rows == keys + nulls.
  6. Fence keys anchor prefix compression and are skipped by readers. A BTREE_LEAF_RECORD_FENCE at slot 1 supplies the common prefix; the decompressor never decompresses a fence, and btree_search_leaf_page on an equal-but-fence match either falls through (c = DB_GT at slot 1) or returns BTREE_KEY_BIGGER (slot key_cnt) instead of BTREE_KEY_FOUND.
  7. Unique stats are a structurally-enforced triple feeding a global hash. btree_unique_stats moves m_keys/m_nulls in lockstep with m_rows; multi_index_unique_stats aggregates per BTID; commit flushes into GLOBAL_UNIQUE_STATS_TABLE with redo writing absolute totals and undo subtracting a delta (tolerating an already-dropped index). deduplicate_key_idx, stamped into the root header at build time, bends the range-scan, FK-remake, and compression paths into dedup-key mode.

The following are line numbers as observed on 2026-06-08; symbols are the canonical anchor and line numbers are hints that decay.

SymbolFileLine
OR_OID_SIZEsrc/base/object_representation_constants.h67
OR_OID_PAGEIDsrc/base/object_representation_constants.h68
OR_OID_SLOTIDsrc/base/object_representation_constants.h69
OR_OID_VOLIDsrc/base/object_representation_constants.h70
BTREE_SPLIT_LOWER_BOUNDsrc/storage/btree.c79
BTREE_SPLIT_MIN_PIVOTsrc/storage/btree.c82
BTREE_SPLIT_DEFAULT_PIVOTsrc/storage/btree.c85
BTREE_NODE_MAX_SPLIT_SIZEsrc/storage/btree.c88
CAN_MERGE_WHEN_EMPTYsrc/storage/btree.c101
FORCE_MERGE_WHEN_EMPTYsrc/storage/btree.c104
BTREE_LEAF_RECORD_FENCEsrc/storage/btree.c113
BTREE_LEAF_RECORD_OVERFLOW_OIDSsrc/storage/btree.c115
BTREE_LEAF_RECORD_OVERFLOW_KEYsrc/storage/btree.c117
BTREE_LEAF_RECORD_CLASS_OIDsrc/storage/btree.c119
BTREE_LEAF_RECORD_MASKsrc/storage/btree.c121
BTREE_OID_HAS_MVCC_INSIDsrc/storage/btree.c124
BTREE_OID_HAS_MVCC_DELIDsrc/storage/btree.c126
BTREE_RECORD_OR_BUF_INITsrc/storage/btree.c260
BTREE_GET_MVCC_INFO_SIZE_FROM_FLAGSsrc/storage/btree.c278
BTREE_MERGE_NOsrc/storage/btree.c311
BTREE_MERGE_TRYsrc/storage/btree.c312
BTREE_MERGE_FORCEsrc/storage/btree.c313
btree_search_key_helpersrc/storage/btree.c358
BTREE_FIND_UNIQUE_HELPERsrc/storage/btree.c387
BTS_IS_SOFT_CAPACITY_ENOUGHsrc/storage/btree.c568
BTS_IS_HARD_CAPACITY_ENOUGHsrc/storage/btree.c582
BTS_SAVE_OID_IN_BUFFERsrc/storage/btree.c592
btree_insert_helpersrc/storage/btree.c662
BTREE_INSERT_HELPERsrc/storage/btree.c662
btree_delete_helpersrc/storage/btree.c766
BTREE_DELETE_HELPERsrc/storage/btree.c766
btree_or_get_objectsrc/storage/btree.c1458
btree_fix_root_with_infosrc/storage/btree.c1850
btree_create_overflow_key_filesrc/storage/btree.c1975
btree_leaf_record_change_overflow_linksrc/storage/btree.c2318
btree_leaf_get_first_objectsrc/storage/btree.c2451
btree_record_get_num_oidssrc/storage/btree.c2757
btree_leaf_change_first_objectsrc/storage/btree.c2822
btree_leaf_set_flagsrc/storage/btree.c3461
btree_record_object_set_mvcc_flagssrc/storage/btree.c3480
btree_append_oidsrc/storage/btree.c3637
btree_record_append_objectsrc/storage/btree.c3798
btree_start_overflow_pagesrc/storage/btree.c3973
btree_get_disk_size_of_keysrc/storage/btree.c4065
btree_write_recordsrc/storage/btree.c4100
btree_read_recordsrc/storage/btree.c4257
btree_read_record_without_decompressionsrc/storage/btree.c4515
btree_search_nonleaf_pagesrc/storage/btree.c5189
btree_leaf_is_key_between_min_maxsrc/storage/btree.c5369
btree_search_leaf_pagesrc/storage/btree.c5537
btree_find_foreign_keysrc/storage/btree.c6361
btree_delete_key_from_leafsrc/storage/btree.c9653
btree_replace_first_oid_with_ovfl_oidsrc/storage/btree.c9781
btree_modify_leaf_ovfl_vpidsrc/storage/btree.c9978
btree_modify_overflow_linksrc/storage/btree.c10054
btree_delete_meta_recordsrc/storage/btree.c10148
btree_write_default_split_infosrc/storage/btree.c10247
btree_merge_rootsrc/storage/btree.c10284
btree_merge_nodesrc/storage/btree.c10557
btree_node_size_uncompressedsrc/storage/btree.c11097
btree_node_mergeablesrc/storage/btree.c11172
btree_key_append_object_as_new_overflowsrc/storage/btree.c11330
btree_find_oid_and_its_pagesrc/storage/btree.c11646
btree_find_oid_does_mvcc_info_matchsrc/storage/btree.c11794
btree_find_oid_from_leafsrc/storage/btree.c11946
btree_find_oid_from_ovflsrc/storage/btree.c12037
btree_find_split_pointsrc/storage/btree.c12419
btree_split_find_pivotsrc/storage/btree.c12849
btree_split_next_pivotsrc/storage/btree.c12874
btree_recompress_recordsrc/storage/btree.c13118
btree_split_nodesrc/storage/btree.c13324
btree_split_rootsrc/storage/btree.c14184
btree_locate_keysrc/storage/btree.c14882
btree_keyval_searchsrc/storage/btree.c15458
btree_scan_update_rangesrc/storage/btree.c16055
btree_rv_util_save_page_recordssrc/storage/btree.c17242
btree_get_next_pagesrc/storage/btree.c19384
btree_set_vpid_previous_vpidsrc/storage/btree.c19435
btree_compare_keysrc/storage/btree.c19460
btree_check_foreign_keysrc/storage/btree.c22655
btree_rv_undo_global_unique_stats_commitsrc/storage/btree.c23050
btree_rv_redo_global_unique_stats_commitsrc/storage/btree.c23120
btree_search_key_and_apply_functionssrc/storage/btree.c23186
btree_get_root_with_keysrc/storage/btree.c23390
btree_advance_and_find_keysrc/storage/btree.c23455
btree_key_find_and_lock_uniquesrc/storage/btree.c23645
btree_key_find_and_lock_unique_of_uniquesrc/storage/btree.c23673
btree_key_find_and_lock_unique_of_non_uniquesrc/storage/btree.c23903
btree_range_scan_startsrc/storage/btree.c24926
btree_range_scan_resumesrc/storage/btree.c25024
btree_range_scan_read_recordsrc/storage/btree.c25185
btree_range_scan_advance_over_filtered_keyssrc/storage/btree.c25230
btree_range_scan_descending_fix_prev_leafsrc/storage/btree.c25451
btree_range_scansrc/storage/btree.c25794
btree_range_scan_select_visible_oidssrc/storage/btree.c26008
btree_select_visible_object_for_range_scansrc/storage/btree.c26330
btree_range_scan_find_fk_any_objectsrc/storage/btree.c26545
btree_insert_internalsrc/storage/btree.c26976
btree_fix_root_for_insertsrc/storage/btree.c27154
btree_get_max_new_data_sizesrc/storage/btree.c27429
btree_split_node_and_advancesrc/storage/btree.c27495
btree_key_insert_new_objectsrc/storage/btree.c28120
btree_key_insert_new_keysrc/storage/btree.c28302
btree_key_lock_and_append_object_uniquesrc/storage/btree.c28572
btree_key_append_object_non_uniquesrc/storage/btree.c28927
btree_key_append_object_uniquesrc/storage/btree.c29042
btree_key_relocate_last_into_ovfsrc/storage/btree.c29156
btree_key_append_object_into_ovfsrc/storage/btree.c29309
btree_delete_internalsrc/storage/btree.c30525
btree_fix_root_for_deletesrc/storage/btree.c30687
btree_merge_node_and_advancesrc/storage/btree.c30895
btree_key_delete_remove_objectsrc/storage/btree.c31493
btree_key_remove_object_and_keep_visible_firstsrc/storage/btree.c31753
btree_leaf_record_replace_first_with_lastsrc/storage/btree.c32046
btree_record_remove_objectsrc/storage/btree.c32137
btree_record_remove_object_internalsrc/storage/btree.c32230
btree_key_remove_objectsrc/storage/btree.c32298
btree_overflow_remove_objectsrc/storage/btree.c32344
btree_leaf_remove_objectsrc/storage/btree.c32517
btree_key_remove_insert_mvccidsrc/storage/btree.c32637
btree_key_remove_delete_mvccidsrc/storage/btree.c32818
btree_key_remove_delete_mvccid_uniquesrc/storage/btree.c32986
btree_remove_delete_mvccid_unique_internalsrc/storage/btree.c33113
btree_key_remove_delete_mvccid_non_uniquesrc/storage/btree.c33257
btree_record_remove_insidsrc/storage/btree.c33419
btree_record_remove_delidsrc/storage/btree.c33479
btree_online_index_list_dispatchersrc/storage/btree.c34209
btree_rv_log_delete_objectsrc/storage/btree.c35823
btree_rv_log_insert_objectsrc/storage/btree.c35885
BTREE_NEED_UNIQUE_CHECKsrc/storage/btree.h63
BTREE_NODE_TYPEsrc/storage/btree.h81
non_leaf_recsrc/storage/btree.h103
leaf_recsrc/storage/btree.h111
btid_intsrc/storage/btree.h119
BTID_INT.deduplicate_key_idxsrc/storage/btree.h133
btree_keyrangesrc/storage/btree.h139
bts_key_statussrc/storage/btree.h151
btree_scansrc/storage/btree.h198
BTREE_INIT_SCANsrc/storage/btree.h297
BTREE_END_OF_SCANsrc/storage/btree.h368
btree_op_purposesrc/storage/btree.h535
BTREE_OP_DELETE_OBJECT_PHYSICALsrc/storage/btree.h549
BTREE_OP_ONLINE_INDEX_IB_INSERTsrc/storage/btree.h565
btree_mvcc_infosrc/storage/btree.h581
btree_object_infosrc/storage/btree.h594
BTREE_RANGE_SCAN_PROCESS_KEY_FUNCsrc/storage/btree.h710
btree_node_header_undo_logsrc/storage/btree_load.c410
btree_node_header_redo_logsrc/storage/btree_load.c432
xbtree_load_indexsrc/storage/btree_load.c856
dk_get_deduplicate_key_positionsrc/storage/btree_load.c925
btree_save_last_leafrecsrc/storage/btree_load.c1275
SET_DECOMPRESS_IDX_HEADERsrc/storage/btree_load.c1929
btree_load_new_pagesrc/storage/btree_load.c2030
btree_proceed_leafsrc/storage/btree_load.c2122
btree_first_oidsrc/storage/btree_load.c2199
bt_load_make_new_record_on_leaf_pagesrc/storage/btree_load.c2582
LOAD_FIXED_EMPTY_FOR_LEAFsrc/storage/btree_load.c2591
btree_construct_leafssrc/storage/btree_load.c3011
xbtree_load_online_indexsrc/storage/btree_load.c4992
online_index_buildersrc/storage/btree_load.c5287
index_builder_loader_task::add_keysrc/storage/btree_load.c5526
index_builder_loader_task::executesrc/storage/btree_load.c5562
LEAF_FENCE_MAX_SIZEsrc/storage/btree_load.h89
NON_LEAF_ENTRY_MAX_SIZEsrc/storage/btree_load.h96
BTREE_OBJECT_FIXED_SIZEsrc/storage/btree_load.h130
BTREE_MAX_KEYLEN_INPAGEsrc/storage/btree_load.h147
BTREE_MAX_OIDCOUNT_IN_LEAF_RECORDsrc/storage/btree_load.h155
BTREE_GET_KEY_LEN_IN_PAGEsrc/storage/btree_load.h167
btree_node_headersrc/storage/btree_load.h206
btree_root_headersrc/storage/btree_load.h218
btree_unique_stats::insert_key_and_rowsrc/storage/btree_unique.cpp72
btree_unique_stats::is_uniquesrc/storage/btree_unique.cpp118
multi_index_unique_stats::add_index_statssrc/storage/btree_unique.cpp168
btree_unique_statssrc/storage/btree_unique.hpp34
multi_index_unique_statssrc/storage/btree_unique.hpp69
spage_find_empty_slot_atsrc/storage/slotted_page.c1674
spage_insert_atsrc/storage/slotted_page.c1902
spage_headersrc/storage/slotted_page.h64
spage_slotsrc/storage/slotted_page.h88
NON_LEAF_RECORD_SIZEsrc/storage/storage_common.h134
LEAF_RECORD_SIZEsrc/storage/storage_common.h136
btree_node_split_infosrc/storage/storage_common.h140
BTREE_SEARCHsrc/storage/storage_common.h399
BTREE_CONSTRAINT_UNIQUEsrc/storage/storage_common.h616
BTREE_CONSTRAINT_PRIMARY_KEYsrc/storage/storage_common.h617
  • cubrid-btree.md — the high-level companion (design intent, theory).
  • Raw analysis: raw/code-analysis/cubrid/storage/index/.
  • Code: src/storage/btree.{c,h}, btree_load.{c,h}, btree_unique.{cpp,hpp}; slotted-page layout in src/storage/slotted_page.h.
  • Methodology: knowledge/methodology/code-analysis-detail-doc.md.