CUBRID B+Tree — Code-Level Deep Dive

Where this document fits: The high-level analysis cubrid-btree.md covers design intent and theoretical background. This document traces every branch and field at the code level. Each chapter is self-contained, but reading in order follows the full lifecycle of a single index key inside the kernel.

Contents:

Ch	Title	Status
1	Data Structure Map	✅
2	Record Encoding Decoding and Module Initialization	✅
3	Descent and Latch Coupling	✅
4	Inserting a Brand New Key	✅
5	Appending an OID and Enforcing Uniqueness	✅
6	Node Split and Separator Promotion	✅
7	Range Scan and LSA Aware Resume	✅
8	Deleting an Object Logical and Physical	✅
9	Merge and Under Fill Rebalancing	✅
10	Special and Edge Paths	✅

Chapter 1: Data Structure Map

Question: what in-memory and on-disk structures describe a B+Tree node, record, and key/OID cell, and how do their fields relate? Theory lives in cubrid-btree.md (### Slotted-page nodes, ### Non-leaf vs. leaf record split, ### OID-list suffix in non-unique leaves); this chapter pins down the fields. Layering: slotted page -> node header (slot HEADER = 0) -> per-key records (NON_LEAF_REC/LEAF_REC fixed prefix + variable payload).

1.1 The slotted-page substrate: `spage_header` and `spage_slot`

// spage_header -- src/storage/slotted_page.h
struct spage_header
{
  PGNSLOTS num_slots;
  PGNSLOTS num_records;
  INT16 anchor_type;            /* ANCHORED / ANCHORED_DONT_REUSE_SLOTS / UNANCHORED_* */
  unsigned short alignment;
  int total_free;
  int cont_free;
  int offset_to_free_area;      /* records grow up to here */
  int reserved1;
  int flags;                    /* always SPAGE_HEADER_FLAG_NONE today */
  unsigned int is_saving:1;     /* space saving for recovery (undo) */
  unsigned int need_update_best_hint:1;
  unsigned int reserved_bits:30;
};

Field	Role	Why it exists
`num_slots`	allocated slot entries (live + reusable)	slot array length; `PGSLOTID` index space
`num_records`	slots currently holding a record	separates live from freed/reusable slots
`anchor_type`	slot-stability policy; enum `ANCHORED`, `ANCHORED_DONT_REUSE_SLOTS`, `UNANCHORED_ANY_SEQUENCE`, `UNANCHORED_KEEP_SEQUENCE`	B+Tree uses `UNANCHORED_KEEP_SEQUENCE`; slot id = the key’s sorted position, later slots shift on insert/delete (Ch 3)
`alignment`	record start alignment	`BTREE_MAX_ALIGN` so MVCCID/OID reads align
`total_free`	all free bytes, may be fragmented	pre-insert feasibility check
`cont_free`	contiguous run at `offset_to_free_area`	fits in place only if `<= cont_free`, else compact
`offset_to_free_area`	record/free-space frontier	new bytes written here, then it advances
`reserved1`, `flags`, `reserved_bits`	padding / future use	8-byte alignment, forward-compatible
`is_saving`	recovery space reservation active	rollback can restore freed space
`need_update_best_hint`	heap stats hint	unused by B+Tree

// spage_slot -- src/storage/slotted_page.h
struct spage_slot
{
  unsigned int offset_to_record:14;
  unsigned int record_length:14;
  unsigned int record_type:4;        /* REC_HOME, REC_NEWHOME, ... */
};

Field	Role	Why it exists
`offset_to_record`	where record bytes start	indirection lets records move on compaction without changing slot ids
`record_length`	record byte length	bounds the `RECDES` handed to B+Tree code
`record_type`	`REC_*` discriminator	B+Tree records are `REC_HOME`; field exists for heap reuse

Invariant — slot/record accounting. num_records <= num_slots and offset_to_free_area + cont_free must not overrun the front-growing slot array. Enforced by spage_insert*; a violation returns SP_DOESNT_FIT -> split (Ch 6).

Invariant — slot order is key order. B+Tree pages are initialized with UNANCHORED_KEEP_SEQUENCE (the spage_initialize calls in btree.c): a slot id is the key’s ordinal position in the node, and spage_insert_at/spage_delete shift the later slots (spage_shift_slot_up/_down) to preserve that order — this is what lets descent binary-search directly over slot ids (Ch 3). The flip side: a PGSLOTID is positional, valid only while the page stays latched; a scan that unfixes the page must re-validate via the page LSA before reusing one (Ch 7). Compaction moves record bytes, never slot entries.

1.2 The B+Tree node header and root header

Slot HEADER (= 0) is the node header. Non-root nodes carry btree_node_header; the root carries btree_root_header whose first field is an embedded btree_node_header, so both share node-level accessors.

// btree_node_header -- src/storage/btree_load.h
struct btree_node_header
{
  BTREE_NODE_SPLIT_INFO split_info;
  VPID prev_vpid;                    /* leaf prev pointer */
  VPID next_vpid;                    /* leaf next pointer */
  short node_level;                  /* Leaf == 1, Non_leaf > 1 */
  short max_key_len;
  int common_prefix;
};

Field	Role	Why it exists
`split_info`	`{float pivot; int index;}`	biases the next split point (Ch 6) toward observed access, not always 50/50
`prev_vpid`	left sibling page id (leaf)	leaf doubly-linked chain range scans walk (Ch 7)
`next_vpid`	right sibling page id (leaf)	B-link “next”; a scan on a just-split page follows it rightward
`node_level`	node height, leaves == 1	leaf/non-leaf discriminator used everywhere
`max_key_len`	largest key length in subtree	sizing / split estimation without descending
`common_prefix`	shared-prefix length on this node	prefix compression (`COMMON_PREFIX_UNKNOWN` = -1 when uncomputed)

// btree_node_split_info -- src/storage/storage_common.h
struct btree_node_split_info { float pivot; int index; };  /* pivot = split_slot_id/num_keys; index = insert slot after split */

// btree_root_header -- src/storage/btree_load.h
struct btree_root_header
{
  BTREE_NODE_HEADER node;       /* root is also a node */
  INT64 num_oids;
  INT64 num_nulls;
  INT64 num_keys;
  OID topclass_oid;             /* topclass oid, or NULL OID for non-unique */
  int unique_pk;
  struct { int rev_level:16; int deduplicate_key_idx:16; } _32;
  VFID ovfid;                   /* overflow (key) file */
  MVCCID creator_mvccid;
  char packed_key_domain[1];    /* key type for the index — ALWAYS LAST */
};

// BTREE_CONSTRAINT_TYPE -- src/storage/storage_common.h
typedef enum { BTREE_CONSTRAINT_UNIQUE = 0x01, BTREE_CONSTRAINT_PRIMARY_KEY = 0x02 } BTREE_CONSTRAINT_TYPE;

Field	Role	Why it exists
`node`	embedded `btree_node_header`	root is a real node; `btree_get_node_header` works on it too
`num_oids`	total objects indexed	unique stats / estimates; maintained by delta logging
`num_nulls`	NULL keys (not stored)	unique constraints ignore NULLs; counted apart
`num_keys`	distinct keys	`num_oids == num_keys` for a healthy unique index
`topclass_oid`	owning class OID, or NULL	unique single-class vs. non-unique multi-class
`unique_pk`	bitmask `BTREE_CONSTRAINT_UNIQUE` (0x01), `BTREE_CONSTRAINT_PRIMARY_KEY` (0x02)	uniqueness enforcement + fixed object sizing (Ch 5)
`_32.rev_level`	on-disk revision	format gate (`BTREE_CURRENT_REV_LEVEL`)
`_32.deduplicate_key_idx`	dedup column index, stored `idx+1`	`SUPPORT_DEDUPLICATE_KEY_MODE`; `+1` lets 0 mean unset
`ovfid`	overflow-key file id	over-long keys (Ch 10)
`creator_mvccid`	index-creating transaction	index visibility during online load
`packed_key_domain`	serialized key `TP_DOMAIN`	the key type; must be last (variable-length)

Invariant — packed_key_domain is trailing. As a flexible array member, nothing may follow it. Enforced by btree_init_root_header (packs last) / btree_glean_root_header_info (unpacks). A field after it overwrites domain bytes and corrupts key comparison.

Invariant — unique counters. num_oids == num_keys must hold after every committed mutation on a unique index (NULLs in num_nulls). Enforced by btree_change_root_header_delta, which adjusts all three atomically with the structural change (Ch 5).

1.3 Per-key record fixed prefixes: `NON_LEAF_REC` and `LEAF_REC`

// non_leaf_rec -- src/storage/btree.h
struct non_leaf_rec { VPID pnt; short key_len; };   /* pnt = child page pointer */
// leaf_rec -- src/storage/btree.h
struct leaf_rec { VPID ovfl; short key_len; };       /* ovfl = overflow-OID page */

Struct	Field	Role	Why it exists
`non_leaf_rec`	`pnt`	child `VPID` for keys `>=` separator	the descent (Ch 3) follows it down
`non_leaf_rec`	`key_len`	separator key length (`-1` if overflowed)	key bytes after `pnt`; on-disk prefix `NON_LEAF_RECORD_SIZE = DISK_VPID_ALIGNED_SIZE`
`leaf_rec`	`ovfl`	first overflow-OID page, `NULL_VPID` if none	OID spillover beyond record capacity (Ch 5)
`leaf_rec`	`key_len`	inline key length (`-1` if overflow key)	bounds key bytes before the OID list; `LEAF_RECORD_SIZE = 0` — no fixed prefix, the first OID is the start and `leaf_rec` is only the parse target

1.4 The full leaf-record byte layout

Left to right: first OID (inline) -> class OID (only if BTREE_LEAF_RECORD_CLASS_OID) -> insert MVCCID (only if first OID’s volid has BTREE_OID_HAS_MVCC_INSID) -> delete MVCCID (only if BTREE_OID_HAS_MVCC_DELID) -> key bytes (leaf_rec.key_len long, or a VPID if overflow key) -> trailing OID list, each OID possibly carrying its own MVCC fields -> aligned overflow VPID at the tail (only if BTREE_LEAF_RECORD_OVERFLOW_OIDS). The discriminating bits live in two reused sub-fields of the first OID: record flags in slotid, per-OID MVCC flags in volid.

// leaf record flags -- src/storage/btree.c   (in first OID's slotid)
#define BTREE_LEAF_RECORD_FENCE         ((short) 0x1000)
#define BTREE_LEAF_RECORD_OVERFLOW_OIDS ((short) 0x2000)
#define BTREE_LEAF_RECORD_OVERFLOW_KEY  ((short) 0x4000)
#define BTREE_LEAF_RECORD_CLASS_OID     ((short) 0x8000)
#define BTREE_LEAF_RECORD_MASK          ((short) 0xF000)

Flag	Meaning when set	Consumer
`BTREE_LEAF_RECORD_FENCE`	record is a fence (split boundary), not an object	scans skip (Ch 7); splits create (Ch 6)
`BTREE_LEAF_RECORD_OVERFLOW_OIDS`	`leaf_rec.ovfl` set; trailing aligned VPID exists	`BTREE_RECORD_OR_BUF_INIT` shrinks parsed length by `DB_ALIGN(DISK_VPID_SIZE, BTREE_MAX_ALIGN)` so the tail VPID is not read as an OID
`BTREE_LEAF_RECORD_OVERFLOW_KEY`	key replaced by a VPID into `ovfid`	key read/compare fetches the overflow page (Ch 10)
`BTREE_LEAF_RECORD_CLASS_OID`	class OID stored after first OID	objects in one record span classes (non-unique multi-class)

// per-OID MVCC flags -- src/storage/btree.c   (in each OID's volid)
#define BTREE_OID_HAS_MVCC_INSID  ((short) 0x4000)
#define BTREE_OID_HAS_MVCC_DELID  ((short) 0x8000)
#define BTREE_OID_MVCC_FLAGS_MASK ((short) 0xC000)

BTREE_GET_MVCC_INFO_SIZE_FROM_FLAGS derives the byte count: both set -> 2 * OR_MVCCID_SIZE, one -> OR_MVCCID_SIZE, neither -> 0.

Invariant — flag namespace separation. Record flags are slotid bits 0xF000; MVCC flags are volid bits 0xC000. OVERFLOW_KEY and HAS_MVCC_INSID are both 0x4000 in different sub-fields. Enforced by BTREE_OID_GET_RECORD_FLAGS (slotid) vs. BTREE_OID_GET_MVCC_FLAGS (volid); mixing them misparses.

Invariant — MVCC fields decode positionally. No length prefix: presence is known only from the OID’s volid flags, so exactly those bytes must be consumed before the next field. Enforced by BTREE_RECORD_OR_BUF_INIT + flag macros; ad-hoc arithmetic desyncs on the first variable-MVCC object.

flowchart TD
  A["start: first OID"] --> B{"slotid has<br/>CLASS_OID?"}
  B -->|yes| C["read class OID"]
  B -->|no| D
  C --> D{"volid has<br/>HAS_MVCC_INSID?"}
  D -->|yes| E["read insert_mvccid"]
  D -->|no| F
  E --> F{"volid has<br/>HAS_MVCC_DELID?"}
  F -->|yes| G["read delete_mvccid"]
  F -->|no| H
  G --> H{"slotid has<br/>OVERFLOW_KEY?"}
  H -->|yes| I["key field = VPID into ovfid"]
  H -->|no| J["read key_len key bytes"]
  I --> K["trailing OID list"]
  J --> K
  K --> L{"slotid has<br/>OVERFLOW_OIDS?"}
  L -->|yes| M["leaf_rec.ovfl = aligned tail VPID"]
  L -->|no| N["done"]
  M --> N

Figure 1-1: branch-complete parse of a single leaf record.

1.5 `BTID_INT`: the in-memory index descriptor

Gleaned from the root header by btree_glean_root_header_info, threaded through nearly every B+Tree function.

// btid_int -- src/storage/btree.h
struct btid_int
{
  BTID *sys_btid;
  int unique_pk;
  int part_key_desc;             /* last partial-key domain is descending */
  TP_DOMAIN *key_type;
  TP_DOMAIN *nonleaf_key_type;   /* differs from key_type under prefix keys */
  VFID ovfid;
  char *copy_buf;                /* key copy buffer, from INDX_SCAN_ID */
  int copy_buf_len;
  int rev_level;
  int deduplicate_key_idx;
  OID topclass_oid;
};

Field	Role	Why it exists
`sys_btid`	persistent `BTID` (file + root page id)	on-disk index identity
`unique_pk`	uniqueness/PK bitmask	unique checks (Ch 5) + fixed object sizing
`part_key_desc`	last partial-key column is `DESC`	reverses comparison sign (`BTREE_IS_PART_KEY_DESC`)
`key_type`	leaf key `TP_DOMAIN`	full-key comparison / serialization
`nonleaf_key_type`	non-leaf domain, differs only with prefix keys	fixed-char separators stored as varying counterpart; non-leaf compare uses this
`ovfid`	overflow-key file	over-long keys; mirrors root `ovfid`
`copy_buf` / `copy_buf_len`	key-materialization scratch	avoids per-key malloc in scans; borrowed from scan id
`rev_level`	on-disk format revision	guards format-dependent paths
`deduplicate_key_idx`	dedup column index	dedup-mode key handling
`topclass_oid`	owning class OID	class checks; NULL for non-unique multi-class

Invariant — key_type vs. nonleaf_key_type. Same domain for non-prefix indexes; for fixed-char prefix indexes nonleaf_key_type is the varying counterpart from btree_generate_prefix_domain, and separators must compare with it, never key_type. The wrong domain mis-orders descent (Ch 6).

1.6 `BTREE_NODE_TYPE` and `BTREE_OP_PURPOSE`

// BTREE_NODE_TYPE -- src/storage/btree.h
typedef enum { BTREE_LEAF_NODE = 0, BTREE_NON_LEAF_NODE, BTREE_OVERFLOW_NODE } BTREE_NODE_TYPE;

Value	Page kind	Record layout
`BTREE_LEAF_NODE`	leaf (`node_level == 1`)	leaf-record layout of 1.4
`BTREE_NON_LEAF_NODE`	internal (`node_level > 1`)	`NON_LEAF_REC` + separator key
`BTREE_OVERFLOW_NODE`	overflow-OID page	`btree_overflow_header` + packed fixed-size OIDs

BTREE_OP_PURPOSE is the dispatch key for the one shared engine (btree_insert_internal/btree_delete_internal):

// btree_op_purpose -- src/storage/btree.h
enum btree_op_purpose
{
  BTREE_OP_NO_OP,
  BTREE_OP_INSERT_NEW_OBJECT,
  BTREE_OP_INSERT_MVCC_DELID,
  BTREE_OP_INSERT_MARK_DELETED,
  BTREE_OP_INSERT_UNDO_PHYSICAL_DELETE,
  BTREE_OP_DELETE_OBJECT_PHYSICAL,
  BTREE_OP_DELETE_OBJECT_PHYSICAL_POSTPONED,
  BTREE_OP_DELETE_UNDO_INSERT,
  BTREE_OP_DELETE_UNDO_INSERT_UNQ_MULTIUPD,
  BTREE_OP_DELETE_UNDO_INSERT_DELID,
  BTREE_OP_DELETE_VACUUM_OBJECT,
  BTREE_OP_DELETE_VACUUM_INSID,
  BTREE_OP_NOTIFY_VACUUM,
  BTREE_OP_ONLINE_INDEX_IB_INSERT,
  BTREE_OP_ONLINE_INDEX_IB_DELETE,
  BTREE_OP_ONLINE_INDEX_TRAN_INSERT,
  BTREE_OP_ONLINE_INDEX_TRAN_INSERT_DF,
  BTREE_OP_ONLINE_INDEX_UNDO_TRAN_INSERT,
  BTREE_OP_ONLINE_INDEX_TRAN_DELETE,
  BTREE_OP_ONLINE_INDEX_UNDO_TRAN_DELETE
};

Five families: insert, physical/logical delete, vacuum, notify-vacuum, online-index load. INSERT_MVCC_DELID (add delete MVCCID on MVCC delete) vs. INSERT_MARK_DELETED (non-MVCC unique, vacuum-exempt) matters for Ch 8; DELETE_VACUUM_OBJECT (full remove) vs. DELETE_VACUUM_INSID (insert-MVCCID-only remove) for vacuum.

1.7 `BTREE_MVCC_INFO` and `BTREE_OBJECT_INFO`: the in-memory cell

Parsing the 1.4 layout yields one BTREE_OBJECT_INFO per object, whose MVCC sub-record is BTREE_MVCC_INFO:

// btree_mvcc_info -- src/storage/btree.h
struct btree_mvcc_info { short flags; MVCCID insert_mvccid; MVCCID delete_mvccid; };
#define BTREE_MVCC_INFO_INITIALIZER { 0, MVCCID_ALL_VISIBLE, MVCCID_NULL }
// btree_object_info -- src/storage/btree.h
struct btree_object_info { OID oid; OID class_oid; BTREE_MVCC_INFO mvcc_info; };

Struct	Field	Role	Why it exists
`btree_mvcc_info`	`flags`	which MVCCIDs present	in-memory mirror of per-OID `volid` flags
`btree_mvcc_info`	`insert_mvccid`	creating tx id (or `MVCCID_ALL_VISIBLE`)	read visibility; vacuum clears once all-visible
`btree_mvcc_info`	`delete_mvccid`	deleting tx id (or `MVCCID_NULL`)	logical-delete marker; physical removal awaits vacuum
`btree_object_info`	`oid`	indexed instance	the lookup answer
`btree_object_info`	`class_oid`	owning class	multi-class non-unique indexes + lock targeting
`btree_object_info`	`mvcc_info`	object MVCC state	scan visibility (Ch 7)

Invariant — flags mirrors stored bits. BTREE_MVCC_INFO.flags must equal the per-OID flags packed into the record. Enforced by the sanctioned mutators BTREE_MVCC_INFO_SET_FIXED_SIZE / _CLEAR_FIXED_SIZE; disagreement makes the computed size wrong and reads the next object from the wrong offset. The fixed-size rule (BTREE_OBJECT_FIXED_SIZE, btree_load.h) forces flags full — OID [+ class OID if unique] + both MVCCIDs — for an overflow-page object, a non-first leaf object, or the first object of a record that has overflow pages, so overflow pages address as fixed-stride arrays.

Containment: BTID_INT is gleaned from btree_root_header (embeds btree_node_header, embeds btree_node_split_info); node records are non_leaf_rec (pnt descends to a child) or leaf_rec (ovfl chains to a btree_overflow_header page of fixed-stride OIDs); each leaf record decodes to a list of btree_object_info, each holding a btree_mvcc_info.

1.8 Chapter summary — key takeaways

Three layers. Each page is a slotted page (spage_header + spage_slot[]) whose slot 0 (HEADER) holds btree_node_header (non-root) or btree_root_header (root, embedding a node header); slots 1..N hold per-key records.
LEAF_RECORD_SIZE is 0. A leaf record has no fixed on-disk prefix — the first OID is the start; leaf_rec/non_leaf_rec are parse targets, and only non_leaf_rec has a real on-disk size (NON_LEAF_RECORD_SIZE).
Flags live in reused OID sub-fields. Record flags (BTREE_LEAF_RECORD_*, mask 0xF000) in the first OID’s slotid; MVCC flags (BTREE_OID_HAS_MVCC_*, mask 0xC000) in each OID’s volid. 0x4000 differs per field — mask the correct sub-field.
Leaf records decode positionally. Optional class OID, insert/delete MVCCIDs, key (inline or overflow VPID), trailing OID list, optional aligned overflow VPID — present by flag, no length prefixes; the only safe walk is BTREE_RECORD_OR_BUF_INIT plus flag macros.
BTID_INT is the threaded descriptor. Gleaned from the root header, it carries unique_pk, separate key_type/nonleaf_key_type (diverge only under prefix keys), ovfid, and dedup metadata into nearly every operation.
BTREE_OP_PURPOSE is the dispatch key. One insert/delete engine serves five families (insert, delete, vacuum, notify, online-load); the purpose distinguishes MVCC delete from non-MVCC mark-deleted and full-remove from insid-only-remove.
Counters and flags stay coherent. Root-header num_oids/num_keys/num_nulls are kept consistent by delta logging, and BTREE_MVCC_INFO.flags must mirror the stored OID flags — so record sizes and unique arithmetic never desynchronize across a crash or concurrent reader.

Chapter 2: Record Encoding Decoding and Module Initialization

Chapter 1 mapped the in-memory structs (BTID_INT, LEAF_REC, NON_LEAF_REC, BTREE_MVCC_INFO, BTREE_OBJECT_INFO). This chapter traces how a logical (key, OID, class_oid, MVCC info) tuple becomes the raw bytes of a slotted-page record and back, then covers the two per-operation scratch structs (btree_insert_helper, btree_delete_helper) every later chapter threads down the call stack, and the one-time overflow-key file bootstrap. For the why of overflow keys, the OID-list suffix, and unique-vs-non-unique layout see the companion’s Node layout and Unique-key handling; this chapter traces the bytes, not the theory.

2.1 The on-disk record grammar

A leaf record’s parsing-control bits are not a separate field — they ride in the high nibble of the first object’s slot-id half-word (OR_OID_SLOTID):

// BTREE_LEAF_RECORD_* — src/storage/btree.c
#define BTREE_LEAF_RECORD_FENCE         ((short) 0x1000)  /* fence key, not a real object */
#define BTREE_LEAF_RECORD_OVERFLOW_OIDS ((short) 0x2000)  /* OID list spills to overflow page */
#define BTREE_LEAF_RECORD_OVERFLOW_KEY  ((short) 0x4000)  /* key spilled to overflow-key file  */
#define BTREE_LEAF_RECORD_CLASS_OID     ((short) 0x8000)  /* subclass OID present after instance OID */
#define BTREE_LEAF_RECORD_MASK          ((short) 0xF000)

A second, independent pair rides in the first object’s volid half-word (OR_OID_VOLID) and governs MVCC fields per object: BTREE_OID_HAS_MVCC_INSID (0x4000), BTREE_OID_HAS_MVCC_DELID (0x8000). An OID is 8 bytes (pageid(4) @0 | slotid(2) @4 | volid(2) @6); a disk address uses only the low bits of slotid/volid, so each high nibble is free. This is the central encoding trick: the first object’s OID doubles as the record header.

flowchart LR
  subgraph First["First object (also the record header)"]
    O1["instance OID 8B<br/>slotid hi-nibble = record flags<br/>volid hi-nibble = MVCC flags"]
    C1["class OID 8B<br/>if CLASS_OID flag"]
    I1["insert MVCCID 8B<br/>if HAS_MVCC_INSID"]
    D1["delete MVCCID 8B<br/>if HAS_MVCC_DELID"]
  end
  KEY["key bytes<br/>(or 6B overflow VPID if OVERFLOW_KEY)"]
  REST["object 2..n<br/>OID + optional MVCCIDs"]
  OVF["overflow-OIDs VPID 6B<br/>if OVERFLOW_OIDS"]
  O1 --> C1 --> I1 --> D1 --> KEY --> REST --> OVF

Figure 2-1. Leaf-record byte grammar. The first object precedes the key; all additional objects follow it after an align32 pad. The trailing overflow-OIDs VPID is excluded from object iteration by BTREE_RECORD_OR_BUF_INIT (§2.5). A non-leaf record has no objects: it is the NON_LEAF_REC fixed portion (child VPID + key_len) followed by the key, where a negative key_len is the non-leaf equivalent of OVERFLOW_KEY.

2.2 Measuring a key: btree_get_disk_size_of_key

The caller needs the key’s packed length to size the buffer and decide whether it overflows. btree_get_disk_size_of_key is a thin wrapper:

// btree_get_disk_size_of_key — src/storage/btree.c
int btree_get_disk_size_of_key (DB_VALUE * key) {
  if (key == NULL || DB_IS_NULL (key)) {
      assert (key != NULL && !DB_IS_NULL (key));  /* <- NULL keys never reach here in release */
      return 0;
  }
  return pr_index_writeval_disk_size (key);       /* <- delegates to the type's packer */
}

Its only branch is the NULL guard: a NULL key is the absence of a record (counted in the root header’s num_nulls), never a zero-length cell, so reaching it is a bug. The result feeds or_init sizing in btree_write_record and the overflow threshold in §2.9.

Invariant — key size is computed once and reused. The insert path stashes the length in btree_insert_helper::key_len_in_page after BTREE_GET_KEY_LEN_IN_PAGE (which collapses keys >= BTREE_MAX_KEYLEN_INPAGE to DISK_VPID_SIZE). If this measurement and the actual index_writeval disagreed, the slot would be mis-sized and adjacent records corrupted. The code enforces agreement by routing both through the same pr_type packer.

2.3 Setting record flags: btree_leaf_set_flag

Every record-flag write goes through one chokepoint:

// btree_leaf_set_flag — src/storage/btree.c
static void btree_leaf_set_flag (RECDES * recp, short record_flag) {
  short slot_id;
  assert ((short) (record_flag & ~BTREE_LEAF_RECORD_MASK) == 0);    /* <- only the F000 nibble allowed */
  slot_id = OR_GET_SHORT (recp->data + OR_OID_SLOTID);             /* <- read first object's slotid+flags */
  OR_PUT_SHORT (recp->data + OR_OID_SLOTID, slot_id | record_flag); /* <- OR in, preserve slotid */
}

It reads the word at OR_OID_SLOTID, ORs in the flag, writes it back. The assert catches a stray bit (or an MVCC 0x4000/0x8000 meant for the volid word) that would corrupt the first OID’s slot-id and silently redirect the object pointer. The write is non-destructive on the low bits because a real slot-id never uses the top nibble. MVCC flags use the sibling btree_record_object_set_mvcc_flags, the identical trick on OR_OID_VOLID.

2.4 Serializing a record: btree_write_record

btree_write_record builds the record for both node types into a caller-supplied RECDES, writing through an OR_BUF cursor and setting flags as it goes. Figure 2-2 traces every branch.

flowchart TD
  A["or_init buf over rec->data"] --> B{node_type == LEAF?}
  B -- "non-leaf" --> NL["write fixed NON_LEAF_REC"]
  B -- "leaf" --> L1["or_put_oid instance OID"]
  L1 --> L2{UNIQUE and class != topclass?}
  L2 -- yes --> L3["or_put_oid class OID<br/>set CLASS_OID flag"]
  L2 -- no --> L4
  L3 --> L4{mvcc_info != NULL?}
  L4 -- yes --> L7["or_put_mvccid insid if HAS_INSID<br/>or_put_mvccid delid if HAS_DELID<br/>set MVCC flags in volid"]
  L4 -- no --> KT
  L7 --> KT
  NL --> KT{NORMAL key?}
  KT -- normal --> K1["pr_type->index_writeval key"]
  KT -- overflow --> K3["leaf: set OVERFLOW_KEY flag<br/>btree_store_overflow_key, write 6B VPID"]
  K1 --> AL["or_put_align32"]
  K3 --> AL
  AL --> FIN["rec->length = ptr-buffer<br/>rec->type = REC_HOME"]

Figure 2-2. Control flow of btree_write_record, all branches.

Every or_put_* is followed by if (error_code != NO_ERROR) { assert_release (false); return error_code; } — an OR_BUF overrun is a sizing bug, not a runtime condition. Leaf-path branches:

First object — or_put_oid writes the instance OID, which is also the record header (§2.1).
Class OID — or_put_oid + btree_leaf_set_flag (rec, BTREE_LEAF_RECORD_CLASS_OID) only when unique and the object’s class differs from btid->topclass_oid (a PK on the base table stores none, saving 8 bytes). Flag and bytes move in lockstep.
MVCC — guarded by mvcc_info != NULL; HAS_INSID/HAS_DELID are independent or_put_mvccid calls. btree_record_object_set_mvcc_flags (rec->data, mvcc_info->flags) commits the volid nibble after the bytes, so a partial record never claims fields it lacks. mvcc_info == NULL writes neither bytes nor flags (non-MVCC bulk paths).

The non-leaf branch emits btree_write_fixed_portion_of_non_leaf_record_to_orbuf and ignores oid/class_oid/mvcc_info. The shared key-write then runs: NORMAL calls pr_type->index_writeval; OVERFLOW sets OVERFLOW_KEY (leaf only — non-leaf signals overflow via negative key_len), btree_store_overflow_key chains the key, and the 6-byte VPID is written in its place. Finally or_put_align32 pads to 4 bytes (the parser relies on this — §2.6), rec->length is the cursor delta, rec->type = REC_HOME.

Invariant — flags written during serialization must exactly predict the fields the parser skips. btree_write_record is the only producer, btree_read_record_without_decompression the only consumer; they share no state but the flag bits. If the writer set OVERFLOW_KEY without storing the key (or vice versa), the reader’s or_advance math walks into the key region. Hence each flag is set immediately adjacent to its or_put, never batched.

2.5 BTREE_RECORD_OR_BUF_INIT — the iteration boundary

Object loops never read the whole record blindly. BTREE_RECORD_OR_BUF_INIT sets an OR_BUF whose endptr excludes the trailing overflow-OIDs VPID:

// BTREE_RECORD_OR_BUF_INIT — src/storage/btree.c
#define BTREE_RECORD_OR_BUF_INIT(buf, btree_rec) \
  do { int size = (btree_rec)->length; \
      if (btree_leaf_is_flaged (btree_rec, BTREE_LEAF_RECORD_OVERFLOW_OIDS)) \
        { size -= DB_ALIGN (DISK_VPID_SIZE, BTREE_MAX_ALIGN); } /* <- hide ovfl pointer from loops */ \
      or_init (&buf, (btree_rec)->data, size); } while (false)

This lets §2.6 and §2.7 loop while (buf.ptr < buf.endptr) without mistaking the 6-byte overflow VPID for an object.

2.6 Parsing a record: btree_read_record and the without_decompression core

btree_read_record is a two-step shell: it calls btree_read_record_without_decompression to parse fields and locate the key, then — for in-page leaf keys that are not fences — reattaches the node’s common prefix. The core walks the bytes, skipping exactly the fields the writer claimed to write, gated on the same conditions §2.4 used:

// btree_read_record_without_decompression — src/storage/btree.c (leaf field-skip, condensed)
rc = or_advance (&buf, OR_OID_SIZE);                 /* skip instance oid */
if (BTREE_IS_UNIQUE (btid->unique_pk)
    && btree_leaf_is_flaged (rec, BTREE_LEAF_RECORD_CLASS_OID))
  { rc = or_advance (&buf, OR_OID_SIZE); }           /* skip class oid (UNIQUE && flag) */
if (btree_record_object_is_flagged (rec->data, BTREE_OID_HAS_MVCC_INSID))
  { rc = or_advance (&buf, OR_MVCCID_SIZE); }        /* skip insert mvccid */
if (btree_record_object_is_flagged (rec->data, BTREE_OID_HAS_MVCC_DELID))
  { rc = or_advance (&buf, OR_MVCCID_SIZE); }        /* skip delete mvccid */
if (btree_leaf_is_flaged (rec, BTREE_LEAF_RECORD_OVERFLOW_KEY))
  { key_type = BTREE_OVERFLOW_KEY; }
if (btree_leaf_is_flaged (rec, BTREE_LEAF_RECORD_OVERFLOW_OIDS))
  { btree_leaf_get_vpid_for_overflow_oids (rec, &leaf_rec->ovfl); }
else { VPID_SET_NULL (&leaf_rec->ovfl); }

Record flags come from the slot-id nibble (btree_leaf_is_flaged), per-object MVCC flags from the volid nibble (btree_record_object_is_flagged); the class-OID skip needs BTREE_IS_UNIQUE and the flag, matching §2.4. Each or_advance has its own if (rc != NO_ERROR) return rc; (omitted). The non-leaf branch reads the fixed NON_LEAF_REC portion and treats key_len < 0 as BTREE_OVERFLOW_KEY. Key-read branches (shared):

NORMAL key. clear_key is true only when key != NULL && copy_key == COPY_KEY_VALUE; copying a MIDXKEY/char/bit string routes through btid->copy_buf to avoid a malloc. index_readval advances the cursor and leaf_rec->key_len is set from the delta. Passing key == NULL makes index_readval skip the key without materializing it — the cheap path scans use to reach the OID list.
OVERFLOW key. Reads the 6-byte VPID; if key != NULL it follows the chain via btree_load_overflow_key and forces *clear_key = true, else *clear_key = false. A failed or_get_short asserts and nulls the key.

The tail computes after_key_offset identically for every branch: a key->need_clear override forces *clear_key = true for self-clearing types, then buf.ptr = PTR_ALIGN (buf.ptr, OR_INT_SIZE) skips the writer’s align32 and *offset = CAST_BUFLEN (buf.ptr - buf.buffer) records the byte after the key. That *offset is the module’s most-reused number — where the OID list begins, passed everywhere as after_key_offset. btree_read_record then decompresses only for a leaf, non-overflow, non-fence key when btree_node_get_common_prefix returns n_prefix > 0: it reads slot 1 (the fence) and pr_midxkey_add_prefix splices prefix + suffix into *key (n_prefix < 0 asserts). See the companion’s node-layout section for prefix mechanics.

2.7 Counting objects: btree_leaf_get_first_object and btree_record_get_num_oids

btree_leaf_get_first_object returns the header-doubling object: it runs BTREE_RECORD_OR_BUF_INIT (record_buffer, recp) then btree_or_get_object (..., BTREE_LEAF_NODE, dummy_offset = 0, oidp, class_oid, mvcc_info), asserting NO_ERROR (the first object must always parse). btree_record_get_num_oids computes the total; its branches encode the storage regimes:

// btree_record_get_num_oids — src/storage/btree.c (condensed; the three branches)
if (node_type == BTREE_LEAF_NODE) {
    rec_oid_cnt = 1;                                /* the object before the key */
    BTREE_RECORD_OR_BUF_INIT (buf, rec);
    or_seek (&buf, after_key_offset);               /* jump past the key */
    if (BTREE_IS_UNIQUE (btid_int->unique_pk)) {    /* unique: fixed-size objects */
        fixed_object_size = BTREE_OBJECT_FIXED_SIZE (btid_int);
        assert ((CAST_BUFLEN (buf.endptr - buf.ptr) % fixed_object_size) == 0);
        return rec_oid_cnt + CAST_BUFLEN (buf.endptr - buf.ptr) / fixed_object_size;
    }
    while (buf.ptr < buf.endptr) {                  /* non-unique: variable size, walk it */
        mvcc_flag = btree_record_object_get_mvcc_flags (buf.ptr);
        buf.ptr += OR_OID_SIZE + BTREE_GET_MVCC_INFO_SIZE_FROM_FLAGS (mvcc_flag);
        rec_oid_cnt++;
    }
    return rec_oid_cnt;
}
return CEIL_PTVDIV (rec->length, BTREE_OBJECT_FIXED_SIZE (btid_int));  /* overflow node: all fixed */

Three regimes: unique leaf — non-first objects are fixed (BTREE_OBJECT_FIXED_SIZE = 2*OID + 2*MVCCID); one division of the post-key span gives the count, the modulo assert is the corruption tripwire. Non-unique leaf — variable objects (no class OID; MVCCIDs optional), so the loop advances by OR_OID_SIZE + BTREE_GET_MVCC_INFO_SIZE_FROM_FLAGS(...), the +1 first object added up front. Overflow node — no key, all fixed: rec->length / fixed_size via CEIL_PTVDIV.

Invariant — after_key_offset must point exactly at the byte after the key. All three leaf strategies depend on or_seek (&buf, after_key_offset) landing on the first post-key object; that offset is §2.6’s *offset verbatim. The unique-leaf modulo assert catches an off-by-one offset before it silently miscounts.

2.8 The per-operation helper structs

Every mutation flows through one of two heap-free scratch structs, constructed once per top-level call and threaded down the recursion, grouping what would otherwise be a dozen parameters.

btree_insert_helper — every field

Field	Role	Why it exists
`obj_info`	`BTREE_OBJECT_INFO` being inserted	the payload (OID + class OID + MVCC info)
`purpose`	`BTREE_OP_PURPOSE` discriminator	selects new-insert / MVCC-delid / online-index / undo behavior
`op_type`	single vs multi insert/modify	drives `unique_stats_info` collection
`unique_stats_info`	`btree_unique_stats *` accumulator	defers unique-count maintenance to a batch flush
`key_len_in_page`	packed key length (post `BTREE_GET_KEY_LEN_IN_PAGE`)	sizes splits without re-measuring
`nonleaf_latch_mode`	`PGBUF_LATCH_MODE` for descent	READ optimistic, WRITE when a split is anticipated (Ch 3)
`is_first_try`	true on first root fix	loads B-tree info from root once
`need_update_max_key_len`	max-key-len bump to propagate down	children must follow a grown max
`is_crt_node_write_latched`	current node already write-latched	skips latch promotion
`is_root`	current node is the root	root has special split/merge handling
`is_unique_key_added_or_deleted`	key count actually changed	distinguishes key-created vs OID-appended for stats
`is_unique_multi_update`	multi-row update of unique index	tolerates >1 visible object mid-statement
`is_ha_enabled`	HA replication active	forbids any transient unique violation
`log_operations`	`er_log` tracing toggle	debug instrumentation
`is_null`	the key is SQL NULL	NULLs are counted, not stored
`printed_key` / `printed_key_sha1`	readable key + its SHA1	diagnostics for large keys
`insert_list`	`btree_insert_list *`	batched multi-key insert driver
`leaf_addr`	`LOG_DATA_ADDR` of target leaf	redo-logging anchor
`rcvindex`	`LOG_RCVINDEX` id	selects the redo handler
`rv_keyval_data` / `_length`	packed key+value undo image	logical undo payload
`rv_redo_data` / `_ptr`	redo buffer + write cursor	physical redo image, built incrementally
`compensate_undo_nxlsa`	next-undo LSA	CLR generation during undo
`is_system_op_started`	a nested system op is open	tracks pending `log_sysop_*` commit/abort
`time_track`	`PERF_UTIME_TRACKER`	performance statistics
`saved_locked_oid` / `_class_oid` (SERVER_MODE)	object locked during unique check	carries the lock across latch-drop-and-retry

Constructor descent-defaults worth memorizing: purpose (BTREE_OP_NO_OP), nonleaf_latch_mode (PGBUF_LATCH_READ) (start optimistic), is_first_try (true) (load btree info once), is_crt_node_write_latched (false), is_root (false), is_unique_key_added_or_deleted (true) (assume the key count changes until proven otherwise), is_system_op_started (false).

btree_delete_helper — every field

Field	Role	Why it exists
`object_info`	`BTREE_OBJECT_INFO` to remove	the deletion target
`second_object_info`	a second object info	undo of a unique-index insert needs the displaced object
`purpose`	`BTREE_OP_PURPOSE` discriminator	same multiplexing role as insert helper
`nonleaf_latch_mode`	descent latch mode	READ-optimistic default
`op_type`	operation type (default `SINGLE_ROW_DELETE`)	gates `unique_stats_info` collection
`unique_stats_info`	`btree_unique_stats *`	batch unique-count maintenance
`match_mvccinfo`	`BTREE_MVCC_INFO` to match	finds the specific object version in a key’s OID list
`buffered_key`	`OR_BUF *` of the key	search by pre-packed key, avoiding re-encode
`printed_key` / `printed_key_sha1`	readable key + SHA1	diagnostics for large keys
`log_operations`	tracing toggle	debug logging
`is_root`	current node is root	root-specific merge handling
`is_first_search`	first traversal	distinguishes initial descent from post-restart retry
`check_key_deleted`	possible >1 visible object (MULTI_ROW_UPDATE)	decides whether to verify the key is gone
`is_key_deleted`	key actually became empty	corrects collected stats if the key survived
`leaf_addr`	`LOG_DATA_ADDR` of leaf	redo anchor
`rv_keyval_data` / `_length`	undo image	logical undo of the delete
`rv_redo_data` / `_ptr`	redo buffer + cursor	physical redo image
`reference_lsa`	reference LSA	recovery/compensation reference point
`is_system_op_started`	nested system op open	same commit/abort bookkeeping
`time_track`	`PERF_UTIME_TRACKER`	performance stats

The two structs mirror each other on the recovery block (leaf_addr, rv_keyval_data*, rv_redo_data*, is_system_op_started); btree_insert_helper_to_delete_helper and its inverse copy these across when one operation switches modes (e.g. an MVCC update that deletes then inserts).

Invariant — is_system_op_started must be balanced on every exit path. If the code log_sysop_starts (as §2.9 does) and sets this flag, every return/goto error must log_sysop_commit or log_sysop_abort. A leaked system op holds the log-append fixed and stalls recovery; the flag lets error paths ask “did I open one?” before unwinding.

2.9 First-time overflow-key file bootstrap

The overflow-key file is created lazily, the first time a key exceeds BTREE_MAX_KEYLEN_INPAGE (= DB_PAGESIZE / 8). The insert root prep guards creation with a double-checked pattern: it tests key_len >= BTREE_MAX_KEYLEN_INPAGE && VFID_ISNULL (&btid_int->ovfid) under a read latch; if true it pgbuf_promote_read_latch’s the root to WRITE (re-fixing on ER_PAGE_LATCH_PROMOTE_FAIL), sets insert_helper->is_crt_node_write_latched = true, then re-tests VFID_ISNULL because a concurrent inserter may have created the file in the latch window. Only then does it log_sysop_start and call btree_create_overflow_key_file (aborting the sysop and goto error on failure). The system operation makes the file persist even if the surrounding insert rolls back — it is permanent metadata, not transactional data.

// btree_create_overflow_key_file — src/storage/btree.c (condensed)
VFID_SET_NULL (&btid->ovfid);
des.btree_key_overflow.btid = *btid->sys_btid;          /* structure copy */
des.btree_key_overflow.class_oid = btid->topclass_oid;
assert (!OID_ISNULL (&des.btree_key_overflow.class_oid));
error_code = file_create_with_npages (thread_p, FILE_BTREE_OVERFLOW_KEY, 3, &des, &btid->ovfid);
if (error_code != NO_ERROR) { return error_code; }
error_code = heap_get_class_tde_algorithm (thread_p, &btid->topclass_oid, &tde_algo);
if (error_code != NO_ERROR) { VFID_SET_NULL (&btid->ovfid); return error_code; }  /* <- roll back VFID */
error_code = file_apply_tde_algorithm (thread_p, &btid->ovfid, tde_algo);
if (error_code != NO_ERROR) { VFID_SET_NULL (&btid->ovfid); return error_code; }  /* <- roll back VFID */
return error_code;

Branch-complete: it nulls ovfid up front, creates a 3-page FILE_BTREE_OVERFLOW_KEY file tagged with the index’s BTID and top-class OID, then applies the class’s TDE (transparent data encryption) algorithm so encrypted tables encrypt their overflow keys. Every error path after the file is created re-nulls ovfid — a half-initialized file never leaves a dangling VFID in BTID_INT, keeping the caller’s VFID_ISNULL re-check honest on retry. The successful VFID is later persisted into the root header. The class_oid assert guards against an orphan overflow file with no owning class.

2.10 Chapter summary — key takeaways

The first object’s OID is the record header — record flags in its slotid nibble (btree_leaf_set_flag), MVCC flags in its volid nibble (btree_record_object_set_mvcc_flags); the low bits stay a valid disk address.
btree_write_record and btree_read_record_without_decompression are a producer/consumer pair joined only by those flag bits; the writer sets each flag adjacent to its field and the reader skips exactly those fields, so flags are never batched and every or_put/or_advance has its own error exit.
*offset / after_key_offset is the load-bearing number — computed once after PTR_ALIGN, then or_seek’d to by btree_record_get_num_oids and later iterators; BTREE_RECORD_OR_BUF_INIT hides the trailing overflow VPID from those loops.
Object counting has three regimes: unique-leaf (one division, modulo-asserted), non-unique-leaf (per-object walk via BTREE_GET_MVCC_INFO_SIZE_FROM_FLAGS), overflow (rec->length / fixed_size); the +1 first object is added explicitly in the leaf cases.
btree_insert_helper / btree_delete_helper are heap-free constructor-initialized scratch structs; their defaults (PGBUF_LATCH_READ, is_first_try = true) encode Ch 3’s optimistic descent and their mirrored recovery block feeds the logging chapters.
The overflow-key file is created lazily and exactly once, behind a double-checked VFID_ISNULL inside a system operation so it survives rollback; every error path in btree_create_overflow_key_file re-nulls ovfid to keep the re-check honest.

Chapter 3: Descent and Latch Coupling

This chapter answers: given a key, which leaf owns it, and how do we get there without corrupting the tree under concurrency? The descent is shared by every insert, delete, scan-seek, and unique probe. For latch-coupling theory and the “restart from root” claim see the companion’s ### Latch-coupling on descent; this chapter corrects it: writes split pessimistically on descent (Ch 6) and only restart when a latch promotion fails.

3.1 The descent driver and its three callbacks

Every traversal flows through btree_search_key_and_apply_functions — a template where callers plug in three callbacks (fix-root / advance / process-leaf):

// btree_search_key_and_apply_functions -- src/storage/btree.c
start_btree_traversal:
  restart = false;  is_leaf = false;
  root_function (..., &crt_page, &is_leaf, search_key, &stop, &restart, root_args);  /* 1. fix root, load BTID_INT 1st try */
  if (stop) goto end;  if (restart) goto start_btree_traversal;   /* stop: NULL/vacuum. restart: promote failed */
  while (!is_leaf)                        /* 2. ADVANCE until leaf reached */
    {
      advance_function (..., &crt_page, &advance_page, &is_leaf, search_key, &stop, &restart, advance_args);
      if (stop) goto end;
      if (restart) { if (advance_page) pgbuf_unfix_and_init (thread_p, advance_page); goto start_btree_traversal; }
      if (!is_leaf)                       /* latch-couple: release parent only AFTER child fixed */
        { if (crt_page) pgbuf_unfix (thread_p, crt_page);  crt_page = advance_page;  advance_page = NULL; }
    }
  if (key_function) { key_function (...); if (restart) goto start_btree_traversal; }  /* 3. operate on leaf */

The callback triples (root / advance / key): read = btree_get_root_with_key / btree_advance_and_find_key / none; insert (Ch 4-6) and delete (Ch 8-9) swap in btree_fix_root_for_insert / _for_delete plus the btree_split_node_and_advance / btree_merge_node_and_advance advances and btree_key_insert_new_object / btree_key_delete_* key functions.

INVARIANT (no page leak across restart): on restart the engine unfixes advance_page before jumping to start_btree_traversal, and the loop top unfixes crt_page; every advance callback also unfixes both crt_page and child_page before setting restart = true. A forgotten unfix pins the buffer-fix counter forever.

Figure 3-1 — the descent skeleton.

stateDiagram-v2
    [*] --> FixRoot : start_btree_traversal
    FixRoot --> End : stop true
    FixRoot --> FixRoot : restart true
    FixRoot --> Advance : root fixed, not leaf
    FixRoot --> ProcessLeaf : root is also leaf
    Advance --> End : stop true
    Advance --> FixRoot : restart true, unfix child
    Advance --> Advance : child fixed, release parent, descend
    Advance --> ProcessLeaf : reached leaf
    ProcessLeaf --> FixRoot : restart true
    ProcessLeaf --> End : done
    End --> [*]

3.2 `BTREE_SEARCH_KEY_HELPER` — the result carrier

Every page-level search writes its verdict into one tiny struct, passed by pointer down the descent and back:

// struct btree_search_key_helper -- src/storage/btree.c
struct btree_search_key_helper
{
  enum fence_key_presence { NO_FENCE_KEY = 0, HAS_FENCE_KEY };
  BTREE_SEARCH result;    /* Result of key search. */
  PGSLOTID slotid;    /* Slot ID of found key or slot ID of the biggest key smaller then key (if not found). */
  fence_key_presence has_fence_key;
};

Field	Role	Why it exists
`result`	A `BTREE_SEARCH` value (matrix below); the primary verdict.	Callers branch on found vs absent.
`slotid`	On `BTREE_KEY_FOUND`, the slot holding the key; when absent, the biggest key smaller than the search key; `NULL_SLOTID` while uninitialized or on `DB_UNK`.	One struct serves both “found, here it is” and “absent, here it belongs”.
`has_fence_key`	`HAS_FENCE_KEY` if the last record examined carried `BTREE_LEAF_RECORD_FENCE`.	The landed-on slot is a synthetic boundary, not data — skip in scans (§3.6).

Role matrix for the values btree_search_*_page can set (subset of the 7-member BTREE_SEARCH enum in storage_common.h; BTREE_ERROR_OCCURRED and BTREE_ACTIVE_KEY_FOUND belong to OID/version search in Ch 5/8 and are never emitted here):

Value	Set when / `slotid` / reaction
`BTREE_KEY_FOUND`	exact non-fence match; `slotid` = the match; read/modify
`BTREE_KEY_NOTFOUND`	`DB_UNK` compare; `slotid = NULL_SLOTID`; propagate error
`BTREE_KEY_SMALLER`	key < all, slot 1; `slotid = 1`; go to prev sibling
`BTREE_KEY_BIGGER`	key > all, off right end; `slotid = key_cnt + 1`; go to next sibling
`BTREE_KEY_BETWEEN`	absent, inside [min,max]; `slotid` = next-bigger slot; owning leaf, insert here

BTREE_SEARCH_KEY_HELPER_INITIALIZER seeds it to { BTREE_KEY_NOTFOUND, NULL_SLOTID, NO_FENCE_KEY }. btree_advance_and_find_key dispatches by node type (Figure 3-2): the leaf search fills all three fields; the non-leaf search fills only slotid, never the helper.

Figure 3-2 — who fills the helper.

flowchart LR
  A["btree_advance_and_find_key"] -->|leaf| LP["btree_search_leaf_page\nfills result+slotid+has_fence_key"]
  A -->|non-leaf| NP["btree_search_nonleaf_page\nfills only slotid"]
  LP --> H["BTREE_SEARCH_KEY_HELPER"]

3.3 Fixing the root

All three root callbacks call btree_fix_root_with_info, which pgbuf_fixes the root at latch_mode, reads the header (btree_get_root_header), and — only when btid_int_p != NULL — gleans metadata via btree_glean_root_header_info. BTID_INT (key type, unique flag, overflow-key VFID, topclass OID, …) loads only on the first fix; on restart the callbacks pass btid_int_p = NULL to reuse it:

// btree_fix_root_for_insert -- src/storage/btree.c
if (insert_helper->is_first_try)
  *root_page = btree_fix_root_with_info (..., insert_helper->nonleaf_latch_mode, NULL, &root_header, btid_int);
else
  { /* restart: re-fix, reset latch flags, skip all first-try-only work */
    *root_page = btree_fix_root_with_info (..., insert_helper->nonleaf_latch_mode, NULL, NULL, NULL);
    insert_helper->is_crt_node_write_latched = false;  insert_helper->need_update_max_key_len = false;
    return NO_ERROR;
  }
insert_helper->is_first_try = false;                     /* <- never do the below twice */

First-try-only branches in btree_fix_root_for_insert (guarded so they never repeat on restart):

purpose == UNDO_PHYSICAL_DELETE / ONLINE_INDEX_UNDO_TRAN_DELETE → fill class OID for unique, compute key_len_in_page, return.
unique index → build btree_unique_stats incr (delete_* vs insert_* by purpose); accumulate into unique_stats_info (multi-row) or logtb_tran_update_unique_stats; error → goto error.
is_null → *stop = true, unfix root, return; INSERT_MVCC_DELID / INSERT_MARK_DELETED → return.
else (real insert) → if key_len >= BTREE_MAX_KEYLEN_INPAGE and no overflow-key file exists, promote root to WRITE and create it (§3.4); set key_len_in_page, return.

btree_fix_root_for_delete mirrors this — first search loads BTID_INT, restart re-fixes and returns. Its first-fix if on delete_helper->purpose (prose summary, not a C excerpt): the two BTREE_OP_DELETE_VACUUM_* purposes return NO_ERROR immediately; the five undo purposes (_UNDO_INSERT*, _OBJECT_PHYSICAL_POSTPONED, _ONLINE_INDEX_UNDO_TRAN_INSERT) COPY_OID btid_int->topclass_oid into a NULL unique BTREE_DELETE_CLASS_OID, then return NO_ERROR; else (asserted btree_is_delete_object_purpose) update unique stats via incr.delete_* and set *stop = true if is_null.

Both fix the root with nonleaf_latch_mode, which defaults to PGBUF_LATCH_READ — writes start optimistic, escalating to WRITE only where a node must change.

3.4 Latch-coupling and where READ becomes WRITE

Read path (btree_advance_and_find_key): root and each child fixed PGBUF_LATCH_READ; the driver releases the parent only after the child is fixed. No promotion.
Write paths (split/merge advance): also start READ; a latch is promoted to WRITE only where its node must be modified.

btree_split_node_and_advance (Ch 6) fixes the child WRITE eagerly when it is a leaf’s parent or max-key-len must grow, else keeps nonleaf_latch_mode. When a split is needed but the node is still READ-latched, it promotes in place and, on failure, restarts WRITE-mode:

// btree_split_node_and_advance -- src/storage/btree.c
if (need_split && insert_helper->nonleaf_latch_mode == PGBUF_LATCH_READ && !insert_helper->is_crt_node_write_latched)
  {
    error_code = pgbuf_promote_read_latch (thread_p, crt_page, PGBUF_PROMOTE_ONLY_READER);  /* always ONLY_READER */
    if (error_code == ER_PAGE_LATCH_PROMOTE_FAIL)
      { /* could not promote -> unfix both pages, switch to WRITE-mode, restart from root */
        insert_helper->nonleaf_latch_mode = PGBUF_LATCH_WRITE;  *restart = true;
        pgbuf_unfix_and_init (thread_p, child_page);  pgbuf_unfix_and_init (thread_p, *crt_page);
        insert_helper->is_crt_node_write_latched = false;  return NO_ERROR;
      }
    insert_helper->is_crt_node_write_latched = true;             /* promoted in place: keep descending */
  }

A later sibling promote targets the child node via pgbuf_promote_read_latch (..., &child_page, PGBUF_PROMOTE_SHARED_READER); both promotes fail-and-restart identically.

INVARIANT (no latch upgrade deadlock): the current-page promote uses PGBUF_PROMOTE_ONLY_READER and the child-page promote PGBUF_PROMOTE_SHARED_READER — both fail rather than wait when another reader holds the page, after which the thread releases all latches and restarts WRITE-mode. Blocking on a WRITE upgrade while holding a READ latch is the textbook latch-upgrade deadlock; fail-and-restart makes it impossible.

This corrects the companion’s “restart from root”: a write does not restart per split — it splits pessimistically on descent (Ch 6); a restart happens only on a promotion failure, after which the WRITE-mode descent cannot loop.

3.5 `btree_search_nonleaf_page` — binary search and child selection

Slot 1 is a dummy negative-infinity key whose only payload is a child pointer:

// btree_search_nonleaf_page -- src/storage/btree.c
key_cnt = btree_node_number_of_keys (thread_p, page_ptr);
if (key_cnt <= 0)  return ER_FAILED;        /* underflow guard */
if (key_cnt == 1)                            /* only the dummy neg-inf key -> follow its pointer */
  { *slot_id = 1;  *child_vpid = non_leaf_rec.pnt;  return NO_ERROR; }
left = 2;  right = key_cnt;                   /* <- start at 2: skip dummy slot 1, then narrow as §3.6 */
while (left <= right)
  {
    middle = CEIL_PTVDIV ((left + right), 2);
    c = btree_compare_key (key, &temp_key, btid->key_type, 1, 1, &start_col);
    if (c == DB_UNK) return ER_FAILED;        /* incomparable -> error */
    if (c == 0) { *slot_id = middle; *child_vpid = non_leaf_rec.pnt; return NO_ERROR; } /* exact separator hit */
    else if (c < 0) right = middle - 1; else left = middle + 1;   /* (+ remember start_col) */
  }

Post-loop (left > right): c < 0 → child is left of middle (*slot_id = middle - 1, re-read its pnt); c >= 0 → child is middle (*child_vpid already holds non_leaf_rec.pnt). page_bounds (the update_boundary_* calls) is NULL for the read descent; start_col is a midxkey skip cursor. The non-leaf search emits only slot_id/child_vpid, never the helper (btree_advance_and_find_key lends search_key->slotid as scratch).

3.6 `btree_search_leaf_page` — found, not-found, and fence-key skipping

At the leaf the post-conditions are richer; fence keys complicate equality:

// btree_search_leaf_page -- src/storage/btree.c
left = 1;  right = key_cnt;                  /* search_key zeroed to NOTFOUND/NULL_SLOTID/NO_FENCE first */
while (left <= right)                         /* middle = CEIL_PTVDIV((left+right),2); c = btree_compare_key(...) */
  {
    if (c == DB_UNK) { search_key->result = BTREE_KEY_NOTFOUND; ... return error; }
    if (c == DB_EQ)
      {
        if (btree_leaf_is_flaged (&rec, BTREE_LEAF_RECORD_FENCE))      /* matched a FENCE record */
          {
            search_key->has_fence_key = HAS_FENCE_KEY;
            if (middle == 1)        c = DB_GT;                          /* lower fence -> "go right" */
            else if (middle == key_cnt)                                /* upper fence -> key is past end */
              { search_key->result = BTREE_KEY_BIGGER; search_key->slotid = key_cnt; return NO_ERROR; }
          }
        else { search_key->result = BTREE_KEY_FOUND; search_key->slotid = middle; return NO_ERROR; }
      }
    if (c < 0) right = middle - 1; else left = middle + 1;      /* (narrow + remember start_col, as §3.5) */
  }

A fence key is a synthetic copy of a neighbor’s boundary that bounds prefix compression (companion’s ### Node layout); it must never be reported as a real match — hence the in-loop branches turn a lower-fence hit into “go right” (c = DB_GT) and an upper-fence hit into BTREE_KEY_BIGGER.

Post-loop (key absent) it flags has_fence_key if the last record was a fence, then sets the verdict per the §3.2 matrix: c < 0 → SMALLER if middle == 1 else BETWEEN, slotid = middle; c >= 0 → BIGGER if middle == key_cnt else BETWEEN, slotid = middle + 1. Per the header comment the key can equal the upper fence only on a scan resuming after an unfix/refix (Ch 7); an #ifndef NDEBUG block cross-checks via btree_leaf_is_key_between_min_max.

3.7 `btree_locate_key` — the single-key descent entry

This one-shot entry calls the engine with the default root function and btree_advance_and_find_key, passing reuse_btid_int = true so BTID_INT is not re-gleaned. Two branches: an error early-return (*leaf_page_out = NULL); and a success path setting *found_p = (search_key.result == BTREE_KEY_FOUND) and *slot_id = search_key.slotid, returning the leaf still latched (READ) — the building block for unique probes (Ch 5) and delete (Ch 8). *found_p is true only for BTREE_KEY_FOUND; the others give found = false with slot_id at the insertion point.

3.8 `btree_compare_key` — the comparison primitive

Every binary search bottoms out here (returning DB_LT/DB_EQ/DB_GT/DB_UNK):

// btree_compare_key -- src/storage/btree.c
if (DB_IS_NULL (key1))
  { if (DB_IS_NULL (key2)) { assert (false); return DB_UNK; } return DB_LT; }  /* NULL sorts low; both-NULL impossible */
if (DB_IS_NULL (key2)) return DB_GT;
if (dom_type == DB_TYPE_MIDXKEY)
  {
    c = pr_midxkey_compare (..., start_colp, &dummy_diff_column, dom_is_desc, NULL);
    if (dom_is_desc[0]) c = ((c == DB_GT) ? DB_LT : (c == DB_LT) ? DB_GT : c);   /* DESC flips leading col */
  }

Beyond the NULL and DESC handling shown, incomparable types or a collation mismatch yield DB_UNK — a hard error for every search. start_colp is the midxkey column-skip cursor (unused for single-column keys).

3.9 Chapter summary — key takeaways

One engine, three callbacks — btree_search_key_and_apply_functions runs fix-root → advance → process-leaf; operations differ only in callbacks (read: btree_get_root_with_key + btree_advance_and_find_key; write: split/merge advance).
BTREE_SEARCH_KEY_HELPER is the universal verdict — result, slotid (match or insertion point), has_fence_key.
BTID_INT loads exactly once on the first root fix; restarts re-fix with btid_int_p = NULL, first-try-only blocks guarded by is_first_try.
Descent is latch-coupled, optimistically READ — child fixed before parent released, escalating only where a node must change.
Restart is the exception — the write path splits pessimistically on descent (Ch 6); a restart happens only on ER_PAGE_LATCH_PROMOTE_FAIL, after which it runs WRITE-mode, preventing the latch-upgrade deadlock.
Non-leaf vs. leaf search differ in output — btree_search_nonleaf_page skips dummy slot 1 and emits (slot_id, child_vpid); btree_search_leaf_page emits a full helper and special-cases fence keys.
btree_compare_key is deterministic — NULL sorts low, DESC flips the leading column, incomparable types return DB_UNK (a hard error).

Chapter 4: Inserting a Brand New Key

This chapter answers one question: when the descent reaches a leaf and the key is not yet present, how is a fresh (key, first-OID) record built and spliced into the leaf, and how are the header counters and recovery records updated? We trace btree_insert_internal, how btree_split_node_and_advance hands control to the leaf, the dispatch in btree_key_insert_new_object, and then btree_key_insert_new_key in full. OID-append to an existing key is Chapter 5; the split branch is Chapter 6. For record layout, latch-coupling descent, and the unique-statistics model in the abstract, see the companion cubrid-btree.md (“Record encoding”, “Latch coupling”); we do not re-derive them. Structs were dissected in Chapter 1; here we touch only the helper fields this path mutates.

4.1 The orchestration entry point: `btree_insert_internal`

All three public insert entry points funnel into btree_insert_internal, a thin setup-and-dispatch layer that never touches a page. It builds one stack-local BTREE_INSERT_HELPER, picks the per-key function by purpose, then delegates the descent to btree_search_key_and_apply_functions. The purpose-to-function dispatch is the first branch point:

// btree_insert_internal -- src/storage/btree.c
switch (purpose) {
  case BTREE_OP_INSERT_UNDO_PHYSICAL_DELETE:
    LSA_COPY (&insert_helper.compensate_undo_nxlsa, undo_nxlsa);  [[fallthrough]];
  case BTREE_OP_INSERT_NEW_OBJECT:
    key_insert_func = btree_key_insert_new_object;  break;        /* <- our path */
  case BTREE_OP_INSERT_MVCC_DELID:
  case BTREE_OP_INSERT_MARK_DELETED:
    key_insert_func = btree_key_find_and_insert_delete_mvccid;  break;  /* <- logical delete */
  default: assert (false); return ER_FAILED;
}

For a brand-new key the purpose is BTREE_OP_INSERT_NEW_OBJECT. The NULL-key flag is computed before the switch and is decisive — a NULL key is never stored: insert_helper.is_null = key == NULL || DB_IS_NULL (key) || btree_multicol_key_is_null (key);. The work is one call threading three callbacks (root-fix, descend/split, leaf-process) through the generic walker btree_search_key_and_apply_functions. After it returns, three branches of post-processing run: (1) saved-lock release (SERVER_MODE) — any predecessor OID stashed in insert_helper.saved_locked_oid by unique-key locking is unlocked now, success or not; (2) error short-circuit, else bump PSTAT_BT_NUM_INSERTS and set *unique; (3) multi-update stats correction — for MULTI_ROW_UPDATE on a unique index with HA disabled, if the key was not added/deleted (!is_unique_key_added_or_deleted) the speculative descent delta is reverted (revert delete_key_and_row(), re-apply add_row()). That is the only place the speculative key-count delta is undone.

flowchart TD
  A["btree_insert public entry"] --> B["btree_insert_internal\nbuild BTREE_INSERT_HELPER"]
  B --> C{"purpose?"}
  C -->|INSERT_NEW_OBJECT / UNDO_PHYSICAL_DELETE| D["key_insert_func =\nbtree_key_insert_new_object"]
  C -->|MVCC_DELID / MARK_DELETED| E["key_insert_func =\nbtree_key_find_and_insert_delete_mvccid"]
  D --> F["btree_search_key_and_apply_functions"]
  E --> F
  F --> G["btree_fix_root_for_insert\nunique stats, NULL gate"]
  G --> H["btree_split_node_and_advance\ndescent + split"]
  H --> I["leaf reached -> key_insert_func"]
  I --> J["post: release lock, stats correction"]

Figure 4-1. btree_insert_internal orchestration and callback wiring.

4.2 The unique-check gate and the NULL-key early exit

btree_fix_root_for_insert runs once per descent (guarded by is_first_try). Beyond fixing the root it does unique-statistics accounting and computes key_len_in_page (the packed key length the Chapter 6 split math needs). Two gates here shape the new-key path.

The unique-stats gate keys on BTREE_IS_UNIQUE (btid_int->unique_pk) alone. A nearby macro looks like it should be this gate but is not:

// BTREE_NEED_UNIQUE_CHECK -- src/storage/btree.h  (defined but never referenced)
#define BTREE_NEED_UNIQUE_CHECK(thread_p, op) \
  (logtb_is_current_active (thread_p) \
   && (op == SINGLE_ROW_INSERT || op == MULTI_ROW_INSERT || op == SINGLE_ROW_UPDATE))

Its intended semantics — an active transaction (so rollback/recovery skip live checking) plus an insert-shaped op type — record the design intent, but no caller in src/ references it; it is dead code. The live stats bump is gated only on BTREE_IS_UNIQUE, with the increment chosen by NULL-ness:

// btree_fix_root_for_insert -- src/storage/btree.c
if (BTREE_IS_UNIQUE (btid_int->unique_pk))
  {
    btree_unique_stats incr;          /* ... MVCC-delete branch omitted ... */
    if (insert_helper->is_null)
      incr.insert_null_and_row ();     /* <- NULL: a null + a row, never a key */
    else
      incr.insert_key_and_row ();      /* <- speculative: assumes a new key WILL be added */
    if (BTREE_IS_MULTI_ROW_OP (insert_helper->op_type))
      (*insert_helper->unique_stats_info) += incr;                    /* multi-row sink */
    else if (!btree_is_online_index_loading (insert_helper->purpose))
      logtb_tran_update_unique_stats (thread_p, *btid, incr, true);   /* single-row tran stats */
  }

Invariant — speculative key count is corrected, never left wrong. Violation: every duplicate-key insert reports a phantom key, corrupting optimizer cardinality. insert_key_and_row() is applied before the descent knows the key exists; if the leaf already holds it (Chapter 5) the 4.1 multi-row correction reverts it, and single-row rollback reverses the logtb_tran_update_unique_stats delta.

The NULL-key early exit — a NULL key never reaches a leaf:

// btree_fix_root_for_insert -- src/storage/btree.c
if (insert_helper->is_null)
  {
    *stop = true;                            /* <- halt walker at root */
    pgbuf_unfix_and_init (thread_p, *root_page);
    return NO_ERROR;
  }

Stats were already counted, so the NULL is accounted but not stored. *stop makes the walker return without the leaf callback, so btree_key_insert_new_key is reached only for non-NULL keys — its own assert (key != NULL && ...) encodes that. Finally the function sets insert_helper->key_len_in_page = BTREE_GET_KEY_LEN_IN_PAGE (btree_get_disk_size_of_key (key)), possibly creating an overflow-key file when key_len >= BTREE_MAX_KEYLEN_INPAGE.

4.3 Reaching the leaf: `btree_split_node_and_advance`

The recursive driver is dissected fully in Chapter 6. For this chapter only its leaf-reaching tail matters: after ensuring room for a child split (Chapter 6) it tests the node type:

// btree_split_node_and_advance -- src/storage/btree.c
if (node_type == BTREE_LEAF_NODE)
  {
    assert (pgbuf_get_latch_mode (*crt_page) == PGBUF_LATCH_WRITE);   /* <- leaves always WRITE-latched */
    /* No other child. Notify called this is a leaf node and return the slot of new key. */
    error_code = btree_search_leaf_page (thread_p, btid_int, *crt_page, key, search_key);
    if (error_code != NO_ERROR)
      { ASSERT_ERROR (); goto error; }
    *is_leaf = true;     /* <- walker now calls key_insert_func instead of recursing */
    return NO_ERROR;
  }

When *is_leaf is set true, the walker stops recursing and invokes the key function on *crt_page. The crucial outputs of btree_search_leaf_page are search_key->slotid (the slot where the key belongs, by binary search) and search_key->result (BTREE_KEY_FOUND if present, else BTREE_KEY_NOTFOUND / BTREE_KEY_BETWEEN) — the pivot the next function branches on. The leaf is always WRITE-latched on arrival, so the leaf path never promotes.

4.4 Leaf dispatch: `btree_key_insert_new_object`

This leaf callback sets up recovery scaffolding then splits between new key (this chapter) and append (Chapter 5). First it stamps insert_helper->leaf_addr (offset = search_key->slotid, pgptr = *leaf_page, vfid) and prepares the undo keyval blob — the logical undo image (key + class OID + OID + MVCC info):

// btree_key_insert_new_object -- src/storage/btree.c
if (insert_helper->purpose == BTREE_OP_INSERT_NEW_OBJECT /* || ONLINE_INDEX_TRAN_INSERT ... */) {
    insert_helper->rcvindex =
      BTREE_MVCC_INFO_IS_INSID_NOT_ALL_VISIBLE (BTREE_INSERT_MVCC_INFO (insert_helper))
        ? RVBT_MVCC_INSERT_OBJECT : RVBT_NON_MVCC_INSERT_OBJECT;     /* <- MVCC vs non-MVCC */
    error_code = btree_rv_save_keyval_for_undo (btid_int, key, ..., &insert_helper->rv_keyval_data, ...);
}
else   /* BTREE_OP_INSERT_UNDO_PHYSICAL_DELETE */
  insert_helper->rcvindex = RVBT_RECORD_MODIFY_COMPENSATE;   /* <- compensation, no undo keyval */

Then the central branch — result != BTREE_KEY_FOUND is the new-key path of this chapter; the else falls through to the append paths of Chapter 5:

// btree_key_insert_new_object -- src/storage/btree.c
if (search_key->result != BTREE_KEY_FOUND)
  {                                          /* Key doesn't exist. Insert new key. */
    error_code = btree_key_insert_new_key (thread_p, btid_int, key, *leaf_page, insert_helper, search_key);
    /* ... free undo blob if reallocated, track time ... */ return NO_ERROR;
  }
/* Key was found. Append new object to existing key. */
if (BTREE_IS_UNIQUE (btid_int->unique_pk) && insert_helper->purpose == BTREE_OP_INSERT_NEW_OBJECT)
  {
    error_code = btree_key_lock_and_append_object_unique (..., restart, ...);   /* <- Chapter 5 */
    if (error_code != NO_ERROR) { ASSERT_ERROR (); goto error; }
    if (*restart == true) { goto exit; }   /* <- re-descend: leaf changed under the lock wait */
  }
else
  { /* btree_key_append_object_non_unique (...) -- Chapter 5 */ }

The found-key unique path may set *restart and goto exit to re-walk from the root: if btree_key_lock_and_append_object_unique waited on a lock and the leaf was reorganized meanwhile, it asks the walker to descend again — a fourth way to leave the dispatch, with mechanics deferred to Chapter 5. The append branches return through a shared exit:/error: epilogue that frees a heap-reallocated undo buffer and records the perf timer; the new-key branch does the equivalent cleanup inline and returns directly without reaching exit:. Only the result != BTREE_KEY_FOUND branch concerns us.

4.5 Building and splicing the cell: `btree_key_insert_new_key`

The heart of the chapter: build a leaf record holding the key and its single first OID, insert at search_key->slotid, update header counters, emit one undoredo log record.

4.5.1 — Mark added. insert_helper->is_unique_key_added_or_deleted = true; confirms the speculative +key from 4.2 was right (reached only when the key was not found), so no later correction fires.

4.5.2 — Overflow-key vs in-page key (three sub-branches).

// btree_key_insert_new_key -- src/storage/btree.c
key_len = btree_get_disk_size_of_key (key);
if (key_len >= BTREE_MAX_KEYLEN_INPAGE)
  {
    key_type = BTREE_OVERFLOW_KEY;
    log_sysop_start (thread_p);              /* <- (a) overflow key needs a nested system op */
    insert_helper->is_system_op_started = true;
  }
else
  {
    int diff_column = btree_node_get_common_prefix (thread_p, btid_int, leaf_page);
    if (diff_column > 0)                     /* <- (b) prefix compression */
      { new_key = &local_key; pr_clone_value (key, new_key); pr_midxkey_remove_prefix (new_key, diff_column); }
    else if (diff_column < 0)
      { ASSERT_ERROR (); return diff_column; }   /* <- (c) error: bad prefix, early return before any sysop */
  }

(a) oversized -> overflow key under a system op so the overflow-page allocation commits/aborts atomically; (b) common prefix > 0 -> clone into local_key and strip prefix columns; (c) diff_column < 0 -> propagate error (returns before any sysop, so no abort needed).

4.5.3 — Encode via btree_write_record into a REC_HOME descriptor:

// btree_key_insert_new_key -- src/storage/btree.c
record.type = REC_HOME;  record.data = PTR_ALIGN (data_buffer, MAX_ALIGNMENT);  record.area_size = DB_PAGESIZE;
error_code = btree_write_record (thread_p, btid_int, NULL /* no non-leaf node_rec */, new_key, BTREE_LEAF_NODE,
                                 key_type, key_len, false, BTREE_INSERT_CLASS_OID (insert_helper),
                                 BTREE_INSERT_OID (insert_helper), BTREE_INSERT_MVCC_INFO (insert_helper), &record);
if (new_key == &local_key) pr_clear_value (&local_key);    /* <- free cloned key on BOTH paths, before err check */
if (error_code != NO_ERROR) { ASSERT_ERROR (); goto error; }

For a leaf this writes the first OID, optionally a subclass OID (unique index + class differs from topclass, sets BTREE_LEAF_RECORD_CLASS_OID), MVCC info, then the packed key body (or a VPID to the overflow page). local_key is cleared before the error check, so the clone never leaks.

4.5.4 — Splice into the slot: spage_insert_at.

// btree_key_insert_new_key -- src/storage/btree.c
node_header = btree_get_node_header (thread_p, leaf_page);   /* ... NULL-check -> goto error ... */
FI_TEST (thread_p, FI_TEST_BTREE_MANAGER_RANDOM_EXIT, 0);
/* Nothing should fail after spage_insert_at! */
if (spage_insert_at (thread_p, leaf_page, search_key->slotid, &record) != SP_SUCCESS)
  { assert_release (false); error_code = ER_FAILED; goto error; }

Invariant — spage_insert_at is the point of no return. Violation: a half-applied insert with no clean rollback, since redo is emitted right after. The Chapter 6 split pass already reserved btree_get_max_new_data_size bytes, so SP_DOESNT_FIT cannot occur; the only remaining failure is a logic bug (assert_release).

Inside, spage_insert_at validates the slot id, then spage_find_empty_slot_at branches: at num_slots it appends via spage_add_new_slot; between existing keys it calls spage_take_slot_in_use, which slides the slot-directory entries from slot_id up by one to free it. Either branch bumps the page-header counters and spage_insert_data copies the bytes (Chapter 1). Inserting at search_key->slotid preserves sorted order.

4.5.5 — Update the node header: max_key_len and split_info.

// btree_key_insert_new_key -- src/storage/btree.c
key_cnt = btree_node_number_of_keys (thread_p, leaf_page);
key_len = BTREE_GET_KEY_LEN_IN_PAGE (key_len);
if (key_len > node_header->max_key_len)
  { update_max_key_length = true; node_header->max_key_len = key_len; }   /* <- redo must know too */
btree_split_next_pivot (&node_header->split_info, (float) search_key->slotid / key_cnt, key_cnt);

Key count is not written here — it is derived from the slot count, which spage_insert_at already incremented. What this function maintains explicitly:

max_key_len — grown only if the new key is longer. The local boolean update_max_key_length tells the redo encoding (4.5.6) to pack the new key length so recovery rebuilds the header identically. This is leaf growth; the ancestor chain was bumped during descent (Chapter 6).
split_info.pivot — btree_split_next_pivot folds slotid / key_cnt into a running average so a future split favours where inserts land (Chapter 6).

4.5.6 — Emit the recovery record into a local buffer (the helper’s smaller redo buffer may not hold a full record plus debug info):

// btree_key_insert_new_key -- src/storage/btree.c
LOG_RV_RECORD_SET_MODIFY_MODE (&insert_helper->leaf_addr, LOG_RV_RECORD_INSERT);   /* <- redo = "insert a slot" */
if (update_max_key_length)
  {
    rv_redo_data_ptr = or_pack_int (rv_redo_data_ptr, key_len);     /* <- replay the header growth */
    BTREE_RV_SET_UPDATE_MAX_KEY_LEN (&insert_helper->leaf_addr);
  }
memcpy (rv_redo_data_ptr, record.data, record.length);  rv_redo_data_ptr += record.length;
btree_rv_log_insert_object (thread_p, *insert_helper, insert_helper->leaf_addr, 0, rv_redo_data_length, NULL, rv_redo_data);

Redo data is an optional packed key_len (only when the header grew, gated by the BTREE_RV_SET_UPDATE_MAX_KEY_LEN address flag) followed by the raw record; the matching undo is the logical keyval blob from 4.4. btree_rv_log_insert_object chooses the log shape by purpose:

// btree_rv_log_insert_object -- src/storage/btree.c
if (insert_helper.is_system_op_started)
  log_append_undoredo_data (thread_p, RVBT_RECORD_MODIFY_UNDOREDO, &addr, undo_length, redo_length, undo_data, redo_data);
else switch (insert_helper.purpose)
  {
  case BTREE_OP_INSERT_NEW_OBJECT:    /* undo logical, redo physical */
    log_append_undoredo_data (thread_p, insert_helper.rcvindex, &addr, insert_helper.rv_keyval_data_length,
                              redo_length, insert_helper.rv_keyval_data, redo_data);
    break;    /* ... online-index and compensate cases ... */
  }

Invariant — undo is logical, redo is physical, for the normal new-key case. Violation: undo corrupts a reshaped page or leaks the overflow allocation. Rollback re-derives and physically removes the object rather than restoring raw bytes a concurrent thread may have reshaped. The exception: an overflow-key sysop (is_system_op_started) logs physical RVBT_RECORD_MODIFY_UNDOREDO bracketed by the sysop.

4.5.7 — Close the system op and finish.

// btree_key_insert_new_key -- src/storage/btree.c
if (insert_helper->is_system_op_started)
  btree_insert_sysop_end (thread_p, insert_helper);
pgbuf_set_dirty (thread_p, leaf_page, DONT_FREE);   /* <- keep latch, flag for flush */
return NO_ERROR;

error:
  if (insert_helper->is_system_op_started)
    { log_sysop_abort (thread_p); insert_helper->is_system_op_started = false; }   /* <- roll back overflow alloc */
  return error_code;

The error: label is the single cleanup point; its only real work is aborting a started sysop (overflow-key case). The point-of-no-return invariant guarantees no slotted-page change reachable from error: needs undoing.

flowchart TD
  A["btree_key_insert_new_key"] --> B["is_unique_key_added_or_deleted = true"]
  B --> C{"key_len >= MAX_KEYLEN_INPAGE?"}
  C -->|yes| D["key_type = OVERFLOW\nlog_sysop_start"]
  C -->|no| E{"common prefix?"}
  E -->|diff_column > 0| F["clone + strip prefix into local_key"]
  E -->|diff_column < 0| G["return error"]
  E -->|0| H["use key as-is"]
  D --> I["btree_write_record"]
  F --> I
  H --> I
  I -->|error| Z["goto error"]
  I -->|ok| J["spage_insert_at at slotid"]
  J -->|not SP_SUCCESS| Z
  J -->|SP_SUCCESS| K["update max_key_len if grown\nbtree_split_next_pivot"]
  K --> L["pack redo: optional key_len + record"]
  L --> M["btree_rv_log_insert_object\nundo logical / redo physical"]
  M --> N{"sysop started?"}
  N -->|yes| O["btree_insert_sysop_end"]
  N -->|no| P["pgbuf_set_dirty; return NO_ERROR"]
  O --> P
  Z --> Q{"sysop started?"}
  Q -->|yes| R["log_sysop_abort"]
  Q -->|no| S["return error_code"]
  R --> S

Figure 4-2. btree_key_insert_new_key branch-complete control flow.

4.6 Chapter summary — key takeaways

btree_insert_internal is pure setup: build the helper, select btree_key_insert_new_object, compute is_null, delegate to the walker; its only post-work is saved-lock release and the no-op-multi-update correction.
Unique stats are bumped optimistically in btree_fix_root_for_insert (insert_key_and_row) before the leaf knows if the key exists; truth is restored by the 4.1 correction or undo logs. NULL keys are counted (insert_null_and_row) but stop at root via *stop = true.
btree_split_node_and_advance reaches the leaf by setting *is_leaf = true after btree_search_leaf_page; the leaf is always WRITE-latched, and search_key->result is the branch pivot.
btree_key_insert_new_object has four exits: new-key insert, unique append (which may goto exit on *restart to re-descend, Chapter 5), non-unique append, and the error path.
btree_key_insert_new_key encodes the cell via btree_write_record through three sub-branches: oversized -> overflow key under a sysop, prefix-compressible -> clone-and-strip, plain -> as-is.
spage_insert_at is the point of no return (append vs spage_take_slot_in_use); the Chapter 6 split pass guarantees room, so SP_DOESNT_FIT would be a bug. Header maintenance is max_key_len (grown only if longer, echoed into redo via update_max_key_length) and split_info.pivot; key count is implicit in the slot count.
Recovery for the normal new-key case is undo logical, redo physical; an overflow-key sysop instead forces a physical RVBT_RECORD_MODIFY_UNDOREDO pair bracketed by the sysop.

Chapter 5: Appending an OID and Enforcing Uniqueness

Chapter 4 covered the first object for a key; this chapter handles the case where the key already exists, so a second OID must be threaded onto it. For a non-unique index that is a pure append; for a unique index it is a locking protocol that may end in ER_BTREE_UNIQUE_FAILED. Three sub-problems interleave: where the new OID goes (inline until a cap, then an overflow chain off LEAF_REC::ovfl); whether a unique append is legal (lock the current visible object, re-check under it); and what “visible” means (purpose-dependent). For the slotted-page layout, the key || OID encoding, and overflow files, see cubrid-btree.md §“Node layout” and §“Unique-key handling”.

5.1 The two append entry points

The leaf-key dispatcher routes on BTREE_IS_UNIQUE (btid_int->unique_pk): the non-unique path appends with no lock and no visibility check; the unique path locks the existing visible object, decides whether the insert violates uniqueness, then appends. Both end at btree_key_append_object_into_ovf when the inline list is full.

5.2 `btree_key_append_object_non_unique` — the bare append

// btree_key_append_object_non_unique -- src/storage/btree.c
n_objects = btree_record_get_num_oids (..., leaf_record, offset_after_key, BTREE_LEAF_NODE);
if (n_objects < BTREE_MAX_OIDCOUNT_IN_LEAF_RECORD (btid_int))   /* <- branch A: room inline */
  {
    btree_record_append_object (..., leaf_record, BTREE_LEAF_NODE, btree_obj, NULL, &..rv_redo_data_ptr);
    spage_update (...); return NO_ERROR;                  /* writes at tail; logical-undo + physical-redo log */
  }
BTREE_MVCC_INFO_SET_FIXED_SIZE (&btree_obj->mvcc_info);   /* <- branch B: overflow objects are fixed size */
error_code = btree_key_append_object_into_ovf (..., leaf_record, leaf_info, insert_helper, btree_obj);

The cap BTREE_MAX_OIDCOUNT_IN_LEAF_RECORD (btree_load.h) is BTREE_MAX_OIDLEN_INPAGE / BTREE_OBJECT_FIXED_SIZE. Branch A writes at the record tail (descent already guaranteed leaf room, so no split). Branch B forces fixed size — every overflow object carries both MVCCID slots so the page is binary-searchable (§5.9).

Invariant: an inline OID list never exceeds BTREE_MAX_OIDCOUNT_IN_LEAF_RECORD — enforced by the n_objects < n_objects_limit gate (and a symmetric assert (==) in §5.7). If violated, the record grows unbounded and breaks the split-point arithmetic that assumes a bounded record.

5.3 `btree_append_oid` and `btree_record_append_object`

btree_append_oid is the mechanical writer (also used by the loader): a bare 8-byte OID (OR_PUT_OID at rec->data + rec->length, then rec->length += OR_OID_SIZE), no MVCC/class/recovery data. The richer btree_record_append_object packs the full object; when the leaf record already carries BTREE_LEAF_RECORD_OVERFLOW_OIDS it first steps append_at back by DISK_VPID_ALIGNED_SIZE, saves the trailing VPID, packs the object, then re-appends the VPID:

// btree_record_append_object -- src/storage/btree.c
if (node_type == BTREE_LEAF_NODE && record->length > 0
    && btree_leaf_is_flaged (record, BTREE_LEAF_RECORD_OVERFLOW_OIDS))
  { append_at -= DISK_VPID_ALIGNED_SIZE; OR_GET_VPID (append_at, &ovf_vpid); }   /* <- preserve ovfl link */

So a new inline object is spliced in front of the trailing LEAF_REC::ovfl link, keeping the link at a fixed position (the tail).

5.4 The inline-to-overflow transition

btree_key_append_object_into_ovf reuses a partially-full chain page or starts a new one:

// btree_key_append_object_into_ovf -- src/storage/btree.c
btree_find_free_overflow_oids_page (..., &leaf_record_info->ovfl, &overflow_page);
if (overflow_page == NULL)                                  /* <- no free page in chain */
  error_code = btree_key_append_object_as_new_overflow (...);   /* allocate + prepend new page */
else                                                        /* <- a chain page has room */
  { btree_key_append_object_to_overflow (..., overflow_page, append_object, insert_helper);
    pgbuf_unfix_and_init (thread_p, overflow_page); }

The new-page branch calls btree_start_overflow_page to allocate and seed the page, then btree_leaf_record_change_overflow_link to splice it onto the leaf link and set BTREE_LEAF_RECORD_OVERFLOW_OIDS:

// btree_start_overflow_page -- src/storage/btree.c
*new_page_ptr = btree_get_new_page (thread_p, btid_int, new_vpid, near_vpid);
VPID_COPY (&ovf_header_info.next_vpid, first_overflow_vpid);   /* <- prepend: new page -> old head */
btree_init_overflow_header (...); spage_insert_at (..., *new_page_ptr, 1, &rec);   /* one object, slot 1 */
// ... redo-only log (RVBT_RECORD_MODIFY_NO_UNDO); undo handled logically by caller's sysop ...

It prepends (leaf -> new -> old head -> ...): the new page’s next_vpid points to the current head before the leaf link is rewritten.

Invariant: the overflow link and the BTREE_LEAF_RECORD_OVERFLOW_OIDS flag are set together, under a system operation. btree_key_append_object_as_new_overflow brackets allocation + link rewrite in log_sysop_start / btree_insert_sysop_end, so a crash between them is undone logically. If flag and link disagreed, btree_record_append_object would read a garbage VPID at the tail.

5.5 The unique protocol — `btree_key_lock_and_append_object_unique`

A unique index cannot just append: under MVCC a “duplicate” key may have its visible version already logically deleted, in which case a new version is legal. The protocol locks the current first object, then re-decides. It saves the leaf LSA, calls btree_key_find_and_lock_unique (§5.6) with the prior saved_locked_oid copied into find_unique_helper.locked_oid at X_LOCK, then walks these branches:

flowchart TB
  FAL["find_and_lock_unique"] --> E1{"error / *restart?"}
  E1 -- yes --> RET["return / restart"]
  E1 -- no --> NF{"result != KEY_FOUND\nor SERVER LSA split?"}
  NF -- yes --> NEWK["restart, or insert_new_key\nkey vacuumed away"]
  NF -- no --> FO{"found_object and\nINSERT_NEW_OBJECT?"}
  FO -- no --> AP["append_object_unique"]
  FO -- yes --> MU{"!multi_update or\nnum_visible > 1?"}
  MU -- yes --> UV["unique violation"]
  MU -- no --> HA{"is_ha_enabled?"}
  HA -- yes --> UV
  HA -- no --> AP

Figure 5-1 — Branch-complete flow of btree_key_lock_and_append_object_unique (the !found_object RR-isolation branch also reaches unique violation).

The branch worth quoting is the violation decision:

// btree_key_lock_and_append_object_unique -- src/storage/btree.c
if (find_unique_helper.found_object && insert_helper->purpose == BTREE_OP_INSERT_NEW_OBJECT) {
    if (insert_helper->is_unique_multi_update)
      // ... btree_get_num_visible_from_leaf_and_ovf with dirty snapshot -> num_visible ...
    if (!insert_helper->is_unique_multi_update || num_visible > 1) {   /* <- real violation */
        if (prm_get_bool_value (PRM_ID_UNIQUE_ERROR_KEY_VALUE)) return ER_UNIQUE_VIOLATION_WITHKEY;
        BTREE_SET_UNIQUE_VIOLATION_ERROR (...); return ER_BTREE_UNIQUE_FAILED; }
    else if (insert_helper->is_ha_enabled) return ER_REPL_MULTI_UPDATE_UNIQUE_VIOLATION;  /* HA strictness */
    insert_helper->is_unique_key_added_or_deleted = false;          /* <- multi-update allowed */
    OID_SET_NULL (&insert_helper->saved_locked_oid);                /* keep lock on relocated object */
} else
  insert_helper->is_unique_key_added_or_deleted = true;             /* all deleted -> a new key */

A live visible object plus a normal insert is a violation — ER_BTREE_UNIQUE_FAILED, or ER_UNIQUE_VIOLATION_WITHKEY when PRM_ID_UNIQUE_ERROR_KEY_VALUE wants the key echoed. The multi-update exception tolerates a transient second version only when num_visible <= 1 under a dirty snapshot; under HA even that becomes ER_REPL_MULTI_UPDATE_UNIQUE_VIOLATION. The RR branch (a !found_object whose repeatable-read snapshot still counts a visible object) raises the same error pair. When cleared, the path saves two objects for undo — but only for an MVCC insert (rcvindex == RVBT_MVCC_INSERT_OBJECT) whose key either gained no new visible object (!is_unique_key_added_or_deleted) or whose first object was deleted by this same transaction (its delete MVCCID equals our insert MVCCID). Then btree_rv_save_keyval_for_undo_two_objects packs both versions and rcvindex switches to RVBT_MVCC_INSERT_OBJECT_UNQ, so undo restores the displaced object.

Invariant: at most one visible object per key in a unique index (two transiently during multi-row update, never under HA) — enforced by the num_visible > 1 and is_ha_enabled branches. Violating it lets two committed rows share a unique key.

5.6 `btree_key_find_and_lock_unique` and the lock loop

btree_key_find_and_lock_unique dispatches on BTREE_IS_UNIQUE: _of_unique inspects only the first object (btree_leaf_get_first_object); _of_non_unique serves system-catalog indexes declared unique but physically storing an OID list. The hot path:

// btree_key_find_and_lock_unique_of_unique -- src/storage/btree.c
if (search_key->result != BTREE_KEY_FOUND) goto error_or_not_found;   /* key gone */
while (true) {
    btree_leaf_get_first_object (btid_int, &record, &unique_oid, &unique_class_oid, &mvcc_info);
    if (class filter mismatch) goto error_or_not_found;               /* partition-scoped lookup */
    if (locked_oid set and != unique_oid)                             /* first object changed */
      { lock_unlock_object_donot_move_to_non2pl (...); OID_SET_NULL (&locked_oid); }
    switch (mvcc_satisfies_delete (thread_p, &mvcc_header)) {
      case INSERT_IN_PROGRESS: case DELETE_IN_PROGRESS:   /* SA: assert_release(false); SERVER: fallthrough */
      case CAN_DELETE:         /* already locked -> found; else cond-lock; on fail unfix+uncond, *restart/break */
      case DELETED: case SELF_DELETED: goto error_or_not_found;       /* gone -> caller may insert */
      default: assert_release (false); error_code = ER_FAILED; goto error_or_not_found;
    } }

_of_non_unique differs materially: the record holds a list, so it scans the whole leaf record then every overflow page rather than stopping at the first object —

// btree_key_find_and_lock_unique_of_non_unique -- src/storage/btree.c
while (true) {
    if (start_reading_leaf_record) { /* read leaf record; first object via btree_or_get_object */ }
    else if (buf.ptr == buf.endptr) {                     /* <- exhausted current record */
        if (VPID_ISNULL (&next_overflow_vpid)) goto error_or_not_found;   /* end of chain: not found */
        overflow_page = pgbuf_fix (..., &next_overflow_vpid, OLD_PAGE, PGBUF_LATCH_READ, ...);
        if (prev_overflow_page) pgbuf_unfix_and_init (thread_p, prev_overflow_page);
        btree_get_next_overflow_vpid (..., &next_overflow_vpid);          /* step the chain */
    }
    btree_or_get_object (&buf, ..., &unique_oid, ...);    /* next object */
    if (class mismatch) continue;                         /* <- advance, NOT not-found */
    switch (mvcc_satisfies_delete (...)) { /* same 5 cases as _of_unique */ } }

Unlike _of_unique, a class mismatch does continue (advance to the next list object) and a deleted object keeps scanning; not-found is reported only when buf.ptr == buf.endptr and VPID_ISNULL(next_overflow_vpid). Pages are fixed PGBUF_LATCH_READ, with the prev/current pair unfixed as the scan steps.

Invariant: the first object of a unique leaf record is the newest visible version, and it is the one that gets locked. Maintained by §5.5 plus btree_leaf_change_first_object (§5.8); _of_unique depends on it. A stale first version would let an inserter lock the wrong object and miss a real conflict.

BTREE_FIND_UNIQUE_HELPER threads through these functions:

Field	Role	Why it exists
`oid`	OID of the found object (output)	Which object was visible
`match_class_oid`	Only this class; NULL = any	Partition-scoped lookup; mismatch = not-found
`lock_mode`	Lock to take; insert = `X_LOCK`, FK = `S_LOCK`	Separates read-only find from insert
`snapshot`	MVCC filter; NULL = none	FK/find set it; locking insert leaves NULL
`found_object`	True iff found and locked	Success flag (§5.5)
`time_track`	`PERF_UTIME_TRACKER` for `PSTAT_BT_FIND_UNIQUE*`	Per-phase counters
`locked_oid` (SERVER_MODE)	Object currently locked here	Survives unfix/refix; detects a changed first object
`locked_class_oid` (SERVER_MODE)	Class OID of `locked_oid`	Needed by `lock_unlock_object_*`

locked_oid/locked_class_oid are absent in SA_MODE. Insert (unique) uses X_LOCK + NULL snapshot (a live object drives the violation decision); FK/find uses S_LOCK + a snapshot (a visible object exists, no exclusion).

5.7 `btree_key_append_object_unique` — keeping the newest first

The new object must become the first object, so the current first is relocated to the tail:

// btree_key_append_object_unique -- src/storage/btree.c
if (btree_record_get_num_oids (...) >= BTREE_MAX_OIDCOUNT_IN_LEAF_RECORD (btid_int))
  error_code = btree_key_relocate_last_into_ovf (...);   /* full: evict LAST inline object first */
BTREE_MVCC_INFO_SET_FIXED_SIZE (&first_object->mvcc_info);
btree_record_append_object (..., first_object, NULL, &..rv_redo_data_ptr);  /* append OLD first at tail */
btree_leaf_change_first_object (..., BTREE_INSERT_OID (insert_helper), ...);  /* overwrite slot zero */
spage_update (...);
// log_append_undoredo_data with insert_helper->rcvindex (MVCC / NON_MVCC / *_UNQ); logical undo.

When full, the last inline object is evicted to overflow by btree_key_relocate_last_into_ovf under its own sysop; with room, relocation is skipped. The two-step split is deliberate — the logging system cannot mix a sysop with logical undo, so step 1 uses a sysop with physical undo/redo and step 2 uses logical undo.

btree_key_relocate_last_into_ovf opens a log_sysop_start, pushes the last object via btree_key_append_object_into_ovf, removes it from the leaf with btree_record_remove_last_object, and logs RVBT_RECORD_MODIFY_UNDOREDO. Its subtle branch is a re-read:

// btree_key_relocate_last_into_ovf -- src/storage/btree.c
if (VPID_ISNULL (&leaf_record_info->ovfl) && btree_leaf_is_flaged (leaf_record, BTREE_LEAF_RECORD_OVERFLOW_OIDS))
  { btree_read_record (...); btree_record_get_last_object (...); }   /* <- first ovf page just added: re-read */

The overflow append may itself have created the first overflow page, so btree_leaf_record_change_overflow_link rewrote the leaf record and the cached offset_to_last_object is stale — recompute it before btree_record_remove_last_object or the wrong bytes get removed.

5.8 `btree_leaf_change_first_object` — overwriting slot zero

This primitive replaces the in-record first object, growing or shrinking the record as the object’s size class changes:

// btree_leaf_change_first_object -- src/storage/btree.c
if (old_rec_flag & BTREE_LEAF_RECORD_OVERFLOW_OIDS) {     /* <- overflow -> FORCE fixed size */
    old_object_size += 2 * OR_MVCCID_SIZE; new_object_size += 2 * OR_MVCCID_SIZE;
    BTREE_MVCC_INFO_SET_FIXED_SIZE (mvcc_info);
    if (BTREE_IS_UNIQUE (...) && !new_has_class_oid)       /* unique + overflow forces a class OID too */
      { new_rec_flag |= BTREE_LEAF_RECORD_CLASS_OID; new_object_size += OR_OID_SIZE; } }
// else: no overflow -> variable size. Then RECORD_MOVE_DATA(recp, new_object_size, old_object_size).

With no overflow the first object is variable-size; with overflow it is forced fixed-size plus (for unique) a class OID, so btree_leaf_get_first_object and the overflow binary search agree on a uniform width. RECORD_MOVE_DATA slides the key and inline OIDs by (new - old), reported through *key_offset.

Invariant: when a leaf record has an overflow chain, its first object is fixed-width (and, if unique, includes a class OID) — enforced here and by BTREE_MVCC_INFO_SET_FIXED_SIZE in the appenders. If violated, btree_find_oid_from_ovfl’s binary search (dividing by BTREE_OBJECT_FIXED_SIZE) reads mid-object garbage.

5.9 Locating an exact `(key, OID)` — the find-OID family

Both the unique check and the delete path ask “which page holds this exact (key, OID) under this purpose’s visibility?” btree_find_oid_and_its_page scans the leaf linearly (btree_find_oid_from_leaf, variable-width, skipping the packed key after object 0), then binary-searches each overflow page (btree_find_oid_from_ovfl, fixed-width, OID-sorted):

// btree_find_oid_and_its_page -- src/storage/btree.c
btree_find_oid_from_leaf (..., oid, match_mvccinfo, purpose, offset_to_object, ...);
if (*offset_to_object != NOT_FOUND) { *found_page = leaf_page; return NO_ERROR; }   /* in leaf */
if (VPID_ISNULL (&leaf_rec_info->ovfl)) return NO_ERROR;                            /* no chain: not found */
do { overflow_page = pgbuf_fix (..., PGBUF_LATCH_WRITE, ...);   /* walk chain, tracking *prev_page */
     btree_find_oid_from_ovfl (..., overflow_page, oid, purpose, match_mvccinfo, offset_to_object, ...);
} while (*offset_to_object == NOT_FOUND && !VPID_ISNULL (&overflow_vpid));   /* step via next_overflow_vpid */
// found: *found_page = overflow_page; else unfix all + not found

Pages are fixed PGBUF_LATCH_WRITE (mutating callers); prev_page is tracked for relinking after removal; all pages are unfixed on the not-found and error paths. The MVCC gate in both finders is btree_find_oid_does_mvcc_info_match, purpose-dependent:

`BTREE_OP_PURPOSE`	Match rule
`DELETE_VACUUM_INSID`	insert MVCCID equals `match_mvccinfo->insert_mvccid`
`DELETE_VACUUM_OBJECT`, `DELETE_OBJECT_PHYSICAL_POSTPONED`	delete MVCCID equals `match_mvccinfo->delete_mvccid` (avoids a reused OID)
`DELETE_UNDO_INSERT`, `DELETE_UNDO_INSERT_UNQ_MULTIUPD`	insert MVCCID matches the one being rolled back
`DELETE_UNDO_INSERT_DELID`	delete MVCCID matches; insert MVCCID cross-checked only if also set
`DELETE_OBJECT_PHYSICAL`, `INSERT_MVCC_DELID`, `INSERT_MARK_DELETED`, default	object has no valid delete MVCCID (alive)

The purpose selects which MVCC field decides a match, so one finder serves insert, delete, and vacuum. The default branch (“no valid delete MVCCID”) drives unique-insert and physical-delete, so a reused OID whose old version is deleted-but-unvacuumed is correctly skipped.

5.10 Chapter summary — key takeaways

Two entry points on BTREE_IS_UNIQUE: btree_key_append_object_non_unique appends blindly; btree_key_lock_and_append_object_unique locks-then-verifies.
Inline until BTREE_MAX_OIDCOUNT_IN_LEAF_RECORD, then overflow via btree_key_append_object_into_ovf — reuse a chain page or prepend a new one (btree_start_overflow_page; flag + link set under a sysop).
Unique = lock first, then decide. _of_unique X-locks the first object; _of_non_unique scans the whole list. A live visible object means ER_BTREE_UNIQUE_FAILED, except a MULTI-ROW-UPDATE tolerates one transient extra — never under HA, never two.
Newest visible stays first — btree_key_append_object_unique relocates the old first object to the tail (evicting the last inline object to overflow first if full), then btree_leaf_change_first_object.
Overflow forces fixed-width objects (both MVCCID slots, class OID for unique) — the precondition for btree_find_oid_from_ovfl’s binary search.
One finder, many purposes — btree_find_oid_does_mvcc_info_match makes the same (key, OID) match or miss by BTREE_OP_PURPOSE, so insert/delete/undo/vacuum reuse btree_find_oid_and_its_page.
Restart and re-read are first-class: *restart bounces to root, the LSA recheck forces a split restart, and the “first overflow page just created” re-read keeps btree_key_relocate_last_into_ovf honest.

Chapter 6: Node Split and Separator Promotion

When the target leaf — or any ancestor on the descent path — has no room, CUBRID grows the tree via pessimistic split-on-descent, making growth safe without ever restarting from the root for capacity reasons. The companion (cubrid-btree.md, §“Descent” and Figures 2–3) covers why CUBRID splits eagerly under a system op; this chapter dissects how, branch by branch. Chapter 3 covered the descent calling the function below; Chapter 4 the leaf insert once room is guaranteed.

6.1 The pessimistic contract: split before you advance

The descent driver btree_search_key_and_apply_functions calls one advance function per level; for inserts it is btree_split_node_and_advance.

INVARIANT — “the child you advance into always has room.” Before returning a child page, btree_split_node_and_advance verifies the child can absorb the worst-case new entry, splitting it first if not. A chain split (a split forcing its parent to split, which forces its…) can never happen: when control reaches a node, its parent was guaranteed one free slot on the previous iteration. The leaf, visited last, is thus guaranteed room by induction.

It enforces this by measuring, at every level, the largest entry the operation could add against free space:

// btree_split_node_and_advance -- src/storage/btree.c
max_new_data_size = btree_get_max_new_data_size (..., node_type, max_key_len, insert_helper, false /*known*/);
need_split = max_new_data_size > spage_get_free_space_without_saving (thread_p, child_page, NULL);

max_key_len is insert_helper->key_len_in_page when need_update_max_key_len, else node_header->max_key_len. known_to_be_found = false: descent does not yet know whether the key exists, so it always reserves for a brand-new entry — the pessimistic estimate.

6.2 `btree_get_max_new_data_size` — the worst-case estimator

An under-estimate would defeat the §6.1 invariant. The body is a node-type check then a switch (helper->purpose):

Branch	Returned size	Why
Non-leaf node	`NON_LEAF_ENTRY_MAX_SIZE(key_len) + SPAGE_SLOT_SIZE`	A child split promotes one separator here; slot directory grows by one.
Leaf, key found	`BTREE_OBJECT_FIXED_SIZE`	Append an OID / upgrade first object / add an overflow link (Ch 5). No new slot.
Leaf, key not found	`LEAF_ENTRY_MAX_SIZE(key_len) + SPAGE_SLOT_SIZE`	Brand-new key record plus its slot (Ch 4).
`INSERT_MVCC_DELID` / `MARK_DELETED`	`OR_MVCCID_SIZE`	Logical delete stamps a delete MVCCID into an existing object (Ch 8).
default	`assert_release(false)`, `ER_FAILED`	Unhandled purpose — a programming error.

Descent passes known_to_be_found = false for leaves, so the leaf estimate is the full new-key cost — largest of the insert cases.

6.3 `btree_node_split_info` — the skew accumulator

Every node header carries a BTREE_NODE_SPLIT_INFO that biases consecutive splits toward the hot side — the optimization for monotonic (auto-increment) key streams, where a 50/50 split would leave the left half permanently half-empty.

// btree_node_split_info -- src/storage/storage_common.h
struct btree_node_split_info
{
  float pivot;   /* pivot = split_slot_id / num_keys */
  int index;     /* number of key insert after node split */
};

Field	Role	Why it exists
`pivot`	Running average in `[0,1]` of where past splits landed. `0.5` balanced; near `1.0` right-skewed (ascending); near `0.0` left-skewed.	Lets `btree_split_find_pivot` aim the next split at the hot side instead of the geometric middle, so the cold half is not wasted.
`index`	Count of splits folded into the average (capped at key count); the CMA divisor.	Damps the average so one outlier insert cannot swing the pivot wildly.

btree_write_default_split_info seeds a fresh node with pivot = BTREE_SPLIT_DEFAULT_PIVOT (0.5), index = 1 — start balanced.

btree_split_find_pivot turns the average into a target byte size, two branches:

Balanced — pivot == 0 (uninit) or BTREE_SPLIT_LOWER_BOUND < pivot < BTREE_SPLIT_UPPER_BOUND: aim for CEIL_PTVDIV(total, 2); the [0.20,0.80] dead-band stops mild noise from skewing.
Skewed — pivot ≤ 0.20 or ≥ 0.80: aim for (int)(total * pivot), clamped to [0.05,0.95] (MIN_PIVOT/MAX_PIVOT) so ≥5% always lands each side.

btree_split_next_pivot folds the new split in: index = MIN(index+1, max_index); if pivot == 0 take new_value directly, else new_pivot = pivot + (new_value - pivot)/index (CMA), clamped to [0,1]. new_value = p_slot_id / key_cnt is where the just-promoted separator landed; ascending inserts feed values near 1.0, dragging pivot upward so future splits keep the dense right side and ship the sparse left half to the sibling.

6.4 `btree_find_split_point` — the size-balanced split point

btree_split_node/btree_split_root delegate the where to btree_find_split_point, which returns the separator key and writes *mid_slot = last slot kept on the left. It is size-driven (spage_get_space_for_record bytes, not key counts, so fat keys split correctly). Three rules from its source comment: (1) get close to mid_size, (2) leave both sides room for the new entry and fences, (3) both sides keep ≥1 non-fence record. Three budgets back them: left_max_size/right_max_size = BTREE_NODE_MAX_SPLIT_SIZE - new_ent_size - (leaf only) new_fence_size; mid_size = btree_split_find_pivot(tot_rec, …); left_min_size = tot_rec - right_max_size.

The loop walks slots left to right accumulating left_size, charging each real record’s bytes — except at the leaf insert slot, where it charges new_ent_size (absent key) or existing + new_ent_size (present key, append grows in place). Per record:

while left_size < left_min_size (right still over budget): continue;
if (left_size > MIN(left_max_size, mid_size)) { *mid_slot = i-1; break; } — cut one slot before overflow;
if (i == stop_at) { *mid_slot = i; break; } — whole page minus fences stays left.

Two post-loop guards enforce rule #3: *mid_slot == start_with - 1 (left empty) → ++; *mid_slot == stop_at (right empty) → --; each gated by node-type / is_key_added_to_left / found.

Fence-key cost. For leaves the budget pre-subtracts new_fence_size = LEAF_FENCE_MAX_SIZE(max_key_len)+SPAGE_SLOT_SIZE from both sides: the split adds an upper fence to the left leaf and a lower fence to the right (a separator copy marked BTREE_LEAF_RECORD_FENCE letting range scans detect boundaries cheaply — Ch 7). The estimate stops the node overflowing when the fence materializes.

Separator selection. With *mid_slot fixed, it reads mid_key (last left) and next_key (first right). A leaf computes the shortest distinguishing prefix via btree_get_prefix_separator — unless either is an overflow key, where it clones next_key whole (a prefix could exceed the in-page max). A non-leaf clones next_key.

INVARIANT — “the separator strictly separates.” Every left-node key is < separator ≤ every right-node key. btree_get_prefix_separator(mid_key, next_key) returns the smallest value > mid_key and ≤ next_key. Violating it routes searches to the wrong subtree.

6.5 `btree_split_node` — splitting a non-root node into Q and R

btree_split_node takes parent P, the full node Q, a freshly allocated sibling R, the P-slot pointing at Q, and the descent key (all three pages already write-latched, asserted up front). Q keeps [1..leftcnt] + upper fence, R gets lower fence + [leftcnt+1..key_cnt], P gains a sep → R slot, and (leaf) Q.next → R.

Figure 6-1 — btree_split_node leaf-chain relink

flowchart LR
  Q["Q: [1..leftcnt]"]
  R["R: [leftcnt+1..]"]
  N["old Q.next"]
  Q -->|"next_vpid"| R
  R -->|"next_vpid"| N
  R -->|"prev_vpid"| Q
  N -->|"prev_vpid (step 8)"| R

Branch-complete walkthrough:

Find the cut. key_cnt = btree_node_number_of_keys(Q); ≤ 0 → goto exit_on_error. btree_find_split_point → sep_key, leftcnt; NULL/DB_NULL sep_key → log and error out. rightcnt = key_cnt - leftcnt.
Fence record (leaf only). If leaf and sep_key_len < BTREE_MAX_KEYLEN_INPAGE && ≤ qheader->max_key_len, write a leaf record from sep_key, set BTREE_LEAF_RECORD_FENCE, flag_fence_insert = true. An overflow-key separator, or PRM_ID_USE_BTREE_FENCE_KEY = false, forces flag_fence_insert = false.
Undo-log Q, set headers. log_append_undo_data2(RVBT_COPYPAGE, …) snapshots the entire old Q for byte-for-byte rollback. Save right_next_vpid = qheader->next_vpid. Sibling links exist only at the leaf level: leaves set qheader->next_vpid = *R_vpid, rheader->prev_vpid = *Q_vpid; non-leaf Q next_vpid is nulled (VPID_SET_NULL) and R’s link fields keep btree_init_node_header defaults (which redo-logs R’s header). R always inherits rheader->next_vpid = right_next_vpid.
Move the upper half, fence Q. Optional lower fence at R slot 1, then for i = 1..rightcnt the moving target stays at leftcnt+1: peek slot leftcnt+1 of Q → spage_insert_at(R, j, …) → spage_delete(Q, leftcnt+1). If flag_fence_insert, Q’s upper fence goes in at leftcnt+1. btree_compress_node(Q), then redo-log Q (RVBT_COPYPAGE).
Redo-log R. btree_compress_node(R), then redo-log R (RVBT_COPYPAGE). R has no undo log — newly allocated in the system op, abort just deallocates it.
Promote separator into P. nleaf_rec.pnt = *R_vpid; key_type = BTREE_OVERFLOW_KEY/key_len = -1 when too long, else BTREE_NORMAL_KEY. p_slot_id++ (insert after the slot pointing at Q), spage_insert_at(P, p_slot_id, &rec), undo/redo RVBT_NDRECORD_INS. P’s header is undo-logged, btree_split_next_pivot(&pheader->split_info, (float) p_slot_id / key_cnt, key_cnt) folds the landing slot in, max_key_len bumped, header redo-logged.
Pick the child. btree_compare_key(key, sep_key, …): DB_UNK → assertion failure; < 0 → *advance_vpid = *Q_vpid; else *R_vpid.
Relink the leaf neighbor. Only when R is a leaf: the page that used to follow Q gets its prev_vpid re-pointed at R via btree_get_next_page(R) + btree_set_vpid_previous_vpid (§6.7).
exit_on_error frees sep_key and returns; inside the caller’s system op, log_sysop_abort reverses every page change.

INVARIANT — “the doubly-linked leaf chain stays consistent.” After a leaf split all four links hold: Q.next = R, R.prev = Q, R.next = old-next, old-next.prev = R (step 3 sets the first three, step 8 the fourth). Skipping step 8 leaves a leaf whose prev_vpid points at the now-too-full Q, corrupting backward range scans.

6.6 `btree_split_root` — the height-growth special case

The root must keep its page identity (it is named in the index catalog), so it cannot spawn one sibling. btree_split_root allocates two children Q and R, moves the halves into them, and rewrites root P to two separators (level L → L+1, slot 1 = neg-inf → Q, slot 2 = sep → R). Contrasting with §6.5:

Undo-log the whole root (RVBT_COPYPAGE); read key_cnt (≤0 → error); snapshot split_info, set split_info.index = 1.
Find split point & fence as in §6.5. neg_inf_key = sep_key — the separator is reused as the dummy negative-infinity key for the left pointer.
Grow the level. pheader->node.node_level++, bump max_key_len, reset root split-info to default. Children get node_level = new_level - 1 and inherit max_key_len/split_info. Same leaf-only sibling-link rule as §6.5 step 3.
Fill R (upper half). Optional lower fence, then move rightcnt records (peek/insert/delete leftcnt+1 of P). Redo-log R RVBT_INS_PGRECORDS (R is new).
Fill Q (lower half). Move leftcnt records by peeking slot 1 of P and deleting it (P shrinks from the front). If a fence is wanted, append Q’s upper fence; else i--. Redo-log Q RVBT_INS_PGRECORDS.
Rebuild root P. Redo-log a delete-all (RVBT_DEL_PGRECORDS, rec_cnt = key_cnt), then write slot 1 = neg_inf_key → Q_vpid, slot 2 = sep_key → R_vpid (each RVBT_NDRECORD_INS; BTREE_OVERFLOW_KEY/key_len = -1 when too long).
Pick the child. Same btree_compare_key three-way branch: DB_UNK → error; < 0 → Q; else R.
exit_on_error frees sep_key; the system op aborts, deallocates Q and R, and the RVBT_COPYPAGE undo restores P.

INVARIANT — “the root never changes page identity, and the leftmost separator is negative infinity.” btree_split_root rewrites P in place (no new VPID) and always installs a dummy neg_inf_key in slot 1. btree_compare_key never compares against slot 1 — searches < sep_key route to Q unconditionally — so the leftmost pointer covers all keys below the real separator. Delete-all + two inserts (not a page copy) keeps P’s identity stable across redo.

Logging asymmetry: btree_split_node uses whole-page RVBT_COPYPAGE for Q and R; btree_split_root uses record-set logging (RVBT_INS_PGRECORDS / RVBT_DEL_PGRECORDS / RVBT_NDRECORD_INS). Both share one system-op bracket, so abort is uniform.

6.7 `btree_set_vpid_previous_vpid` — the back-link rewrite

The smallest piece carries its own undo/redo so a split rollback restores the neighbor’s prev_vpid:

// btree_set_vpid_previous_vpid -- src/storage/btree.c
if (page_p == NULL) return NO_ERROR;                 /* <- chain tail, no-op */
header = btree_get_node_header (thread_p, page_p);
if (header == NULL) return ER_FAILED;
btree_node_header_undo_log (thread_p, &btid->sys_btid->vfid, page_p);
header->prev_vpid = *prev;                            /* <- repoint backward link, then redo-log */

Three branches: NULL page (chain tail) → no-op success; header read failure → ER_FAILED; else back-link rewritten with full undo/redo and pgbuf_set_dirty. Reused by leaf merge (Ch 9), hence standalone.

6.8 `btree_split_node_and_advance` — the full branch map

The advance function is more than the need_split block. With node_type, need_split, and need_update_max_key_len computed (§6.1), it threads through these branches before returning. A root restart below means: on ER_PAGE_LATCH_PROMOTE_FAIL, set nonleaf_latch_mode = WRITE, *restart = true, unfix child+parent, is_crt_node_write_latched = false, return NO_ERROR (other promote errors → goto error). The nine branch entries, in source order:

Root dispatch (insert_helper->is_root): the root path mirrors the non-root flow in a parallel block (its own sysop, promotion, advance branches), but splits via btree_split_root (two new pages, §6.6) not btree_split_node.
*crt_page promotion (need_split && nonleaf_latch_mode == READ && !is_crt_node_write_latched): promote the parent read→write (PGBUF_PROMOTE_ONLY_READER); on fail, root restart.
child promotion ((need_split || need_update_max_key_len) && nonleaf_latch_mode == READ && !need_update_max_key_len && !is_child_leaf): promote the child (PGBUF_PROMOTE_SHARED_READER); on fail, root restart.
need_update_max_key_len (rule parent->max_key_len >= child->max_key_len violated): set child max_key_len = key_len_in_page, btree_node_header_redo_log + pgbuf_set_dirty, propagate need_update_max_key_len = true, is_crt_node_write_latched = true so every descendant also updates.
Split + advance-to-child (need_split, VPID_EQ(&advance_vpid, &child_vpid)): unfix new_page1, log_sysop_commit, *advance_to_page = child_page, is_crt_node_write_latched = true, return.
Split + advance-to-new-page (need_split, VPID_EQ(&advance_vpid, &new_page_vpid1)): unfix child_page, log_sysop_commit, *advance_to_page = new_page1, is_crt_node_write_latched = true, return.
Split + impossible (need_split, neither VPID matches): assert_release(false), unfix all three, error_code = ER_FAILED, goto error.
No-split fall-through (!need_split && !need_update_max_key_len && nonleaf_latch_mode == READ): read-latched and untouched, so is_crt_node_write_latched = false, *advance_to_page = child_page, return.
error: label (any goto error): unfix new_page1/new_page2 first, then log_sysop_abort if a system op is open, then unfix child_page.

The two promotion early returns are the only paths that restart from the root, and they restart on latch contention, never capacity. The split bracket is opened by log_sysop_start (is_system_op_started = true) inside the need_split block, closed by log_sysop_commit per advance branch or by the error: label:

// btree_split_node_and_advance -- src/storage/btree.c (error: label)
if (new_page1 != NULL) pgbuf_unfix_and_init (thread_p, new_page1);
if (new_page2 != NULL) pgbuf_unfix_and_init (thread_p, new_page2);
if (is_system_op_started)
  log_sysop_abort (thread_p);           /* <- unfix new pages FIRST so abort can deallocate them */
if (child_page != NULL) pgbuf_unfix_and_init (thread_p, child_page);

The ordering is load-bearing: new pages must be unfixed before log_sysop_abort, because abort deallocates them and the fix count must be zero.

6.9 Chapter summary — key takeaways

Pessimistic split-on-descent splits any node that might not fit the worst-case entry before advancing into it, so the leaf is always guaranteed room and chain splits are structurally impossible.
btree_get_max_new_data_size is the estimator behind the guarantee, branching on node type and insert purpose; descent passes known_to_be_found = false to reserve the full new-key cost.
btree_node_split_info {pivot, index} is a CMA accumulator; btree_split_find_pivot turns it into a target byte size (balanced in [0.20,0.80], clamped to [0.05,0.95]), btree_split_next_pivot updates it from each separator’s landing slot.
btree_find_split_point is size-balanced: it reserves space for the new entry and fences on both sides, keeps ≥1 non-fence record per side, and returns a prefix-truncated separator.
btree_split_node undo-logs Q whole (RVBT_COPYPAGE), moves the upper half to R, writes fences, redo-logs both, promotes the separator into P (RVBT_NDRECORD_INS), and relinks the leaf chain (§6.7).
btree_split_root grows height in place: two children, a dummy neg-inf separator in slot 1 plus the real one in slot 2, node_level bump, record-set redo.
btree_split_node_and_advance has eight live exit paths across nine branch entries (the error: label is the shared abort sink): two latch-promotion root-restarts, max-key-len propagation, two split commits, the assert_release(false) impossible path, and the no-split read-latch fall-through — all in one system-op bracket whose error: label unfixes new pages before log_sysop_abort.

Chapter 7: Range Scan and LSA Aware Resume

This chapter answers the read-side question: once data is on disk, how does a range scan walk leaves and overflow pages key-by-key, and resume safely after releasing every latch across a client roundtrip? The answer is one generic driver, btree_range_scan, parameterized by a key-processing callback (BTREE_RANGE_SCAN_PROCESS_KEY_FUNC). The driver owns positioning, latching, advancement, and page-LSA-aware resume; the callback owns per-key work (visible OIDs for SELECT, any FK match for FK checks). All persistent cursor state lives in one BTREE_SCAN (BTS), so a scan drops all latches between iterations and resumes where it left off. For latch-coupling descent see cubrid-btree.md §“Descent — latch-coupling on read”.

7.1 The cursor structs — `btree_scan` and `btree_keyrange`

btree_keyrange (src/storage/btree.h) is the “what to scan” descriptor; btree_scan (BTS) is the large “where am I now” cursor. Fields tagged /* TO BE REMOVED */ in the header are legacy-only and marked legacy.

`btree_keyrange` field	Role	Why it exists
`range`	`RANGE` enum, each endpoint open/closed/infinite	Captures all bound combos; positioning branches on it alone
`lower_key`	Lower-bound `DB_VALUE`, or `NULL`	`NULL` = no lower limit; `btree_range_scan_start` branches on this
`upper_key`	Upper-bound `DB_VALUE`, or `NULL`	`NULL` = no upper limit; range test never trips on top
`num_index_term`	Count of leading index columns bound	`MIDXKEY` can range on a prefix; tells comparator how many columns matter

The table below is the authority for every btree_scan member:

Field	Role	Why it exists
`btid_int`	Resolved tree id + key domain	Avoids re-reading the root header per key (Ch 1)
`read_uncommitted`	Legacy dirty-read flag	legacy — dies with `btree_prepare_next_search`
`P_vpid`/`P_page`	Previous leaf vpid/ptr	legacy — dies with `btree_find_next_index_record`
`C_vpid`	Current leaf VPID	Persists across unfix — anchor resume re-fixes
`O_vpid`	Current overflow VPID	Park point when a key’s OIDs overflow one buffer (mid-key resume)
`C_page`	Current leaf ptr	`NULL` when no latch held; set only while walking
`slot_id`	Current slot in `C_page`	Re-validated after a latch re-acquire
`oid_pos`	Legacy OID cursor	legacy — superseded by callback OID walk
`cur_key`	Key the cursor is parked on	Resume anchor when `slot_id` is untrustworthy
`clear_cur_key`	Ownership flag for `cur_key`	Tells `btree_scan_clear_key` whether to free
`cur_common_prefix_page_vpid`/`_lsa`	VPID+LSA of page whose common prefix is cached (debug builds only)	Present only under `CHECK_VERIFY_COMMON_PREFIX_PAGE_INFO`; asserts the cached prefix still belongs to the current page
`common_prefix_size`	Length of page’s common prefix	`COMMON_PREFIX_UNKNOWN` until computed
`common_prefix_key`	Shared prefix bytes for the page	Combined with a compressed `cur_key` to rebuild the full key
`clear_common_prefix_key`	Ownership flag for prefix	Frees the prefix copy when page changes
`is_cur_key_compressed`	`cur_key` is prefix-stripped	`true` → `btree_check_decompress_key` re-attaches prefix before unfix
`attid_idxs`	attr-id → column-position map	Filters reference base columns by attr-id
`is_cur_key_copied`	`cur_key` owns a heap copy	Copy survives `C_page` unfix; set in §7.8
`key_range`/`key_filter`	What to scan / per-key predicate	Range positioning; filter rejects in-range failures
`key_filter_storage`	Backing `FILTER_INFO`	Owns the struct so `key_filter` is stable across iterations
`use_desc_index`	Descending scan flag	Selects `prev_vpid` + conditional prev-leaf fix path
`restart_scan`	Legacy restart counter	legacy
`read_keys`/`qualified_keys`	Query-trace counters	Bumped in §7.7; feed EXPLAIN/trace stats
`key_range_max_value_equal`	Upper bound matched exactly	Lets next advance end the scan without reading another key
`cur_leaf_lsa`	`C_page` LSA at last unfix	Decides reuse vs. re-locate on resume
`lock_mode`	`S_LOCK` or `X_LOCK`	From the scan’s isolation needs
`key_record`	Peeked `RECDES` for the key	Avoids re-reading the slot in OID-walking callbacks
`need_to_check_null`	Range can match SQL `NULL`	Toggles the NULL branch in range qualification
`leaf_rec_info`	Decoded `LEAF_REC`	`.ovfl` is the head of the key’s OID overflow chain
`node_type`	Always `BTREE_LEAF_NODE`	Asserts cursor never parks on a non-leaf
`offset`	Byte offset to first OID	Where OID processing starts in `key_record`
`key_status`	NOT_VERIFIED/VERIFIED/CONSUMED	Per-key state machine (§7.2)
`end_scan`	Whole range exhausted	Driver terminates; `C_vpid` reset, key cleared
`end_one_iteration`	OID buffer full	Return partial results, keep `C_vpid`/`cur_leaf_lsa`
`is_interrupted`	Callback returned mid-key	Breaks inner loop without advancing
`is_key_partially_processed`	Key’s OIDs spilled across iterations	Resume re-enters callback at `O_vpid`, not the leaf
`n_oids_read`/`..._last_iteration`	Cumulative / per-iter counts	Per-iter is `btree_keyval_search`’s return; debits `key_limit_upper`
`oid_ptr`	Write head into caller’s buffer	Reset to `oid_list->oidp` per iteration
`match_class_oid`	Class filter for hierarchical uniques	Skips OIDs from other classes in a partition hierarchy
`key_limit_lower`/`_upper`	LIMIT/OFFSET push-down	`_upper` decremented by per-iter count at exit
`index_scan_idp`	Owning index scan	Source of OID buffer, snapshot, opt mode (ISS/MRO/covering)
`is_btid_int_valid`	`btid_int` resolved	Guards re-resolution on first use
`is_scan_started`	Start vs. resume selector	`false` until start succeeds; then resume is used
`force_restart_from_root`	Desc fast path failed	Forces full re-locate on next resume
`is_fk_remake`	Deduplicate-mode FK retry	Retries FK existence with rebuilt key
`time_track`	Timing	Feeds `PSTAT_BT_*`
`bts_other`	Per-purpose extra state	FK: points at the `BTREE_FIND_FK_OBJECT` result

7.2 The `key_status` state machine

BTS_KEY_STATUS (bts_key_status, src/storage/btree.h) is three-valued: BTS_KEY_IS_NOT_VERIFIED (slot reached, range/filter not yet checked), BTS_KEY_IS_VERIFIED (passed range AND filter, ready to process), BTS_KEY_IS_CONSUMED (fully processed, advance to next slot).

stateDiagram-v2
  [*] --> NOT_VERIFIED : positioned on a slot
  NOT_VERIFIED --> VERIFIED : range ok and filter ok
  NOT_VERIFIED --> NOT_VERIFIED : fence or filtered, advance slot
  VERIFIED --> CONSUMED : key_func read all OIDs
  CONSUMED --> NOT_VERIFIED : advance to next slot
  VERIFIED --> VERIFIED : resume re-reads same record
  CONSUMED --> [*] : range exhausted, end_scan

Figure 7-1 — key_status transitions. advance_over_filtered_keys drives NOT_VERIFIED to VERIFIED; the callback drives VERIFIED to CONSUMED.

Key invariant: the driver only calls key_func when key_status == BTS_KEY_IS_VERIFIED. Asserted in btree_range_scan right before the callback. Every path to the callback first routes through btree_range_scan_advance_over_filtered_keys, which sets VERIFIED only after range AND filter pass. Violate it and a callback could emit OIDs for a fence or out-of-range key — rows that do not belong to the result.

7.3 The single-key wrapper — `btree_keyval_search` and `btree_scan_update_range`

A point lookup (WHERE k = 5) is a degenerate range:

// btree_keyval_search — src/storage/btree.c
rc = btree_prepare_bts (thread_p, bts, btid, isidp, kv_range, filter, ...);
rc = btree_range_scan (thread_p, bts, btree_range_scan_select_visible_oids);
return bts->n_oids_read_last_iteration;   /* <- OID count, iterative */

It is “just a GE_LE range search with the same key.” Returning n_oids_read_last_iteration lets a caller loop while !BTREE_END_OF_SCAN(bts) to drain a key whose OIDs span buffer-fills. btree_scan_update_range installs a new range into an existing BTS (multiple-range optimization): it validates range, nulls SQL-NULL bounds, and swaps bounds for descending scans:

// btree_scan_update_range — src/storage/btree.c
if ((bts->use_desc_index && !BTREE_IS_PART_KEY_DESC (&bts->btid_int))
    || (!bts->use_desc_index && BTREE_IS_PART_KEY_DESC (&bts->btid_int)))
  {                                       /* XOR: scan-desc != column-desc  */
    range_reverse (bts->key_range.range); /* GE_LE becomes LE_GE etc.       */
    swap_key = bts->key_range.lower_key;  /* <- swap bounds                 */
    bts->key_range.lower_key = bts->key_range.upper_key;
    bts->key_range.upper_key = swap_key;
  }

The swap fires only when exactly one of “scan descending” / “column stored descending” holds; if both hold the physical walk is already forward.

7.4 Starting a scan — `btree_range_scan_start`

Positions on the first eligible key; called only while is_scan_started == false. The /* (X) */ markers are the branch map.

// btree_range_scan_start — src/storage/btree.c
bts->key_status = BTS_KEY_IS_NOT_VERIFIED;
if (bts->key_range.lower_key == NULL)
  {                                  /* (A) no lower bound: jump to lowest leaf */
    btree_find_lower_bound_leaf (thread_p, bts, NULL);
    if (bts->end_scan) { return NO_ERROR; }   /* empty index */
  }
else
  {                                  /* (B) lower bound: descend to leaf */
    btree_locate_key (..., bts->key_range.lower_key, ..., &found);
    if (!found)
      { if (bts->use_desc_index) { bts->slot_id--; } }  /* (B1) desc: first-smaller */
    else if (bts->key_range.range == GT_LT || bts->key_range.range == GT_LE
             || bts->key_range.range == GT_INF)
      { bts->key_status = BTS_KEY_IS_CONSUMED; }        /* (B2) strict GT_*: skip exact */
  }
btree_range_scan_advance_over_filtered_keys (thread_p, bts);
if (bts->force_restart_from_root)
  {                                  /* (C) desc couldn't position: restart */
    assert (bts->use_desc_index);
    bts->key_status = BTS_KEY_IS_NOT_VERIFIED;
    return NO_ERROR;                 /* is_scan_started stays false -> driver loops */
  }
bts->is_scan_started = true;         /* (D) success */
return NO_ERROR;

All four outcomes funnel into advance_over_filtered_keys. Only (C) leaves is_scan_started false — a descending start that hit the prev-leaf fix path without progress, which the driver retries.

7.5 The generic driver — `btree_range_scan`

The outer loop runs until end_scan (range done) or end_one_iteration (buffer full, return to client); the inner loop drains keys, calling the callback for each VERIFIED key.

flowchart TB
  START["reset end_scan/end_one_iteration\nn_oids_read_last_iteration=0\noid_ptr=oid_list.oidp"] --> OUTER{"!end_scan and\n!end_one_iteration?"}
  OUTER -->|no| EXIT["save cur_leaf_lsa, unfix pages\ndebit key_limit_upper"]
  OUTER -->|yes| POS{"is_scan_started?"}
  POS -->|no| STARTF["btree_range_scan_start"]
  POS -->|yes| RESUME["btree_range_scan_resume"]
  STARTF --> CHK
  RESUME --> CHK{"end_scan?"}
  CHK -->|yes| EXIT
  CHK -->|no| FRR{"force_restart_from_root?"}
  FRR -->|yes| UNFIXC["unfix C_page, continue outer"]
  UNFIXC --> OUTER
  FRR -->|no| INNER["INNER: key_func(bts)"]
  INNER --> ICHK{"interrupted /\nend_one_iteration /\nend_scan?"}
  ICHK -->|yes| EXIT
  ICHK -->|no| ADV["advance_over_filtered_keys"]
  ADV --> ACHK{"end_scan?"}
  ACHK -->|yes| EXIT
  ACHK -->|no| AFRR{"force_restart_from_root?"}
  AFRR -->|yes| UNFIXC
  AFRR -->|no| INNER

Figure 7-2 — btree_range_scan control flow. Outer loop chooses start vs. resume; inner loop calls key_func then advances. Both force_restart_from_root arms unfix C_page and re-enter the outer loop.

// btree_range_scan — src/storage/btree.c
while (true)
  {
    assert (bts->key_status == BTS_KEY_IS_VERIFIED);  /* invariant 7.2 */
    error_code = key_func (thread_p, bts);            /* <- the callback */
    if (bts->is_interrupted || bts->end_one_iteration || bts->end_scan)
      { break; }
    assert (bts->key_status == BTS_KEY_IS_CONSUMED);
    btree_range_scan_advance_over_filtered_keys (thread_p, bts);
    if (bts->force_restart_from_root)
      { /* unfix C_page, decompress cur_key */ break; }
  }

At the end: label the driver always, in order: captures cur_leaf_lsa = pgbuf_get_lsa(C_page) before unfixing; decompresses cur_key if end_one_iteration; unfixes C_page/P_page; on end_scan nulls C_vpid and clears is_scan_started/the key; debits key_limit_upper by n_oids_read_last_iteration.

Key invariant: cur_leaf_lsa is captured at the same instant C_page is unfixed. Unfix without first copying the LSA and a later resume would compare a stale LSA, conclude “unchanged,” and reuse a slot_id now pointing at a different key. The end: block guarantees copy-then-unfix; the no-page branch nulls cur_leaf_lsa so resume must re-locate from root.

7.6 Resuming — `btree_range_scan_resume` and the LSA decision

Re-establish position without holding a latch the whole time. The decision hinges on cur_leaf_lsa; /* (X) */ is the branch map.

// btree_range_scan_resume — src/storage/btree.c
if (!bts->force_restart_from_root)
  {
    pgbuf_fix_if_not_deallocated (..., &bts->C_vpid, ..., &bts->C_page);
    if (bts->C_page != NULL)
      {
        if (LSA_EQ (&bts->cur_leaf_lsa, pgbuf_get_lsa (bts->C_page)))
          { return btree_range_scan_advance_over_filtered_keys (...); }  /* (1) UNCHANGED */
        if (BTREE_IS_PAGE_VALID_LEAF (thread_p, bts->C_page))
          {                                  /* (2) still a leaf but changed */
            btree_leaf_is_key_between_min_max (..., &bts->cur_key, &search_key);
            if (search_key.result == BTREE_KEY_BETWEEN)
              { btree_search_leaf_page (..., &bts->cur_key, &search_key); }
            switch (search_key.result)
              {
              case BTREE_KEY_FOUND:          /* (2a) key still here: reset slot */
                bts->slot_id = search_key.slotid;
                return btree_range_scan_advance_over_filtered_keys (...);
              case BTREE_KEY_BETWEEN:        /* (2b) key deleted, neighbor here */
                bts->slot_id = bts->use_desc_index ? search_key.slotid - 1 : search_key.slotid;
                bts->key_status = BTS_KEY_IS_NOT_VERIFIED;
                return btree_range_scan_advance_over_filtered_keys (...);
              case BTREE_KEY_SMALLER: case BTREE_KEY_BIGGER: case BTREE_KEY_NOTFOUND:
                break;                        /* (2c) migrated off page: re-locate */
              }
          }
        pgbuf_unfix_and_init (thread_p, bts->C_page);   /* (3) reused as non-leaf */
      }
    /* (4) C_page deallocated */
  }
bts->force_restart_from_root = false;
btree_locate_key (..., &bts->cur_key, ..., &found);
if (found) { return btree_range_scan_advance_over_filtered_keys (...); }
if (btree_node_number_of_keys (thread_p, bts->C_page) < 1)
  { bts->end_scan = true; return NO_ERROR; }     /* (5) index emptied under us */
if (bts->use_desc_index) { bts->slot_id--; }
bts->key_status = BTS_KEY_IS_NOT_VERIFIED;
return btree_range_scan_advance_over_filtered_keys (...);

(1) fast path — leaf byte-for-byte unchanged, slot_id/key_status hold. (2a)/(2b) re-derive the slot by searching for cur_key on a changed-but-still-leaf page. (2c)/(3)/(4)/(5) re-descend from root via btree_locate_key, then advance, end, or position on the insertion point.

Key invariant: resume re-anchors on cur_key, never on slot_id alone. slot_id is trusted only in branch (1), where the LSA proved no change; elsewhere the slot is re-derived by searching for cur_key. A stale slot would skip or duplicate keys whenever a concurrent split/merge shifted slot numbers.

7.7 Advancing over fence and filtered keys — `btree_range_scan_advance_over_filtered_keys`

Turns “positioned somewhere” into “positioned on the next VERIFIED key, or end_scan.” Three phases.

Resumed-VERIFIED early return — if already VERIFIED (page reused, re-fixed without re-checking), re-peek and return; otherwise set direction and step a CONSUMED slot:

// btree_range_scan_advance_over_filtered_keys — src/storage/btree.c
if (bts->key_status == BTS_KEY_IS_VERIFIED)
  {
    spage_get_record (..., bts->slot_id, &bts->key_record, PEEK);
    btree_range_scan_read_record (thread_p, bts);  /* page may have moved bytes */
    return NO_ERROR;                               /* keep current key */
  }
if (bts->key_range_max_value_equal)
  { bts->end_scan = true; return NO_ERROR; }       /* exact upper bound returned */
inc_slot = bts->use_desc_index ? -1 : 1;
if (bts->key_status == BTS_KEY_IS_CONSUMED) { bts->slot_id += inc_slot; }
next_vpid = bts->use_desc_index ? node_header->prev_vpid : node_header->next_vpid;

Page-crossing loop — when slot_id runs off the page, fix the next leaf; VPID_ISNULL ends the scan. Ascending latch-couples directly, descending defers to §7.8:

while (bts->slot_id <= 0 || bts->slot_id > key_count || key_count == 0)
  {
    if (VPID_ISNULL (&next_vpid)) { bts->end_scan = true; return NO_ERROR; }  /* consumed */
    if (bts->use_desc_index)
      {
        btree_range_scan_descending_fix_prev_leaf (..., &key_count, &node_header, &next_vpid);
        if (bts->force_restart_from_root) { return NO_ERROR; }
      }
    else
      {
        next_node_page = pgbuf_fix (..., &next_vpid, ...);
        pgbuf_unfix (thread_p, bts->C_page);   /* latch-couple to next */
        bts->C_page = next_node_page; VPID_COPY (&bts->C_vpid, &next_vpid);
        bts->slot_id = 1; next_vpid = node_header->next_vpid;
      }
  }

Per-key qualification — peek; skip fences; else read, count, apply range + filter:

spage_get_record (..., bts->slot_id, &bts->key_record, PEEK);
if (!btree_leaf_is_flaged (&bts->key_record, BTREE_LEAF_RECORD_FENCE))
  {
    btree_range_scan_read_record (thread_p, bts);
    bts->read_keys++;
    btree_apply_key_range_and_filter (..., &is_range_satisfied, &is_filter_satisfied);
    if (!is_range_satisfied) { bts->end_scan = true; return NO_ERROR; }  /* past range */
    if (is_filter_satisfied)
      {
        bts->qualified_keys++;
        bts->key_status = BTS_KEY_IS_VERIFIED;   /* <- only success exit */
        return NO_ERROR;
      }
    /* filter failed: fall through */
  }
bts->slot_id += inc_slot;                        /* fence/filter-fail: try next slot */

Fence keys are page-edge sentinels (see cubrid-btree.md §“Node layout”), skipped silently. A failed range test ends the scan; a failed filter advances and loops. Descending page-crossing may set force_restart_from_root via §7.8.

7.8 Reading the record and the descending prev-leaf fix

btree_range_scan_read_record decodes the peeked slot into cur_key: variable-length text/bit/midxkey use PEEK_KEY_VALUE (decompression may still allocate, tracked by is_cur_key_copied); fixed types use COPY_KEY_VALUE:

// btree_range_scan_read_record — src/storage/btree.c
switch (TP_DOMAIN_TYPE (bts->btid_int.key_type))
  {
  case DB_TYPE_MIDXKEY: case DB_TYPE_VARCHAR: case DB_TYPE_CHAR:
  case DB_TYPE_BIT:     case DB_TYPE_VARBIT:
    {
      int ret = btree_read_record_in_leafpage (thread_p, bts->C_page, PEEK_KEY_VALUE, bts);
      bts->is_cur_key_copied = bts->cur_key.need_clear;  /* compressed -> copied */
      return ret;
    }
  default: break;
  }
bts->is_cur_key_copied = true;
return btree_read_record_in_leafpage (thread_p, bts->C_page, COPY_KEY_VALUE, bts);

A copied/decompressed key survives the C_page unfix — which is why the driver runs btree_check_decompress_key before unfixing on end_one_iteration: resume must search the page for a complete cur_key.

btree_range_scan_descending_fix_prev_leaf exists because descending scans walk against the forward sibling links and would deadlock with forward scans under unconditional latches. It tries a conditional latch on the previous leaf, dropping the current latch and re-acquiring in safe order only on failure:

// btree_range_scan_descending_fix_prev_leaf — src/storage/btree.c
prev_leaf = pgbuf_fix (..., &prev_leaf_vpid, ..., PGBUF_CONDITIONAL_LATCH);
if (prev_leaf != NULL)
  {                                  /* (A) conditional latch won: common case */
    pgbuf_unfix_and_init (thread_p, bts->C_page);
    bts->C_page = prev_leaf;
    bts->slot_id = *key_count = btree_node_number_of_keys (...);  /* last key of prev */
    VPID_COPY (next_vpid, &(*node_header_ptr)->prev_vpid);
    return NO_ERROR;
  }
/* (B) conditional failed: drop current, re-fix prev unconditionally */
btree_check_decompress_key (bts);    /* cur_key must survive the unfix */
pgbuf_unfix_and_init (thread_p, bts->C_page);
pgbuf_fix_if_not_deallocated (..., &prev_leaf_vpid, ..., &prev_leaf);
if (prev_leaf == NULL)
  { bts->force_restart_from_root = true; return NO_ERROR; }       /* (B1) freed */
if (!BTREE_IS_PAGE_VALID_LEAF (thread_p, prev_leaf))
  { bts->force_restart_from_root = true; ...; return NO_ERROR; }  /* (B2) reused non-leaf */
*node_header_ptr = btree_get_node_header (thread_p, prev_leaf);
if (!VPID_EQ (&(*node_header_ptr)->next_vpid, &bts->C_vpid))
  { bts->force_restart_from_root = true; ...; return NO_ERROR; }  /* (B3) relinked */

(A) wins without waiting, positioning on the previous page’s last key. (B) drops the current latch (after decompressing cur_key) and re-fixes unconditionally; three latchless hazards each force a root restart — (B1) freed, (B2) reused as non-leaf, (B3) leaves no longer linked.

Key invariant: descending advancement never holds two leaf latches in the forward link direction. It grabs the previous page conditionally or releases the current page before waiting — the rule that keeps a descending scan from deadlocking against ascending scans and writers latching in next_vpid order. When it cannot make progress safely it sets force_restart_from_root and re-descends from root: slow but safe.

7.9 The SELECT callback — `btree_range_scan_select_visible_oids`

The BTREE_RANGE_SCAN_PROCESS_KEY_FUNC for SELECTs: for the current VERIFIED key, walk its OID list (leaf inline OIDs, then the overflow chain) and copy the MVCC-visible ones into the caller’s buffer. Because a key’s OID list can exceed the buffer, it carries partial-key resume via O_vpid and is_key_partially_processed.

flowchart TB
  ENTER["key_func entry (VERIFIED key)"] --> ISS{"ISS distinct-key op?"}
  ISS -->|yes| EOUT["set key, n=1, end_scan"]
  ISS -->|no| PART{"is_key_partially_processed?"}
  PART -->|yes| RESOVF["re-fix O_vpid\nget next overflow vpid"]
  PART -->|no| SOFT{"soft cap exceeded\nby leaf+1ovf?"}
  SOFT -->|yes| ONEIT["end_one_iteration"]
  SOFT -->|no| LEAF["process leaf OIDs"]
  LEAF --> OVF["overflow loop"]
  RESOVF --> OVF
  OVF --> HARD{"hard cap exceeded\nby this ovf page?"}
  HARD -->|yes| SAVE["O_vpid=last_visible\npartial=true, end_one_iteration"]
  HARD -->|no| PROC["process page OIDs"]
  PROC --> MORE{"more overflow pages?"}
  MORE -->|yes| OVF
  MORE -->|no| DONE["O_vpid=NULL\nkey_status=CONSUMED"]

Figure 7-3 — btree_range_scan_select_visible_oids. The soft cap stops before a key when the buffer is nearly full (end_one_iteration); the hard cap suspends inside a key’s overflow chain, saving O_vpid for partial-key resume.

Leaf and overflow OIDs are processed by btree_record_process_objects with btree_select_visible_object_for_range_scan, which builds an MVCC header, checks the snapshot, applies match_class_oid and key-limit filters, and copies visible OIDs via BTS_SAVE_OID_IN_BUFFER.

7.10 The FK callback — `btree_range_scan_find_fk_any_object`

An FK check only needs to know whether any referenced row exists, so the FK callback stops at the first qualifying object, using bts->bts_other (a BTREE_FIND_FK_OBJECT) as the result holder:

// btree_range_scan_find_fk_any_object — src/storage/btree.c
fk_arg.bts = bts; fk_arg.ovfl_page = NULL;
btree_record_process_objects (..., BTREE_LEAF_NODE, ..., &stop,
                              btree_fk_object_does_exist, &fk_arg);  /* leaf */
if (stop == true) { return NO_ERROR; }     /* found -> done, no overflow */
VPID_COPY (&ovf_vpid, &bts->leaf_rec_info.ovfl);
while (!VPID_ISNULL (&ovf_vpid))
  {                                         /* hand-over-hand chain walk */
    fk_arg.ovfl_page = pgbuf_fix (..., &ovf_vpid, ...);
    btree_record_process_objects (..., BTREE_OVERFLOW_NODE, ..., btree_fk_object_does_exist, &fk_arg);
    if (stop) { break; }                    /* found -> done */
    btree_get_next_overflow_vpid (thread_p, fk_arg.ovfl_page, &ovf_vpid);
  }
if (bts->end_scan) { return NO_ERROR; }
if (!bts->is_interrupted)
  {                                         /* fully scanned, none found */
    assert (OID_ISNULL (&((BTREE_FIND_FK_OBJECT *) bts->bts_other)->found_oid));
    if (bts->is_fk_remake) { /* deduplicate-mode retry with rebuilt key */ }
  }

A leaf match sets stop; the callback returns without touching the chain. Otherwise it walks the chain (same hand-over-hand fix) and stops on the first match. Fully scanned, none found, not interrupted → found_oid NULL; under deduplicate-key mode (is_fk_remake) it retries with the rebuilt key. The FK callback never sets end_one_iteration — an FK check fits one key’s work and needs no partial-key resume.

7.11 Chapter summary — key takeaways

One generic driver, two callbacks. btree_range_scan owns positioning, latching, advancement, and resume; btree_range_scan_select_visible_oids does per-key SELECT work and btree_range_scan_find_fk_any_object does FK existence. btree_keyval_search is just a single-key range over the SELECT callback.
cur_leaf_lsa is the resume contract. Captured at the exact instant C_page is unfixed; on resume, LSA_EQ against the re-fixed page is the fast path, and any mismatch forces re-finding cur_key or a root re-locate.
key_status is a strict three-state machine. VERIFIED is asserted before every callback and CONSUMED after; only advance_over_filtered_keys promotes NOT_VERIFIED to VERIFIED, after range AND filter pass.
Resume re-anchors on cur_key, not slot_id. Slot numbers are trusted only when the LSA proves the page unchanged; elsewhere the slot is re-derived by searching for the key, so concurrent splits/merges never make the scan skip or duplicate rows.
Descending scans avoid deadlock by conditional prev-leaf latching. btree_range_scan_descending_fix_prev_leaf tries a conditional latch and, failing, releases its current latch before re-acquiring in safe order; a changed leaf chain forces force_restart_from_root.
Two capacity limits split a large key across iterations. The soft limit stops before a key when the buffer is nearly full (end_one_iteration); the hard limit suspends inside a key’s overflow chain, saving O_vpid and is_key_partially_processed so the next iteration resumes mid-key — streaming millions of OIDs without holding a latch across roundtrips.
Fence keys are invisible to scans. advance_over_filtered_keys skips any BTREE_LEAF_RECORD_FENCE slot, so the page-edge sentinels from splits (Ch 6) never leak into a result set.

Chapter 8: Deleting an Object Logical and Physical

How does CUBRID take a single OID out of a key? A logical delete only stamps a delete MVCCID so older readers still see the object; a physical delete splices the object bytes out, and when the last OID leaves it removes the whole key. One orchestrator, btree_delete_internal, drives both plus the undo (re-insert) and vacuum paths, dispatching on BTREE_OP_PURPOSE. For the record layout (key‖OID, unique first-object rule, overflow chain) see cubrid-btree.md § “CUBRID’s Approach”.

8.1 One orchestrator, seven purposes

btree_delete_internal builds a BTREE_DELETE_HELPER, switch (purpose) picks the per-key worker, and hands it to btree_search_key_and_apply_functions (the shared descent machine from insert, Chapter 3).

// btree_delete_internal -- src/storage/btree.c
switch (purpose)
  {
  case BTREE_OP_DELETE_OBJECT_PHYSICAL_POSTPONED:
  case BTREE_OP_DELETE_UNDO_INSERT:
    LSA_COPY (&delete_helper.reference_lsa, ref_lsa);  /* <- CLR anchor */
    [[fallthrough]];
  case BTREE_OP_DELETE_OBJECT_PHYSICAL:
  case BTREE_OP_DELETE_VACUUM_OBJECT:
    key_func = btree_key_delete_remove_object; break;  /* <- physical path */
  case BTREE_OP_DELETE_UNDO_INSERT_UNQ_MULTIUPD:
    LSA_COPY (&delete_helper.reference_lsa, ref_lsa);
    key_func = btree_key_remove_object_and_keep_visible_first; break;
  case BTREE_OP_DELETE_VACUUM_INSID:
    key_func = btree_key_remove_insert_mvccid; break;  /* <- vacuum insid */
  case BTREE_OP_DELETE_UNDO_INSERT_DELID:
    LSA_COPY (&delete_helper.reference_lsa, ref_lsa);
    key_func = btree_key_remove_delete_mvccid; break;  /* <- logical undo */
  default:
    assert_release (false); return ER_FAILED;
  }

The seven cases collapse to four leaf workers:

`key_func`	Purposes	Effect
`btree_key_delete_remove_object`	`PHYSICAL`, `PHYSICAL_POSTPONED`, `UNDO_INSERT`, `VACUUM_OBJECT`	splice OID out, remove key if last; `PHYSICAL` logs full undo, rest are CLR/vacuum
`btree_key_remove_object_and_keep_visible_first`	`UNDO_INSERT_UNQ_MULTIUPD`	splice + restore visible first object
`btree_key_remove_delete_mvccid`	`UNDO_INSERT_DELID`	logical: clear delete MVCCID
`btree_key_remove_insert_mvccid`	`VACUUM_INSID`	logical: clear insert MVCCID

The post-descent tail: free printed_key; on error log and return; else bump PSTAT_BT_NUM_DELETES, publish *unique, and run the subtle revert branch:

// btree_delete_internal -- src/storage/btree.c
if (delete_helper.check_key_deleted && !delete_helper.is_key_deleted)
  { /* <- key still has a visible object: undo the optimistic decrement */
    delete_helper.unique_stats_info->insert_key_and_row ();  /* <- revert key */
    delete_helper.unique_stats_info->delete_row ();          /* <- keep row delta */
  }

flowchart TB
  A["btree_delete_internal\nswitch(purpose)"] --> B["btree_search_key_and_apply_functions"]
  B --> C["btree_fix_root_for_delete"]
  C --> D["btree_merge_node_and_advance\ndescend, merge under-full -- Ch 9"]
  D --> E{"key_func"}
  E -->|physical| F["btree_key_delete_remove_object"]
  E -->|undo delid| G["btree_key_remove_delete_mvccid"]
  E -->|vacuum insid| H["btree_key_remove_insert_mvccid"]
  E -->|unq multiupd| I["btree_key_remove_object_and_keep_visible_first"]

Figure 8-1 — Orchestration. The purpose chooses the leaf worker; the descent machinery is shared with insert.

8.2 Reaching the leaf

btree_fix_root_for_delete runs once per root fix. First search loads B-tree info (btree_fix_root_with_info); a restart re-fixes and returns early. It unpacks the key, sets the domain, computes is_null. Several purposes stop here: both vacuum purposes return NO_ERROR before stat work; undo/postpone purposes return after patching a NULL class OID to topclass_oid (omitted from the undo image when it equals topclass). Only PHYSICAL falls through to unique stats, where MULTI_ROW_UPDATE arms the check_key_deleted/is_key_deleted pair (§8.1). A NULL key sets *stop.

btree_merge_node_and_advance is the advance callback (merge logic is Ch 9). The leaf-reaching branch: at node_level == 1 it calls btree_search_leaf_page, then promotes a root read latch via pgbuf_promote_read_latch. On ER_PAGE_LATCH_PROMOTE_FAIL it sets *restart and flips nonleaf_latch_mode to PGBUF_LATCH_WRITE so the retry descends write-latched; else sets *is_leaf, leaving it write-latched for key_func.

Invariant — the leaf is write-latched before any mutation. Every key_func asserts pgbuf_get_latch_mode (*leaf_page) >= PGBUF_LATCH_WRITE; the promote-or-restart branch guarantees it. A read-latched worker would corrupt the page against a concurrent writer.

8.3 Physical delete dispatch — `btree_key_delete_remove_object`

Finds the object, prepares logging, removes it, recounts:

Key found? result == BTREE_KEY_FOUND → copy record, btree_read_record, btree_find_oid_and_its_page locates the OID (possibly overflow) into found_page/prev_found_page/offset_to_object; else the offset is NOT_FOUND.
Object not found: for VACUUM_OBJECT benign (already vacuumed) — warn and goto exit; else assert_release(false) + ER_BTREE_UNKNOWN_KEY.
Found: a SERVER_MODE assert checks the lock is held (vacuum/recovery/ own-insert-into-non-unique are lock-free); a second enforces the unique first-object rule (§8.7).
Undo logging: only PHYSICAL saves a key-value image via btree_rv_save_keyval_for_undo into rv_keyval_data (rollback re-inserts the OID); vacuum/undo-of-insert are compensation, so = NULL.
Remove: node_type = (*leaf_page == found_page) ? LEAF : OVERFLOW, then btree_key_remove_object (§8.4); unfix other pages.
Recount: for MULTI_ROW_UPDATE, re-read the leaf, btree_get_num_visible_from_leaf_and_ovf (dirty snapshot); if num_visible_oids > 0 set is_key_deleted = false (§8.1 fixes the stat).
exit: idempotent unfix of both pages, free rv_keyval_data.

8.4 `btree_key_remove_object` — leaf vs. overflow fork

A two-line dispatcher (Figure 8-2): leaf → btree_leaf_remove_object, overflow → btree_overflow_remove_object. A leaf record must never be left empty (unique first-object rule); an overflow page may shrink to nothing.

flowchart TB
  K["btree_key_remove_object\nnode_type?"] --> L{"BTREE_LEAF_NODE?"}
  L -->|yes| LO["btree_leaf_remove_object"]
  L -->|no| OO["btree_overflow_remove_object"]
  LO --> LO0{"offset_to_object == 0?"}
  LO0 -->|no| RR["btree_record_remove_object\nsplice non-first"]
  LO0 -->|yes, only object, no ovfl| DK["btree_delete_key_from_leaf\nremove whole key"]
  LO0 -->|yes, only object, has ovfl| RP["btree_replace_first_oid_with_ovfl_oid"]
  LO0 -->|yes, more leaf objects| RL["btree_leaf_record_replace_first_with_last"]
  OO --> OO1{"overflow has 1 object?"}
  OO1 -->|yes| DEA["dealloc page + relink"]
  OO1 -->|no| RR2["btree_record_remove_object"]

Figure 8-2 — Every physical-removal branch.

8.5 `btree_leaf_remove_object` — the four leaf branches

This enforces the unique first-object invariant; removing the key clears check_key_deleted. The four branches:

// btree_leaf_remove_object -- src/storage/btree.c
if (offset_to_object == 0)
  {
    btree_record_get_last_object (..., &last_oid, ..., &offset_to_last_object);
    if (offset_to_last_object == 0)               /* <- only object */
      {
        if (VPID_ISNULL (&leaf_rec_info->ovfl))
          btree_delete_key_from_leaf (...);        /* <- remove the key (§8.8) */
        else
          btree_replace_first_oid_with_ovfl_oid (...); /* <- pull from overflow (§8.6) */
      }
    else
      btree_leaf_record_replace_first_with_last (...); /* <- first := last */
  }
else
  btree_record_remove_object (...);                /* <- splice non-first */

Invariant — a unique leaf record’s first object is the visible one, and the record is never empty while OIDs remain. The first slot has a distinct encoding (may carry the class OID and fixed-size MVCC info), so removing it always means replacement, never a raw splice. Violating it leaves a scan reading a malformed first object or a key with a live overflow link but no OID.

8.6 Pulling an OID up — `btree_replace_first_oid_with_ovfl_oid`

When the leaf’s only object is removed but overflow OIDs remain, an overflow object becomes the new leaf first object via a system-op sandwich: btree_leaf_change_first_object swaps the overflow’s last object into the leaf first slot, btree_overflow_record_replace_object moves the object-to-delete into the vacated overflow slot, commit, then remove it via btree_overflow_remove_object (§8.7). exit_on_error aborts the system op; pure-undo purposes must not fail here.

8.7 Overflow removal — `btree_overflow_remove_object` and friends

Reads the overflow record (slot 1) and forks on whether this object is alone:

// btree_overflow_remove_object -- src/storage/btree.c
if (overflow_record.length == BTREE_OBJECT_FIXED_SIZE (btid_int))
  {
    assert (offset_to_object == 0);                  /* <- only first removable */
    btree_get_next_overflow_vpid (thread_p, *overflow_page, &next_overflow_vpid);
    pgbuf_unfix_and_init (thread_p, *overflow_page);
    if (!delete_helper->is_system_op_started) { log_sysop_start (...); ... }
    file_dealloc (..., &overflow_vpid, FILE_BTREE);
    if (prev_page == leaf_page)
      btree_modify_leaf_ovfl_vpid (...);             /* <- relink from leaf */
    else
      btree_modify_overflow_link (...);              /* <- relink from prev ovfl */
  }
else
  btree_record_remove_object (..., BTREE_OVERFLOW_NODE, offset_to_object, &ovf_addr);

The error: label ends a self-started system op and asserts a non-no-undo purpose.

btree_record_remove_object / _internal is the byte-level splice shared by both paths: _internal computes the encoded size (OR_OID_SIZE + a second OID for unique + MVCC bytes per flags), packs removed bytes as undo, RECORD_MOVE_DATA closes the gap, packs redo; the wrapper spage_updates and btree_rv_log_delete_objects.

Invariant — only the first object of an overflow page may be the alone/last object removed by deallocation. _internal asserts node_type == BTREE_OVERFLOW_NODE || offset_to_object > 0 — it refuses to splice the leaf’s first object; that goes through replacement (§8.5/§8.6).

8.8 Removing the whole key — `btree_delete_key_from_leaf`

When the last OID leaves, the key record is deleted. If leafrec_pnt->key_len < 0 the key value lives in an overflow key page → start a system op and btree_delete_overflow_key first; if a system op runs, copy the record for undo before deleting. Then:

// btree_delete_key_from_leaf -- src/storage/btree.c
if (spage_delete (thread_p, leaf_pg, search_key->slotid) != search_key->slotid) goto exit_on_error;
key_cnt = btree_node_number_of_keys (thread_p, leaf_pg);
if (key_cnt == 0) header->max_key_len = 0;          /* <- no keys left */
LOG_RV_RECORD_SET_MODIFY_MODE (&delete_helper->leaf_addr, LOG_RV_RECORD_DELETE);
btree_rv_log_delete_object (thread_p, *delete_helper, delete_helper->leaf_addr,
                            leaf_record.length, 0, leaf_record.data, NULL); /* <- undo = record */

exit_on_error aborts a started system op. Only the page-local key_cnt updates here; index-wide num_keys/num_oids deltas flow through the unique-stats path (delete_key_and_row/delete_row, §8.2/§8.1).

8.9 Logical delete — `btree_key_remove_delete_mvccid`

The logical half (BTREE_OP_DELETE_UNDO_INSERT_DELID) rolls back an MVCC delete: the object stays, only its delete MVCCID clears. A miss is a hard ER_BTREE_UNKNOWN_KEY (undo must find it). Forks on uniqueness:

unique (_unique → btree_remove_delete_mvccid_unique_internal): the visible object must be first, so undo may move it back — if already first, just btree_record_remove_delid; else swap with the current first (fixed-size), under a system op with undoredo in an overflow page. Logged as compensate (log_append_compensate_with_undo_nxlsa, anchored to reference_lsa), or undoredo when a system op is open.
non-unique (_non_unique): no ordering — btree_record_remove_delid in place, spage_update, then log_append_compensate_with_undo_nxlsa (RVBT_RECORD_MODIFY_COMPENSATE, anchored to reference_lsa).

btree_record_remove_delid branches on has_fixed_size: a fixed-size object (overflow, or first-with-overflow) keeps its width and nulls the MVCCID (btree_set_mvccid to MVCCID_NULL); a variable-size object removes the MVCCID bytes and clears BTREE_OID_HAS_MVCC_DELID (btree_remove_mvccid).

8.10 Logical vacuum — `btree_key_remove_insert_mvccid`

BTREE_OP_DELETE_VACUUM_INSID is the mirror: vacuum clears the insert MVCCID once the object is visible to all — no undo at all. Branches: object not found -> benign, warn and return NO_ERROR; found -> assert the INSID is not all-visible, then btree_record_remove_insid strips the bytes, spage_update, and log_append_redo_data with redo-only mode RVBT_RECORD_MODIFY_NO_UNDO.

Invariant — vacuum never produces undo. Both vacuum workers (_insert_mvccid here, VACUUM_OBJECT via the physical path) emit only redo. If vacuum logged undo, a crash during recovery could “un-vacuum” and re-expose a dead version, breaking visibility.

8.11 Chapter summary — key takeaways

One orchestrator, seven purposes. btree_delete_internal switch (purpose) selects one of four leaf workers; the descent (btree_fix_root_for_delete → btree_merge_node_and_advance) is shared with insert and reaches the leaf write-latched.
Logical vs. physical are different paths. Logical (_remove_delete_mvccid undo, _remove_insert_mvccid vacuum) clears an MVCCID in place; physical (_delete_remove_object) splices OID bytes and may remove the key.
The unique first-object rule drives the leaf branches. btree_leaf_remove_object never raw-splices the first object: it removes the key (btree_delete_key_from_leaf), pulls one up from overflow (btree_replace_first_oid_with_ovfl_oid), or swaps in the last object; overflow removal forks on alone-or-not (dealloc + relink, else splice).
Logging encodes intent. Physical delete carries a key-value undo image for re-insert; vacuum is redo-only; undo-of-delete writes compensate records anchored to reference_lsa. The check_key_deleted/is_key_deleted pair lets MULTI_ROW_UPDATE recount visible objects and reverse an over-eager stat decrement.

Chapter 9: Merge and Under Fill Rebalancing

This chapter is the mirror image of Chapter 6. A merge is driven down the descent before a delete: the descent walks with btree_merge_node_and_advance, and at every non-leaf hop asks “can the child I am about to enter be folded into a sibling?” If yes, it collapses the two children into the left one, deletes the between-key separator from the parent, and recycles the right page. A degenerate two-key root with height > 2 instead triggers btree_merge_root, which pulls both children up into the root frame and shrinks the tree by one level.

For merge theory (under-fill threshold, separator fold-down, height shrink) see cubrid-btree.md, “Deletion and rebalancing”. This chapter traces only the CUBRID mechanics. The split logging this mirrors is in Chapter 6; merge re-uses the same RVBT_COPYPAGE / RVBT_INS_PGRECORDS redo verbs and the RVBT_MARK_DEALLOC_PAGE undo verb in reverse.

9.1 The merge-on-descend driver: btree_merge_node_and_advance

btree_merge_node_and_advance is the BTREE_ADVANCE_WITH_KEY_FUNCTION for delete, dispatched per-node by btree_search_key_and_apply_functions for btree_delete_internal. Its job is dual: advance one level toward the leaf, and opportunistically merge the child it enters. It fans into three regions, each ending in return: a leaf (position, no merge); a root with two keys and level > 2 (delegate to btree_merge_root); an ordinary non-leaf (pick the child following key, try to merge it with its right neighbor via btree_merge_node). Figure 9-1 is the skeleton; every labelled edge is a real branch.

flowchart TD
  A["merge_node_and_advance"] --> B{"node_level == 1 ?"}
  B -->|leaf| C{"is_root and latch not WRITE ?"}
  C -->|yes| D["promote SHARED_READER"]
  D -->|FAIL| E["restart, latch:=WRITE, return"]
  D -->|ok| F["is_leaf=true, return"]
  C -->|no| F
  B -->|non-leaf| G{"is_root and level>2 and keys==2 ?"}
  G -->|yes| H["fix L+R, compute need/force_root_merge"]
  H --> I{"need_root_merge ?"}
  I -->|no| J["search child, advance, fall to tail"]
  I -->|yes| LM{"latch == READ ?"}
  LM -->|already WRITE| M["sysop, merge_root, dealloc Q+R, commit"]
  LM -->|READ| K{"promote all 3 ?"}
  K -->|FAIL and force| L["restart, latch:=WRITE, return"]
  K -->|FAIL not force| J
  K -->|ok| M
  M --> N["advance_to_page=root, crt_page=NULL"]
  G -->|no| O["search child, fix child_page"]
  J --> O
  O --> P{"slotid<key_count and free>PAGESIZE/2 ?"}
  P -->|no| Z["advance_to_page=child, return"]
  P -->|yes| Q["fix right neighbor"]
  Q --> R{"right fixed ?"}
  R -->|not in buffer| Z
  R -->|yes| S["btree_node_mergeable"]
  S --> T{"status != NO ?"}
  T -->|no| Z
  T -->|yes| U{"latch READ ? promote all 3"}
  U -->|already WRITE| W
  U -->|FAIL and FORCE| V["restart, latch:=WRITE, return"]
  U -->|FAIL and TRY| Z
  U -->|ok| W["sysop, merge_node, dealloc right, commit"]

Figure 9-1. Control flow of btree_merge_node_and_advance, including the already-WRITE skip-promotion edges.

Leaf branch. At node_level == 1 it positions search_key with btree_search_leaf_page, then guarantees WRITE. If we descended READ and this leaf is also the root (single-page tree), it promotes; ER_PAGE_LATCH_PROMOTE_FAIL sets *restart and bumps nonleaf_latch_mode to WRITE so the descent restarts exclusively. Otherwise the leaf already holds WRITE and returns *is_leaf = true.

// btree_merge_node_and_advance -- src/storage/btree.c
if (node_header->node_level == 1)
  {
    error_code = btree_search_leaf_page (thread_p, btid_int, *crt_page, key, search_key);
    if (delete_helper->is_root && delete_helper->nonleaf_latch_mode != PGBUF_LATCH_WRITE)
      {
        error_code = pgbuf_promote_read_latch (thread_p, crt_page, PGBUF_PROMOTE_SHARED_READER);
        if (error_code == ER_PAGE_LATCH_PROMOTE_FAIL)
          { *restart = true; delete_helper->nonleaf_latch_mode = PGBUF_LATCH_WRITE; return NO_ERROR; }
      }
    *is_leaf = true;
    return NO_ERROR;
  }

Root-collapse precheck: is_root && node_level > 2 && btree_node_number_of_keys(...) == 2. The node_level > 2 guard means the tree never collapses to a single page. Both children are fixed, used bytes computed as DB_PAGESIZE - spage_get_free_space(...), and two thresholds derived with the root’s max_key_len as a defensive allowance for the folded-in separator:

// btree_merge_node_and_advance -- src/storage/btree.c
need_root_merge  = (left_used + right_used + CAN_MERGE_WHEN_EMPTY   + root_max_key_length) < DB_PAGESIZE;
force_root_merge = (left_used + right_used + FORCE_MERGE_WHEN_EMPTY + root_max_key_length) < DB_PAGESIZE;

CAN_MERGE_WHEN_EMPTY is MAX(DB_PAGESIZE*0.33, ...), FORCE_MERGE_WHEN_EMPTY MAX(*0.66, ...): need = merged page at most ~67% full, force at most ~34%. Promotion is attempted only when nonleaf_latch_mode == PGBUF_LATCH_READ; if the descent is already exclusive (restart path) promotion is skipped and the merge runs directly.

Branch matrix for the root-merge outcome (entered only after need_root_merge):

Latch / promotion result	`force_root_merge`	Action
`nonleaf_latch_mode` already WRITE	—	skip promotion, fall into merge (`else`: “Pages already latched exclusively”)
promote from READ -> all 3 latched	—	`log_sysop_start`, `btree_merge_root`, dealloc both children, `log_sysop_commit`, `advance_to_page = crt_page`, `*crt_page = NULL`
promote -> `PROMOTE_FAIL`	true	`*restart = true`, latch:=WRITE, unfix children, return
promote -> `PROMOTE_FAIL`	false	fall through, skip merge, advance to one child
other error	—	`goto error` (abort sysop if started)

Invariant — root merge re-enters the root. A successful btree_merge_root sets *advance_to_page = *crt_page and *crt_page = NULL rather than descending, re-running the callback on the same root so a second collapse fires if the shorter root again has two mergeable children. Drop it and a tree that should shrink two levels shrinks one, leaving a transiently under-filled root.

Ordinary non-leaf tail. btree_search_nonleaf_page chooses search_key->slotid and the child VPID. child_latch is WRITE with PGBUF_PROMOTE_ONLY_READER if the child is a leaf (node_level == 2), else inherits nonleaf_latch_mode with PGBUF_PROMOTE_SHARED_READER — the single-reader promote at level 2 mirrors btree_split_node_and_advance (Chapter 6). The right-neighbor precheck is the gate the reader question turns on:

// btree_merge_node_and_advance -- src/storage/btree.c
if (search_key->slotid < key_count && spage_get_free_space (thread_p, child_page) > DB_PAGESIZE / 2)

Both conditions required: slotid < key_count means a right neighbor exists; more than half free means plausibly under-filled. Under SERVER_MODE IO stress, neighbor_fetch_mode downgrades to OLD_PAGE_IF_IN_BUFFER so vacuum does not stall on cold siblings — a missing neighbor leaves right_page NULL, er_clear masks the benign miss, and the code falls through to a plain advance. Here too, an already-WRITE descent skips the promotion block and merges directly; promotion is attempted only on the READ path. There is no left-neighbor branch: only the right sibling is tried, relying on the descent visiting every node so each pair is checked from its left member.

9.2 The mergeability classifier: btree_node_mergeable

btree_node_mergeable returns three-valued BTREE_MERGE_STATUS: BTREE_MERGE_NO (skip), BTREE_MERGE_TRY (merge if latches promote cheaply), BTREE_MERGE_FORCE (mandatory; restart exclusively if promotion fails). Figure 9-2 enumerates every return.

flowchart TD
  A["node_mergeable(L,R)"] --> B{"L_cnt == 0 ?"}
  B -->|yes| F1["FORCE: left empty"]
  B -->|no| C{"L non-fence == 0 ?"}
  C -->|yes| D["R_used = uncompressed(R)"]
  D --> D1{"R + FORCE_EMPTY < PAGESIZE"}
  D1 -->|yes| F2["FORCE"]
  D1 -->|no| D2{"R + CAN_EMPTY < PAGESIZE"}
  D2 -->|yes| T1["TRY"]
  D2 -->|no| N1["NO"]
  C -->|no| E{"R_cnt == 0 ?"}
  E -->|yes| F3["FORCE: right empty"]
  E -->|no| G{"R non-fence == 0 ?"}
  G -->|yes| H["symmetric on L_used"]
  G -->|no| I{"L==1 and R==1 key ?"}
  I -->|yes| F4["FORCE: both single-key"]
  I -->|no| J["used bytes"]
  J --> K{"L+R+CAN_EMPTY < PAGESIZE ?"}
  K -->|no| N2["NO"]
  K -->|yes| Lr["recompute uncompressed for LEAF"]
  Lr --> Mm{"uncompressed still fits ?"}
  Mm -->|no| N3["NO"]
  Mm -->|yes| O{"L+R+FORCE_EMPTY < PAGESIZE ?"}
  O -->|yes| F5["FORCE"]
  O -->|no| T2["TRY"]

Figure 9-2. Return classification of btree_node_mergeable.

It counts keys with btree_node_number_of_keys, then a non-fence count by subtracting lower fence (slot 1) and upper fence (slot L_cnt) when btree_is_fence_key reports them. Three cases:

One side empty / fence-only. L_cnt == 0 -> FORCE. If L has only fences (L_non_fence_cnt == 0), R’s uncompressed size decides (btree_node_size_uncompressed, fences re-derived after merge): fits with FORCE_MERGE_WHEN_EMPTY slack -> FORCE; only with CAN_MERGE_WHEN_EMPTY -> TRY; else NO. R symmetric.
Both single-key. L_non_fence_cnt == 1 && R_non_fence_cnt == 1 -> FORCE.
Size test. With used bytes, if L_used + R_used + CAN_MERGE_WHEN_EMPTY < DB_PAGESIZE it is a candidate. Leaf nodes re-check with uncompressed sizes (merge discards the shared fence prefix, so post-merge bytes can exceed the naive sum):

// btree_node_mergeable -- src/storage/btree.c
if (l_node_type == BTREE_LEAF_NODE)
  {
    L_used = btree_node_size_uncompressed (thread_p, btid, L_page);
    if (L_used < 0) { return BTREE_MERGE_NO; }
    if (L_used + R_used + CAN_MERGE_WHEN_EMPTY >= DB_PAGESIZE) { return BTREE_MERGE_NO; }
    R_used = btree_node_size_uncompressed (thread_p, btid, R_page);
    if (L_used + R_used + CAN_MERGE_WHEN_EMPTY >= DB_PAGESIZE) { return BTREE_MERGE_NO; }
  }
if (L_used + R_used + FORCE_MERGE_WHEN_EMPTY < DB_PAGESIZE) { return BTREE_MERGE_FORCE; }
return BTREE_MERGE_TRY;

The TRY/FORCE asymmetry is the 0.33/0.66 split. The size-fails case reaches a final return BTREE_MERGE_NO.

9.3 The general merge: btree_merge_node

btree_merge_node(P, left_pg, right_pg, p_slot_id, child_vpid, status) folds right_pg into left_pg, deletes the separator at parent slot p_slot_id (the slot pointing at the right page), and recycles the right page, returning the surviving VPID in *child_vpid (always left_vpid). The 12 STEP comments map to the recovery story; Figure 9-3 is the flow.

flowchart TD
  A["btree_merge_node"] --> B["STEP 0-1: classify fences\nkeep left lower + right upper"]
  B --> C["STEP 2-3: gather rec[] in merge_buf\nrecompress if merged_prefix changed"]
  C --> D["STEP 4: undo-log left RVBT_COPYPAGE"]
  D --> E["STEP 5-7: remove upper fence,\nspage_insert rec[], append fence"]
  E --> F["STEP 8: RVBT_NDRECORD_DEL +\nspage_delete parent slot, child_vpid=left"]
  F --> G["STEP 9-10: left next_vpid past right,\nmax_key_len, redo-log left"]
  G --> H["STEP 11: right node_level=-1\nunder RVBT_MARK_DEALLOC_PAGE"]
  H --> I["STEP 12: btree_get_next_page +\nset_vpid_previous_vpid -> left"]
  E -->|spage failure| X["assert_release, goto exit_on_error"]
  F -->|delete != slot| X

Figure 9-3. STEP flow of btree_merge_node with its error exits.

STEP 0–3. Leaf pages carry a lower fence (slot 1) and upper fence (slot L_cnt) when flagged BTREE_LEAF_RECORD_FENCE. The merged page keeps the left lower fence and right upper fence; the left’s upper fence and right’s lower fence (the vanished boundary) are discarded. The right upper fence is copied out early (merged_upper_fence_record) since the right page will be recycled. Records are gathered into a scratch rec[] backed by merge_buf (grown via db_private_alloc if left_used + right_used + MAX_MERGE_ALIGN_WASTE overflows the stack buffer), re-compressing each with btree_recompress_record when the merged prefix differs from the source.

Invariant — prefix monotonicity. The merged prefix can never be longer than either source prefix. For MIDXKEY indexes with both fences, merged_prefix = pr_midxkey_common_prefix(left_fence_key, right_fence_key); otherwise it stays 0 (no-recompress fast path). The code only ever re-compresses (expands) records; compressing against a prefix longer than merged_prefix would corrupt the key bytes.

STEP 4 — undo image of left. Before mutating left, its full image is logged for undo:

// btree_merge_node -- src/storage/btree.c  (STEP 4)
log_append_undo_data2 (thread_p, RVBT_COPYPAGE, &btid->sys_btid->vfid, left_pg, -1, DB_PAGESIZE, left_pg);

This physical undo half pairs with the STEP 10 redo so a rolled-back delete reconstructs the pre-merge left page byte-for-byte (9.5).

STEP 5–7 — rewrite left. Remove the left upper fence (if present); if the prefix changed, delete and re-insert all left non-fence records, else keep them; spage_insert the gathered rec[]; append the merged upper fence. Each spage_* failure is assert_release(false) + goto exit_on_error.

STEP 8 — fold the separator out of the parent. The structural heart. The parent record at p_slot_id is read; an overflow key (nleaf_pnt.key_len < 0) has its pages freed via btree_delete_overflow_key; then the separator is logged and deleted:

// btree_merge_node -- src/storage/btree.c  (STEP 8)
btree_rv_write_log_record (recset_data, &recset_length, &peek_rec, BTREE_NON_LEAF_NODE);
log_append_undoredo_data2 (thread_p, RVBT_NDRECORD_DEL, &btid->sys_btid->vfid, P, p_slot_id,
                           recset_length, sizeof (p_slot_id), recset_data, &p_slot_id);
if (spage_delete (thread_p, P, p_slot_id) != p_slot_id) { /* ... goto exit_on_error ... */ }
*child_vpid = *left_vpid;          /* <- surviving child is always the left page */

Deleting the right page’s slot leaves the left page’s slot still pointing at left_pg, now spanning both old ranges. RVBT_NDRECORD_DEL carries the old record (undo: re-insert) and the slot id (redo: delete).

STEP 9–11 — left header, redo, mark right dead. Left next_vpid is repointed past the right page (= right_header->next_vpid), max_key_len maxed, split-info reset, redo image logged via log_append_redo_data2(RVBT_COPYPAGE, left_pg, ...) before pgbuf_set_dirty. The right page is not freed here; its node_level is set to -1 under RVBT_MARK_DEALLOC_PAGE so a concurrent re-fix detects a dead page (BTREE_IS_PAGE_VALID_LEAF); file_dealloc is in the caller.

STEP 12 — relink the following node’s prev_vpid. The node after the right page still points its prev_vpid at the recycled right page. btree_get_next_page fetches it (NULL at end of chain) and btree_set_vpid_previous_vpid rewrites the back-link to left. This is the only branch touching a fourth page, so a merge can momentarily hold P, left, right, and next. The exit_on_error tail frees merge_buf_ptr if heap-allocated; the sysop abort (9.5) then undoes everything.

9.4 The root collapse: btree_merge_root

btree_merge_root(P, Q, R) is simpler than btree_merge_node because it does not move P’s separator into a sibling — it pours both children’s records into P itself. Q and R are untouched beyond their dealloc mark, and the root’s page identity never changes, so the b-tree header pointer stays valid.

Q’s upper fence (if btree_leaf_is_flaged(..., BTREE_LEAF_RECORD_FENCE)) is excluded by decrementing Q_end; R’s lower fence by advancing R_start (root level > 2 implies non-leaf children, so fences are usually absent, but the code defends anyway). The two root separators are removed with btree_delete_meta_record — slot 2 first, then slot 1, so numbering does not shift. Q’s records are saved with btree_rv_util_save_page_records and spage_insert_at into P at slots 1..Q_end, logged once via RVBT_INS_PGRECORDS; the per-iteration error path logs only the records that did land:

// btree_merge_root -- src/storage/btree.c
for (i = 1; i <= Q_end; i++)
  {
    if (spage_get_record (...) != S_SUCCESS || spage_insert_at (...) != SP_SUCCESS)
      {
        if (i > 1) { recset_header.rec_cnt = i - 1; /* undo-log the i-1 records already inserted */ }
        goto exit_on_error;
      }
  }

R’s records follow at slots left_cnt + 1 .., logged the same way. After both blocks land, the root header is undo-logged, prev_vpid/next_vpid nulled, split-info defaulted, and node_level decremented:

// btree_merge_root -- src/storage/btree.c
btree_node_header_undo_log (thread_p, &btid->sys_btid->vfid, P);
VPID_SET_NULL (&root_header->node.prev_vpid);
VPID_SET_NULL (&root_header->node.next_vpid);
btree_write_default_split_info (&(root_header->node.split_info));
root_header->node.node_level--;
assert_release (root_header->node.node_level > 1);
btree_node_header_redo_log (thread_p, &btid->sys_btid->vfid, P);

Invariant — the root is the only level that shrinks the tree. node_level-- on the root frame is the single place height decreases. assert_release(node_level > 1) keeps the post-collapse root non-leaf — with the node_level > 2 precheck, the tree never collapses a level-3 root to a leaf in one step, nor to a single page. The #ifndef NDEBUG block also asserts the root’s max_key_len equals MAX(q,r max_key_len). Decrementing node_level without first copying the children’s records would make descent index a separator-less level and read garbage slots.

Q and R are each marked dead (node_level = -1, undo-logged via RVBT_MARK_DEALLOC_PAGE) but not deallocated — the caller does file_dealloc after btree_merge_root returns.

9.5 Back-link repair and the system-op / logical-undo bracket

btree_set_vpid_previous_vpid keeps the doubly-linked leaf chain consistent after a merge removed the right page — the exact dual of the next-link rewrite a split performs.

// btree_set_vpid_previous_vpid -- src/storage/btree.c
header = btree_get_node_header (thread_p, page_p);
if (header == NULL) { return ER_FAILED; }
btree_node_header_undo_log (thread_p, &btid->sys_btid->vfid, page_p);
header->prev_vpid = *prev;                          /* <- repoint back-link to surviving left page */
btree_node_header_redo_log (thread_p, &btid->sys_btid->vfid, page_p);
pgbuf_set_dirty (thread_p, page_p, DONT_FREE);

An early if (page_p == NULL) return NO_ERROR; makes it a no-op when the merged left page was last in the chain. The same primitive serves split and merge.

The bracket. Each merge runs inside log_sysop_start / log_sysop_commit, enclosing all physical changes — RVBT_COPYPAGE on left, RVBT_INS_PGRECORDS on root, RVBT_NDRECORD_DEL on the parent, RVBT_MARK_DEALLOC_PAGE on the recycled page, and file_dealloc:

// btree_merge_node_and_advance -- src/storage/btree.c
log_sysop_start (thread_p);
is_system_op_started = true;
/* ... btree_merge_node / btree_merge_root, then file_dealloc of the recycled page(s) ... */
log_sysop_commit (thread_p);
is_system_op_started = false;

Invariant — a structural merge is atomic and self-undoing within the delete. On any goto error between start and commit, the error: label runs log_sysop_abort, replaying the undo records in reverse: the recycled page un-deallocated, the parent separator re-inserted, the left page restored from its RVBT_COPYPAGE image — re-splitting to the pre-merge shape so the delete retries. The sysop commits before the leaf-level object removal, so the merge is a structural (nested top) action: even if the enclosing logical delete rolls back, the committed merge is never logically undone, leaving the merged shape. Leave file_dealloc outside the sysop and a crash between commit and dealloc leaks the right page; commit the dealloc without the record moves and the parent points at a freed page.

Cross-reference Chapter 6: split logs RVBT_COPYPAGE redo on the new right page and RVBT_NDRECORD_INS on the parent; merge logs the symmetric RVBT_COPYPAGE undo+redo on the surviving left page and RVBT_NDRECORD_DEL on the parent — the undo of one is structurally the other.

9.6 Chapter summary — key takeaways

Merge is driven on descent, right-sibling only. btree_merge_node_and_advance checks each child against its right neighbor as it walks down; there is no left-merge branch, and the precheck requires both a right neighbor (slotid < key_count) and an under-filled child (free > PAGESIZE/2).
Mergeability is three-valued. btree_node_mergeable returns NO/TRY/FORCE keyed on CAN_MERGE_WHEN_EMPTY (~33%) and FORCE_MERGE_WHEN_EMPTY (~66%); empty/fence-only and double-single-key pairs are always FORCE, and leaf pairs are re-checked against uncompressed size. TRY is skipped on a failed promotion while FORCE restarts the descent exclusively.
Promotion is skipped when the descent is already exclusive. Both merge blocks promote latches only when nonleaf_latch_mode == PGBUF_LATCH_READ; on the restart (already-WRITE) path the merge runs directly.
General merge folds the separator out of the parent. btree_merge_node copies right into left, deletes the parent slot pointing at the right page (RVBT_NDRECORD_DEL), repoints left_header->next_vpid, marks the right page dead (node_level = -1), and relinks the following node’s prev_vpid via btree_set_vpid_previous_vpid.
Root collapse is the only height shrink. btree_merge_root pours both children’s records into the unchanged root and decrements node_level; the node_level > 2 precheck and assert_release(node_level > 1) keep the tree off a single page and the root non-leaf.
The recycled page is marked, not freed, inside the merge. node_level = -1 (undo-logged with RVBT_MARK_DEALLOC_PAGE) lets concurrent re-fixers detect a dead page; file_dealloc happens in the driver, still inside the system op.
The merge is self-undoing. log_sysop_start/commit brackets each merge so a mid-merge error aborts back to the exact pre-merge layout (re-split), while a committed merge survives even if the enclosing delete rolls back — mirroring Chapter 6’s split logging.

Chapter 10: Special and Edge Paths

What builds or maintains the tree outside the single-key insert/delete lifecycle of Chapters 3-9: the two index-load paths, the leaf-record auxiliaries (fence keys, FK probes), unique-statistics aggregation, and deduplicate-key mode. For why a bulk build beats N inserts and what a fence key is for, see cubrid-btree.md (“Bulk Loading”, “Prefix Compression and Fence Keys”).

10.1 Offline bulk load: `xbtree_load_index`

xbtree_load_index is the CREATE INDEX path used when no concurrent writer sees the table. It builds the tree invisibly in one top system operation, then commits atomically. Because nobody reaches the file until commit, latching is bypassed — pages are fixed PGBUF_LATCH_WRITE only for the buffer manager, never to coordinate transactions. Figure 10-1 traces every branch; the load is all-or-nothing (each goto error aborts the sysop, returns NULL). The zero-key branch recreates an empty index through the regular xbtree_add_index.

flowchart TD
  A["xbtree_load_index"] --> B["init SORT_ARGS + LOAD_ARGS\nset deduplicate_key_idx"]
  B --> C["btree_index_sort -> btree_construct_leafs callback"]
  C --> D{"any keys loaded?\nleaf.pgptr != NULL"}
  D -->|yes| E["btree_save_last_leafrec"]
  E --> F{"has_fk?"}
  F -->|yes| G["btree_load_check_fk"]
  F -->|no| H["btree_build_nleafs"]
  G --> H
  D -->|no| I["log_sysop_abort\nxbtree_add_index empty"]
  H --> J{"is_sysop_started?"}
  I --> J
  J -->|yes| K["attach to outer\nif unique: undo RVBT_REMOVE_UNIQUE_STATS"]
  J -->|no| L["nothing to log"]
  K --> M["return btid"]
  L --> M

Figure 10-1. xbtree_load_index control flow with empty-index and FK branches.

10.2 Packing leaves left-to-right: `btree_construct_leafs` and `btree_first_oid`

btree_construct_leafs is the sort callback; it folds each sorted (key, oid) into the current leaf. The comments label all five branches:

// btree_construct_leafs -- src/storage/btree_load.c
  next = *(char **) recdes->data;                 /* <- save forward link before mutating recdes */
  if (VPID_ISNULL (&(load_args->leaf.vpid)))      /* <- A: very first record */
    bt_load_get_first_leaf_page_and_init_args (...);
  else {
      int c = btree_compare_key (&sparam.this_key, &load_args->current_key, ...);
      if (c == DB_GT)      bt_load_make_new_record_on_leaf_page (...);   /* <- B: new key */
      else if (c == DB_EQ) bt_load_add_same_key_to_record (...);        /* <- C: append OID */
      else { assert_release (false); goto error; }                      /* <- D: out-of-order = corruption */
    }
  if (!next) break;                               /* <- E: chain exhausted */

INVARIANT (sorted-input monotonicity). btree_compare_key(this_key, current_key) is never DB_LT; the sort guarantees non-decreasing keys, so the else arm is assert_release(false). A buggy comparator would otherwise produce an unsorted leaf chain that scans (Ch 7) and searches (Ch 3) mis-navigate; the assert turns latent corruption into immediate failure.

bt_load_make_new_record_on_leaf_page applies the fill-factor: when (cur_maxspace - leaf_nleaf_recdes.length) < LOAD_FIXED_EMPTY_FOR_LEAF and spage_number_of_records > 1 it calls btree_proceed_leaf to start a fresh leaf (chaining prev/next) before spage_insert, then btree_first_oid. LOAD_FIXED_EMPTY_FOR_LEAF reserves slack for later in-place inserts; the > 1 guard prevents an infinite loop when one oversized key (already overflow-backed) exceeds the threshold. btree_first_oid branches on key size (key_len < BTREE_MAX_KEYLEN_INPAGE → BTREE_NORMAL_KEY, else BTREE_OVERFLOW_KEY, creating the overflow-key file via btree_create_overflow_key_file if ovfid is null) and on the first object’s delete flag: MVCC_IS_HEADER_DELID_VALID → curr_non_del_obj_count = 0 (only object already MVCC-deleted, key not live), else = 1 and ++n_keys. A key whose only object is already deleted does not count toward the distinct-key statistic — the bulk-load analogue of the runtime bookkeeping in §10.9.

10.3 Bottom-up internal levels and `btree_load_new_page`

btree_build_nleafs builds internal levels one at a time with two work lists (push_list/pop_list) that ping-pong (Figure 10-2). Every page goes through btree_load_new_page, whose node_level argument discriminates:

// btree_load_new_page -- src/storage/btree_load.c
  assert (log_check_system_op_is_started (thread_p));   /* <- caller must hold a sysop */
  log_sysop_start (thread_p);                            /* <- nested sysop just for this alloc */
  file_alloc (thread_p, &btid->vfid, btree_initialize_new_page, NULL, vpid_new, page_new);
  if (header)        /* <- leaf/non-leaf (node_level >= 1) */
    { header->node_level = node_level; header->common_prefix = 0; btree_init_node_header (...); }
  else               /* <- overflow page (node_level == -1) */
    { VPID_SET_NULL (&ovf_header_info.next_vpid); btree_init_overflow_header (...); }
end:                 /* <- error: log_sysop_abort; success: log_sysop_commit (committed independently) */
  if (error_code != NO_ERROR) log_sysop_abort (thread_p); else log_sysop_commit (thread_p);

INVARIANT (per-page allocation durability). Each allocation is wrapped in its own nested system operation that commits independently. Source comment: “we need to commit page allocations. if loading index is aborted, the entire file is destroyed.” Because the whole file is dropped on outer abort, the error path is sloppy about freeing individual pages — it relies on file_destroy at the top level. The assert enforces that no caller allocates a load page outside the protective top operation.

btree_proceed_leaf (§10.2) calls this with node_level == 1 per new leaf and threads prev_vpid/next_vpid so Chapter 7’s range scan chain is built right.

flowchart TD
  A["btree_build_nleafs"] --> B["pop children level k"]
  B --> C["btree_connect_page -> push parent level k+1"]
  C --> D{"pop_list length > 0?"}
  D -->|yes| B
  D -->|no| E["swap push_list / pop_list\nlevel := k+1"]
  E --> F{"more than one page\nat this level?"}
  F -->|yes| B
  F -->|no| G["write root header\nSET_DECOMPRESS_IDX_HEADER"]

Figure 10-2. Bottom-up internal-level construction in btree_build_nleafs.

10.4 Online (concurrent) index build: `xbtree_load_online_index`

The online path lets DML run while the index is built, differing from §10.1 three ways: an MVCC builder_snapshot gives a consistent table version; the schema lock is demoted per class from SCH_M_LOCK to IX_LOCK (lock_demote_class_lock) so writers are not blocked, then re-promoted at the end; and batched inserts go to a worker pool via online_index_builder. It sets tdes->has_deadlock_priority = true so the builder wins deadlocks against DML. The re-promotion loop is the most safety-critical edge path: the lock must return to SCH_M_LOCK before commit/rollback, else the catalog mutation races concurrent readers.

// xbtree_load_online_index -- src/storage/btree_load.c
  xlogtb_reset_wait_msecs (thread_p, LK_INFINITE_WAIT);   /* <- never give up */
  while (true) {
      lock_ret = lock_object (..., SCH_M_LOCK, LK_UNCOND_LOCK);
      if (lock_ret == LK_GRANTED) break;
      else if (lock_ret == LK_NOTGRANTED_DUE_ERROR && er_errid () == ER_INTERRUPTED)
        { er_clear (); logtb_set_tran_index_interrupt (...); continue; }   /* <- swallow interrupt, retry */
      else if (lock_ret == LK_NOTGRANTED_DUE_TIMEOUT && css_is_shutdowning_server ())
        { er_clear (); continue; }                       /* <- shutdown: still must promote */
      assert (0);                                         /* <- any other outcome unacceptable */
    }

Only after the lock is back does it honor the builder’s return code (if (ret != NO_ERROR) goto error;). For unique indexes it re-validates uniqueness across all classes via btree_online_index_check_unique_constraint — a concurrent inserter might have created a collision the snapshot builder could not see. The snapshot is invalidated on both exits.

10.5 The insert-list dispatcher: `online_index_builder`

online_index_builder is the producer. Its heap_next loop branches on the scan result: S_END breaks (heap exhausted), S_ERROR sets the error and breaks, any other non-S_SUCCESS trips assert (false). With a filter_pred (partial index) a V_ERROR aborts and a non-V_TRUE continues to skip the row. Otherwise it generates the key (heap_attrinfo_generate_key; NULL ⇒ ER_FAILED), lazily allocates an index_builder_loader_task, and calls add_key (§10.6); a BATCH_FULL return pushes the task to ib_workpool and bumps tasks_started. A set load_context.m_has_error aborts with ER_IB_ERROR_ABORT. After the loop it pushes the final partial task, spins until m_tasks_executed == tasks_started, and for a unique index calls logtb_tran_update_btid_unique_stats with (num_keys, num_oids, num_nulls) (§10.9).

Each task’s execute sorts its batch (prepare_list) then loops calling btree_online_index_list_dispatcher once per key with the purpose BTREE_OP_ONLINE_INDEX_IB_INSERT; on error it sets m_load_context.m_has_error (atomic exchange) and breaks. The BTREE_OP_ONLINE_INDEX_* enum (in btree.h) is the routing key the regular insert/delete machinery checks. Its seven values split into the builder’s _IB_INSERT/_IB_DELETE; the concurrent-DML _TRAN_INSERT, _TRAN_INSERT_DF (object pre-flagged DELETE so vacuum reclaims it without a second descent — used when an UPDATE inserts a row it is about to delete), _TRAN_DELETE; and the rollback _UNDO_TRAN_INSERT/_UNDO_TRAN_DELETE.

10.6 `add_key`: NULL handling and the batch boundary

index_builder_loader_task::add_key filters NULLs from the physical tree but still counts them for stats:

// index_builder_loader_task::add_key -- src/storage/btree_load.c
  if (DB_IS_NULL (key) || btree_multicol_key_is_null (const_cast<DB_VALUE *>(key)))
    { if (BTREE_IS_UNIQUE (m_unique_pk)) ++m_insert_list.m_ignored_nulls_cnt;  /* <- count, don't store */
      return BATCH_CONTINUE; }
  m_memsize += m_insert_list.add_key (key, oid);
  return (m_memsize > prm_get_bigint_value (PRM_ID_IB_TASK_MEMSIZE)) ? BATCH_FULL : BATCH_CONTINUE;

INVARIANT (NULLs are tracked, never stored). A NULL (or all-NULL multi-column) key is never placed in the tree, but for a unique index it increments m_ignored_nulls_cnt, which execute folds into both m_num_oids and m_num_nulls. This preserves rows == keys + nulls (§10.9): NULLs add a row and a null but no key, as the SQL standard requires (multiple NULLs do not violate uniqueness). Storing NULLs, or failing to count them, would make the count query and the uniqueness check diverge.

10.7 Fence keys: `BTREE_LEAF_RECORD_FENCE`

A fence key is a sentinel leaf record (flagged BTREE_LEAF_RECORD_FENCE) recording a leaf’s boundary without a real object. CUBRID uses fences as the prefix-compression anchor: every key’s common prefix is computed against the leaf’s fence, so records store only their suffix. Readers must skip them. The decompressor in btree_read_record rebuilds the full key only for a non-fence record on a page with positive prefix:

// btree_read_record (decompress branch) -- src/storage/btree.c
  if (node_type == BTREE_LEAF_NODE && !btree_leaf_is_flaged (rec, BTREE_LEAF_RECORD_OVERFLOW_KEY)
      && !btree_leaf_is_flaged (rec, BTREE_LEAF_RECORD_FENCE)) {     /* <- never decompress a fence */
      int n_prefix = btree_node_get_common_prefix (thread_p, btid, pgptr);
      if (n_prefix > 0) {
          spage_get_record (thread_p, pgptr, 1, &peek_rec, PEEK);    /* <- slot 1 = fence */
          assert (btree_leaf_is_flaged (&peek_rec, BTREE_LEAF_RECORD_FENCE));
          pr_midxkey_add_prefix (&result, &lf_key, key, n_prefix);   /* <- fence prefix + record suffix */
        }
    }

INVARIANT (slot 1 is the prefix anchor). When a leaf has a positive common_prefix, slot 1 holds the fence key and the decompressor asserts it. A non-fence record at slot 1 would make pr_midxkey_add_prefix graft the wrong prefix and produce corrupt keys.

Scan skipping is the mirror image. In btree_search_leaf_page’s binary search, an exact DB_EQ match against a fence is never a hit:

// btree_search_leaf_page -- src/storage/btree.c
  if (c == DB_EQ) {
      if (btree_leaf_is_flaged (&rec, BTREE_LEAF_RECORD_FENCE)) {
          assert (middle == 1 || middle == key_cnt);
          if (middle == 1) c = DB_GT;                     /* <- left fence: fall through to next key */
          else if (middle == key_cnt)
            { search_key->result = BTREE_KEY_BIGGER; search_key->slotid = key_cnt; return NO_ERROR; }
        }
      else { search_key->result = BTREE_KEY_FOUND; search_key->slotid = middle; return NO_ERROR; }
    }

This differs from the first-key fast check btree_leaf_is_key_between_min_max, which on a fence at slot 1 sets result = BTREE_KEY_BETWEEN, slotid = -1 to force a full search. Either way a range scan (Ch 7) never returns a fence object.

10.8 Foreign-key probes: `btree_find_foreign_key` and `btree_check_foreign_key`

Both FK helpers reuse the range scan; they differ in direction. btree_check_foreign_key runs on the referencing side — verifying a child key exists in the parent’s PK index. The annotations label every branch:

// btree_check_foreign_key -- src/storage/btree.c
  if (has_null == true) return NO_ERROR;              /* <- 1: NULL FK column trivially satisfied */
  classrepr = heap_classrepr_get (...);
  if (classrepr == NULL) goto exit_on_error;          /* <- 2: no repr -> cleanup + normalize errid */
  if (classrepr->has_partition_info > 0) {            /* <- 3: partitioned parent -> prune to local btid */
      partition_init_pruning_context (&pcontext); partition_load_pruning_context (...); }
  BTID_COPY (&local_btid, pk_btid);
  if (classrepr->has_partition_info > 0 && pcontext.partitions != NULL)
      partition_prune_unique_btid (&pcontext, keyval, &part_oid, &class_hfid, &local_btid);
  ret_search = xbtree_find_unique (thread_p, &local_btid, S_SELECT_WITH_LOCK, keyval, ...);
  if (ret_search == BTREE_KEY_NOTFOUND)               /* <- 4a: no parent row -> FK violation */
    { er_set (...ER_FK_INVALID...); ret = ER_FK_INVALID; goto exit_on_error; }
  else if (ret_search == BTREE_ERROR_OCCURRED) { ASSERT_ERROR_AND_SET (ret); goto exit_on_error; } /* <- 4b */
  assert (ret_search == BTREE_KEY_FOUND);             /* <- 4c: parent exists -> ok */
  // ... success and exit_on_error share cleanup: clear pcontext + heap_classrepr_free_and_init ...

Success and exit_on_error share the cleanup; only exit_on_error forces a non-NO_ERROR return via (ret == NO_ERROR && er_errid() == NO_ERROR) ? ER_FAILED : ret.

btree_find_foreign_key runs on the referenced side — before deleting/updating a parent PK it must find any child row still referencing the key, locking it so the check is stable. It calls btree_remake_foreign_key_with_PK (yielding is_newly); when !is_newly it pr_share_values the key into both range bounds, sets range = GE_LE, find_fk_object.lock_mode = S_LOCK, threads btree_scan.is_fk_remake = is_newly, and drives btree_range_scan with the btree_range_scan_find_fk_any_object visitor. On error or no match (OID_ISNULL (&find_fk_object.found_oid)) it releases any lock taken (lock_unlock_object_donot_move_to_non2pl if locked_object set). The is_newly flag plumbs deduplicate-key mode: btree_remake_foreign_key_with_PK may rebuild the probe key so range bounds match the on-disk dedup-augmented key; when it does the original key is not shared and the rebuilt copy is cleared on exit (§10.10).

10.9 Unique statistics: `btree_unique_stats` and `multi_index_unique_stats`

A unique index keeps three counters so COUNT(*), COUNT(DISTINCT key), and the uniqueness check answer without scanning. btree_unique_stats is the per-index triple (stat_type m_rows, m_keys, m_nulls; in btree_unique.hpp).

Field	Role	Why it exists
`m_rows`	total objects (OIDs), including NULL-keyed ones	answers `COUNT(*)`; left side of the uniqueness identity
`m_keys`	number of non-NULL objects (one per row with a real key)	the “distinct enough” count; participates in `is_unique`
`m_nulls`	number of objects whose indexed key is NULL	NULLs are exempt from uniqueness, so tallied separately

The mutators move two counters together; no setter touches one alone:

// btree_unique_stats mutators -- src/storage/btree_unique.cpp
  void insert_key_and_row ()  { ++m_keys;  ++m_rows; }
  void insert_null_and_row () { ++m_nulls; ++m_rows; }
  void add_row ()             {            ++m_rows; }   /* <- dup OID of existing key */
  void delete_key_and_row ()  { --m_keys;  --m_rows; }
  void delete_null_and_row () { --m_nulls; --m_rows; }
  void delete_row ()          {            --m_rows; }
  bool is_unique () const     { return m_rows == m_keys + m_nulls; }

INVARIANT (m_rows == m_keys + m_nulls). Enforced structurally: every mutator changing m_keys or m_nulls changes m_rows by the same amount, and add_row/delete_row (a dup OID of an existing key) move only m_rows. A unique index is consistent exactly when is_unique() is true; `m_rows > m_keys

m_nullsmeans some key has two live OIDs — a violation. The+=/-=` operators move all three counters componentwise, so aggregating partial stats from worker threads (§10.5) cannot break the identity.

multi_index_unique_stats aggregates several indexes touched by one statement (multi-column, partitioned, or the online builder’s per-class loop). Its btid_comparator orders by (root_pageid, vfid.volid).

Field	Role	Why it exists
`m_stats_map`	`std::map<BTID, btree_unique_stats, btid_comparator>`	one stats triple per index touched; the map shape means an index appears at most once

Mutators delegate to operator+=: add_index_stats(index, us) does m_stats_map[index] += us (asserting !BTID_IS_NULL) and operator+=(other) folds every entry the same way, inheriting the per-index identity. construct/destruct use placement new/delete because the object can live inside a transaction descriptor whose memory is managed elsewhere.

Feeding GLOBAL_UNIQUE_STATS_TABLE. Per-transaction stats are local until commit, then flushed into the server-global GLOBAL_UNIQUE_STATS_TABLE (an LF_HASH_TABLE keyed by BTID). Two recovery hooks replay the logged flush. btree_rv_redo_global_unique_stats_commit unpacks (btid, num_nulls, num_oids, num_keys) and calls logtb_rv_update_global_unique_stats_by_abs — replaying the absolute totals (idempotent). btree_rv_undo_global_unique_stats_commit unpacks the same fields but calls logtb_update_global_unique_stats_by_delta with negated counts, reversing the aborted transaction’s contribution; during LOG_RECOVERY_UNDO_PHASE it first checks disk_is_page_sector_reserved, and if the index file is gone returns NO_ERROR — honoring the rule that broken references are tolerated. This is the RVBT_REMOVE_UNIQUE_STATS lifecycle §10.1 hooks for the bulk-load undo.

10.10 Deduplicate-key mode: `BTID_INT.deduplicate_key_idx`

SUPPORT_DEDUPLICATE_KEY_MODE lets a non-unique index store duplicate keys efficiently by appending a hidden discriminator column. The position is computed once at load (dk_get_deduplicate_key_position) into btid_int.deduplicate_key_idx and persisted into the root header at finalization via SET_DECOMPRESS_IDX_HEADER (root_header, load_args->btid->deduplicate_key_idx), read back on every open via GET_DECOMPRESS_IDX_HEADER. Three places diverge from the regular path: (1) range scan / prefix length sets env->same_prefix_len = ...deduplicate_key_idx so the dedup column bounds user-key prefix from discriminator suffix; (2) foreign-key remake (§10.8) rebuilds the probe key to include/strip the dedup column via the is_newly/is_fk_remake flags; (3) the same header field doubles as the compression-index marker (SET/GET_DECOMPRESS_IDX_HEADER), reusing the slot rather than spending a new field. For a plain index deduplicate_key_idx is the sentinel dk_get_deduplicate_key_position returns when no dedup column applies, and all three branches collapse to the ordinary behavior of Chapters 3, 7, 8.

10.11 Chapter summary — key takeaways

Offline bulk load builds invisibly, then commits atomically. xbtree_load_index sorts the heap, packs leaves left-to-right via the btree_construct_leafs callback, builds internal levels bottom-up, skips runtime latching; any error or empty result aborts the whole file.
Sorted input is a load-time invariant. btree_construct_leafs trusts btree_compare_key to never return DB_LT; the assert_release converts a comparator bug into immediate failure, not a silently unsorted chain.
Every load page allocation is its own committed sysop. btree_load_new_page brackets file_alloc in a nested system operation that commits independently, relying on whole-file destruction for cleanup; node_level discriminates leaf/non-leaf from overflow.
Online build trades a sort for a snapshot, lock demotion, and a worker pool. xbtree_load_online_index demotes SCH_M_LOCK to IX_LOCK, drives online_index_builder to dispatch batched BTREE_OP_ONLINE_INDEX_IB_INSERT tasks, then must re-promote the schema lock in an uninterruptible loop and re-check uniqueness for concurrent collisions.
NULL keys are tracked, never stored. add_key skips NULLs from the tree but counts them for unique stats, preserving rows == keys + nulls.
Fence keys anchor prefix compression and are skipped by readers. A BTREE_LEAF_RECORD_FENCE at slot 1 supplies the common prefix; the decompressor never decompresses a fence, and btree_search_leaf_page on an equal-but-fence match either falls through (c = DB_GT at slot 1) or returns BTREE_KEY_BIGGER (slot key_cnt) instead of BTREE_KEY_FOUND.
Unique stats are a structurally-enforced triple feeding a global hash. btree_unique_stats moves m_keys/m_nulls in lockstep with m_rows; multi_index_unique_stats aggregates per BTID; commit flushes into GLOBAL_UNIQUE_STATS_TABLE with redo writing absolute totals and undo subtracting a delta (tolerating an already-dropped index). deduplicate_key_idx, stamped into the root header at build time, bends the range-scan, FK-remake, and compression paths into dedup-key mode.

Position hints as of this revision

The following are line numbers as observed on 2026-06-08; symbols are the canonical anchor and line numbers are hints that decay.

Symbol	File	Line
`OR_OID_SIZE`	`src/base/object_representation_constants.h`	67
`OR_OID_PAGEID`	`src/base/object_representation_constants.h`	68
`OR_OID_SLOTID`	`src/base/object_representation_constants.h`	69
`OR_OID_VOLID`	`src/base/object_representation_constants.h`	70
`BTREE_SPLIT_LOWER_BOUND`	`src/storage/btree.c`	79
`BTREE_SPLIT_MIN_PIVOT`	`src/storage/btree.c`	82
`BTREE_SPLIT_DEFAULT_PIVOT`	`src/storage/btree.c`	85
`BTREE_NODE_MAX_SPLIT_SIZE`	`src/storage/btree.c`	88
`CAN_MERGE_WHEN_EMPTY`	`src/storage/btree.c`	101
`FORCE_MERGE_WHEN_EMPTY`	`src/storage/btree.c`	104
`BTREE_LEAF_RECORD_FENCE`	`src/storage/btree.c`	113
`BTREE_LEAF_RECORD_OVERFLOW_OIDS`	`src/storage/btree.c`	115
`BTREE_LEAF_RECORD_OVERFLOW_KEY`	`src/storage/btree.c`	117
`BTREE_LEAF_RECORD_CLASS_OID`	`src/storage/btree.c`	119
`BTREE_LEAF_RECORD_MASK`	`src/storage/btree.c`	121
`BTREE_OID_HAS_MVCC_INSID`	`src/storage/btree.c`	124
`BTREE_OID_HAS_MVCC_DELID`	`src/storage/btree.c`	126
`BTREE_RECORD_OR_BUF_INIT`	`src/storage/btree.c`	260
`BTREE_GET_MVCC_INFO_SIZE_FROM_FLAGS`	`src/storage/btree.c`	278
`BTREE_MERGE_NO`	`src/storage/btree.c`	311
`BTREE_MERGE_TRY`	`src/storage/btree.c`	312
`BTREE_MERGE_FORCE`	`src/storage/btree.c`	313
`btree_search_key_helper`	`src/storage/btree.c`	358
`BTREE_FIND_UNIQUE_HELPER`	`src/storage/btree.c`	387
`BTS_IS_SOFT_CAPACITY_ENOUGH`	`src/storage/btree.c`	568
`BTS_IS_HARD_CAPACITY_ENOUGH`	`src/storage/btree.c`	582
`BTS_SAVE_OID_IN_BUFFER`	`src/storage/btree.c`	592
`btree_insert_helper`	`src/storage/btree.c`	662
`BTREE_INSERT_HELPER`	`src/storage/btree.c`	662
`btree_delete_helper`	`src/storage/btree.c`	766
`BTREE_DELETE_HELPER`	`src/storage/btree.c`	766
`btree_or_get_object`	`src/storage/btree.c`	1458
`btree_fix_root_with_info`	`src/storage/btree.c`	1850
`btree_create_overflow_key_file`	`src/storage/btree.c`	1975
`btree_leaf_record_change_overflow_link`	`src/storage/btree.c`	2318
`btree_leaf_get_first_object`	`src/storage/btree.c`	2451
`btree_record_get_num_oids`	`src/storage/btree.c`	2757
`btree_leaf_change_first_object`	`src/storage/btree.c`	2822
`btree_leaf_set_flag`	`src/storage/btree.c`	3461
`btree_record_object_set_mvcc_flags`	`src/storage/btree.c`	3480
`btree_append_oid`	`src/storage/btree.c`	3637
`btree_record_append_object`	`src/storage/btree.c`	3798
`btree_start_overflow_page`	`src/storage/btree.c`	3973
`btree_get_disk_size_of_key`	`src/storage/btree.c`	4065
`btree_write_record`	`src/storage/btree.c`	4100
`btree_read_record`	`src/storage/btree.c`	4257
`btree_read_record_without_decompression`	`src/storage/btree.c`	4515
`btree_search_nonleaf_page`	`src/storage/btree.c`	5189
`btree_leaf_is_key_between_min_max`	`src/storage/btree.c`	5369
`btree_search_leaf_page`	`src/storage/btree.c`	5537
`btree_find_foreign_key`	`src/storage/btree.c`	6361
`btree_delete_key_from_leaf`	`src/storage/btree.c`	9653
`btree_replace_first_oid_with_ovfl_oid`	`src/storage/btree.c`	9781
`btree_modify_leaf_ovfl_vpid`	`src/storage/btree.c`	9978
`btree_modify_overflow_link`	`src/storage/btree.c`	10054
`btree_delete_meta_record`	`src/storage/btree.c`	10148
`btree_write_default_split_info`	`src/storage/btree.c`	10247
`btree_merge_root`	`src/storage/btree.c`	10284
`btree_merge_node`	`src/storage/btree.c`	10557
`btree_node_size_uncompressed`	`src/storage/btree.c`	11097
`btree_node_mergeable`	`src/storage/btree.c`	11172
`btree_key_append_object_as_new_overflow`	`src/storage/btree.c`	11330
`btree_find_oid_and_its_page`	`src/storage/btree.c`	11646
`btree_find_oid_does_mvcc_info_match`	`src/storage/btree.c`	11794
`btree_find_oid_from_leaf`	`src/storage/btree.c`	11946
`btree_find_oid_from_ovfl`	`src/storage/btree.c`	12037
`btree_find_split_point`	`src/storage/btree.c`	12419
`btree_split_find_pivot`	`src/storage/btree.c`	12849
`btree_split_next_pivot`	`src/storage/btree.c`	12874
`btree_recompress_record`	`src/storage/btree.c`	13118
`btree_split_node`	`src/storage/btree.c`	13324
`btree_split_root`	`src/storage/btree.c`	14184
`btree_locate_key`	`src/storage/btree.c`	14882
`btree_keyval_search`	`src/storage/btree.c`	15458
`btree_scan_update_range`	`src/storage/btree.c`	16055
`btree_rv_util_save_page_records`	`src/storage/btree.c`	17242
`btree_get_next_page`	`src/storage/btree.c`	19384
`btree_set_vpid_previous_vpid`	`src/storage/btree.c`	19435
`btree_compare_key`	`src/storage/btree.c`	19460
`btree_check_foreign_key`	`src/storage/btree.c`	22655
`btree_rv_undo_global_unique_stats_commit`	`src/storage/btree.c`	23050
`btree_rv_redo_global_unique_stats_commit`	`src/storage/btree.c`	23120
`btree_search_key_and_apply_functions`	`src/storage/btree.c`	23186
`btree_get_root_with_key`	`src/storage/btree.c`	23390
`btree_advance_and_find_key`	`src/storage/btree.c`	23455
`btree_key_find_and_lock_unique`	`src/storage/btree.c`	23645
`btree_key_find_and_lock_unique_of_unique`	`src/storage/btree.c`	23673
`btree_key_find_and_lock_unique_of_non_unique`	`src/storage/btree.c`	23903
`btree_range_scan_start`	`src/storage/btree.c`	24926
`btree_range_scan_resume`	`src/storage/btree.c`	25024
`btree_range_scan_read_record`	`src/storage/btree.c`	25185
`btree_range_scan_advance_over_filtered_keys`	`src/storage/btree.c`	25230
`btree_range_scan_descending_fix_prev_leaf`	`src/storage/btree.c`	25451
`btree_range_scan`	`src/storage/btree.c`	25794
`btree_range_scan_select_visible_oids`	`src/storage/btree.c`	26008
`btree_select_visible_object_for_range_scan`	`src/storage/btree.c`	26330
`btree_range_scan_find_fk_any_object`	`src/storage/btree.c`	26545
`btree_insert_internal`	`src/storage/btree.c`	26976
`btree_fix_root_for_insert`	`src/storage/btree.c`	27154
`btree_get_max_new_data_size`	`src/storage/btree.c`	27429
`btree_split_node_and_advance`	`src/storage/btree.c`	27495
`btree_key_insert_new_object`	`src/storage/btree.c`	28120
`btree_key_insert_new_key`	`src/storage/btree.c`	28302
`btree_key_lock_and_append_object_unique`	`src/storage/btree.c`	28572
`btree_key_append_object_non_unique`	`src/storage/btree.c`	28927
`btree_key_append_object_unique`	`src/storage/btree.c`	29042
`btree_key_relocate_last_into_ovf`	`src/storage/btree.c`	29156
`btree_key_append_object_into_ovf`	`src/storage/btree.c`	29309
`btree_delete_internal`	`src/storage/btree.c`	30525
`btree_fix_root_for_delete`	`src/storage/btree.c`	30687
`btree_merge_node_and_advance`	`src/storage/btree.c`	30895
`btree_key_delete_remove_object`	`src/storage/btree.c`	31493
`btree_key_remove_object_and_keep_visible_first`	`src/storage/btree.c`	31753
`btree_leaf_record_replace_first_with_last`	`src/storage/btree.c`	32046
`btree_record_remove_object`	`src/storage/btree.c`	32137
`btree_record_remove_object_internal`	`src/storage/btree.c`	32230
`btree_key_remove_object`	`src/storage/btree.c`	32298
`btree_overflow_remove_object`	`src/storage/btree.c`	32344
`btree_leaf_remove_object`	`src/storage/btree.c`	32517
`btree_key_remove_insert_mvccid`	`src/storage/btree.c`	32637
`btree_key_remove_delete_mvccid`	`src/storage/btree.c`	32818
`btree_key_remove_delete_mvccid_unique`	`src/storage/btree.c`	32986
`btree_remove_delete_mvccid_unique_internal`	`src/storage/btree.c`	33113
`btree_key_remove_delete_mvccid_non_unique`	`src/storage/btree.c`	33257
`btree_record_remove_insid`	`src/storage/btree.c`	33419
`btree_record_remove_delid`	`src/storage/btree.c`	33479
`btree_online_index_list_dispatcher`	`src/storage/btree.c`	34209
`btree_rv_log_delete_object`	`src/storage/btree.c`	35823
`btree_rv_log_insert_object`	`src/storage/btree.c`	35885
`BTREE_NEED_UNIQUE_CHECK`	`src/storage/btree.h`	63
`BTREE_NODE_TYPE`	`src/storage/btree.h`	81
`non_leaf_rec`	`src/storage/btree.h`	103
`leaf_rec`	`src/storage/btree.h`	111
`btid_int`	`src/storage/btree.h`	119
`BTID_INT.deduplicate_key_idx`	`src/storage/btree.h`	133
`btree_keyrange`	`src/storage/btree.h`	139
`bts_key_status`	`src/storage/btree.h`	151
`btree_scan`	`src/storage/btree.h`	198
`BTREE_INIT_SCAN`	`src/storage/btree.h`	297
`BTREE_END_OF_SCAN`	`src/storage/btree.h`	368
`btree_op_purpose`	`src/storage/btree.h`	535
`BTREE_OP_DELETE_OBJECT_PHYSICAL`	`src/storage/btree.h`	549
`BTREE_OP_ONLINE_INDEX_IB_INSERT`	`src/storage/btree.h`	565
`btree_mvcc_info`	`src/storage/btree.h`	581
`btree_object_info`	`src/storage/btree.h`	594
`BTREE_RANGE_SCAN_PROCESS_KEY_FUNC`	`src/storage/btree.h`	710
`btree_node_header_undo_log`	`src/storage/btree_load.c`	410
`btree_node_header_redo_log`	`src/storage/btree_load.c`	432
`xbtree_load_index`	`src/storage/btree_load.c`	856
`dk_get_deduplicate_key_position`	`src/storage/btree_load.c`	925
`btree_save_last_leafrec`	`src/storage/btree_load.c`	1275
`SET_DECOMPRESS_IDX_HEADER`	`src/storage/btree_load.c`	1929
`btree_load_new_page`	`src/storage/btree_load.c`	2030
`btree_proceed_leaf`	`src/storage/btree_load.c`	2122
`btree_first_oid`	`src/storage/btree_load.c`	2199
`bt_load_make_new_record_on_leaf_page`	`src/storage/btree_load.c`	2582
`LOAD_FIXED_EMPTY_FOR_LEAF`	`src/storage/btree_load.c`	2591
`btree_construct_leafs`	`src/storage/btree_load.c`	3011
`xbtree_load_online_index`	`src/storage/btree_load.c`	4992
`online_index_builder`	`src/storage/btree_load.c`	5287
`index_builder_loader_task::add_key`	`src/storage/btree_load.c`	5526
`index_builder_loader_task::execute`	`src/storage/btree_load.c`	5562
`LEAF_FENCE_MAX_SIZE`	`src/storage/btree_load.h`	89
`NON_LEAF_ENTRY_MAX_SIZE`	`src/storage/btree_load.h`	96
`BTREE_OBJECT_FIXED_SIZE`	`src/storage/btree_load.h`	130
`BTREE_MAX_KEYLEN_INPAGE`	`src/storage/btree_load.h`	147
`BTREE_MAX_OIDCOUNT_IN_LEAF_RECORD`	`src/storage/btree_load.h`	155
`BTREE_GET_KEY_LEN_IN_PAGE`	`src/storage/btree_load.h`	167
`btree_node_header`	`src/storage/btree_load.h`	206
`btree_root_header`	`src/storage/btree_load.h`	218
`btree_unique_stats::insert_key_and_row`	`src/storage/btree_unique.cpp`	72
`btree_unique_stats::is_unique`	`src/storage/btree_unique.cpp`	118
`multi_index_unique_stats::add_index_stats`	`src/storage/btree_unique.cpp`	168
`btree_unique_stats`	`src/storage/btree_unique.hpp`	34
`multi_index_unique_stats`	`src/storage/btree_unique.hpp`	69
`spage_find_empty_slot_at`	`src/storage/slotted_page.c`	1674
`spage_insert_at`	`src/storage/slotted_page.c`	1902
`spage_header`	`src/storage/slotted_page.h`	64
`spage_slot`	`src/storage/slotted_page.h`	88
`NON_LEAF_RECORD_SIZE`	`src/storage/storage_common.h`	134
`LEAF_RECORD_SIZE`	`src/storage/storage_common.h`	136
`btree_node_split_info`	`src/storage/storage_common.h`	140
`BTREE_SEARCH`	`src/storage/storage_common.h`	399
`BTREE_CONSTRAINT_UNIQUE`	`src/storage/storage_common.h`	616
`BTREE_CONSTRAINT_PRIMARY_KEY`	`src/storage/storage_common.h`	617

Sources

cubrid-btree.md — the high-level companion (design intent, theory).
Raw analysis: raw/code-analysis/cubrid/storage/index/.
Code: src/storage/btree.{c,h}, btree_load.{c,h}, btree_unique.{cpp,hpp}; slotted-page layout in src/storage/slotted_page.h.
Methodology: knowledge/methodology/code-analysis-detail-doc.md.

CUBRID B+Tree — Code-Level Deep Dive

Chapter 1: Data Structure Map

1.1 The slotted-page substrate: spage_header and spage_slot

1.2 The B+Tree node header and root header

1.3 Per-key record fixed prefixes: NON_LEAF_REC and LEAF_REC