콘텐츠로 이동

CUBRID Heap Manager — Code-Level Deep Dive

이 콘텐츠는 아직 번역되지 않았습니다.

Where this document fits: The high-level analysis cubrid-heap-manager.md covers design intent and theoretical background. This document traces every branch and field at the code level. Each chapter is self-contained, but reading in order follows the full lifecycle of a single heap record inside the kernel.

Contents:

ChTitleStatus
1Data Structure Map
2Slotted Page Primitives and Page Initialization
3Record Types and the MVCC Record Header
4Insert Flow and OID Assignment
5Read Path Visibility and Following the Forwarding Chain
6Update Flow and Record Type Transitions
7Delete Flow and Tombstoning
8Vacuum Reclamation and Page Vacuum Status
9Best Space and Free Space Management
10Crash Recovery and the Redo Undo Log Paths

The field-by-field reference the rest of the document leans on. MVCC visibility, slotted-page theory, and the forwarding/overflow rationale are not re-derived here — see the high-level companion (cubrid-heap-manager.md). Three layers, stitched by the record OID: the page layer (slotted_page.h), the heap-file layer (heap_file.c), and the transient operation/read bundles (heap_file.h).

graph TB
  subgraph disk["On-disk heap page"]
    HDR["SPAGE_HEADER"]
    SLOT0["SPAGE_SLOT[0]"]
    SLOTN["SPAGE_SLOT[1..n]"]
    HDR --> SLOT0
    HDR --> SLOTN
  end
  STATS["HEAP_HDR_STATS<br/>header page slot 0"]
  CHAIN["HEAP_CHAIN<br/>data page slot 0"]
  HFID["HFID<br/>vfid + hpgid"] -->|hpgid| STATS
  STATS -->|next_vpid| CHAIN
  CHAIN -->|prev/next_vpid| CHAIN
  subgraph mem["In-memory bundles"]
    OPCTX["HEAP_OPERATION_CONTEXT<br/>4x PGBUF_WATCHER"]
    GETCTX["HEAP_GET_CONTEXT<br/>2x PGBUF_WATCHER"]
    SCAN["HEAP_SCANCACHE<br/>snapshot / page_latch"]
    NODE["HEAP_SCANCACHE_NODE"]
    SCAN --> NODE
    OPCTX -->|scan_cache_p| SCAN
    GETCTX -->|scan_cache| SCAN
  end
  OPCTX -->|hfid| HFID
  NODE -->|hfid| HFID

Figure 1-1. Disk structures (top), the heap-file spine (HFID to header-page HEAP_HDR_STATS to the HEAP_CHAIN doubly-linked list), and the in-memory bundles that latch into those pages during one operation.

Invariant: HEAP_HEADER_AND_CHAIN_SLOTID == 0. Slot 0 of every heap page is reserved metadata: HEAP_HDR_STATS on the header page, HEAP_CHAIN on every other page. Enforced twice — slot 0 is allocated at page-init so a normal spage_insert never returns it, and HEAP_ISJUNK_OID rejects any OID with slotid == 0. A user record in slot 0 would make the chain walk read record bytes as a chain/stats struct, corrupting the heap.

// HEAP_HEADER_AND_CHAIN_SLOTID -- src/storage/heap_file.h
#define HEAP_HEADER_AND_CHAIN_SLOTID 0 /* Slot for chain and header */
// HEAP_ISJUNK_OID -- src/storage/heap_file.h
#define HEAP_ISJUNK_OID(oid) \
((oid)->slotid == HEAP_HEADER_AND_CHAIN_SLOTID \
|| (oid)->slotid < 0 || (oid)->volid < 0 || (oid)->pageid < 0)

1.2 Page layer — SPAGE_HEADER and SPAGE_SLOT

Section titled “1.2 Page layer — SPAGE_HEADER and SPAGE_SLOT”

A fixed header, the slot array growing from the front, records from the back. Mechanics are Chapter 2.

// spage_header -- src/storage/slotted_page.h (comments in table)
struct spage_header
{
PGNSLOTS num_slots, num_records;
INT16 anchor_type; /* ANCHORED / ANCHORED_DONT_REUSE_SLOTS / UNANCHORED_* */
unsigned short alignment; /* char, short, int, double */
int total_free, cont_free, offset_to_free_area;
int reserved1;
int flags; /* always SPAGE_HEADER_FLAG_NONE */
unsigned int is_saving:1; /* save-for-undo */
unsigned int need_update_best_hint:1; /* best-hint refresh */
unsigned int reserved_bits:30;
};
FieldRole / why it exists
num_slots, num_recordsSlot-array length and live count; slots persist after delete, so num_slots >= num_records.
anchor_typeANCHORED/ANCHORED_DONT_REUSE_SLOTS/UNANCHORED_*. Heap uses anchored → stable slot ids → OIDs never move.
alignmentRecord alignment in bytes (heap uses INT_ALIGNMENT).
total_freeAll free bytes; tested vs record+slot size to see if it fits after compaction.
cont_freeBytes in the single contiguous gap; if short but total_free suffices, page is compacted first.
offset_to_free_areaBump pointer where a new record is written.
reserved1Reserved int; keeps header 8-byte aligned.
flagsPage flags; only SPAGE_HEADER_FLAG_NONE used.
is_saving1 bit: save-space-for-undo (Ch 10).
need_update_best_hint1 bit: best hint stale, refresh estimates (Ch 9).
reserved_bits30-bit padding; pins the bitfield-word layout.
// spage_slot -- src/storage/slotted_page.h (4-byte disk slot)
struct spage_slot
{
unsigned int offset_to_record:14; /* Byte offset to start of record */
unsigned int record_length:14; /* Length of record */
unsigned int record_type:4; /* REC_HOME, REC_NEWHOME, ... */
};
FieldRole / why it exists
offset_to_record (14b)Offset to record bytes; the indirection making slot ids stable (compaction rewrites only this). Caps at 16383.
record_length (14b)Record byte length; bounds copy/peek. Caps an on-page record at 16383 bytes (larger → overflow).
record_type (4b)REC_HOME/NEWHOME/RELOCATION/BIGONE/DELETED_WILL_REUSE/MARKDELETED/ASSIGN_ADDRESS/… — branch key for every flow (Ch 3).

The whole slot is exactly 4 bytes (14+14+4 = 32 bits).

1.3 Heap-file layer — HFID, HEAP_HDR_STATS, HEAP_CHAIN

Section titled “1.3 Heap-file layer — HFID, HEAP_HDR_STATS, HEAP_CHAIN”

HFID is the heap’s name: a file id plus the header page’s id.

// hfid -- src/storage/storage_common.h
struct hfid
{
VFID vfid; /* Volume and file identifier */
INT32 hpgid; /* First page identifier (the header page) */
};
FieldRole / why it exists
vfid(volid, fileid) of the heap’s file; used for every page alloc/dealloc.
hpgidPage id of the header page — heap entry point; with vfid.volid gives the header VPID that next_vpid walks from.

HEAP_HDR_STATS is the heap’s global control block, in slot 0 of the header page. Its nested estimates block is the best-space hint cache (Chapter 9), not logged.

// heap_hdr_stats -- src/storage/heap_file.c (comments elided; see table)
struct heap_hdr_stats
{
OID class_oid; /* the first field MUST be class_oid */
VFID ovf_vfid;
VPID next_vpid; /* the 2nd page of the heap file */
int unfill_space;
struct
{
int num_pages, num_recs;
float recs_sumlen;
int num_other_high_best, num_high_best, num_substitutions;
int num_second_best, head_second_best, tail_second_best, head;
VPID last_vpid; /* todo: move out of estimates */
VPID full_search_vpid;
VPID second_best[HEAP_NUM_BEST_SPACESTATS]; /* 10 hints */
HEAP_BESTSPACE best[HEAP_NUM_BEST_SPACESTATS]; /* 10 hints */
} estimates;
int reserve0_for_future, reserve1_for_future, reserve2_for_future;
};
FieldRole / why it exists
class_oidOID of the stored class. Must be first — slot 0’s leading OID is the class id, shared with HEAP_CHAIN for validation.
ovf_vfidOverflow file id (holds REC_BIGONE bodies); null until first overflow.
next_vpidVPID of the 2nd page — head of the page chain.
unfill_spaceFree-space floor for inserts; headroom so updates grow in place.
estimates.num_pages, num_recs, recs_sumlenEstimated page/record count and total record bytes; recs_sumlen derives average length.
estimates.num_other_high_bestGood pages not in best[]; triggers fuller search before growing the file.
estimates.num_high_bestbest[] entries >= HEAP_DROP_FREE_SPACE; at zero, rescans.
estimates.num_substitutionsSubstitution count; feeds second-best promotion.
estimates.num_second_best, head_second_best, tail_second_bestCount, read index (oldest), write index of the second_best[] ring.
estimates.headHead index of best[] ring; where alloc/scan starts.
estimates.last_vpidLast/append page; todo to relocate out of estimates.
estimates.full_search_vpidResume point for an incremental scan across calls.
estimates.second_best[10]Ring of 10 decent VPIDs; used when best[] runs dry.
estimates.best[10]Ring of 10 HEAP_BESTSPACE; primary hint set. HEAP_NUM_BEST_SPACESTATS == 10.
reserve0/1/2_for_futurePadding ints; reserved.

Invariant: estimates is a hint, never the truth. Changes are not logged — “only used for hints,” “may not be accurate,” “may contain duplicated pages.” Consumers re-validate a candidate’s real total_free before use; trusting best[] blindly could target a full page (Chapter 9).

HEAP_CHAIN lives in slot 0 of every non-header page; the doubly-linked list.

// heap_chain -- src/storage/heap_file.c (Double-linked)
struct heap_chain
{
OID class_oid; /* the first must be class_oid */
VPID prev_vpid; /* Previous page */
VPID next_vpid; /* Next page */
MVCCID max_mvccid; /* Max MVCCID of any MVCC operation on this page */
INT32 flags; /* 2 high bits encode vacuum state */
};
FieldRole / why it exists
class_oidClass OID (same leading-field convention); page-validate without distinguishing header vs data page.
prev_vpidPrevious page VPID; backward traversal and chain repair.
next_vpidNext page VPID; forward traversal — what the spine HEAP_HDR_STATS.next_vpid feeds.
max_mvccidLargest MVCCID of any op here. Vacuum predictor: when it precedes vacuum’s oldest MVCCID, the page is fully vacuumed. Init MVCCID_NULL (Ch 8).
flagsINT32; top 2 bits = vacuum status (HEAP_PAGE_VACUUM_NONE/ONCE/UNKNOWN, mask 0xC0000000). Rest reserved.

Vacuum status is packed into the two high bits of flags, accessed only through HEAP_PAGE_SET_VACUUM_STATUS / HEAP_PAGE_GET_VACUUM_STATUS:

// HEAP_PAGE_FLAG_VACUUM_STATUS_MASK -- src/storage/heap_file.c
#define HEAP_PAGE_FLAG_VACUUM_STATUS_MASK 0xC0000000
#define HEAP_PAGE_FLAG_VACUUM_ONCE 0x80000000
#define HEAP_PAGE_FLAG_VACUUM_UNKNOWN 0x40000000
// status == NONE => both bits clear /* <- the HEAP_PAGE_VACUUM_NONE encoding */

Invariant: max_mvccid is monotonically non-decreasing per page. Every MVCC op does if (MVCC_ID_PRECEDES (chain->max_mvccid, mvccid)) chain->max_mvccid = mvccid; (debug-asserted). If it moved backward, vacuum could deallocate a page still holding pending versions.

1.4 Operation bundle — HEAP_OPERATION_CONTEXT and its enums

Section titled “1.4 Operation bundle — HEAP_OPERATION_CONTEXT and its enums”

The single argument threaded through every write: type and update_in_place drive dispatch; up to four watchers, a home-record stack buffer, and the output OID ride along.

// HEAP_OPERATION_TYPE -- src/storage/heap_file.h
typedef enum
{
HEAP_OPERATION_NONE = 0,
HEAP_OPERATION_INSERT,
HEAP_OPERATION_DELETE,
HEAP_OPERATION_UPDATE
} HEAP_OPERATION_TYPE;
// update_inplace_style -- src/storage/heap_file.h
enum update_inplace_style
{
UPDATE_INPLACE_NONE = 0, /* None */
UPDATE_INPLACE_CURRENT_MVCCID = 1, /* non-MVCC in-place update with current MVCC ID */
UPDATE_INPLACE_OLD_MVCCID = 2 /* non-MVCC in-place update, preserves old MVCC ID */
};
typedef enum update_inplace_style UPDATE_INPLACE_STYLE;
#define HEAP_IS_UPDATE_INPLACE(update_inplace_style) \
((update_inplace_style) != UPDATE_INPLACE_NONE)

HEAP_OPERATION_TYPE tags the write path. UPDATE_INPLACE_STYLE is orthogonal: an MVCC update runs physically in place yet maps to UPDATE_INPLACE_NONE (source: “mvcc update is also executed inplace, but coresponds to UPDATE_INPLACE_NONE”) — so UPDATE_INPLACE_NONE means new logical version, while HEAP_IS_UPDATE_INPLACE (styles 1/2) is a true non-MVCC overwrite. Chapter 6 walks the branches.

// heap_operation_context -- src/storage/heap_file.h (condensed)
struct heap_operation_context
{
HEAP_OPERATION_TYPE type;
UPDATE_INPLACE_STYLE update_in_place;
HFID hfid; OID oid; OID class_oid; RECDES *recdes_p; HEAP_SCANCACHE *scan_cache_p;
RECDES map_recdes; OID ovf_oid; /* overflow transient */
RECDES home_recdes;
char home_recdes_buffer[IO_MAX_PAGE_SIZE + MAX_ALIGNMENT];
INT16 record_type; FILE_TYPE file_type;
PGBUF_WATCHER home_page_watcher, overflow_page_watcher, header_page_watcher, forward_page_watcher;
PGBUF_WATCHER *home_page_watcher_p, *overflow_page_watcher_p, *header_page_watcher_p, *forward_page_watcher_p; /* the handles */
OID res_oid; bool is_logical_old; /* logical output */
bool is_redistribute_insert_with_delid;
bool is_bulk_op; bool use_bulk_logging;
bool do_supplemental_log; LOG_LSA supp_undo_lsa, supp_redo_lsa;
PERF_UTIME_TRACKER *time_track; /* perf stat dump */
};
FieldRole / why it exists
typeWhich write op (INSERT/DELETE/UPDATE/NONE); drives dispatch.
update_in_placeUpdate style; mutate vs new-version relocation (Ch 6).
hfidTarget heap; names file/header page for alloc + best-space.
oidInput OID (delete/update target; ignored for insert) — the home address.
class_oidClass; locking, MVCC header build, index maintenance.
recdes_pCaller’s record descriptor — the new bytes.
scan_cache_pOptional reuse of latched pages + snapshot.
map_recdesMap record built during overflow insert (points at the overflow object).
ovf_oidOverflow object location; set for REC_BIGONE.
home_recdesDescriptor for the fetched home record (read before mutation).
home_recdes_bufferInline stack buffer backing home_recdes; avoids heap alloc.
record_typeType of the original record before mutation.
file_typeFILE_HEAP/FILE_HEAP_REUSE_SLOTS; slot reuse (Ch 7).
home/overflow/header/forward_page_watcherFour embedded PGBUF_WATCHER storage slots (overflow → REC_BIGONE, header → best-space update, forward → REC_RELOCATION/REC_NEWHOME).
*_watcher_p (×4)Pointers to the four watchers — the handles; null = page not involved.
res_oidOutput OID; for insert, the assigned address.
is_logical_oldOutput: initial record was not REC_ASSIGN_ADDRESS; for logging.
is_redistribute_insert_with_delidInsert from a partition redistribute carrying a valid delid.
is_bulk_op, use_bulk_logging, do_supplemental_log, supp_undo_lsa, supp_redo_lsaLogging/bulk control: bulk-insert flag (also disables MVCC ops), bulk log path, supplemental-log enable, and the supplemental undo/redo image LSAs.
time_trackPERF_UTIME_TRACKER * — perf-stat dump.

Invariant: pages are touched only through the _p watcher pointers. The four embedded PGBUF_WATCHERs are storage; the *_watcher_p pointers are the handles (“should not be referenced directly”). Code points one at its embedded watcher to latch that page, NULL otherwise; cleanup unfixes exactly the non-NULL ones. Touching an embedded watcher directly risks a double-unfix or leaked latch.

The read-side counterpart. It needs only two watchers — home and forward — since a read never touches the header or overflow page the way a write does.

// heap_get_context -- src/storage/heap_file.h (comments in table)
struct heap_get_context
{
INT16 record_type;
const OID *oid_p;
OID forward_oid; /* of REC_RELOCATION or REC_BIGONE */
OID *class_oid_p;
RECDES *recdes_p;
HEAP_SCANCACHE *scan_cache;
PGBUF_WATCHER home_page_watcher, fwd_page_watcher;
bool ispeeking; /* PEEK or COPY */
int old_chn;
PGBUF_LATCH_MODE latch_mode;
};
FieldRole / why it exists
record_typeType at oid_p; branch key for forwarding (Ch 5): REC_HOME in place, REC_RELOCATION/REC_BIGONE chase forward_oid.
oid_pRequested OID (input, const); where the read starts.
forward_oidOID the home slot forwards to; second hop, filled when record_type demands.
class_oid_pClass OID (in/out); needed for snapshot/CHN.
recdes_pWhere bytes are returned; into the page (PEEK) or a copy buffer (COPY).
scan_cacheGoverning HEAP_SCANCACHE; supplies mvcc_snapshot.
home_page_watcher, fwd_page_watcherHome and forward page watchers; forward latched on the second hop, null until forwarding.
ispeekingPEEK (zero-copy, holds latch) vs COPY (frees latch sooner).
old_chnCaller’s cached CHN; matching the record’s CHN skips the copy (see companion).
latch_modeREAD normally, WRITE for e.g. serial increment.

1.6 Scan bundle — HEAP_SCANCACHE and HEAP_SCANCACHE_NODE

Section titled “1.6 Scan bundle — HEAP_SCANCACHE and HEAP_SCANCACHE_NODE”

The longest-lived bundle: it survives many get/next calls, caching a fixed page, the snapshot, and the latch/lock policy. It embeds one HEAP_SCANCACHE_NODE for the current heap plus a list of more for partitioned scans.

// heap_scancache_node -- src/storage/heap_file.h
struct heap_scancache_node
{
HFID hfid; /* Heap file of scan */
OID class_oid; /* Class oid of scanned instances */
const char *classname;
};
// heap_scancache -- src/storage/heap_file.h (condensed C++ class; comments in table)
struct heap_scancache
{
int debug_initpattern;
HEAP_SCANCACHE_NODE node;
LOCK page_latch; /* NULL_LOCK to skip per-page lock */
bool cache_last_fix_page, mvcc_disabled_class;
PGBUF_WATCHER page_watcher;
int num_btids;
multi_index_unique_stats *m_index_stats;
FILE_TYPE file_type;
MVCC_SNAPSHOT *mvcc_snapshot;
HEAP_SCANCACHE_NODE_LIST *partition_list;
private:
cubmem::single_block_allocator *m_area; /* the one private member */
};
FieldRole / why it exists
node.hfidHeap being scanned — the file whose pages the scan walks.
node.class_oidClass of scanned instances; locking and visibility.
node.classnameCached class name; logging/diagnostics, avoids re-lookup per record.
debug_initpatternInit sentinel; catches use of an uninitialized scancache in debug builds.
nodeCurrent HEAP_SCANCACHE_NODE; set via HEAP_SCANCACHE_SET_NODE.
page_latchLOCK for heap pages, or NULL_LOCK when the class is already locked S/SIX/X.
cache_last_fix_pageKeep last fixed page + area memory; avoids re-fixing when many records sit on one page.
mvcc_disabled_classClass is non-MVCC; skips snapshot visibility for catalog/non-MVCC classes.
page_watcherWatcher holding the cached fixed page — the handle for cache_last_fix_page.
num_btidsIndex count on the class; sizes index-maintenance work for scan-driven updates.
m_index_statsPer-index unique stats (a source comment questions if it belongs here).
file_typeFILE_HEAP/FILE_HEAP_REUSE_SLOTS; same slot-reuse decision, available to the scan.
mvcc_snapshotGoverning MVCC snapshot; single source of visibility, passed to the get context.
partition_listList of HEAP_SCANCACHE_NODEs for sub-heaps; a partitioned scan crosses several heaps.
(private) m_areaThe cubmem::single_block_allocator * backing the COPY-area methods.

Invariant: a non-NULL page_latch is required unless the class lock already covers the page. It “may be NULL_LOCK when it is secure to skip lock on heap pages” — i.e. when the class is held S/SIX/X. Leaving NULL_LOCK without that covering lock lets two transactions touch the same page without serialization.

The unit the best-space machinery (Chapter 9) trades in. Each estimates.best[] entry is one, and the global best-space hash caches them too.

// heap_bestspace -- src/storage/heap_file.h
struct heap_bestspace
{
VPID vpid; /* Vpid of one of the best pages */
int freespace; /* Estimated free space in this page */
};
FieldRole / why it exists
vpid(volid, pageid) of a good-free-space page; the page an insert tries first.
freespaceEstimated free bytes; ranks candidates, re-validated against real SPAGE_HEADER.total_free before use (per §1.3).
  1. Three layers, one OID thread. Slotted page stores bytes; the heap-file layer (HFIDHEAP_HDR_STATS → chained HEAP_CHAINs) gives identity and order; the bundles carry latches and policy through one call.
  2. Slot 0 is sacred. HEAP_HEADER_AND_CHAIN_SLOTID == 0 reserves it for HEAP_HDR_STATS (header) or HEAP_CHAIN (data); HEAP_ISJUNK_OID keeps user OIDs out.
  3. SPAGE_SLOT is a 4-byte 14/14/4 bitfield. offset_to_record is the stable-address indirection, record_length bounds the span, 4-bit record_type is the branch key every flow switches on.
  4. HEAP_HDR_STATS.estimates and HEAP_BESTSPACE are advisory and unlogged. Best/second-best rings (10 each) are hints, re-validated against real free space.
  5. HEAP_CHAIN packs vacuum state into flags. Top two bits encode VACUUM_NONE/ONCE/UNKNOWN; max_mvccid (monotonic) predicts a clean page.
  6. HEAP_OPERATION_CONTEXT owns four watchers via indirection. *_watcher_p pointers are the only legal handles; type + update_in_place select the path. HEAP_GET_CONTEXT is the lighter read twin (home + forward only), and HEAP_SCANCACHE (with HEAP_SCANCACHE_NODE) holds the snapshot, cached page, and page_latch policy across many calls.

Chapter 2: Slotted Page Primitives and Page Initialization

Section titled “Chapter 2: Slotted Page Primitives and Page Initialization”

Every heap record — REC_HOME, REC_NEWHOME, REC_BIGONE, REC_RELOCATION — lives inside a slotted page. The heap layer never touches payload directly; it asks slotted_page.c to carve out, resize, and reclaim variable-length areas and hands back a stable slot id. This chapter dissects that substrate: layout, the free-space invariant, the four anchor types, and the slot primitives Chapters 4–8 call down into. The high-level companion (cubrid-heap-manager.md) explains why CUBRID splits a stable OID from a moving byte offset; here we trace how the slot id stays fixed while the offset moves.

A slotted page is a fixed buffer (SPAGE_DB_PAGESIZE) with the SPAGE_HEADER at offset 0. Record payloads grow downward from past the header, the slot array upward from the page end, and the gap between is the contiguous free area. SPAGE_SLOT is the 4-byte indirection unit, three packed bit-fields:

// spage_slot -- src/storage/slotted_page.h
struct spage_slot
{
unsigned int offset_to_record:14; /* byte offset of record start */
unsigned int record_length:14; /* current record length */
unsigned int record_type:4; /* REC_HOME, REC_NEWHOME, ... */
};

The 14-bit fields cap a page at 16 KB — CUBRID’s max page size. record_type is what Chapter 3 reads to dispatch interpretation. Slot N is found by counting backward from the last 4 bytes (spage_find_slot: slot_p = page_p + SPAGE_DB_PAGESIZE - sizeof(SPAGE_SLOT); slot_p -= slot_id;) — the geometry that lets the slot array (growing up, slotN..slot0) and record area (growing down) meet at the free gap, whose first byte is offset_to_free_area.

The SPAGE_HEADER carries the bookkeeping this chapter manipulates: num_slots (array length / iteration bound) and num_records (live count, so num_slots - num_records is the reuse pool); anchor_type (§2.3) and alignment; the three free-space counters below; and is_saving, arming the undo-reserve of §2.7.

INVARIANT (header consistency). At every primitive’s entry/exit: total_free >= 0, 0 <= cont_free <= total_free, offset_to_free_area < SPAGE_DB_PAGESIZE and alignment-aligned, 0 <= num_records <= num_slots. Enforced by spage_verify_header / SPAGE_VERIFY_HEADER. If violated, free-space arithmetic is corrupt and a later insert can overwrite a live record.

2.2 Page bring-up: spage_initialize and spage_verify_header

Section titled “2.2 Page bring-up: spage_initialize and spage_verify_header”

spage_initialize is the only function that establishes the starting invariant from scratch, with no branches (a debug-only assert (spage_is_valid_anchor_type (slot_type)) guards the 1..4 anchor range). It zeroes num_slots/num_records, stores is_saving and anchor_type, then sets total_free = DB_ALIGN (SPAGE_DB_PAGESIZE - sizeof (SPAGE_HEADER), alignment), cont_free = total_free, and offset_to_free_area = DB_ALIGN (sizeof (SPAGE_HEADER), alignment). The canonical empty page: cont_free == total_free, both counts zero, cursor just past the aligned header. total_free excludes the header but does not yet subtract slot-array space — slot bytes are charged lazily per allocation.

spage_verify_header is the runtime auditor behind §2.1: it ANDs all bound checks and on failure formats the header, raises ER_SP_INVALID_HEADER, and assert (false). SPAGE_VERIFY_HEADER is its macro form, at nearly every primitive boundary.

Anchor typeOn delete, slot id is…Reuse
ANCHORED (1)Kept; marked REC_DELETED_WILL_REUSESame id reusable by later insert
ANCHORED_DONT_REUSE_SLOTS (2)Kept; marked REC_MARKDELETEDNever reused until spage_reclaim; heap’s choice so an OID is never silently reassigned
UNANCHORED_ANY_SEQUENCE (3)Removed; last slot moved into holeIds unstable, order not preserved
UNANCHORED_KEEP_SEQUENCE (4)Removed; higher slots memmove downIds unstable, order preserved

Heap data pages are ANCHORED_DONT_REUSE_SLOTS: an OID is (volid, pageid, slotid), so the slotid must outlive the record. spage_is_valid_anchor_type rejects anything outside 1..4.

2.4 Free-space checks: spage_has_enough_total_space, spage_has_enough_contiguous_space, spage_check_space

Section titled “2.4 Free-space checks: spage_has_enough_total_space, spage_has_enough_contiguous_space, spage_check_space”

spage_has_enough_total_space answers “room at all”: true for space <= 0, else space <= total_free minus spage_get_total_saved_spaces(...) on an is_saving page (reserved undo). spage_has_enough_contiguous_space is space <= cont_free || spage_compact(...) == NO_ERROR — it triggers compaction as a side effect. spage_check_space composes both: total fails → SP_DOESNT_FIT; else contiguous fails (compaction errored) → SP_ERROR; else SP_SUCCESS.

Return-code contract. SP_SUCCESS (1) = done; SP_DOESNT_FIT (3) = too full even after compaction, caller must find another page; SP_ERROR (-1) = hard internal failure (compaction error, corrupt slot, illegal op for the anchor), not “try elsewhere.” Confusing them is a correctness bug: a SP_DOESNT_FIT mistaken for SP_ERROR aborts a transaction that should have moved pages.

2.5 Allocating a slot: spage_find_free_slot and spage_find_empty_slot

Section titled “2.5 Allocating a slot: spage_find_free_slot and spage_find_empty_slot”

spage_find_free_slot picks an id: if every slot is live (num_slots == num_records) there is no hole, so slot_id = num_slots (append); otherwise it scans forward from start_slot for the first slot with record_type == REC_DELETED_WILL_REUSE and reuses that id. Only REC_DELETED_WILL_REUSE is reusable — REC_MARKDELETED (the ANCHORED_DONT_REUSE_SLOTS tombstone) is deliberately not matched, so heap OIDs avoid reassignment until vacuum reclaims them.

spage_find_empty_slot is the allocate-and-reserve primitive, branch by branch: (1) spage_has_enough_total_space fails → SP_DOESNT_FIT; (2) spage_find_free_slot returning SP_ERROR or an id > num_slotsSP_ERROR; (3) a new id (slot_id == num_slots) adds sizeof(SPAGE_SLOT) to space and runs spage_check_space (the array eats the gap), returning that status if not SP_SUCCESS, else num_slots++; (4) a reused hole already verified total space, so only cont_free is re-checked (fail → SP_ERROR); (5) both converge on the reserve block — set the slot, num_records++, debit both counters by space, advance offset_to_free_areaSP_SUCCESS.

2.6 Insert, in-place update, and the two-tier update path

Section titled “2.6 Insert, in-place update, and the two-tier update path”

spage_insert composes spage_find_slot_for_insert(...) then, on SP_SUCCESS, spage_insert_data(...). spage_find_slot_for_insert calls spage_check_record_for_insert — rejects oversize with SP_DOESNT_FIT, rewrites a REC_MARKDELETED/REC_DELETED_WILL_REUSE descriptor type into REC_HOME (no inserting a tombstone). spage_insert_data branches on REC_ASSIGN_ADDRESS (a TRANID placeholder) versus a normal memcpy, each overflow-checked.

spage_insert_at is the explicit-id variant for UNANCHORED pages and recovery: validate slot_id <= num_slots (SP_ERROR/ER_SP_UNKNOWN_SLOTID on overflow), then via spage_find_empty_slot_at dispatch to spage_add_new_slot (append) or spage_take_slot_in_use (rejects re-targeting an in-use slot on an ANCHORED* page with ER_SP_BAD_INSERTION_SLOT, else shifts the array up).

Update. spage_update saves total_free_save, checks fit (spage_check_updatableSP_DOESNT_FIT if no room), then branches on size: length <= slot_p->record_length takes spage_update_record_in_place (fast), else spage_update_record_after_compact (grow). When is_saving it reserves the net change via spage_save_space(..., total_free - total_free_save), returning SP_ERROR on failure.

The fast path (spage_update_record_in_place) sets slot_p->record_length, memcpys into the existing offset_to_record, and does total_free -= space. Because space <= 0 for a non-growing update, that subtraction increases free space. It branches on is_located_end = spage_is_record_located_at_end(...): only a last-in-area record also does cont_free -= space and offset_to_free_area += space (cursor pulls back); otherwise the freed bytes stay fragmented. The grow path (spage_update_record_after_compact, Chapter 6) does a tail-compaction, full spage_compact, or rolls back.

INVARIANT (savings symmetry). When is_saving, the net free-space change is handed to spage_save_space as total_free - total_free_save. If skipped, a concurrent transaction could consume bytes this one needs to roll back into, breaking undo.

2.7 Delete, tombstone, and the savings mechanism

Section titled “2.7 Delete, tombstone, and the savings mechanism”

spage_delete returns the slot id on success, NULL_SLOTID on failure, branching on anchor_type. Figure 2-1 is branch-complete:

Figure 2-1 — spage_delete control flow

flowchart TD
  B{"slot NULL?"} -->|yes| R0["UNKNOWN_SLOTID; NULL_SLOTID"]
  B -->|no| C["num_records--; total_free += freed"]
  C --> G{"anchor?"}
  G -->|ANCHORED| H["EMPTY; WILL_REUSE"]
  G -->|DONT_REUSE| I["EMPTY; MARKDELETED"]
  G -->|UNANCHORED| J["shift_down; freed += slot sz"]
  G -->|default| R1["assert; NULL_SLOTID"]
  H --> K{"is_saving?"}
  I --> K
  J --> K
  K -->|yes| L{"save_space ok?"}
  L -->|no| R2["NULL_SLOTID"]
  L -->|yes| N["dirty; return slot_id"]
  K -->|no| N

The total_free += free_space credit is unconditional; the cont_free/cursor adjustment applies only when the deleted record was last in the area (spage_is_record_located_at_end, else a hole is left for compaction). The UNANCHORED cases also reclaim the slot (free_space += sizeof(SPAGE_SLOT)) and assert is_saving == false.

spage_mark_deleted_slot_as_reusable is the vacuum-side flip (Chapter 8): it downgrades an empty tombstone to REC_DELETED_WILL_REUSE so spage_find_free_slot hands the id back out. Its two failing branches raise distinct error codes: an out-of-range slot_id (< 0 or >= num_slots) → SP_ERROR/ER_SP_UNKNOWN_SLOTID; a slot that is not an empty tombstone (must have offset_to_record == SPAGE_EMPTY_OFFSET and a REC_MARKDELETED/REC_DELETED_WILL_REUSE type), i.e. a live record → SP_ERROR/ER_SP_BAD_INSERTION_SLOT (both assert (false) first). Only the tombstone path sets record_type = REC_DELETED_WILL_REUSE and returns SP_SUCCESS.

The savings mechanism. On an is_saving page freed bytes are not immediately spendable — a rollback may reinsert the larger old record. spage_save_space reserves them in the lock-free spage_Saving_hashmap (keyed by VPID; per-transaction SPAGE_SAVE_ENTRY under a SPAGE_SAVE_HEAD). It early-returns NO_ERROR for space == 0, crash recovery, vacuum workers, or space < 0 / inactive transaction; only positive savings by an active transaction allocate an entry, bumping head->total_saved. spage_free_saved_spaces walks the tran_next_save chain at commit/abort, erasing the head when first becomes NULL — hence §2.4 subtracts total_saved.

2.8 Compaction and reclamation: identity fixed, offset moves

Section titled “2.8 Compaction and reclamation: identity fixed, offset moves”

Compaction slides records together without changing any slot id. spage_compact builds an array of live slots (skipping SPAGE_EMPTY_OFFSET holes), sorts by offset_to_record, then walks in offset order memmove-ing each record down to the next aligned to_offset and rewriting that slot’s offset:

// spage_compact -- src/storage/slotted_page.c
memmove ((char *) page_p + to_offset,
(char *) page_p + slot_array[i]->offset_to_record,
slot_array[i]->record_length);
slot_array[i]->offset_to_record = to_offset; /* <- offset moves, slot id stays */
to_offset += slot_array[i]->record_length;
...
page_header_p->total_free = SPAGE_DB_PAGESIZE - to_offset - (num_slots * sizeof (SPAGE_SLOT));
page_header_p->cont_free = page_header_p->total_free; /* <- all free now contiguous */

Branches: num_records == 0 skips the array and resets to_offset to header size; calloc failure → ER_FAILED; a record_type > REC_4BIT_USED_TYPE_MAX or slot-count mismatch (num_records != j) is fatal — the latter raises ER_SP_WRONG_NUM_SLOTS and calls logpb_fatal_error_exit_immediately_wo_flush.

INVARIANT (slot identity under compaction). Compaction may rewrite offset_to_record for any slot but must never change a slot’s index. A reader holding OID (…, slotid) re-reads through spage_find_slot trusting the index; reordering indices would point an OID at a different row.

spage_need_compact is the policy gate: true only when fragmented free space is at least 5% of the page (total_free - cont_free >= SPAGE_DB_PAGESIZE / 20).

spage_reclaim shrinks the slot array on ANCHORED_DONT_REUSE_SLOTS pages. It iterates slot_id backward from num_slots - 1 (so trailing empties collapse cleanly); for each empty tombstone (offset_to_record == SPAGE_EMPTY_OFFSET with a REC_MARKDELETED/REC_DELETED_WILL_REUSE type) it branches — a current-last slot (slot_id + 1 == num_slots) is dropped via spage_reduce_a_slot, else downgraded to REC_DELETED_WILL_REUSE — setting is_reclaim = true. After the loop, if anything was reclaimed and num_slots == 0 it re-runs spage_initialize. Returns true iff something was reclaimed.

  1. A slot is a 4-byte indirection. SPAGE_SLOT packs offset_to_record/record_length/record_type into 14/14/4 bits, addressed backward from the page end so record area and slot array grow toward each other.
  2. Three counters encode free-space state. total_free (any), cont_free (without compacting, <= total_free), offset_to_free_area (write cursor) — spage_verify_header enforces their invariant at every boundary.
  3. anchor_type is the slot-identity knob. Heap pages use ANCHORED_DONT_REUSE_SLOTS, so a delete becomes a REC_MARKDELETED tombstone whose id is never reassigned until vacuum.
  4. Return codes are not interchangeable. SP_DOESNT_FIT = try another page, SP_ERROR = hard failure, SP_SUCCESS = done; spage_has_enough_contiguous_space attempts compaction before reporting failure.
  5. Compaction moves bytes, never slot ids. spage_compact rewrites offset_to_record to cont_free == total_free, but slot indices stay immutable — the basis of stable OIDs.
  6. Deletes and shrinking updates may owe undo space. On is_saving pages freed bytes are reserved via spage_save_space, subtracted by the total-space check, released by spage_free_saved_spaces at transaction end.
  7. Reclamation is anchored-page housekeeping. spage_reclaim collapses trailing tombstones and downgrades the rest to reusable; spage_need_compact gates compaction at 5% fragmentation.

Chapter 3: Record Types and the MVCC Record Header

Section titled “Chapter 3: Record Types and the MVCC Record Header”

Chapter 2 gave a slot that holds an opaque blob. What is a heap record physically? Two vocabularies do the work. The 4-bit record_type in the slot (Chapter 2’s spage_slot) says how to read the bytes — record, forwarding pointer, overflow pointer, or tombstone. The variable-size MVCC record header inside the body carries versioning metadata (insert/delete ids, prev-version pointer) consumed by the read path (Chapter 5) and vacuum (Chapter 8). For why MVCC needs per-record stamps, see the companion cubrid-heap-manager.md, “MVCC and the heap”.

The slot’s record_type:4 bit-field (Chapter 2) is an enum in storage_common.h. Nine values are meaningful; the rest of the 4-bit space (REC_RESERVED_TYPE_8.._15) is reserved.

// enum record_type — src/storage/storage_common.h
enum
{
/* Unknown record type */
REC_UNKNOWN = 0,
/* Record without content, just the address */
REC_ASSIGN_ADDRESS = 1,
/* Home of record */
REC_HOME = 2,
/* No the original home of record. part of relocation process */
REC_NEWHOME = 3,
/* Record describe new home of record */
REC_RELOCATION = 4,
/* Record describe location of big record */
REC_BIGONE = 5,
// ... condensed: REC_MARKDELETED = 6, REC_DELETED_WILL_REUSE = 7 ...
// ... condensed: REC_RESERVED_TYPE_8 .. 15 ...
REC_4BIT_USED_TYPE_MAX = REC_DELETED_WILL_REUSE, // highest live value is 7
REC_4BIT_TYPE_MAX = REC_RESERVED_TYPE_15
};

(Enum comments are verbatim source text, original grammar and all.) Classified by what they encode (column 3 = does the slot hold a body):

TypeBody?Encodes
REC_HOMEyesrecord lives entirely in this slot — the common case
REC_NEWHOMEyesrelocated body pointed to by a REC_RELOCATION; not OID-addressable
REC_RELOCATIONforwardingbody outgrew its home page; slot holds a forward OID to a REC_NEWHOME
REC_BIGONEoverflowrecord exceeds one page; slot holds an overflow OID into the overflow file
REC_ASSIGN_ADDRESSnoOID reserved, content not yet written; bypasses MVCC stamping (3.4)
REC_MARKDELETEDnotombstone whose slot cannot be reused
REC_DELETED_WILL_REUSEnotombstone whose slot will be reused by a future insert
REC_UNKNOWNnosentinel / uninitialized (RECDES_INITIALIZER); never a live slot type

Four types are “live” to a public OID lookup: REC_HOME, REC_RELOCATION, REC_BIGONE, REC_ASSIGN_ADDRESS. REC_NEWHOME is live data reached only by dereferencing a REC_RELOCATION forward OID; a direct OID hit on it is a bug. The REC_*DELETED* types are tombstones (Chapter 7); REC_UNKNOWN is a sentinel never written to a live slot.

flowchart TB
  OID["public OID lands on slot"] --> T{record_type}
  T -->|REC_HOME| H["body here, header inline"]
  T -->|REC_RELOCATION| R["forward OID"] --> NH["REC_NEWHOME"]
  T -->|REC_BIGONE| B["overflow OID"]
  T -->|REC_ASSIGN_ADDRESS| A["address only"]
  T -->|MARKDELETED / WILL_REUSE| D["tombstone"]

Figure 3-1 — Record-type dispatch after fetching a slot; the traversal is Chapter 5. Forwarding (REC_RELOCATION/REC_NEWHOME) and overflow (REC_BIGONE) differ in header storage: a relocated REC_NEWHOME carries an inline variable-size header like any home record; a REC_BIGONE carries a fixed maximum header (3.5).

The in-memory decoded form is a struct in mvcc.h. On disk it is packed by or_mvcc_set_header / unpacked by or_mvcc_get_header (both in src/base/object_representation_sr.c) per the flag byte (3.3); the on-disk bytes are not laid out as this struct.

// struct mvcc_rec_header — src/transaction/mvcc.h
struct mvcc_rec_header
{
INT32 mvcc_flag:8; /* MVCC flags */
INT32 repid:24; /* representation id */
int chn; /* cache coherency number */
MVCCID mvcc_ins_id; /* MVCC insert id */
MVCCID mvcc_del_id; /* MVCC delete id */
LOG_LSA prev_version_lsa; /* log address of previous version */
};
// MVCC_REC_HEADER_INITIALIZER zeroes flag/repid and sets NULL_CHN, MVCCID_NULL x2, LSA_INITIALIZER
FieldRoleWhy it exists
mvcc_flag:8Bit set of OR_MVCC_FLAG_VALID_INSID (0x01), _VALID_DELID (0x02), _VALID_PREV_VERSION (0x04); low 5 bits usable (OR_MVCC_FLAG_MASK = 0x1f).Self-describing: decides which optional fields are present, hence total size. Shares word 0 with repid.
repid:24Representation (schema version) id.An old-schema row keeps its repid so the engine reads the right layout. Packed via OR_MVCC_REPID_MASK = 0x00FFFFFF.
chnCache coherency number.For non-MVCC classes the only versioning info, bumped each update so a client cache detects staleness; present but not the visibility key for MVCC classes (3.6). Always present in the decoded struct.
mvcc_ins_idMVCCID of the inserter.Visibility lower bound (Chapter 5). On disk present only if _VALID_INSID set; else MVCCID_ALL_VISIBLE.
mvcc_del_idMVCCID of the deleter.Visibility upper bound / tombstone marker (Chapter 7). On disk present only if _VALID_DELID set; else MVCCID_NULL.
prev_version_lsaLog LSA of the previous version.Back-pointer the read path follows on TOO_NEW_FOR_SNAPSHOT, and vacuum prunes. On disk present only if _VALID_PREV_VERSION set.

On disk, the word after chn holds either the delete id or nothing, discriminated by OR_MVCC_FLAG_VALID_DELID — per the object_representation_constants.h:168 comment on that flag: “The record contains MVCC delete id. If not set, the record contains chn”. This chn/del_id overlap is purely the on-disk encoding (CUBRID’s MVCC_* macros speak of a delid_chn view, e.g. MVCC_IS_REC_DELETED_BY in mvcc.h); it is distinct from the decoded in-memory struct above, which carries chn and mvcc_del_id as two separate always-present fields.

Invariant: the flag byte alone determines header size. or_header_size is mvcc_header_size_lookup[OR_GET_MVCC_FLAG(ptr)] — no length field or terminator, so reader and writer must agree on the flag→size mapping. A writer that sets mvcc_ins_id but forgets OR_MVCC_FLAG_VALID_INSID makes the next reader compute a header 8 bytes too short and read attribute data as the insert id.

3.3 Flag-driven variable-size encoding: mvcc_header_size_lookup

Section titled “3.3 Flag-driven variable-size encoding: mvcc_header_size_lookup”

With optional ids present only when flagged, the header has eight sizes; the table makes the flag→size map O(1):

// mvcc_header_size_lookup — src/object/object_representation.c
int mvcc_header_size_lookup[8] = {
OR_MVCC_REP_SIZE + OR_CHN_SIZE, // index 0
OR_MVCC_REP_SIZE + OR_CHN_SIZE + OR_MVCCID_SIZE, // index 1 (+INSID)
// ... condensed: indices 2..6 sum REP+CHN with the flagged MVCCID/LSA terms ...
OR_MVCC_REP_SIZE + OR_CHN_SIZE + OR_MVCCID_SIZE + OR_MVCCID_SIZE + OR_MVCC_PREV_VERSION_LSA_SIZE
};

The mandatory rep+chn prefix is 8 bytes; each MVCCID and the prev-version LSA add 8. So the eight entries, by flag index 0..7, evaluate to 8, 16, 16, 24, 16, 24, 24, 32 bytes. The table indexes a bitmask, not a count: indices 1 and 2 are both 16 (one MVCCID either way), 4/6 likewise. Offsets are positional — OR_MVCC_DELETE_ID_OFFSET(flags) adds OR_MVCC_INSERT_ID_SIZE only when VALID_INSID is set — bracketed by OR_MVCC_MIN_HEADER_SIZE = 8 / OR_MVCC_MAX_HEADER_SIZE = 32. On disk the 8-byte prefix (repid+flags, then chn) is followed by mvcc_ins_id, mvcc_del_id, prev_version_lsa (8B each) in that order, each only when flagged; attribute data begins at or_header_size(ptr).

3.4 heap_insert_adjust_recdes_header — stamping the header before placement

Section titled “3.4 heap_insert_adjust_recdes_header — stamping the header before placement”

heap_insert_adjust_recdes_header turns the client-supplied record (a bare header) into a stamped one: insert-id added for MVCC classes, flags stripped for non-MVCC classes, prev-version cleared, length adjusted. Two paths. The fast path is gated on use_optimizationis_mvcc_class && update_in_place == UPDATE_INPLACE_NONE && !VALID_PREV_VERSION && !heap_is_big_length(record_size + OR_MVCCID_SIZE) && !is_bulk_op (SERVER_MODE only).

Branch A — use_optimization. A plain MVCC insert that skips the unpack/repack round-trip and writes the INSID directly into the body.

// heap_insert_adjust_recdes_header (Branch A) — src/storage/heap_file.c
assert (!(mvcc_flags & OR_MVCC_FLAG_VALID_DELID)); /* <- a fresh insert is never pre-deleted */
mvcc_id = logtb_get_current_mvccid (thread_p);
new_ins_mvccid_pos_p = start_p + OR_MVCC_INSERT_ID_OFFSET;
if (!(mvcc_flags & OR_MVCC_FLAG_VALID_INSID))
{
repid_and_flag_bits |= (OR_MVCC_FLAG_VALID_INSID << OR_MVCC_FLAG_SHIFT_BITS);
OR_PUT_INT (start_p, repid_and_flag_bits); /* <- set flag in word 0 */
memmove (new_ins_mvccid_pos_p + OR_MVCCID_SIZE, new_ins_mvccid_pos_p,
insert_context->recdes_p->length - OR_MVCC_INSERT_ID_OFFSET); /* <- open 8-byte gap */
insert_context->recdes_p->length += OR_MVCCID_SIZE;
}
OR_PUT_BIGINT (new_ins_mvccid_pos_p, &mvcc_id); /* <- write INSID into the gap */
return NO_ERROR;

Sub-branch: if the INSID flag is already set the gap exists, so the memmove and length bump are skipped and only the id is overwritten (Chapter 4 pre-sizes the buffer for the extra 8 bytes).

Branch B — general path. Reached when any optimization precondition fails (non-MVCC class, in-place update, prev-version present, would go big, bulk op, client mode); it decodes the header fully first.

// heap_insert_adjust_recdes_header (Branch B) — src/storage/heap_file.c
or_mvcc_get_header (insert_context->recdes_p, &mvcc_rec_header); // ... err check condensed ...
if (insert_context->update_in_place != UPDATE_INPLACE_OLD_MVCCID)
{
if (is_mvcc_class && !insert_context->is_bulk_op) /* B1: MVCC class */
{
mvcc_id = logtb_get_current_mvccid (thread_p);
if (!MVCC_IS_FLAG_SET (&mvcc_rec_header, OR_MVCC_FLAG_VALID_INSID))
{ MVCC_SET_FLAG (&mvcc_rec_header, OR_MVCC_FLAG_VALID_INSID); record_size += OR_MVCCID_SIZE; }
MVCC_SET_INSID (&mvcc_rec_header, mvcc_id);
}
else /* B2: non-MVCC / client */
{
curr_header_size = mvcc_header_size_lookup[mvcc_rec_header.mvcc_flag];
MVCC_CLEAR_ALL_FLAG_BITS (&mvcc_rec_header); /* <- strip all optional ids */
new_header_size = mvcc_header_size_lookup[mvcc_rec_header.mvcc_flag];
record_size -= (curr_header_size - new_header_size); /* <- shrink length to match */
}
}
else if (MVCC_IS_HEADER_DELID_VALID (&mvcc_rec_header)) /* B3: redistribute keeps DELID */
insert_context->is_redistribute_insert_with_delid = true;
MVCC_CLEAR_FLAG_BITS (&mvcc_rec_header, OR_MVCC_FLAG_VALID_PREV_VERSION); /* always: new row has no prev */
if (is_mvcc_class && heap_is_big_length (record_size))
HEAP_MVCC_SET_HEADER_MAXIMUM_SIZE (&mvcc_rec_header); /* <- big record -> full 32B header */
or_mvcc_set_header (insert_context->recdes_p, &mvcc_rec_header); // ... err check condensed ...

Four mutually exclusive outcomes: B1 stamps INSID and grows by one MVCCID; B2 strips all optional ids and shrinks by the exact lookup-table delta; B3 (partition redistribute) only records that an existing DELID must be preserved; UPDATE_INPLACE_OLD_MVCCID falls through all three untouched. Then, branch-independent, prev-version is cleared and a big record is promoted to max size (3.5).

SM_CLASS root-class vs ATTRINFO normal-table path. The function runs only for normal table rows. The insert driver calls it under the gate !OID_ISNULL(class_oid) && !OID_IS_ROOTOID(class_oid) && recdes_p->type != REC_ASSIGN_ADDRESS, so two kinds skip adjustment: a root-class row (OID_IS_ROOTOID — a raw serialized SM_CLASS record with its own object header, not MVCC-versioned, so an insert-id would corrupt it) and a REC_ASSIGN_ADDRESS placeholder. The ATTRINFO path (table rows from heap_attrinfo-built descriptors) is the non-root, non-placeholder branch that does call the adjuster.

flowchart TB
  IN["insert driver"] --> Q{"root class?\nor REC_ASSIGN_ADDRESS?"}
  Q -->|yes: SM_CLASS raw record| SKIP["skip adjust"]
  Q -->|no: ATTRINFO table row| OPT{use_optimization?}
  OPT -->|yes| A["Branch A: fast INSID stamp"]
  OPT -->|no| B["Branch B: full get/set header"]

Figure 3-2 — Caller gate plus internal branching of header adjustment.

3.5 Overflow records keep a fixed-size header

Section titled “3.5 Overflow records keep a fixed-size header”

A REC_BIGONE body lives in the overflow file, its MVCC header on the first overflow page, always written at maximum size so it updates in place. Two helpers enforce this. heap_get_mvcc_rec_header_from_overflow reads it back as a recdes of OR_MVCC_MAX_HEADER_SIZE:

// heap_get_mvcc_rec_header_from_overflow — src/storage/heap_file.c
peek_recdes->data = overflow_get_first_page_data (ovf_page);
peek_recdes->length = OR_MVCC_MAX_HEADER_SIZE; /* <- always read 32B */
return or_mvcc_get_header (peek_recdes, mvcc_header);

The getter’s only branch: peek_recdes == NULL uses a local ovf_recdes; otherwise the caller’s recdes is populated so it can reach the overflow body too. heap_set_mvcc_rec_header_on_overflow writes it back, forcing maximum size so the overwritten slot never changes length:

// heap_set_mvcc_rec_header_on_overflow — src/storage/heap_file.c
ovf_recdes.area_size = ovf_recdes.length = OR_HEADER_SIZE (ovf_recdes.data);
assert (ovf_recdes.length == OR_MVCC_MAX_HEADER_SIZE); /* <- existing header must already be 32B */
if (!MVCC_IS_FLAG_SET (mvcc_header, OR_MVCC_FLAG_VALID_INSID))
{ MVCC_SET_FLAG_BITS (mvcc_header, OR_MVCC_FLAG_VALID_INSID);
MVCC_SET_INSID (mvcc_header, MVCCID_ALL_VISIBLE); } /* <- force INSID present */
if (!MVCC_IS_FLAG_SET (mvcc_header, OR_MVCC_FLAG_VALID_DELID))
{ MVCC_SET_FLAG_BITS (mvcc_header, OR_MVCC_FLAG_VALID_DELID);
MVCC_SET_DELID (mvcc_header, MVCCID_NULL); } /* <- force DELID present */
assert (mvcc_header_size_lookup[MVCC_GET_FLAG (mvcc_header)] == OR_MVCC_MAX_HEADER_SIZE);
return or_mvcc_set_header (&ovf_recdes, mvcc_header);

The two if blocks are the only branches: a set flag is left alone; a missing INSID/DELID flag is added with a neutral value (MVCCID_ALL_VISIBLE, MVCCID_NULL). HEAP_MVCC_SET_HEADER_MAXIMUM_SIZE (3.4) does the same at insert time.

Invariant: an overflow header is exactly OR_MVCC_MAX_HEADER_SIZE (32 bytes), always. Both helpers assert it. An in-place header update (e.g. stamping DELID at delete time) must not relocate the body; the fixed 32-byte header keeps record length constant. A variable-size header would shift every byte of a possibly multi-megabyte record on each DELID stamp.

Already traced above (the chn row in 3.2, Branch B2 in 3.4): non-MVCC classes strip all MVCC flags to the 8-byte rep+chn form, where chn alone carries coherency (checked by MVCC_IS_CHN_UPTODATE); MVCC classes stamp mvcc_ins_id as the visibility key while chn stays present but inert. One decoder (or_mvcc_get_header) serves both.

  1. The 4-bit record_type is the first dispatch: four types (REC_HOME, REC_RELOCATION, REC_BIGONE, REC_ASSIGN_ADDRESS) are what a public OID may land on; REC_NEWHOME is reachable only via a REC_RELOCATION forward; REC_*DELETED* are tombstones; REC_UNKNOWN is a sentinel.
  2. Forwarding (REC_NEWHOME) carries an inline variable-size header; overflow (REC_BIGONE) carries a fixed 32-byte header.
  3. mvcc_rec_header packs flags+repid into one word, an always-present chn, and three optional on-disk fields (mvcc_ins_id, mvcc_del_id, prev_version_lsa) gated by the 8-bit mvcc_flag.
  4. Header size is a pure function of the flag byteor_header_size indexes mvcc_header_size_lookup[flag]; no length field, so writers and readers must agree on flags or mis-parse.
  5. heap_insert_adjust_recdes_header has a fast path (Branch A: stamp INSID via memmove) and a general path (Branch B: B1 MVCC-stamp, B2 strip-for-non-MVCC, B3 redistribute-keep-DELID), always clearing prev-version and promoting big records to max size. Root-class SM_CLASS records and REC_ASSIGN_ADDRESS placeholders skip adjustment; only non-root ATTRINFO rows are stamped.
  6. chn is the coherency key for non-MVCC classes (8-byte header); mvcc_ins_id is the visibility key for MVCC classes.

This chapter follows the birth of a single record — from the caller’s RECDES to a minted, locked OID and bytes resident in a slotted page: how is the OID minted, how is a home page chosen, and how does an oversized record spill to overflow at insert time? The high-level companion’s ### Insert flow section gives the placement algorithm; we trace every branch instead. Record types are in Chapter 3; HEAP_OPERATION_CONTEXT in Chapter 1; best-space selection in Chapter 9; logging in Chapter 10.

4.1 Context build-up — heap_create_insert_context

Section titled “4.1 Context build-up — heap_create_insert_context”

Every logical heap operation funnels through a HEAP_OPERATION_CONTEXT. The insert constructor clears it, then stamps in the inputs:

// heap_create_insert_context -- src/storage/heap_file.c
heap_clear_operation_context (context, hfid_p); /* <- reset ALL fields; flag defaults below */
if (class_oid_p != NULL) { COPY_OID (&context->class_oid, class_oid_p); } /* <- may stay NULL */
context->recdes_p = recdes_p; /* <- caller's record bytes + type */
context->scan_cache_p = scancache_p; /* <- optional page-caching hint */
context->type = HEAP_OPERATION_INSERT;

heap_clear_operation_context nulls res_oid, ovf_oid, map_recdes and the three behavior flags below (each = false):

FlagSet whereEffect on insert
is_logical_oldfalse for insert; true only in heap_update_logical for the relocated sourcefalse = genuinely new logical object; REC_NEWHOME relocation uses heap_insert_newhome, not this path, so it never flips.
is_redistribute_insert_with_delidheap_insert_adjust_recdes_header, when the incoming header already has a valid DELIDRoutes logging to heap_mvcc_log_redistribute (§4.7).
is_bulk_opbulk-load caller, before heap_insert_logicalSuppresses INSID stamping, takes NULL_LOCK not X_LOCK, asserts a pre-held BU_LOCK.

Invariant — a context is single-use and fully reset. Because heap_create_insert_context always clears first, no field survives a prior operation; a reused-without-reset context would leak a stale ovf_oid/res_oid and corrupt the forwarding map.

4.2 The whole flow — heap_insert_logical

Section titled “4.2 The whole flow — heap_insert_logical”

heap_insert_logical is the single entry point (Figure 4-1, every branch to error:). One decision is not in the figure: MVCC-op classificationis_mvcc_op is true only under SERVER_MODE with an MVCC-enabled class, a non-REC_ASSIGN_ADDRESS record, and not bulk; it is consumed only at logging (§4.7), never at placement. The header-adjust gate (§4.3) is skipped for the root class, a NULL class OID, and REC_ASSIGN_ADDRESS.

flowchart TB
  A["heap_insert_logical"] --> B{"scancache_check OK?"}
  B -- "no" --> RF1["return ER_FAILED (no page fixed)"]
  B -- "yes" --> G{"class not NULL, not ROOT,\ntype != REC_ASSIGN_ADDRESS?"}
  G -- "yes, fail" --> RF1
  G -- "yes, ok / no" --> J["heap_insert_handle_multipage_record"]
  J -- "ER_FAILED" --> ERR["goto error"]
  J -- "ok" --> M["class lock: bulk asserts BU_LOCK,\nelse IX_LOCK UNCOND"]
  M -- "not granted" --> RF1
  M -- "granted" --> Q["heap_get_insert_location_with_lock"]
  Q -- "fail" --> RF1
  Q -- "ok, res_oid set + locked" --> O["heap_insert_physical"]
  O -- "fail" --> ERR
  O -- "ok" --> U["log unless bulk; set_dirty;\ncache or pgbuf_ordered_unfix; perfmon"]
  U --> ERR
  ERR["error:"] --> RET["return rc"]

Figure 4-1 — Branch-complete control flow. Early failures return directly (no page fixed); post-fix failures goto error (the shared SystemTap exit). error: is also the normal (success) exit.

4.3 Stamping the MVCC INSID — heap_insert_adjust_recdes_header

Section titled “4.3 Stamping the MVCC INSID — heap_insert_adjust_recdes_header”

For an MVCC class the row gets its insert MVCCID here, via a fast path (guard predicate below) and a general path:

// heap_insert_adjust_recdes_header -- src/storage/heap_file.c
use_optimization = (is_mvcc_class && update_in_place == UPDATE_INPLACE_NONE
&& !(mvcc_flags & OR_MVCC_FLAG_VALID_PREV_VERSION)
&& !heap_is_big_length (record_size + OR_MVCCID_SIZE) && !is_bulk_op);
if (use_optimization) { /* <- in-place: OR_PUT_INT INSID flag; memmove 8-byte gap; */
return NO_ERROR; /* length += OR_MVCCID_SIZE; OR_PUT_BIGINT INSID; one memmove, no pack/unpack */
}

The fast path fires for the common case (a fresh, non-big, no-prev-version insert). The general path (or_mvcc_get_header → mutate → or_mvcc_set_header) has three sub-branches: MVCC, not bulk — set OR_MVCC_FLAG_VALID_INSID if absent, grow by OR_MVCCID_SIZE, MVCC_SET_INSID; non-MVCC or bulk — strip all flags (MVCC_CLEAR_ALL_FLAG_BITS) and shrink; UPDATE_INPLACE_OLD_MVCCID — keep the existing MVCCID, setting is_redistribute_insert_with_delid if a DELID is present (a partition redistribute). A now-big MVCC record then has its header forced to maximum size (HEAP_MVCC_SET_HEADER_MAXIMUM_SIZE) so a later in-place delete/update of the overflow map never grows the home slot.

Invariant — INSID width is reserved before placement. The §4.5 home page is sized against recdes_p->length after OR_MVCCID_SIZE is added; the fast path’s area_size >= length + OR_MVCCID_SIZE assert keeps the memmove in bounds.

4.4 Oversized spill — heap_insert_handle_multipage_record

Section titled “4.4 Oversized spill — heap_insert_handle_multipage_record”

This is the gate between “fits in a page” and “goes to overflow”:

// heap_insert_handle_multipage_record -- src/storage/heap_file.c
if (!heap_is_big_length (context->recdes_p->length)) { return NO_ERROR; } /* <- normal: untouched */
if (heap_ovf_insert (thread_p, &context->hfid, &context->ovf_oid, context->recdes_p) == NULL)
{ return ER_FAILED; } /* <- overflow insert failed */
heap_build_forwarding_recdes (&context->map_recdes, REC_BIGONE, &context->ovf_oid);
context->recdes_p = &context->map_recdes; /* <- home page now receives the 8-byte map */

heap_is_big_length (length > heap_Maxslotted_reclength) is the single source of truth for “oversized”; when not big, the function is a no-op. When big, heap_build_forwarding_recdes builds the 8-byte REC_BIGONE map and repoints context->recdes_p at it, so the rest of heap_insert_logical places that tiny record like a normal one.

Invariant — the MVCC header lives in the overflow record, not the map. The source comment is explicit (“MVCC information is held in overflow record”) — which is why §4.3 forced the overflow record’s header to maximum size. The home-page REC_BIGONE map carries no MVCCID; visibility (Chapter 5) follows the OID into overflow.

// heap_ovf_insert -- src/storage/heap_file.c
if (heap_ovf_find_vfid (thread_p, hfid, &ovf_vfid, true, PGBUF_UNCONDITIONAL_LATCH) == NULL
|| overflow_insert (thread_p, &ovf_vfid, &ovf_vpid, recdes, FILE_MULTIPAGE_OBJECT_HEAP) != NO_ERROR)
{ return NULL; } /* <- either branch fails -> NULL */
ovf_oid->pageid = ovf_vpid.pageid; ovf_oid->volid = ovf_vpid.volid;
ovf_oid->slotid = NULL_SLOTID; /* <- overflow has no slot */

The true argument creates the overflow file if absent; ovf_oid identifies the first overflow page (its slot field is meaningless — overflow pages are page-chained, not slotted).

4.5 Choosing the home page and minting the OID

Section titled “4.5 Choosing the home page and minting the OID”

With recdes_p ready and the class IX lock held, heap_insert_logical calls heap_get_insert_location_with_lock — where the OID is born:

  1. Page select. home_hint == NULLheap_stats_find_best_page (isnew_rec = true, Chapter 9’s black box — a fixed page or an error via the §4.6 fallback, never “no space”; NULLreturn error_code). A hint → ER_SP_NOSPACE_IN_PAGE if too small, else adopt it. Either way res_oid.volid/pageid are set.
  2. Lock mode: SCH_M_LOCK (root), NULL_LOCK (bulk), else X_LOCK.
  3. Slot loop (slot 0..slot_count): spage_find_free_slot returns a slot_id (also reclaims REC_DELETED_WILL_REUSE slots, Chapter 7; == slot_count means “append”) or SP_ERROR → break. Set res_oid.slotid; NULL_LOCK returns immediately, else lock_object conditionally: LK_GRANTED → return; LK_NOTGRANTED_DUE_TIMEOUT → next slot; any other error → break.
  4. Break path: null res_oid, unfix the page, assert(false), return ER_FAILED.

Invariant — the OID is fully determined and X-locked before any byte is written. res_oid is complete only after a successful lock_object (or immediately for NULL_LOCK bulk), with the lock on the not-yet-written slot — so the row lock is taken inside INSERT, and step 3’s conditional retry means two inserters never deadlock on a tentative slot.

4.6 Physical placement — heap_insert_physical and heap_alloc_new_page

Section titled “4.6 Physical placement — heap_insert_physical and heap_alloc_new_page”

heap_insert_physical writes at the exact reserved slot — it does not choose one:

// heap_insert_physical -- src/storage/heap_file.c
assert (context->res_oid.slotid != NULL_SLOTID); /* <- slot chosen in §4.5 */
if (spage_insert_at (thread_p, context->home_page_watcher_p->pgptr, context->res_oid.slotid,
context->recdes_p) != SP_SUCCESS)
{ er_set (ER_FATAL_ERROR_SEVERITY, ...); OID_SET_NULL (&context->res_oid); return ER_FAILED; }

spage_insert_at is Chapter 2’s primitive; failure is fatal (the slot was just proven free and lockable, so SP_DOESNT_FIT means page corruption).

heap_alloc_new_page is the new-page fallback behind best-space — called not by heap_insert_logical but by heap_stats_find_best_page (and the bulk loader) when no page has room:

// heap_alloc_new_page -- src/storage/heap_file.c
HEAP_PAGE_SET_VACUUM_STATUS (&new_page_chain, HEAP_PAGE_VACUUM_NONE); /* <- clean page, links nulled */
error_code = file_alloc (thread_p, &hfid->vfid, heap_vpid_init_new, &new_page_chain, new_page_vpid, &page_ptr);
if (error_code != NO_ERROR) { ASSERT_ERROR (); return error_code; } /* <- no watcher attached yet */
pgbuf_attach_watcher (thread_p, page_ptr, PGBUF_LATCH_WRITE, hfid, home_hint_p);

It initializes a fresh HEAP_CHAIN header (Chapter 8) then attaches a write-latched watcher, fixing the page before return. The error path mirrors §4.2’s asymmetry: when file_alloc fails it returns the error via ASSERT_ERROR() before any watcher is attached, leaving no page fixed. (HEAP_CHAIN fields: Chapter 1.)

After a successful insert (unless use_bulk_logging), heap_log_insert_physical dispatches by record type and op kind (using is_mvcc_op and is_redistribute_insert_with_delid):

  • MVCC + redistributeheap_mvcc_log_redistribute.
  • MVCC, normalheap_mvcc_log_insert.
  • non-MVCC, REC_ASSIGN_ADDRESSRVHF_INSERT undoredo, 2-byte reserved-length payload (no body yet).
  • non-MVCC, REC_NEWHOMERVHF_INSERT_NEWHOME.
  • non-MVCC, else → plain RVHF_INSERT.

RVHF_* semantics are Chapter 10’s. Perfmon counters (PSTAT_HEAP_HOME_INSERTS / BIG_INSERTS / ASSIGN_INSERTS) are keyed by the final recdes_p->type (REC_BIGONE if §4.4 spilled).

4.8 The pre-mint path — heap_assign_address

Section titled “4.8 The pre-mint path — heap_assign_address”

heap_assign_address reserves an OID before the row’s bytes exist (the transformer uses it to break a circular reference).

// heap_assign_address -- src/storage/heap_file.c
if (expected_length <= 0) { /* ... heap_estimate_avg_length ... */ }
recdes.length = /* <- clamp to [OID_SIZE, non-big] */
((expected_length > SSIZEOF (OID) && !heap_is_big_length (expected_length)) ? expected_length : SSIZEOF (OID));
recdes.data = NULL; recdes.type = REC_ASSIGN_ADDRESS; /* <- placeholder type */
heap_create_insert_context (&insert_context, (HFID *) hfid, class_oid, &recdes, NULL);
rc = heap_insert_logical (thread_p, &insert_context, NULL);
COPY_OID (oid, &insert_context.res_oid); /* <- hand back the minted OID */

The clamp falls back to the heap’s average object length when the length is unknown (a big reservation reserves just OID_SIZE, the content going to overflow later). The REC_ASSIGN_ADDRESS type makes the insert skip header adjustment (§4.2) and log non-MVCC (§4.7); a later UPDATE (Chapter 6) fills the slot.

4.9 The shared relocation helper — heap_insert_newhome

Section titled “4.9 The shared relocation helper — heap_insert_newhome”

heap_insert_newhome places a REC_NEWHOME body — the relocated copy of a record that outgrew its home page. It is never reached from a logical insert; update (Chapter 6) and delete relocation reuse it.

// heap_insert_newhome -- src/storage/heap_file.c
assert (parent_context->type == HEAP_OPERATION_DELETE || parent_context->type == HEAP_OPERATION_UPDATE);
heap_create_insert_context (&ins_context, &parent_context->hfid, &parent_context->class_oid, recdes_p, NULL);
error_code = heap_find_location_and_insert_rec_newhome (thread_p, &ins_context); /* <- find page + spage_insert */
heap_log_insert_physical (thread_p, ..., &ins_context.res_oid, ins_context.recdes_p, false, false); /* <- always non-MVCC */
if (out_oid_p != NULL) { COPY_OID (out_oid_p, &ins_context.res_oid); } /* <- give caller the OID */
if (newhome_pg_watcher != NULL) /* <- optional: hand page back */
pgbuf_replace_watcher (thread_p, ins_context.home_page_watcher_p, newhome_pg_watcher);

Differences that matter to a modifier: it builds its own child ins_context (§4.1’s single-use invariant); it takes no row lock (REC_NEWHOME is reached only via its REC_RELOCATION pointer, so the visible OID and X-lock live there, and placement uses heap_stats_find_best_page with isnew_rec = false then spage_insert — a page-chosen, not pre-locked, slot); logging is always non-MVCC (RVHF_INSERT_NEWHOME) since vacuum never inspects REC_NEWHOME; and a non-NULL newhome_pg_watcher keeps the page fixed via pgbuf_replace_watcher so the caller sets the prev-version LSA (Chapter 6) without re-fixing.

  1. Insert is one funnel. heap_insert_logical drives a single-use context through header-adjust → spill → page-select+lock → physical-write → log; early failures return directly, post-fix failures goto error.
  2. The MVCCID is stamped before placement. heap_insert_adjust_recdes_header adds an 8-byte INSID (fast in-place path) so the page is sized against the final length; non-MVCC/bulk strip all flags.
  3. Oversized means a tiny home record. heap_insert_handle_multipage_record writes the bytes to overflow via heap_ovf_insert, then leaves an 8-byte REC_BIGONE map whose MVCC header lives in the overflow record.
  4. The OID is minted and X-locked before any byte is written. heap_get_insert_location_with_lock fixes a page, completes res_oid, and conditionally X-locks the tentative slot, advancing on contention.
  5. Physical write trusts the reservation. heap_insert_physical calls spage_insert_at at the pre-locked slot (failure is fatal); heap_alloc_new_page is the fallback, leaving no page fixed when file_alloc fails.
  6. Two specialized births. heap_assign_address mints a body-less OID (REC_ASSIGN_ADDRESS, filled later by UPDATE); heap_insert_newhome is the relocation-only twin — own context, no row lock, always non-MVCC RVHF_INSERT_NEWHOME, optional page-watcher hand-back.

Chapter 5: Read Path Visibility and Following the Forwarding Chain

Section titled “Chapter 5: Read Path Visibility and Following the Forwarding Chain”

The read path answers: given an OID (or sequential cursor), what bytes may this transaction see? The record at the OID may not hold its own data — a REC_RELOCATION pointer to a REC_NEWHOME body or a REC_BIGONE pointer to an overflow page — and even once located, the MVCC snapshot may rule the current version invisible, forcing a walk back into the log. This chapter traces both, branch by branch.

The companion’s ### Read flow and ### MVCC integration — record header give the concepts; the snapshot predicate (mvcc_satisfies_snapshot) is theory we do not re-derive — we assume SNAPSHOT_SATISFIED / TOO_OLD_FOR_SNAPSHOT / TOO_NEW_FOR_SNAPSHOT are understood.

Every read funnels through a HEAP_GET_CONTEXT (struct heap_get_context in heap_file.h) — the scratchpad carrying the OID, latched pages, and record type between helpers. Every field:

FieldRoleWhy it exists
record_typeDispatcher output; REC_HOME/REC_RELOCATION/REC_BIGONE/…Drives per-type dispatch; slot read once.
oid_pHome OID (input, const).The home page+slot; never REC_NEWHOME.
forward_oidFilled for REC_RELOCATION/REC_BIGONE.The REC_NEWHOME body slot, or first overflow page.
class_oid_pClass OID; may be NULL, filled from page chain.Decides whether the class is MVCC-disabled.
recdes_pOutput bytes; NULL for header-only.Visibility runs without copying data.
scan_cacheOwning HEAP_SCANCACHE, or NULL.Copy area + cache_last_fix_page latch retention.
home_page_watcherOrdered watcher on the home page.Held across the get; handed back on cleanup.
fwd_page_watcherOrdered watcher on forward / overflow page.Fixed only for relocation/bigone.
ispeekingPEEK (alias) vs COPY (into an area).Whether data outlives the latch — see 5.7.
old_chnCaller’s cached CHN, or NULL_CHN.The “client already has this version” short-circuit.
latch_modeREAD, or WRITE when scan cache demands X.Serial increment reads under X to avoid a re-fix.

Invariant (forward consistency). On heap_prepare_get_context returning S_SUCCESS, every caller asserts record_type == REC_HOME or (!OID_ISNULL(&forward_oid) and fwd_page_watcher.pgptr != NULL) — enforced by the assert pair atop heap_get_visible_version_internal/heap_get_last_version. If violated, heap_get_mvcc_header would dereference a NULL forward page for a relocation/bigone record — a crash.

5.2 Fixing the home page — heap_prepare_object_page

Section titled “5.2 Fixing the home page — heap_prepare_object_page”

On success the watcher holds the page the OID lives on.

// heap_prepare_object_page -- src/storage/heap_file.c
VPID_GET_FROM_OID (&object_vpid, oid);
if (page_watcher_p->pgptr != NULL && !VPID_EQ (pgbuf_get_vpid_ptr (page_watcher_p->pgptr), &object_vpid))
pgbuf_ordered_unfix (thread_p, page_watcher_p); /* <- wrong page latched; drop it */
if (page_watcher_p->pgptr == NULL)
{
ret = pgbuf_ordered_fix (thread_p, &object_vpid, OLD_PAGE, latch_mode, page_watcher_p);
if (ret == ER_PB_BAD_PAGEID) ret = ER_HEAP_UNKNOWN_OBJECT; /* <- bad page = "no object" */
if (ret == ER_LK_PAGE_TIMEOUT && er_errid () == NO_ERROR) ret = ER_PAGE_LATCH_ABORTED;
}
return ret;

Three branches: right page held → no fix; wrong page → unfix first; no page → fix. ER_PB_BAD_PAGEIDER_HEAP_UNKNOWN_OBJECT (caller maps to S_DOESNT_EXIST — a dangling OID is normal); ER_LK_PAGE_TIMEOUTER_PAGE_LATCH_ABORTED, the ordered-fix retry signal.

5.3 The dispatcher — heap_prepare_get_context

Section titled “5.3 The dispatcher — heap_prepare_get_context”

This fixes the home page (5.2), reads the record type, and for indirect types fixes the forward page too — the branching heart of the read path:

flowchart TD
    A[heap_prepare_get_context] --> B[heap_prepare_object_page home]
    B -->|ER_HEAP_UNKNOWN_OBJECT| Z1[S_DOESNT_EXIST]
    B -->|other error| ERR[goto error: clean + S_ERROR]
    B -->|ok| E{slot record_type}
    E -->|NULL slot / ASSIGN_ADDRESS / MARKDELETED / DELETED_WILL_REUSE| Z2[S_DOESNT_EXIST or err]
    E -->|REC_HOME| H[S_SUCCESS, home only]
    E -->|REC_RELOCATION| R[peek forward_oid, fix fwd page]
    E -->|REC_BIGONE| G[peek forward_oid, fix overflow page]
    E -->|REC_NEWHOME direct read| ERR2[ER_HEAP_BAD_OBJECT_TYPE]
    R -->|home unfixed| RT[retry once: try_again]
    R -->|stable| H
    G -->|fix ok| H
    G -->|page_was_unfixed| AS[assert false]

Figure 5-2. Every branch of heap_prepare_get_context.

Non-obvious branches:

  • REC_RELOCATION retry. After the forward fix, if home_page_watcher.page_was_unfixed shows ordered-fix re-grabbed home, the relocation link may have changed, so it loops to try_again once (try_max == 1); a second unfix raises ER_PAGE_LATCH_ABORTED. For REC_BIGONE the forward watcher re-ranks to PGBUF_ORDERED_HEAP_OVERFLOW and a home page_was_unfixed is not expected (overflow is immutable) → assert(false).
  • Edge cases. A supplied-but-NULL class_oid_p is filled from the chain record; REC_NEWHOME direct read is illegal → ER_HEAP_BAD_OBJECT_TYPE. The error: label runs heap_clean_get_context + S_ERROR; the S_DOESNT_EXIST/S_SUCCESS paths do not clean up — the caller owns the latches.

5.4 Reading the header — heap_get_mvcc_header

Section titled “5.4 Reading the header — heap_get_mvcc_header”

With pages latched and type known, heap_get_mvcc_header is a pure 3-way switch (context->record_type): REC_HOME peeks the home slot and or_mvcc_get_headers it; REC_RELOCATION does the same on forward_oid; REC_BIGONE calls heap_get_mvcc_rec_header_from_overflow; default is assert(false)S_ERROR. The pre-condition asserts (home matches the OID, forward page matches forward_oid) are the teeth of the forward-consistency invariant of 5.1.

5.5 The visibility decision — heap_get_visible_version_internal

Section titled “5.5 The visibility decision — heap_get_visible_version_internal”

heap_get_visible_version is a thin wrapper. The internal prepares the context (5.3; not S_SUCCESSgoto exit), reads the header (5.4) when a snapshot or old_chn is present, then maps the verdict:

// heap_get_visible_version_internal -- src/storage/heap_file.c
snapshot_res = mvcc_snapshot->snapshot_fnc (thread_p, &mvcc_header, mvcc_snapshot);
if (snapshot_res == TOO_NEW_FOR_SNAPSHOT) /* wanted version is older, in the log */
{ scan = heap_get_visible_version_from_log (..., &MVCC_GET_PREV_VERSION_LSA (&mvcc_header), ...); goto exit; }
else if (snapshot_res == TOO_OLD_FOR_SNAPSHOT) /* dead to us; a miss */
{ scan = S_SNAPSHOT_NOT_SATISFIED; goto exit; }
/* else SNAPSHOT_SATISFIED falls through to the CHN check, else copy (5.7) */
if (MVCC_IS_CHN_UPTODATE (&mvcc_header, context->old_chn)) /* <- runs even with no snapshot */
{ scan = S_SUCCESS_CHN_UPTODATE; goto exit; }

The asymmetry is load-bearing: TOO_OLD is a miss, TOO_NEW is not — the wanted version is older in the undo log, so we follow MVCC_GET_PREV_VERSION_LSA (5.6). The CHN short-circuit runs after the snapshot block unconditionally, firing even with no snapshot when an old_chn was passed (### CHN). heap_get_last_version is this skeleton minus the snapshot block, the snapshot-free sibling for updaters/lockers.

5.6 Walking into the log — heap_get_visible_version_from_log

Section titled “5.6 Walking into the log — heap_get_visible_version_from_log”

On TOO_NEW, older versions are reconstructed from undo records chained by prev_version_lsa:

// heap_get_visible_version_from_log -- src/storage/heap_file.c
for (LSA_COPY (&process_lsa, previous_version_lsa); !LSA_ISNULL (&process_lsa);)
{ /* fetch page + log_get_undo_record elided */
if (scan_code == S_DOESNT_FIT && scan_cache->is_recdes_assigned_to_area (*recdes))
{ scan_cache->assign_recdes_to_area (*recdes, (size_t) (-recdes->length)); continue; } /* grow + retry */
or_mvcc_get_header (recdes, &mvcc_header);
snapshot_res = scan_cache->mvcc_snapshot->snapshot_fnc (...);
if (snapshot_res == SNAPSHOT_SATISFIED)
return MVCC_IS_CHN_UPTODATE (&mvcc_header, has_chn) ? S_SUCCESS_CHN_UPTODATE : S_SUCCESS;
else if (snapshot_res == TOO_OLD_FOR_SNAPSHOT) { assert (false); return S_ERROR; } /* <- impossible: only older here */
else /* TOO_NEW */ LSA_COPY (&process_lsa, &MVCC_GET_PREV_VERSION_LSA (&mvcc_header)); /* step back */
}
return S_DOESNT_EXIST; /* chain exhausted, nothing visible */

TOO_OLD is impossible here (only older versions live in the log) and asserts; an exhausted chain (LSA_ISNULL) → S_DOESNT_EXIST.

5.7 Copying the bytes — heap_get_record_data_when_all_ready and PEEK vs COPY

Section titled “5.7 Copying the bytes — heap_get_record_data_when_all_ready and PEEK vs COPY”

This helper maps type to a spage_get_record — the only place honoring ispeeking:

// heap_get_record_data_when_all_ready -- src/storage/heap_file.c
switch (context->record_type)
{
case REC_RELOCATION: /* never aliased -- forced COPY */
return spage_get_record (..., fwd_page_watcher.pgptr, forward_oid.slotid, recdes_p, COPY);
case REC_BIGONE: return heap_get_bigone_content (..., ispeeking, &forward_oid, recdes_p);
case REC_HOME: /* honors context->ispeeking */
return spage_get_record (..., home_page_watcher.pgptr, oid_p->slotid, recdes_p, context->ispeeking);
default: break;
}
return S_ERROR;

PEEK vs COPY (the latch lifetime contract). PEEK returns recdes->data into the latched page — zero copy, valid only while latched; COPY memcpys into a preallocated area. Entry assertion: PEEK, or COPY with a scan_cache or caller-provided recdes->data.

Invariant (REC_RELOCATION is never peeked). The relocation case hard-codes COPY; using context->ispeeking would let a PEEK scan alias forward-page memory and read garbage after heap_clean_get_context unfixes the forward watcher.

5.8 The overflow fetch — heap_get_bigone_content

Section titled “5.8 The overflow fetch — heap_get_bigone_content”

REC_BIGONE data lives in an overflow file (### Overflow file); the fetch grows its area on S_DOESNT_FIT:

// heap_get_bigone_content -- src/storage/heap_file.c
if (scan_cache != NULL && (ispeeking == PEEK || recdes->data == NULL || scan_cache->is_recdes_assigned_to_area (*recdes)))
{
scan_cache->assign_recdes_to_area (*recdes);
while ((scan = heap_ovf_get (thread_p, forward_oid, recdes, NULL_CHN, NULL)) == S_DOESNT_FIT)
{ assert (recdes->length < 0); scan_cache->assign_recdes_to_area (*recdes, (size_t) (-recdes->length)); } /* grow + retry */
if (scan != S_SUCCESS) recdes->data = NULL; /* <- no dangling pointer */
}
else scan = heap_ovf_get (thread_p, forward_oid, recdes, NULL_CHN, NULL);

The snapshot was already validated by the caller — no re-check. The retry reallocates to -recdes->length (the negative size heap_ovf_get returns); non-success nulls recdes->data.

5.9 The legacy CHN fast path — heap_get_if_diff_chn

Section titled “5.9 The legacy CHN fast path — heap_get_if_diff_chn”

heap_get_if_diff_chn is the pre-context-refactor primitive, guarded by #if defined(ENABLE_UNUSED_FUNCTION) and not compiled into production builds. Its logic — peek only the header, skip the data COPY when scan == S_SUCCESS_CHN_UPTODATE — now lives as the CHN short-circuit in heap_get_visible_version_internal (5.5), which is the source of truth.

5.10 The scan variant — heap_next and heap_next_internal

Section titled “5.10 The scan variant — heap_next and heap_next_internal”

A scan walks the page chain, iterates slots, filters non-object slots, and qualifies survivors through the same machinery (heap_scan_get_visible_version). heap_next, heap_prev, heap_next_sampling, and the *_record_info variants are one-line wrappers over heap_next_internal.

flowchart TD
    A[heap_next_internal: start at hpgid/slot0, heap_get_last_vpid, or resume next_oid] --> P[outer loop: per page]
    P --> PG{cache page right VPID?}
    PG -->|stale| SW[stash to old_page_watcher then fetch]
    PG -->|missing| FX[heap_scan_pb_lock_and_fetch]
    PG -->|hit| IT
    SW --> IT
    FX --> IT[inner loop: spage_next_record PEEK]
    IT --> T{slot type}
    T -->|slot0 / REC_NEWHOME / REC_ASSIGN_ADDRESS / REC_UNKNOWN| IT
    T -->|S_END| NX[heap_vpid_next]
    T -->|object| QV[heap_scan_get_visible_version]
    NX -->|NULL_PAGEID| END[return S_END]
    NX -->|more| P
    QV -->|S_SUCCESS, right class| RET[set next_oid, return S_SUCCESS]
    QV -->|S_SUCCESS wrong class / NOT_SATISFIED / DOESNT_EXIST| IT
    QV -->|S_ERROR| ERR[return error]

Figure 5-4. heap_next_internal page-chain + slot iteration.

Beyond the flowchart: the latch-retention stash keeps the previous page fixed in old_page_watcher until the next fix succeeds (avoiding thrash); the right page already cached → no fix. Filtering skips REC_NEWHOME (reached only via its relocation), REC_ASSIGN_ADDRESS/REC_UNKNOWN, and slot HEAP_HEADER_AND_CHAIN_SLOTID — except the get_rec_info variant (spage_next_record_dont_skip_empty) wanting every slot. cache_last_fix_page is forced true around heap_scan_get_visible_version so the home page returns for the next call; that qualifier has a REC_HOME+PEEK shortcut (MVCC_IS_HEADER_ALL_VISIBLE or snapshot satisfied → peeked recdes, no full context).

5.11 Stepping the page chain — heap_vpid_next

Section titled “5.11 Stepping the page chain — heap_vpid_next”

heap_vpid_next reads slot HEAP_HEADER_AND_CHAIN_SLOTID and returns the successor VPID:

// heap_vpid_next -- src/storage/heap_file.c
if (spage_get_record (thread_p, pgptr, HEAP_HEADER_AND_CHAIN_SLOTID, &recdes, PEEK) != S_SUCCESS)
{ VPID_SET_NULL (next_vpid); ret = ER_FAILED; }
else
{
pgbuf_get_vpid (pgptr, next_vpid);
if (next_vpid->pageid == hfid->hpgid && next_vpid->volid == hfid->vfid.volid)
*next_vpid = ((HEAP_HDR_STATS *) recdes.data)->next_vpid; /* <- header page */
else
*next_vpid = ((HEAP_CHAIN *) recdes.data)->next_vpid; /* <- normal page */
}

The single non-obvious branch: the first page stores its link inside HEAP_HDR_STATS (which embeds a chain), every other page in a bare HEAP_CHAIN — picked by comparing the current VPID against hfid->hpgid (both structs in Chapter 1). A NULL_PAGEID terminates the walk.

  1. One context, three data-bearing types. Only REC_HOME/REC_RELOCATION/REC_BIGONE hold data; every other type → S_DOESNT_EXIST (scan) or an error (point read).
  2. heap_prepare_get_context is the branch hub — fixes home, reads the type, and for relocation/bigone fixes the forward page, with a single try_again retry on a mid-flight home unfix and assert(false) for the immutable BIGONE case.
  3. The forward-consistency invariant (REC_HOME xor non-null forward OID + forward page) is asserted at every consumer, letting heap_get_mvcc_header and the copier dereference the forward page blindly.
  4. Snapshot verdicts are asymmetric. TOO_OLDS_SNAPSHOT_NOT_SATISFIED; SNAPSHOT_SATISFIED → copy; TOO_NEW → walk prev_version_lsa into the undo log until satisfied or exhausted (S_DOESNT_EXIST).
  5. PEEK aliases the page, COPY duplicates into an areaREC_RELOCATION is always COPY; the CHN short-circuit runs unconditionally after the snapshot block.
  6. The scan path reuses the point-read machinery plus heap_vpid_next chain walking, slot filtering, and cache_last_fix_page latch retention. heap_get_if_diff_chn is legacy; heap_get_last_version is the snapshot-free sibling for updaters/lockers.

Chapter 6: Update Flow and Record Type Transitions

Section titled “Chapter 6: Update Flow and Record Type Transitions”

A heap update is the most branch-heavy operation in the manager: the new image may shrink, stay, or grow, and the source slot may be REC_HOME, a REC_RELOCATION + REC_NEWHOME pair, or a REC_BIGONE overflow. CUBRID’s contract is the OID never moves (res_oid is reset to oid), so growth is absorbed by changing the physical representation behind the stable home slot. This chapter traces how heap_update_logical dispatches on the current type and how each worker chooses among in-place / relocate / overflow. For the record-type taxonomy and MVCC header layout see Chapter 3; for insert-side placement see Chapter 4; for the rationale see cubrid-heap-manager.md.

6.1 Entry: context creation and the MVCC/in-place fork

Section titled “6.1 Entry: context creation and the MVCC/in-place fork”

heap_create_update_context is a pure initializer — no I/O — recording the OID, class OID, the new image (recdes_p), and the update_in_place style:

// heap_create_update_context -- src/storage/heap_file.c
COPY_OID (&context->oid, oid_p);
// ... condensed: COPY_OID (&context->class_oid, ...); context->scan_cache_p = scancache_p; ...
context->recdes_p = recdes_p; /* the new image */
context->type = HEAP_OPERATION_UPDATE;
context->update_in_place = in_place; /* <- in-place vs MVCC switch */

UPDATE_INPLACE_STYLE (in heap_file.h) has three values: UPDATE_INPLACE_NONE (default MVCC path — new version with fresh INSID + a prev_version_lsa chain entry); UPDATE_INPLACE_CURRENT_MVCCID (destructive in-place rewrite stamping the current MVCCID, no new version); UPDATE_INPLACE_OLD_MVCCID (destructive rewrite preserving old MVCC IDs — replication / redistribution). The whole-operation toggle is computed once in heap_update_logical via is_mvcc_op = HEAP_UPDATE_IS_MVCC_OP (is_mvcc_class, update_in_place), where is_mvcc_class = !mvcc_is_mvcc_disabled_class (&class_oid) and the macro is is_mvcc_class && !HEAP_IS_UPDATE_INPLACE(style).

INVARIANT — MVCC and in-place are mutually exclusive. is_mvcc_op is true iff the class is MVCC-enabled and the style is UPDATE_INPLACE_NONE. On non-SERVER_MODE builds HEAP_UPDATE_IS_MVCC_OP is hard-coded false, so standalone tools always take the in-place arm.

6.2 heap_update_logical: locate, fetch, adjust, dispatch

Section titled “6.2 heap_update_logical: locate, fetch, adjust, dispatch”

After the scancache, file-type (FILE_HEAP / FILE_HEAP_REUSE_SLOTS only), and heap_is_valid_oid validations, the function fixes the home page (heap_get_record_location), reads record_type = spage_get_record_type (a REC_UNKNOWN slot is ER_HEAP_UNKNOWN_OBJECT), copies the home record into home_recdes (undo source — COPY, not PEEK), calls heap_update_adjust_recdes_header (Section 6.7) on the new image for a real user class (!OID_IS_ROOTOID), decides do_supplemental_log (CDC), and dispatches:

// heap_update_logical -- src/storage/heap_file.c
switch (context->record_type)
{
case REC_RELOCATION: rc = heap_update_relocation (thread_p, context, is_mvcc_op); break;
case REC_BIGONE: rc = heap_update_bigone (thread_p, context, is_mvcc_op); break;
case REC_ASSIGN_ADDRESS:
context->is_logical_old = false; /* <- inserted this tran, not an old version */
[[fallthrough]];
case REC_HOME: rc = heap_update_home (thread_p, context, is_mvcc_op); break;
default:
rc = ER_HEAP_BAD_OBJECT_TYPE; goto exit;
}

The REC_ASSIGN_ADDRESS fallthrough routes a just-reserved address slot through heap_update_home flagged not logically-old. On exit the home watcher is handed to the scancache (if cache_last_fix_page) or unfixed, and heap_unfix_watchers releases the rest.

6.3 heap_update_home: REC_HOME as the source type

Section titled “6.3 heap_update_home: REC_HOME as the source type”

heap_update_home handles REC_HOME (and REC_ASSIGN_ADDRESS), picking one of three destinations in strict priority — overflow, in-place, else relocation:

// heap_update_home -- src/storage/heap_file.c
if (heap_is_big_length (context->recdes_p->length)) { /* 1. overflow */
heap_ovf_insert (..., &forward_oid, context->recdes_p);
heap_build_forwarding_recdes (&forwarding_recdes, REC_BIGONE, &forward_oid);
home_page_updated_recdes_p = &forwarding_recdes;
}
else if (!spage_is_updatable (..., context->recdes_p->length)) { /* 3. relocate */
context->recdes_p->type = REC_NEWHOME;
heap_insert_newhome (..., context->recdes_p, &forward_oid, newhome_pg_watcher_p);
heap_build_forwarding_recdes (&forwarding_recdes, REC_RELOCATION, &forward_oid);
home_page_updated_recdes_p = &forwarding_recdes;
}
else { /* 2. in place: stays REC_HOME */
context->recdes_p->type = REC_HOME;
home_page_updated_recdes_p = context->recdes_p;
}

These are rows 1–3 of Section 6.8. After choosing, a re-peek guard fires when the destination is a forwarder and home_page_watcher_p->page_was_unfixed (the page was released during the ordered second-page fix, possibly vacuumed/compacted): it re-reads home_recdes so the logged undo image matches current bytes. It then logs (heap_log_update_physical, RVHF_UPDATE_NOTIFY_VACUUM for MVCC else RVHF_UPDATE; a live REC_ASSIGN_ADDRESS also uses NOTIFY_VACUUM), captures prev_version_lsa, calls heap_update_physical, and for MVCC ops calls heap_update_set_prev_version.

INVARIANT — REC_ASSIGN_ADDRESS is never MVCC-updated. An early guard (!HEAP_IS_UPDATE_INPLACE && home_recdes.type == REC_ASSIGN_ADDRESS) returns ER_FAILED with assert(false): a reservation has no MVCC header to version, so only the non-MVCC in-place arm fills it.

flowchart TD
  A["heap_update_home"] --> Z{REC_ASSIGN_ADDRESS\n and MVCC op?}
  Z -->|yes| ZF["assert false; ER_FAILED"]
  Z -->|no| B{heap_is_big_length?}
  B -->|yes| C["REC_BIGONE forwarder"]
  B -->|no| D{spage_is_updatable home?}
  D -->|no| E["REC_RELOCATION + REC_NEWHOME"]
  D -->|yes| F["REC_HOME in place"]
  C --> H["re-peek if unfixed; log; physical; set_prev_version if mvcc"]
  E --> H
  F --> H

Figure 6-1: heap_update_home branch tree.

6.4 heap_update_relocation: the REC_RELOCATION + REC_NEWHOME source

Section titled “6.4 heap_update_relocation: the REC_RELOCATION + REC_NEWHOME source”

The densest worker. The home slot holds a REC_RELOCATION OID pointing at a REC_NEWHOME on the forward page. It reads forward_oid, fixes the forward page (heap_fix_forward_page), and computes two predicates:

// heap_update_relocation -- src/storage/heap_file.c
fits_in_home = spage_is_updatable (... home slot ..., context->recdes_p->length);
fits_in_forward = spage_is_updatable (... forward slot ..., context->recdes_p->length);
if (heap_is_big_length (context->recdes_p->length) || (!fits_in_forward && !fits_in_home))
heap_fix_header_page (thread_p, context); /* header page needed for overflow or a new newhome */

A four-way decision then sets three booleans (update_old_home, update_old_forward, remove_old_forward) and the new home image — rows 4–7 of Section 6.8, plus an impossible else (assert(false); ER_FAILED). fits_in_home beats fits_in_forward, but only after the fits-nowhere case has spawned a new relocation. Before any I/O, two asserts encode the bookkeeping invariant: assert (remove_old_forward != update_old_forward) and assert (remove_old_forward == update_old_home).

INVARIANT — the stale REC_NEWHOME is reconciled exactly once. The old REC_NEWHOME is either deleted (remove_old_forward, when the home slot changes meaning) or overwritten (update_old_forward, stay-relocated) — never both, never neither. The two asserts above crash on violation in debug builds: removed XOR updated, and removed iff the home is rewritten.

Up to three conditional blocks then run (home rewrite via heap_log_update_physical; old-newhome free via heap_log_delete_physical + heap_delete_physical; or in-place forward rewrite logging heap_mvcc_log_home_no_change). The stay-relocated branch sets prev_version_lsa to the forward-update LSA; the others use the heap_log_delete_physical undo LSA of the deleted old newhome — that record is the prior version. MVCC ops then call heap_update_set_prev_version with newhome_pg_watcher_p (fresh newhome) or the existing forward watcher.

6.5 heap_update_bigone: the REC_BIGONE source

Section titled “6.5 heap_update_bigone: the REC_BIGONE source”

The home slot holds a REC_BIGONE forwarder; the body lives in the overflow file. ovf_oid is read from home_recdes.data, the header page fixed, and — for MVCC ops — the old overflow content is read and logged under RVHF_MVCC_UPDATE_OVERFLOW before anything changes:

// heap_update_bigone -- src/storage/heap_file.c
context->ovf_oid = *((OID *) context->home_recdes.data);
heap_fix_header_page (thread_p, context);
if (is_mvcc_op) {
heap_get_bigone_content (... &context->ovf_oid, &ovf_recdes); /* old version image */
log_append_undo_recdes2 (thread_p, RVHF_MVCC_UPDATE_OVERFLOW, &ovf_vfid, first_pgptr, -1, &ovf_recdes);
or_mvcc_set_log_lsa_to_record (context->recdes_p, logtb_find_current_tran_lsa (thread_p)); /* prev_version_lsa */
}

INVARIANT — the overflow path stamps prev_version before the body changes. For REC_BIGONE the prev-version LSA is the RVHF_MVCC_UPDATE_OVERFLOW undo-record LSA, wired into the new image’s header here because heap_update_set_prev_version (6.7) is not called by this worker.

The body update then forks three ways — rows 8–10 of Section 6.8:

// heap_update_bigone -- src/storage/heap_file.c
if (heap_is_big_length (context->recdes_p->length)) { /* overflow -> overflow */
is_old_home_updated = false;
heap_ovf_update (..., &context->ovf_oid, context->recdes_p);
if (is_mvcc_op) heap_mvcc_log_home_no_change (...); /* vacuum must still reach new overflow */
}
else if (spage_update (..., context->recdes_p) == SP_SUCCESS) { /* overflow -> home */
is_old_home_updated = true;
context->record_type = context->recdes_p->type = REC_HOME;
spage_update_record_type (..., REC_HOME); new_home_recdes = *context->recdes_p;
}
else { /* overflow -> relocation */
context->recdes_p->type = REC_NEWHOME;
heap_insert_newhome (..., context->recdes_p, &newhome_oid, NULL); /* + REC_RELOCATION forwarder */
is_old_home_updated = true;
}

Cleanup is keyed by is_old_home_updated: false in the stay-overflow branch (only heap_mvcc_log_home_no_change under MVCC); true in both contract branches, where a trailing block logs the home change (RVHF_UPDATE_NOTIFY_VACUUM / RVHF_UPDATE) and heap_ovf_delete reclaims the orphan overflow chain. A NULL from heap_ovf_update propagates as ASSERT_ERROR_AND_SET; goto exit.

6.6 heap_insert_newhome and heap_ovf_update — the placement helpers

Section titled “6.6 heap_insert_newhome and heap_ovf_update — the placement helpers”

heap_insert_newhome is shared by all three workers (and the delete path) to materialize a relocated REC_NEWHOME: it builds a fresh insert context on the parent’s HFID/class, places the record via best-space search (Chapter 9), logs a plain RVHF_INSERT (vacuum not notified — it never scans REC_NEWHOME), and copies the OID out:

// heap_insert_newhome -- src/storage/heap_file.c
heap_create_insert_context (&ins_context, &parent_context->hfid, &parent_context->class_oid, recdes_p, NULL);
heap_find_location_and_insert_rec_newhome (thread_p, &ins_context);
heap_log_insert_physical (... ins_context.recdes_p, false, false); /* RVHF_INSERT, not MVCC */
if (out_oid_p != NULL) COPY_OID (out_oid_p, &ins_context.res_oid);
if (newhome_pg_watcher != NULL) /* hand fixed page back */
pgbuf_replace_watcher (thread_p, ins_context.home_page_watcher_p, newhome_pg_watcher);

When newhome_pg_watcher is non-NULL (the MVCC relocate branches), the newly-fixed page is handed back via pgbuf_replace_watcher rather than unfixed, so heap_update_set_prev_version can later patch the prev-version field in place. heap_ovf_update is a thin wrapper resolving the overflow VFID (heap_ovf_find_vfid) and delegating to overflow_update, returning ovf_oid or NULL.

6.7 Header stamping: heap_update_adjust_recdes_header and heap_update_set_prev_version

Section titled “6.7 Header stamping: heap_update_adjust_recdes_header and heap_update_set_prev_version”

These two split the MVCC version-chain work across the operation’s timeline. heap_update_adjust_recdes_header runs up front (from heap_update_logical), rewriting the new image’s header with a NULL prev-version LSA (the undo LSA is unknown until logging). Its optimized path (MVCC op, image not big, source header has no DELID) reserves room for INSID + prev-version LSA in one memmove, stamps the fresh MVCCID, and writes a placeholder NULL LSA:

// heap_update_adjust_recdes_header -- src/storage/heap_file.c
update_mvcc_flags = OR_MVCC_FLAG_VALID_INSID | OR_MVCC_FLAG_VALID_PREV_VERSION;
mvcc_id = logtb_get_current_mvccid (thread_p);
if ((mvcc_flags & update_mvcc_flags) != update_mvcc_flags) {
repid_and_flag_bits |= (update_mvcc_flags << OR_MVCC_FLAG_SHIFT_BITS);
memmove (new_data_p, existing_data_p, ...); /* room for INSID + LSA; ... condensed ... */
}
OR_PUT_BIGINT (new_ins_mvccid_pos_p, &mvcc_id); /* fresh INSID */
memcpy (new_ins_mvccid_pos_p + OR_MVCCID_SIZE, &null_lsa, ...); /* placeholder LSA */

The slow path (or_mvcc_get_header + per-flag editing) handles the rest: UPDATE_INPLACE_OLD_MVCCID keeps the old INSID; the non-MVCC arm strips all MVCC flags (MVCC_CLEAR_ALL_FLAG_BITS); big records force OR_MVCC_MAX_HEADER_SIZE. Every MVCC arm leaves prev_version_lsa NULL.

heap_update_set_prev_version runs after the physical update and logging, patching the now-known undo LSA into the record via PEEK (no spage_update), dispatching on the home slot’s current type:

// heap_update_set_prev_version -- src/storage/heap_file.c
spage_get_record (... home_pg_watcher->pgptr, oid->slotid, &recdes, PEEK);
if (recdes.type == REC_HOME) { /* patch home record */
or_mvcc_set_log_lsa_to_record (&recdes, prev_version_lsa);
} else if (recdes.type == REC_RELOCATION) { /* follow to REC_NEWHOME, patch it */
forward_oid = *((OID *) recdes.data);
spage_get_record (... fwd_pg_watcher->pgptr, forward_oid.slotid, &forward_recdes, PEEK);
or_mvcc_set_log_lsa_to_record (&forward_recdes, prev_version_lsa);
} else if (recdes.type == REC_BIGONE) { /* patch overflow OR header */
forward_recdes.data = overflow_get_first_page_data (overflow_pg_watcher.pgptr);
or_mvcc_set_log_lsa_to_record (&forward_recdes, prev_version_lsa);
} else { assert (false); error_code = ER_FAILED; } /* each arm pgbuf_set_dirty's its page */

INVARIANT — prev_version_lsa points at the undo image of the immediately prior version. heap_update_adjust_recdes_header reserves the slot; heap_update_set_prev_version (home/relocation) or the inline stamp in heap_update_bigone fills it with the logging-time LSA. The REC_RELOCATION arm needs fwd_pg_watcher on the forward page — hence the pgbuf_replace_watcher handoff in 6.6.

6.8 The full old-type × new-size transition matrix

Section titled “6.8 The full old-type × new-size transition matrix”

Combining the three workers, every reachable transition is:

Source typeSize conditionResulting home typeBody action
REC_HOMEbigREC_BIGONEnew overflow inserted
REC_HOMEnot big, no home fitREC_RELOCATIONnew REC_NEWHOME
REC_HOMEnot big, fits homeREC_HOMEin place
REC_RELOCATIONbigREC_BIGONEnew overflow; old newhome deleted
REC_RELOCATIONnot big, fits neitherREC_RELOCATIONnew newhome; old newhome deleted
REC_RELOCATIONnot big, fits homeREC_HOMEimage into home; old newhome deleted
REC_RELOCATIONnot big, fits fwdREC_RELOCATION (unchanged)old newhome updated in place
REC_BIGONEbigREC_BIGONE (unchanged)heap_ovf_update in place
REC_BIGONEnot big, spage_update okREC_HOMEoverflow deleted
REC_BIGONEnot big, spage_update failsREC_RELOCATIONnew newhome; overflow deleted

The unifying priority across all three sources is overflow > home > relocate, with heap_update_relocation’s one specialization that reusing the already-fixed forward slot (stay-relocated) beats allocating a new newhome.

  1. heap_update_logical dispatches on the current home-slot type into one of three workers; the OID never moves (res_oid = oid), only the physical representation behind it changes.
  2. The MVCC-vs-in-place fork is one boolean (is_mvcc_op = is_mvcc_class && update_in_place == UPDATE_INPLACE_NONE), asserted mutually exclusive; the in-place styles stamp the current or preserve the old MVCCID.
  3. Each worker chooses its destination in the priority overflow > home > relocate; heap_update_relocation adds a stay-in-forward specialization preferring the already-fixed forward slot.
  4. heap_update_relocation’s three booleans are guarded by two asserts that reconcile the stale REC_NEWHOME exactly once — deleted when the home meaning changes, updated when it stays relocated, never both or neither.
  5. Header stamping is split in time: heap_update_adjust_recdes_header reserves INSID + a NULL prev-version LSA up front; the LSA is patched later (by heap_update_set_prev_version, or inline for bigone under RVHF_MVCC_UPDATE_OVERFLOW), keeping the read-path version chain (Chapter 5) intact.

A DELETE in CUBRID is not “remove the bytes.” Under MVCC the record body must survive so a snapshot that began before the delete committed can still see the old row. An MVCC delete is therefore almost identical to the in-place update of Chapter 6 — it stamps a delete MVCCID on the header and leaves the payload intact — and only physical deletes (non-MVCC tables, plus the eventual freeing of forwarded bodies) call spage_delete. This chapter answers: how does DELETE leave the record in place for readers, where does it diverge from UPDATE, and when is a slot physically torn out instead? (Reading the stamped mvcc_del_id back is Chapter 5.)

7.1 Entry point: heap_create_delete_context and the delete dispatcher

Section titled “7.1 Entry point: heap_create_delete_context and the delete dispatcher”

DELETE enters through a HEAP_OPERATION_CONTEXT from heap_create_delete_context. Unlike the update context-builder (Chapter 6) it is deliberately bare — there is no recdes_p: the new record is synthesized during the delete from the record already on the page, so the caller supplies nothing but the OID.

// heap_create_delete_context -- src/storage/heap_file.c
heap_clear_operation_context (context, hfid_p);
COPY_OID (&context->oid, oid_p); COPY_OID (&context->class_oid, class_oid_p);
context->scan_cache_p = scancache_p;
context->type = HEAP_OPERATION_DELETE; /* <- no recdes_p */

heap_delete_logical drives the branches mapped in Figure 7-1: input validation (heap_is_valid_oid / heap_scancache_check_with_hfid, failing to ER_FAILED before any page is touched); a file-type guard (anything but FILE_HEAP/FILE_HEAP_REUSE_SLOTS is fatal); the root-class case (heap_mark_class_as_modified); the MVCC decision (invariant below); locate under X_LOCK via heap_get_record_location

  • spage_get_record_type (a REC_UNKNOWN slot raises ER_HEAP_UNKNOWN_OBJECT); a COPY snapshot into context->home_recdes; and the record_type dispatch:
// heap_delete_logical -- src/storage/heap_file.c
#if defined (SERVER_MODE)
if (mvcc_is_mvcc_disabled_class (&context->class_oid))
is_mvcc_op = false; /* <- catalog/system class */
else
is_mvcc_op = true; /* <- ordinary user table */
#else
is_mvcc_op = false; /* <- standalone (SA) mode */
#endif
// ... condensed ...
switch (context->record_type)
{
case REC_BIGONE: rc = heap_delete_bigone (thread_p, context, is_mvcc_op); break;
case REC_RELOCATION: rc = heap_delete_relocation (thread_p, context, is_mvcc_op); break;
case REC_HOME:
case REC_ASSIGN_ADDRESS: rc = heap_delete_home (thread_p, context, is_mvcc_op); break;
default: /* REC_NEWHOME, REC_MARKDELETED, ... reached directly => bug */
er_set (..., ER_HEAP_BAD_OBJECT_TYPE, ...); rc = ER_FAILED; goto error;
}
flowchart TD
  A["heap_delete_logical"] --> B{"valid oid &\nFILE_HEAP?"}
  B -- no --> ERR["ER_FAILED / fatal"]
  B -- yes --> D{"mvcc disabled\nclass / SA?"}
  D -- yes --> E["is_mvcc_op=false"]
  D -- no --> F["is_mvcc_op=true"]
  E --> H{"record_type"}
  F --> H
  H -- REC_HOME / REC_ASSIGN_ADDRESS --> I["heap_delete_home"]
  H -- REC_RELOCATION --> J["heap_delete_relocation"]
  H -- REC_BIGONE --> K["heap_delete_bigone"]
  H -- other --> L["ER_HEAP_BAD_OBJECT_TYPE"]

Figure 7-1 — heap_delete_logical branch map (rootclass + X_LOCK locate elided). The same three workers serve delete and update.

Invariant 7-A — the MVCC/physical decision is made once, from the class, and threaded down unchanged. is_mvcc_op is computed once (!mvcc_is_mvcc_disabled_class on the class OID, false standalone) and passed to every worker; they never re-derive it. Wrong here, a user row would be physically deleted and vanish from snapshots that should still see it.

7.2 heap_delete_home — the REC_HOME / REC_ASSIGN_ADDRESS path

Section titled “7.2 heap_delete_home — the REC_HOME / REC_ASSIGN_ADDRESS path”

heap_delete_home shares the spine of the update worker heap_update_home (Chapter 6) — read flags, build via a fast/slow path, classify, relocate if it no longer fits — but diverges precisely:

AspectUPDATE worker (Ch 6)DELETE worker (this chapter)
New payloadcaller’s context->recdes_psynthesized from the existing record
Header stampnew repid/data, prev_version_lsa rewrittenmvcc_del_id set; prev_version_lsa NOT written
Size deltaarbitrary (shrink or grow a lot)grows by at most OR_MVCCID_SIZE (8B) if DELID was absent
Body bytesreplacedcopied through verbatim
Forwarder/overflowmay be created or freedtouched only when it must be (REC_BIGONE: edit in place)

Invariant 7-B — an MVCC delete never writes prev_version_lsa. A delete creates no new version, so heap_delete_adjust_header sets only OR_MVCC_FLAG_VALID_DELID + the DELID, and the fast path preserves existing prev-version bytes by copying from delid_offset onward. Break this and a reader following the chain from a later version loops back into a dead record.

After re-fetching the record if the page was unfixed (the vacuum-might-have-shrunk-it guard), heap_delete_home splits on is_mvcc_op. The MVCC branch builds the death-stamped record by one of two paths gated by use_optimization:

// heap_delete_home -- src/storage/heap_file.c
repid_and_flag_bits = OR_GET_MVCC_REPID_AND_FLAG (context->home_recdes.data);
mvcc_flags = (repid_and_flag_bits >> OR_MVCC_FLAG_SHIFT_BITS) & OR_MVCC_FLAG_MASK;
adjusted_size = context->home_recdes.length;
use_optimization = true;
if (!(mvcc_flags & OR_MVCC_FLAG_VALID_DELID))
{
adjusted_size += OR_MVCCID_SIZE; /* <- 8 more bytes for DELID */
is_adjusted_size_big = heap_is_big_length (adjusted_size);
if (is_adjusted_size_big) use_optimization = false; /* rare: spills to overflow */
}
else
{ /* DELID already set: re-delete of vacuum-pending row */
is_adjusted_size_big = false;
use_optimization = false;
}

The fast path (use_optimization; DELID absent, result small) is pure byte surgery — copy [0, delid_offset), OR OR_MVCC_FLAG_VALID_DELID into the leading word, splice in the 8-byte MVCCID, copy the remainder, no header parse. The slow path (DELID present, or the record would become big) calls or_mvcc_get_header, heap_delete_adjust_header, re-serializes with or_mvcc_add_header, then memcpys the body.

// heap_delete_adjust_header -- src/storage/heap_file.c
MVCC_SET_FLAG_BITS (header_p, OR_MVCC_FLAG_VALID_DELID);
MVCC_SET_DELID (header_p, mvcc_id); /* <- the death stamp, nothing else */
if (need_mvcc_header_max_size)
HEAP_MVCC_SET_HEADER_MAXIMUM_SIZE (header_p); /* <- only when spilling to REC_BIGONE */

The built record is classifiedREC_BIGONE if is_adjusted_size_big, else REC_NEWHOME if !spage_is_updatable (... built_recdes.length), else REC_HOME — and acted on (Figure 7-2). REC_HOME (common) does no relocation: heap_mvcc_log_delete (..., RVHF_MVCC_DELETE_REC_HOME), then heap_update_physical overwrites the slot in place, OID and body unchanged. REC_NEWHOME (no longer fits) calls heap_insert_newhome, builds a REC_RELOCATION forwarder via heap_build_forwarding_recdes, logs heap_mvcc_log_home_change_on_delete, then heap_update_physical writes the home. REC_BIGONE (crossed the big-length threshold) is identical but with heap_ovf_insert to overflow.

Non-MVCC branch (is_mvcc_op == false): no header games — call heap_log_delete_physical (..., is_reusable, ...) with is_reusable = heap_is_reusable_oid (context->file_type), then heap_delete_physical. That flag, not the page, drives slot recyclability (§7.5).

flowchart TD
  A["heap_delete_home"] --> B{"is_mvcc_op?"}
  B -- no --> P["log + heap_delete_physical"]
  B -- yes --> C{"DELID present\nor size big?"}
  C -- no, small --> D["fast: set DELID flag + MVCCID"]
  C -- yes/rare --> E["slow: get/adjust/add header"]
  D --> F{"classify"}
  E --> F
  F -- REC_HOME --> G["RVHF_MVCC_DELETE_REC_HOME\nupdate in place"]
  F -- REC_NEWHOME --> H["heap_insert_newhome\nREC_RELOCATION in home"]
  F -- REC_BIGONE --> I["heap_ovf_insert\nREC_BIGONE in home"]

Figure 7-2 — heap_delete_home. The MVCC branch can promote a home record to relocated/big when the 8-byte DELID overflows its slot.

7.3 heap_delete_relocation — the already-forwarded path

Section titled “7.3 heap_delete_relocation — the already-forwarded path”

When the home slot is a REC_RELOCATION forwarder, the body lives in a REC_NEWHOME slot on the forward page. heap_delete_relocation fixes the forward page (heap_fix_forward_page), peeks the forward record, and (MVCC branch) builds the death-stamped record with the §7.2 fast/slow split. Three booleans decide where it lands:

Branchremove_old_forwardupdate_old_forwardupdate_old_homeNew home record
is_adjusted_size_bigyesnoyesREC_BIGONE forwarder
fits_in_homeyesnoyesREC_HOME (body folded back into home)
fits_in_forwardnoyesnounchanged (forward updated in place)
elseyesnoyesREC_RELOCATION to a fresh NEWHOME

Epilogues: update_old_home rewrites the home slot (heap_mvcc_log_home_change_on_delete + heap_update_physical); the else/fits_in_forward case logs heap_mvcc_log_home_no_change to advance vacuum status; update_old_forward rewrites the forward slot in place (heap_mvcc_log_delete (..., RVHF_MVCC_DELETE_REC_NEWHOME)); remove_old_forward frees the old forward slot (log_append_undoredo_recdes (RVHF_DELETE, ...) + heap_delete_physical), appending a RVHF_MARK_REUSABLE_SLOT postpone for reusable-OID heaps since the relocated body is unreferenced.

Non-MVCC branch: physically delete both slots (home then forward, each heap_log_delete_physical + heap_delete_physical); the forward slot is always logged mark_reusable = true regardless of heap type, since a relocated record is never referenced by an index.

Invariant 7-C — heap_mvcc_log_home_no_change runs on every MVCC delete that does not rewrite the home slot. Vacuum walks the page’s vacuum chain (Chapter 8), keyed off its max MVCCID; a delete touching only the forward/overflow page would leave the home chain unaware and vacuum would never revisit. The call updates the chain via heap_page_update_chain_after_mvcc_op and, on a status flip, ORs HEAP_RV_FLAG_VACUUM_STATUS_CHANGE into the log offset so recovery rebuilds it.

7.4 heap_delete_bigone — editing the fixed-size overflow header

Section titled “7.4 heap_delete_bigone — editing the fixed-size overflow header”

A REC_BIGONE home slot holds only an overflow OID; the real record (with its full-size MVCC header) sits on the overflow page. Rather than move the body, the MVCC branch edits the header in place on the overflow page:

// heap_delete_bigone -- src/storage/heap_file.c
overflow_oid = *((OID *) context->home_recdes.data); /* home holds only the OID */
// ... fix overflow page WRITE, check PAGE_OVERFLOW ptype ...
heap_get_mvcc_rec_header_from_overflow (... &overflow_header ...);
heap_mvcc_log_delete (thread_p, &log_addr, RVHF_MVCC_DELETE_OVERFLOW);
heap_delete_adjust_header (&overflow_header, mvcc_id, false); /* <- false: already max size */
rc = heap_set_mvcc_rec_header_on_overflow (context->overflow_page_watcher_p->pgptr, &overflow_header);

heap_set_mvcc_rec_header_on_overflow is the dedicated mutator. Because overflow headers are always stored at maximum size (every flag present, slots pre-reserved), the DELID is written without moving a payload byte:

// heap_set_mvcc_rec_header_on_overflow -- src/storage/heap_file.c
ovf_recdes.data = overflow_get_first_page_data (ovf_page);
ovf_recdes.area_size = ovf_recdes.length = OR_HEADER_SIZE (ovf_recdes.data);
// force INSID slot present (MVCCID_ALL_VISIBLE) if absent ...
if (!MVCC_IS_FLAG_SET (mvcc_header, OR_MVCC_FLAG_VALID_DELID)) { /* force DELID slot present */
MVCC_SET_FLAG_BITS (mvcc_header, OR_MVCC_FLAG_VALID_DELID);
MVCC_SET_DELID (mvcc_header, MVCCID_NULL); }
return or_mvcc_set_header (&ovf_recdes, mvcc_header); /* <- overwrites only the header region */

After stamping overflow the home slot is not rewritten, but heap_mvcc_log_home_no_change is still logged against it so the home page’s vacuum chain advances (Invariant 7-C); the forwarder keeps pointing at the overflow record and is freed only on the non-MVCC branch.

Non-MVCC branch: free both — heap_log_delete_physical + heap_delete_physical removes the home forwarder slot, then heap_ovf_delete releases the overflow chain (the only delete path that calls it).

Invariant 7-D — the overflow header is fixed at OR_MVCC_MAX_HEADER_SIZE, which is why REC_BIGONE deletes never relocate. A compact home/relocation header can grow past a slot/big-length boundary when the 8-byte DELID is added (§7.2, §7.3); an overflow header always carries every slot (the mutator forces INSID and DELID present), so the DELID writes into pre-reserved space — hence need_mvcc_header_max_size is false. Stored compact, the in-place stamp would overrun into the body.

7.5 heap_delete_physical, spage_delete, and how OID reuse is decided

Section titled “7.5 heap_delete_physical, spage_delete, and how OID reuse is decided”

Every physical removal funnels through heap_delete_physical, a thin wrapper that snapshots spage_get_free_space_without_saving, calls spage_delete (ER_FAILED on NULL_SLOTID), then heap_stats_update (best-space cache, Ch 9) and pgbuf_set_dirty. spage_delete chooses the tombstone shape from the page header’s anchor_type:

// spage_delete -- src/storage/slotted_page.c
switch (page_header_p->anchor_type) /* ANCHORED arms set offset_to_record = SPAGE_EMPTY_OFFSET */
{
case ANCHORED: /* non-heap callers only */
slot_p->record_type = REC_DELETED_WILL_REUSE; break;
case ANCHORED_DONT_REUSE_SLOTS: /* <- ALWAYS taken by heap deletes */
slot_p->record_type = REC_MARKDELETED; break;
// UNANCHORED_* (not used by heap files) remove the slot entry; default asserts
}

The anchor cannot by itself tell heaps apart: every heap page is ANCHORED_DONT_REUSE_SLOTS, because heap_get_spage_type returns that type unconditionally (for both FILE_HEAP and FILE_HEAP_REUSE_SLOTS) and every heap spage_initialize passes it. So a heap delete always takes that arm and stamps REC_MARKDELETED, reclaiming the body bytes but keeping the SPAGE_SLOT entry; the ANCHORED arm is non-heap only.

OID reuse is therefore decided by the caller: for a FILE_HEAP_REUSE_SLOTS table heap_is_reusable_oid (file_type) is true, so the workers append a RVHF_MARK_REUSABLE_SLOT postpone after the physical delete (in heap_log_delete_physical when mark_reusable, and on the relocation forward-slot path), whose redo handler heap_rv_redo_mark_reusable_slot runs the upgrader at commit and on recovery replay:

// spage_mark_deleted_slot_as_reusable -- src/storage/slotted_page.c
// ... asserts slot is empty and REC_MARKDELETED/REC_DELETED_WILL_REUSE ...
slot_p->record_type = REC_DELETED_WILL_REUSE; /* <- REC_MARKDELETED -> reusable */

Invariant 7-E — recyclability is decided by the caller (heap_is_reusable_oid + a RVHF_MARK_REUSABLE_SLOT postpone), not by the page anchor type. All heap pages are ANCHORED_DONT_REUSE_SLOTS, so spage_delete always tombstones to REC_MARKDELETED; only a FILE_HEAP_REUSE_SLOTS delete appends the postpone whose handler runs spage_mark_deleted_slot_as_reusable to upgrade it to REC_DELETED_WILL_REUSE. If a non-reusable heap appended that postpone, its OIDs would be recycled under stale index keys. Corroboration: spage_delete_for_recovery only special-cases ANCHORED_DONT_REUSE_SLOTS (downgrading a fresh REC_MARKDELETED to REC_DELETED_WILL_REUSE while undoing an insert — a never-committed record has no external OID reference) — pointless unless heap pages are that type.

  1. DELETE reuses the update workersheap_delete_logical dispatches to heap_delete_home/heap_delete_relocation/heap_delete_bigone exactly as update does (Chapter 6); the difference is per-worker, not structural.
  2. MVCC delete = stamp, not remove — the worker sets OR_MVCC_FLAG_VALID_DELID + the MVCCID via heap_delete_adjust_header and writes in place; the body survives and no prev_version_lsa is written (Invariant 7-B). The 8-byte stamp can push a compact REC_HOME to REC_NEWHOME or REC_BIGONE, so the worker re-classifies after building.
  3. REC_BIGONE is the cheap case — overflow headers are stored at OR_MVCC_MAX_HEADER_SIZE, so heap_set_mvcc_rec_header_on_overflow stamps the DELID into pre-reserved space without moving the body (Invariant 7-D).
  4. Vacuum is always notified — even when the home slot is unchanged, heap_mvcc_log_home_no_change advances its vacuum chain (Invariant 7-C); physical delete happens only on non-MVCC paths (catalog, standalone) and when freeing forwarded/overflow bodies.
  5. The page anchor does NOT decide OID reuse (Invariant 7-E) — all heap pages are ANCHORED_DONT_REUSE_SLOTS (heap_get_spage_type), so spage_delete always stamps REC_MARKDELETED; a FILE_HEAP_REUSE_SLOTS delete (gated by heap_is_reusable_oid) appends a RVHF_MARK_REUSABLE_SLOT postpone whose handler runs spage_mark_deleted_slot_as_reusable to upgrade REC_MARKDELETED -> REC_DELETED_WILL_REUSE.

Chapter 8: Vacuum Reclamation and Page Vacuum Status

Section titled “Chapter 8: Vacuum Reclamation and Page Vacuum Status”

Chapters 6 and 7 left dead tuples on the page (a deleted record’s mvcc_del_id, a superseded version at the tail of a prev_version_lsa chain); neither was freed then, because an older snapshot might still need the old version. The deferred reclaimer is vacuum. This chapter answers: who physically reclaims the slot, and how does a page learn it is safe to deallocate? For MVCC visibility theory and the active-vs-vacuum split, see cubrid-heap-manager.md (“Vacuum and the dead-version problem”). Here we trace the code.

8.1 The threshold contract and the page-walk driver

Section titled “8.1 The threshold contract and the page-walk driver”

Vacuum never reasons about individual snapshots. It is handed a single threshold_mvccid — the oldest MVCCID still possibly visible to any active transaction. Any version whose mvcc_del_id (delete) or insert MVCCID (superseded version) precedes that threshold is provably invisible to everyone and may be destroyed. heap_vacuum_all_objects is the bulk entry point (compactdb, class purge); it walks the file page-by-page.

// heap_vacuum_all_objects -- src/storage/heap_file.c
next_vpid.pageid = upd_scancache->node.hfid.hpgid; /* <- start at header page */
reusable = heap_is_reusable_oid (upd_scancache->file_type); /* <- slot reuse policy */
while (!VPID_ISNULL (&next_vpid)) {
vpid = next_vpid;
error_code = pgbuf_ordered_fix (thread_p, &vpid, OLD_PAGE, PGBUF_LATCH_WRITE, &pg_watcher);
if (error_code != NO_ERROR) { goto exit; } /* <- fix failed */
// ... unfix old_pg_watcher (previous page in the rolling double-watcher) ...
error_code = heap_vpid_next (thread_p, &upd_scancache->node.hfid, pg_watcher.pgptr, &next_vpid);
if (error_code != NO_ERROR) { assert (false); goto exit; } /* <- corrupt next_vpid chain */
worker.n_heap_objects = spage_number_of_slots (pg_watcher.pgptr) - 1; /* <- minus header slot */
if (worker.n_heap_objects > 0
&& heap_page_get_vacuum_status (thread_p, pg_watcher.pgptr) != HEAP_PAGE_VACUUM_NONE) {
// ... fill worker.heap_objects[i].oid.slotid = 1..n; skip already-clean pages ...
error_code = vacuum_heap_page (thread_p, worker.heap_objects, worker.n_heap_objects,
threshold_mvccid, &upd_scancache->node.hfid, &reusable, false);
if (error_code != NO_ERROR) { goto exit; }
}
pgbuf_replace_watcher (thread_p, &pg_watcher, &old_pg_watcher);
}
exit: // ... unfix both watchers, free worker.heap_objects ...

Error branches are annotated inline (Figure 8-1 covers all of them). The design-carrying branch is the HEAP_PAGE_VACUUM_NONE skip — a page owing no vacuum is never scanned. Otherwise the driver builds the OID array [1..n] and delegates to vacuum_heap_page (query/vacuum.c), the only caller of spage_vacuum_slot, heap_page_set_vacuum_status_none, and heap_remove_page_on_vacuum: heap_vacuum_all_objects selects candidate pages, vacuum_heap_page selects candidate slots.

Invariant — vacuum only ever frees provably-dead versions. A slot reaches spage_vacuum_slot only after vacuum_heap_record confirms its MVCCID precedes threshold_mvccid (via vacuum_is_mvccid_vacuumed). The != HEAP_PAGE_VACUUM_NONE gate is a coarse pre-filter; the per-record test is the real guard. Violating it strands an active snapshot on a freed slot.

flowchart TD
  A["heap_vacuum_all_objects\nnext_vpid = header page"] --> B{"VPID_ISNULL(next_vpid)?"}
  B -- yes --> Z["unfix, free, return"]
  B -- no --> C["pgbuf_ordered_fix(WRITE)"]
  C --> D{"fix ok?"}
  D -- no --> Z
  D -- yes --> E["heap_vpid_next -> next_vpid"]
  E --> F{"n_objects>0 AND status != NONE?"}
  F -- no --> G["replace watcher, loop"]
  F -- yes --> H["build OID[1..n], vacuum_heap_page"]
  H --> I{"error?"}
  I -- yes --> Z
  I -- no --> G
  G --> B

Figure 8-1. heap_vacuum_all_objects page-walk, every branch.

8.2 spage_vacuum_slot: turning a live slot into reclaimable space

Section titled “8.2 spage_vacuum_slot: turning a live slot into reclaimable space”

vacuum_heap_record calls spage_vacuum_slot once per dead version. It credits the freed bytes back to total_free; it does not compact.

// spage_vacuum_slot -- src/storage/slotted_page.c
SPAGE_SLOT *slot_p = spage_find_slot (page_p, page_header_p, slotid, false);
if (slot_p->record_type == REC_MARKDELETED || slot_p->record_type == REC_DELETED_WILL_REUSE)
vacuum_er_log_error (..., "... was already vacuumed", ...); /* <- double-vacuum: log, continue */
page_header_p->num_records--; /* <- one fewer live record */
waste = DB_WASTED_ALIGN (slot_p->record_length, page_header_p->alignment);
page_header_p->total_free += slot_p->record_length + waste; /* <- bytes returned to the page */
slot_p->offset_to_record = SPAGE_EMPTY_OFFSET; /* <- slot now points at nothing */
if (reusable)
slot_p->record_type = REC_DELETED_WILL_REUSE;/* <- no refs: OID can be recycled */
else
slot_p->record_type = REC_MARKDELETED; /* <- refs may exist: keep slot id reserved */

When reusable == false the slot becomes REC_MARKDELETED and its slot id stays reserved — an external reference (an old version’s prev_version_lsa chain, or a relocation) must still resolve to a tombstone, not a recycled record. The already-vacuumed branch is defensive, tolerating the interrupted-and-replayed scenario (8.4). What it leaves behind: offset_to_record is SPAGE_EMPTY_OFFSET, but the slot-array entry is not removed and the record bytes are still present — only total_free was bumped. Shrinking the slot array is spage_reclaim’s job (8.5); recovering data-area free space is spage_compact’s (Chapter 9).

8.3 The HEAP_PAGE_VACUUM_STATUS state machine

Section titled “8.3 The HEAP_PAGE_VACUUM_STATUS state machine”

A heap page must not be deallocated while a vacuum worker could still visit it. CUBRID predicts that with a three-state flag in the top two bits of HEAP_CHAIN.flags (the chain struct is fully tabulated in Chapter 1).

// HEAP_PAGE_VACUUM_STATUS enum -- src/storage/heap_file.h
HEAP_PAGE_VACUUM_NONE, /* Heap page is completely vacuumed. */
HEAP_PAGE_VACUUM_ONCE, /* Heap page requires one vacuum action. */
HEAP_PAGE_VACUUM_UNKNOWN /* Heap page requires an unknown number of vacuum actions. */
// flag bits -- src/storage/heap_file.c
#define HEAP_PAGE_FLAG_VACUUM_STATUS_MASK 0xC0000000
#define HEAP_PAGE_FLAG_VACUUM_ONCE 0x80000000
#define HEAP_PAGE_FLAG_VACUUM_UNKNOWN 0x40000000

HEAP_PAGE_GET_VACUUM_STATUS decodes those bits (both clear ⇒ NONE); HEAP_PAGE_SET_VACUUM_STATUS clears the mask then ORs the bit in. The low 30 bits of flags carry other attributes. Two HEAP_CHAIN fields drive the machine:

FieldRoleWhy it exists
flags (top 2 bits)Current vacuum status (NONE/ONCE/UNKNOWN)Advertises owed vacuum visits without scanning records
max_mvccidLargest MVCCID of any MVCC op ever applied to the pageThe only escape from UNKNOWN: older than vacuum’s horizon ⇒ all owed vacuum has run

max_mvccid reads differently per state: in NONE stale (next op asserts it precedes the new id, then bumps it); in ONCE the owed vacuum targets a version <= max_mvccid; in UNKNOWN the escape test (vacuum_is_mvccid_vacuumed(max_mvccid) true ⇒ all owed vacuum ran). MVCC ops push the machine forward; vacuum pulls it back:

// heap_page_update_chain_after_mvcc_op -- src/storage/heap_file.c
switch (vacuum_status) {
case HEAP_PAGE_VACUUM_NONE:
assert (MVCC_ID_PRECEDES (chain->max_mvccid, mvccid));
HEAP_PAGE_SET_VACUUM_STATUS (chain, HEAP_PAGE_VACUUM_ONCE); break; /* <- first op since clean */
case HEAP_PAGE_VACUUM_ONCE:
HEAP_PAGE_SET_VACUUM_STATUS (chain, HEAP_PAGE_VACUUM_UNKNOWN); break; /* <- future unpredictable */
case HEAP_PAGE_VACUUM_UNKNOWN:
if (vacuum_is_mvccid_vacuumed (chain->max_mvccid)) /* <- all prior owed vacuum ran */
HEAP_PAGE_SET_VACUUM_STATUS (chain, HEAP_PAGE_VACUUM_ONCE);
break; /* <- else: stays UNKNOWN */
}
if (MVCC_ID_PRECEDES (chain->max_mvccid, mvccid)) chain->max_mvccid = mvccid; /* <- track the max */
// heap_page_set_vacuum_status_none -- src/storage/heap_file.c
assert (HEAP_PAGE_GET_VACUUM_STATUS (chain) == HEAP_PAGE_VACUUM_ONCE); /* <- only ONCE -> NONE */
HEAP_PAGE_SET_VACUUM_STATUS (chain, HEAP_PAGE_VACUUM_NONE);
stateDiagram-v2
  [*] --> NONE
  NONE --> ONCE: mvcc op \n assert max_mvccid precedes new id
  ONCE --> UNKNOWN: second mvcc op \n future unpredictable
  ONCE --> NONE: vacuum visit \n set_vacuum_status_none
  UNKNOWN --> ONCE: mvcc op AND max_mvccid already vacuumed
  UNKNOWN --> UNKNOWN: mvcc op AND max_mvccid not yet vacuumed
  NONE --> [*]: page may be deallocated

Figure 8-2. HEAP_PAGE_VACUUM_STATUS transitions. The only exit edge to deallocation is from NONE.

UNKNOWN never collapses straight to NONE: once two MVCC ops stack up without an intervening vacuum, CUBRID stops counting owed visits. Only a new op observing that max_mvccid is past vacuum’s horizon drops it to ONCE, and from ONCE one vacuum visit returns it to NONE.

Invariant — a page is deallocatable only in state NONE. Vacuum drives a page to NONE only from ONCE, never directly from UNKNOWN, and heap_page_set_vacuum_status_none asserts the prior state is ONCE. NONE therefore means no owed vacuum visit remains — the precondition heap_remove_page_on_vacuum relies on. Reaching it from any other state is a use-after-free of a slot a future worker still intends to touch.

8.4 Tying status to reclamation: the vacuum_heap_page decision

Section titled “8.4 Tying status to reclamation: the vacuum_heap_page decision”

vacuum_heap_page reads the status after per-record work, then decides whether to flip it to NONE and whether to attempt removal:

// vacuum_heap_page (decision tail) -- src/query/vacuum.c
page_vacuum_status = heap_page_get_vacuum_status (thread_p, helper.home_page);
assert (page_vacuum_status != HEAP_PAGE_VACUUM_NONE || (was_interrupted && helper.n_vacuumed == 0));
if ((page_vacuum_status == HEAP_PAGE_VACUUM_ONCE && !was_interrupted)
|| (page_vacuum_status == HEAP_PAGE_VACUUM_NONE && was_interrupted))
{
if (page_vacuum_status == HEAP_PAGE_VACUUM_ONCE)
heap_page_set_vacuum_status_none (thread_p, helper.home_page); /* <- ONCE -> NONE, logged */
// ... pgbuf_set_dirty (home_page) ...
if (spage_number_of_records (helper.home_page) <= 1 && helper.reusable)
{ /* <- only header slot, reusable heap */
if (pgbuf_has_prevent_dealloc (helper.home_page) == false
&& heap_remove_page_on_vacuum (thread_p, &helper.home_page, &helper.hfid))
{ /* page gone */ goto end; }
}
}

Two branches match: ONCE && !was_interrupted (normal — flip to NONE, then consider removal) and NONE && was_interrupted (a replayed task whose vacuum ran in a prior life — already NONE, no re-flip). ONCE && was_interrupted is deliberately excluded: a replayed worker cannot tell whether the ONCE is its own task or a new delete logged after the crash, so flipping could strand owed vacuum. UNKNOWN never matches. Removal is attempted only for a page holding just the header slot, on a reusable heap, with no scanner pinning it.

8.5 spage_reclaim: compacting the slot array of a don’t-reuse page

Section titled “8.5 spage_reclaim: compacting the slot array of a don’t-reuse page”

spage_vacuum_slot leaves dead slot-array entries; spage_reclaim shrinks the slot array for ANCHORED_DONT_REUSE_SLOTS pages, invoked from the reclaim-addresses path (xheap_reclaim_addresses, reached by compactdb), not the steady-state vacuum loop.

// spage_reclaim -- src/storage/slotted_page.c
if (page_header_p->num_slots > 0) {
first_slot_p = spage_find_slot (page_p, page_header_p, 0, false);
for (slot_id = page_header_p->num_slots - 1; slot_id >= 0; slot_id--) { /* <- backwards */
slot_p = first_slot_p - slot_id;
if (slot_p->offset_to_record == SPAGE_EMPTY_OFFSET
&& (slot_p->record_type == REC_MARKDELETED || slot_p->record_type == REC_DELETED_WILL_REUSE)) {
assert (page_header_p->anchor_type == ANCHORED_DONT_REUSE_SLOTS);
if ((slot_id + 1) == page_header_p->num_slots)
spage_reduce_a_slot (page_p); /* <- trailing dead slot: drop array entry */
else
slot_p->record_type = REC_DELETED_WILL_REUSE; /* <- interior: mark recyclable */
is_reclaim = true;
}
}
}
if (is_reclaim == true) {
if (page_header_p->num_slots == 0)
spage_initialize (thread_p, page_p, ...); /* <- fully empty: reset page to pristine layout */
pgbuf_set_dirty (thread_p, page_p, DONT_FREE);
}
return is_reclaim;

The load-bearing branch is backwards iteration. Only a trailing dead slot can be removed with spage_reduce_a_slot (which shortens the array), so walking from the last slot toward 0 lets each removal expose the next trailing dead slot. An interior dead slot cannot be removed without renumbering the live slots after it (invalidating their OIDs), so it is re-typed REC_DELETED_WILL_REUSE for a future insert. Exit branches: num_slots == 0 at entry returns false; if anything was reclaimed and the page is now fully empty, spage_initialize resets it; if nothing was reclaimed the page is untouched.

The asymmetry with spage_vacuum_slot is the takeaway: vacuum frees space continuously via spage_vacuum_slot and logs it through vacuum_log_vacuum_heap_page. spage_reclaim reclaims slot-array entries only on the heavier reclaim-addresses path and only for the don’t-reuse anchor; there xheap_reclaim_addresses deliberately skips logging the reclaim (log_skip_logging, in the if (spage_reclaim (...) == true) block) because reusing unreferenced dead OIDs leaves the database logically unmodified, so neither REDO nor UNDO is required.

8.6 heap_remove_page_on_vacuum: unlinking the empty page

Section titled “8.6 heap_remove_page_on_vacuum: unlinking the empty page”

When a page reaches NONE and holds only its header slot, vacuum tries to give it back to the file manager. The function is minimally intrusive: every neighbor is fixed conditionally through ordered watchers, and any failure abandons the attempt (return false).

// heap_remove_page_on_vacuum -- src/storage/heap_file.c
assert (spage_number_of_records (*page_ptr) <= 1); /* <- precondition: page is empty */
pgbuf_get_vpid (*page_ptr, &page_vpid);
if (page_vpid.pageid == hfid->hpgid && page_vpid.volid == hfid->vfid.volid)
return false; /* <- never remove the header page */
if (pgbuf_ordered_fix (thread_p, &header_vpid, OLD_PAGE, PGBUF_LATCH_WRITE, &header_watcher) != NO_ERROR)
goto error; /* <- give up if header busy */
// ... heap_vpid_prev/next, conditional fix of prev (unless == header) and next ...
if (crt_watcher.page_was_unfixed) {
*page_ptr = crt_watcher.pgptr; /* <- ordered fix may have refixed home */
if (spage_number_of_records (crt_watcher.pgptr) > 1) goto error; /* <- re-filled while we waited */
}
if (pgbuf_has_prevent_dealloc (crt_watcher.pgptr)) goto error; /* <- a scanner reached us */
if (pgbuf_has_any_waiters (crt_watcher.pgptr)) { assert (false); goto error; }
log_sysop_start (thread_p); is_system_op_started = true; /* <- atomic unlink begins */
// ... scrub vpid from estimates.best[]/second_best[]/last_vpid/full_search_vpid,
// splice prev.next_vpid=next and next.prev_vpid=prev (RVHF_STATS / RVHF_CHAIN logs) ...
pgbuf_ordered_unfix_and_init (thread_p, *page_ptr, &crt_watcher);
if (file_dealloc (thread_p, &hfid->vfid, &page_vpid, FILE_HEAP) != NO_ERROR) goto error;
(void) heap_stats_del_bestspace_by_vpid (thread_p, &page_vpid); /* <- evict from cache */
log_sysop_commit (thread_p); is_system_op_started = false;
return true;
error:
if (is_system_op_started) log_sysop_abort (thread_p); /* <- roll back half-done unlink */
// ... unfix any fixed watchers; return false ...

Every branch is a “give up safely” path (Figure 8-3); the function never partially unlinks a page. A busy neighbor merely defers removal; page_was_unfixed with > 1 records means a concurrent insert re-filled the page while watchers were acquired in VPID order; has_any_waiters should be impossible (assert(false)). The happy path scrubs the page from the header’s best-space estimates, splices the chain (prev.next_vpid = next, next.prev_vpid = prev, plus heap_hdr.next_vpid = next when this was the file’s first data page), and file_deallocs it — all under one system op.

Invariant — chain splice and dealloc are atomic. The header-stats, prev-chain, and next-chain spage_updates plus file_dealloc run inside one log_sysop_start/log_sysop_commit op (RVHF_STATS/ RVHF_CHAIN). A crash mid-splice recovers all-or-nothing; a half-spliced chain would corrupt traversal — what the sysop prevents.

flowchart TD
  A["remove_page_on_vacuum\nassert <=1 record"] --> B{"is header page?"}
  B -- yes --> R0["return false"]
  B -- no --> C["ordered_fix header"]
  C --> D{"fix ok?"}
  D -- no --> E["goto error -> false"]
  D -- yes --> F["vpid_prev / vpid_next"]
  F --> G["fix prev, fix next conditionally"]
  G --> H{"home refilled, >1 record?"}
  H -- yes --> E
  H -- no --> I{"prevent_dealloc or waiters?"}
  I -- yes --> E
  I -- no --> J["sysop_start"]
  J --> K["scrub bestspace, splice prev/next, file_dealloc"]
  K --> L{"all ok?"}
  L -- no --> M["sysop_abort -> false"]
  L -- yes --> N["sysop_commit, del bestspace cache, return true"]

Figure 8-3. heap_remove_page_on_vacuum — every path is fail-safe.

8.7 Best-space cache and the HEAP_DROP_FREE_SPACE threshold

Section titled “8.7 Best-space cache and the HEAP_DROP_FREE_SPACE threshold”

The best-space machinery is dissected in Chapter 9; vacuum touches two seams. HEAP_DROP_FREE_SPACE (heap_file.h, (int)(DB_PAGESIZE * 0.3)) is the free-space level at which a page is worth advertising as an insert target; when spage_vacuum_slot returns enough bytes to cross it, the page can re-enter the best-space hints. Conversely, when heap_remove_page_on_vacuum deallocates a page it removes it from the cache (nulling the matching estimates.best[]/second_best[] slots and calling heap_stats_del_bestspace_by_vpid) so a future insert is never handed a freed VPID.

8.8 REC_MVCC_NEXT_VERSION (legacy) versus prev_version_lsa

Section titled “8.8 REC_MVCC_NEXT_VERSION (legacy) versus prev_version_lsa”

Older CUBRID stored the forward pointer to a record’s successor version inside the heap as a REC_MVCC_NEXT_VERSION slot that vacuum had to chase. Current code supersedes that with prev_version_lsa: an update writes the log LSA of the previous version into the new record’s MVCC header (see heap_update_set_prev_version and Chapters 5-6), so version chaining lives in the log, not in extra heap slots. Consequently vacuum no longer walks a forwarding link to reclaim a successor — it tombstones the dead version’s own slot via spage_vacuum_slot, reclaiming its bytes in place, while readers needing the prior image follow prev_version_lsa into the log. This is also why spage_vacuum_slot on a reusable heap can immediately pick REC_DELETED_WILL_REUSE: with no in-heap next-version link to honor, there are no references to the slot.

  1. heap_vacuum_all_objects selects pages; vacuum_heap_page selects slots — the driver walks the next_vpid chain, skips HEAP_PAGE_VACUUM_NONE pages, delegating the per-record threshold_mvccid death test downward.
  2. spage_vacuum_slot frees bytes, not slot entries — it bumps total_free, sets offset_to_record = SPAGE_EMPTY_OFFSET, and stamps REC_DELETED_WILL_REUSE (reusable) or REC_MARKDELETED (referable); it never compacts.
  3. The NONE → ONCE → UNKNOWN machine predicts owed vacuum visits — MVCC ops push it forward, only a vacuum visit pulls it back (only ONCE → NONE), and UNKNOWN escapes only when max_mvccid falls below vacuum’s horizon.
  4. A page is deallocatable only in state NONE — the precondition heap_remove_page_on_vacuum relies on to avoid a future worker touching freed slots.
  5. spage_reclaim shrinks the slot array backwards, for ANCHORED_DONT_REUSE_SLOTS only, on the reclaim path where xheap_reclaim_addresses skips logging the reclaim; trailing tombstones drop via spage_reduce_a_slot, interior ones become reusable.
  6. heap_remove_page_on_vacuum is fail-safe and atomic — every contention case aborts (return false); the chain splice plus file_dealloc run in one logged system op (all-or-nothing recovery).
  7. prev_version_lsa replaced the in-heap REC_MVCC_NEXT_VERSION link, so vacuum reclaims a dead version’s slot in place; freed space crossing HEAP_DROP_FREE_SPACE (30%) re-feeds the best-space cache.

Chapter 9: Best Space and Free Space Management

Section titled “Chapter 9: Best Space and Free Space Management”

Chapters 4 (Insert) and 6 (Update) deferred one function: “pick a page with enough room” — heap_stats_find_best_page. This chapter dissects the machinery behind it: the on-disk hints (HEAP_HDR_STATS.estimates), the in-memory cache (heap_Bestspace), and the lazy reconciliation (heap_stats_sync_bestspace) that keeps the picture current without scanning the whole heap on the hot path.

The governing idea: free-space accounting is deliberately approximate. Nothing here is logged for correctness (the log_skip_logging calls), entries may be stale, and a wrong hint costs at most a wasted page fix, never a data error. For why a heap keeps free-space hints, see cubrid-heap-manager.md, “Free-space management”; this is the implementation.

Two stores, consulted in order: (1) the in-memory cache heap_Bestspace (HEAP_STATS_BESTSPACE_CACHE), a process-global hash keyed by both HFID and VPID, bounded by PRM_ID_HF_MAX_BESTSPACE_ENTRIES (disabled when <= 0); (2) the on-disk hints in the heap header HEAP_HDR_STATS.estimates — a best[10] ring, a second_best[10] ring, and aggregate counters that persist across restarts. heap_stats_find_best_page reads both tiers; heap_stats_sync_bestspace rebuilds both.

9.2 heap_bestspace and the estimates block — every field

Section titled “9.2 heap_bestspace and the estimates block — every field”

heap_bestspace (in heap_file.h) is the atom of hinting: one candidate page plus its estimated room.

// heap_bestspace -- src/storage/heap_file.h
struct heap_bestspace
{
VPID vpid; /* Vpid of one of the best pages */
int freespace; /* Estimated free space in this page */
};
FieldRoleWhy it exists
vpidCandidate page idFix the page directly, no directory walk
freespaceEstimated bytes freePre-filter: skip if < needed; lazy, may lie

This struct appears three ways: the best[] ring element on disk, HEAP_STATS_ENTRY.best in memory, and a selector stack local. The on-disk ring lives in the anonymous estimates sub-struct of heap_hdr_stats (heap_file.c), whose every field follows:

FieldRoleWhy it exists
num_pagesApprox page countSizes the bounded scan min(20%, 100)
num_recsApprox record countAvg record size for heap_stats_get_min_freespace
recs_sumlenApprox sum of record lengthsNumerator of the avg-record-size estimate
num_other_high_bestGood pages not in best[]Re-sync gate: scan only if /num_pages >= 0.1
num_high_bestbest[] slots still good0 means exhausted; sync worthwhile
num_substitutionsBest-slot eviction count% 1000 == 0 admits one to second_best[]
num_second_bestLive second_best[] entriesTells empty from full ring (both head==tail)
head_second_bestsecond_best[] consume indexheap_stats_get_second_best pops here
tail_second_bestsecond_best[] produce indexheap_stats_put_second_best pushes here
headbest[] consume/insert indexWhere the next substitution lands
last_vpidThe heap’s tail pageheap_vpid_alloc links the new page after it
full_search_vpidBounded-scan resume cursorSpreads scanning over calls
second_best[10]Ring of evicted-but-good pagesReservoir; not all reused at once
best[10]Ring of primary candidatesThe hot-path source, read first

Invariant — ring consistency. head stays in [0,10), advanced only via HEAP_STATS_NEXT_BEST_INDEX(i)=(i+1)%10; a bad head reads past the array. For second_best, (tail - head + 10) % 10 == num_second_best (asserted in the put/get helpers) — num_second_best is the only way to tell a full ring from an empty one, since both make head==tail. None of these are logged, so they are purely in-flight.

9.3 The thresholds: HEAP_DROP_FREE_SPACE, unfill_space, min-freespace

Section titled “9.3 The thresholds: HEAP_DROP_FREE_SPACE, unfill_space, min-freespace”

HEAP_DROP_FREE_SPACE (heap_file.h) is (int)(DB_PAGESIZE * 0.3) — the 30% admission floor. A page is cached only if more than ~30% of it is free; below that it is dropped (hence “drop free space”). unfill_space is reserved per-page headroom (DB_PAGESIZE * PRM_ID_HF_UNFILL_FACTOR, set at heap creation); the selector adds it to the request so a page must fit this record plus slack for future in-place update growth (Chapter 6: a growing update prefers to stay home). heap_stats_get_min_freespace combines both: it takes the average record size (recs_sumlen/num_recs, or header_size + 20 when num_recs == 0), adds unfill_space, then clamps with MIN(..., HEAP_DROP_FREE_SPACE).

Invariant — min-freespace never exceeds the drop floor. That final MIN(..., HEAP_DROP_FREE_SPACE) keeps any page clearing 30% eligible. Without it, large records could push min-freespace above 30%, making the updater refuse pages the sync (which only checks 30%) records — a permanent disagreement.

9.4 heap_stats_find_page_in_bestspace — scanning the hints

Section titled “9.4 heap_stats_find_page_in_bestspace — scanning the hints”

The inner loop: given best[] (from the caller) and the cache, return a fixed, X-latched page with >= needed_space, or HEAP_FINDSPACE_NOTFOUND/_ERROR, under LK_FORCE_ZERO_WAIT so a busy page is skipped, never waited on:

// heap_stats_find_page_in_bestspace -- src/storage/heap_file.c
while (notfound_cnt < BEST_PAGE_SEARCH_MAX_COUNT /* cap self-heal at 100 misses */
&& (ent = mht_get2 (heap_Bestspace->hfid_ht, hfid, NULL)) != NULL)
{
if (ent->best.freespace >= needed_space) { best = ent->best; break; } /* hit */
mht_rem2 (...); mht_rem (...); /* stale: evict from both ht */
heap_stats_entry_free (thread_p, ent, NULL);
heap_Bestspace->num_stats_entries--; notfound_cnt++;
}

The cache pull self-heals — a hint short on room is evicted from both hash tables. On a cache miss (or disabled cache) best.freespace stays -1 and the code scans the on-disk best[] from best_array_index, setting best_hint_is_used so a refresh writes the corrected freespace back. Each candidate runs one fix-and-recheck cycle. Branches a modifier must respect:

  • Recheck, repair even on a miss. spage_max_space_for_new_record is the truth; if short, unfix and continue, but still refresh the hint/cache.
  • Returned freespace excludes unfill — only record_length + heap_Slotted_overhead is subtracted; the reserve was already in needed_space.
  • Error vs. timeout. NULL fix with er_errid()==NO_ERROR is the zero-wait timeout (try next); ER_INTERRUPTED aborts; any other error drops the hint and returns _ERROR with assert(false).
  • idx_badspace out-param returns the least-room slot; the caller makes it the new head, so the next substitution overwrites the worst slot.

9.5 heap_stats_find_best_page — the orchestrator

Section titled “9.5 heap_stats_find_best_page — the orchestrator”

What Chapters 4 and 6 called. It bumps record estimates (num_recs += 1 on insert; recs_sumlen += needed_space always) then loops over hints, sync, and allocation. Two subtleties live in the code:

// heap_stats_find_best_page -- src/storage/heap_file.c
total_space = needed_space + heap_Slotted_overhead + heap_hdr->unfill_space;
if (heap_is_big_length (total_space)) /* unfill would overflow page */
total_space = needed_space + heap_Slotted_overhead; /* -> drop the reserve */
if (try_find >= 2 || other_high_best_ratio < HEAP_BESTSPACE_SYNC_THRESHOLD) /*0.1f*/
break; /* sync-admission gate */

Every branch: (1) header fix fails → goto error, return NULL. (2) Else loop (try_find++). (3) heap_stats_find_page_in_bestspace_ERROR unfixes and goto error; a page → finish. (4) No page → the gate (num_pages<=0 or num_other_high_best<=0 force ratio 0, assert(num_pages>0); try_find>=2; ratio<0.1) breaks to allocation. (5) Gate passes → syncloop runs heap_stats_sync_bestspace(false,true) up to 3 times while found is 0, then re-scans; num_pages_found<0 is a hard error, <=0 after retries → alloc. (6) Still NULL → heap_vpid_alloc (failure → goto error); else exit: log_skip_logging header, set dirty + free, return pgptr.

Invariant — header held WRITE for the whole selection. Fixed PGBUF_LATCH_WRITE before any estimate is touched, released only at the single dirty-and-free exit (or each goto error). Estimate mutation, sync, and allocation all happen under that one latch, so two inserters can’t corrupt head/num_high_best. The log_skip_logging at exit leaves the mutations unlogged; they self-heal.

9.6 heap_stats_update — marking pages good after a delete/update

Section titled “9.6 heap_stats_update — marking pages good after a delete/update”

When a delete or in-place shrink frees space, the page may become a good reuse target. heap_stats_update is the lightweight notifier (delete/update paths and recovery redo, Chapter 10).

// heap_stats_update -- src/storage/heap_file.c
freespace = spage_get_free_space_without_saving (thread_p, pgptr, &need_update);
if (PRM_ID_HF_MAX_BESTSPACE_ENTRIES > 0 && prev_freespace < freespace)
heap_stats_add_bestspace (thread_p, hfid, vpid, freespace); /* room grew: cache */
if (need_update || prev_freespace <= HEAP_DROP_FREE_SPACE)
if (freespace > HEAP_DROP_FREE_SPACE) { /* now genuinely good */
error = heap_stats_update_internal (...); /* try to write best[] */
if (error != NO_ERROR) spage_set_need_update_best_hint (..., true); /* defer */
else if (need_update) spage_set_need_update_best_hint (..., false); /* clear */
} else if (need_update) spage_set_need_update_best_hint (..., false); /* obsolete */

Branches: cache tried first (if enabled and room grew — no latch); on-disk update only if need_update or prev_freespace <= 30%; heap_stats_update_internal takes a CONDITIONAL header latch — on failure it flags need_update_best_hint = true (“good page, couldn’t tell the header”), invisible until a future delete or full sync rediscovers it (the approximation tax). Success with need_update clears the flag; a now-sub-30% flagged page clears it too (obsolete note).

Inside heap_stats_update_internal, an evicted slot above 30% is offered to the reservoir via heap_stats_put_second_best, where the cadence lives: the guard if (heap_hdr->estimates.num_substitutions++ % 1000 == 0) admits one in 1000 evictions — pushing *vpid at tail_second_best, advancing the tail (and head if full), bumping num_second_best, and resetting the counter to 1.

Why 1-in-1000 (write-spreading). Caching every page of a bulk-emptied run would make the next inserts pile back in, re-dirtying just-emptied pages. Sampling every 1000th spreads reuse across the file so emptied extents can stay empty (and be returned to the OS).

9.7 heap_stats_sync_bestspace — the bounded rebuild

Section titled “9.7 heap_stats_sync_bestspace — the bounded rebuild”

When the hints run dry, this rebuilds best[] and refreshes the counters by walking the chain — but only a slice, the answer to “without scanning the whole heap.” The cap is max_iterations = MAX(MIN((int)(num_pages*0.2), heap_Find_best_page_limit /*100*/), HEAP_NUM_BEST_SPACESTATS /*10*/): at most min(20% of pages, 100), never fewer than 10. The seed cascade: (1) cache enabled → search_all=true from full_search_vpid, the resume cursor written forward each step so syncs stride the file 20%-per-call (scan-level write-spreading); (2) else num_high_best > 0 → just behind head; (3) else num_second_best > 0 → pop a reservoir page (heap_stats_get_second_best); (4) else search_all=true from full_search_vpid, or if NULL the first page (hfid->hpgid) with can_cycle=false.

stateDiagram-v2
  [*] --> Seed
  Seed --> Walk: start_vpid by cascade
  Walk --> Inspect: fix page READ, prevent dealloc
  Inspect --> Walk: free_space <= 30 pct, skip
  Inspect --> Record: free_space > 30 pct
  Record --> Walk: best[] not full, store and num_high_best++
  Record --> Walk: best[] full, num_other_best++
  Walk --> Commit: iterations > max OR num_high_best == 10 OR at stopat_vpid
  Commit --> [*]: rebuild head, num_high_best, counters

Figure 9-3: heap_stats_sync_bestspace, bounded (scan_all=false) sync. On the cap it sets iterate_all and breaks; a seeding slot that yielded nothing (start_pos != -1 && num_high_best == 0) is NULLed so the next call won’t re-seed from a dead spot. Per page, spage_collect_statistics accumulates the counters; search_all persists full_search_vpid; can_cycle wraps to the heap head.

The commit phase: early goto end if a bounded scan found nothing and second_best is empty; else NULL the unused best[] tail, set head/num_high_best; if scan_all or num_pages >= stored overwrite all counters wholesale; otherwise a conservative merge (num_other_high_best -= num_high_best, bump to num_other_best if larger, overwrite record counts only if the partial scan saw more).

Invariant — sync never logs. Its header comment says so; the caller wraps it in log_skip_logging. A crash mid-sync leaves stale estimates that the next insert re-syncs — hence no undo data, and readers tolerate the chain walk (OLD_PAGE_PREVENT_DEALLOC, PGBUF_LATCH_READ, no record locks).

9.8 Allocation fallback: heap_vpid_alloc vs heap_alloc_new_page

Section titled “9.8 Allocation fallback: heap_vpid_alloc vs heap_alloc_new_page”

When reuse fails, the orchestrator calls heap_vpid_alloc — the stats-aware allocator. It file_allocs a page, links it after last_vpid (chain logged RVHF_CHAIN), bumps last_vpid/num_pages, then installs it at head (which it advances). The three install branches:

// heap_vpid_alloc -- src/storage/heap_file.c
best = heap_hdr->estimates.head;
heap_hdr->estimates.head = HEAP_STATS_NEXT_BEST_INDEX (best);
if (VPID_ISNULL (&heap_hdr->estimates.best[best].vpid))
heap_hdr->estimates.num_high_best++; /* slot was empty */
else if (heap_hdr->estimates.best[best].freespace > HEAP_DROP_FREE_SPACE)
{ heap_hdr->estimates.num_other_high_best++; /* evict good page to reservoir */
heap_stats_put_second_best (heap_hdr, &heap_hdr->estimates.best[best].vpid); }
heap_hdr->estimates.best[best].vpid = vpid;
heap_hdr->estimates.best[best].freespace = DB_PAGESIZE; /* fresh page is all free */

Unlike the estimate mutations, the chain link and RVHF_STATS here are logged inside a log_sysop_start/commit — losing a fresh page from the chain would leak storage (Chapter 10). heap_alloc_new_page is the bare allocator (takeaway 7): file_alloc with NULL prev/next links, attach a watcher, touch no estimates.

  1. Two tiers, in order: heap_stats_find_page_in_bestspace probes the in-memory heap_Bestspace cache, then the on-disk best[10] ring; both self-heal stale entries on touch.
  2. heap_stats_find_best_page orchestrates under a held header WRITE latch: bump estimates, scan hints, conditionally sync, else heap_vpid_alloc. The gate (try_find >= 2 OR ratio < 0.1) is the storage-vs-CPU knob.
  3. Nothing here is logged for correctness — estimate mutations run under log_skip_logging, sync logs nothing; only heap_vpid_alloc logs the chain link and RVHF_STATS.
  4. HEAP_DROP_FREE_SPACE (30%) is the admission floor; unfill_space is update headroom, folded into needed_space then excluded from the returned freespace.
  5. Bounded, resumable scanning: a sync inspects min(20% of pages, 100), seeded from full_search_vpid and advancing that cursor, so syncs stride the file without blocking an insert.
  6. Second-best reservoir spreads writes: only every 1000th evicted-but-good page is admitted (num_substitutions % 1000), so a bulk-emptied region is reused sparsely.
  7. heap_vpid_alloc (stats-aware) vs heap_alloc_new_page (bare): the former installs the page as best[head] and maintains last_vpid/num_pages; the latter touches no estimates and serves callers that chain pages themselves.

Chapter 10: Crash Recovery and the Redo Undo Log Paths

Section titled “Chapter 10: Crash Recovery and the Redo Undo Log Paths”

This chapter answers the question every prior flow deferred: what happens after a crash? It traces the redo/undo handler each operation pins, shows how the undo image of an UPDATE is the exact bytes prev_version_lsa points at (tying recovery back to Chapter 6), and dissects the recovery-only slotted-page primitives. This is the edge path: it never invents a new record state, only reconstructs states the forward path produced. For WAL / ARIES theory see cubrid-heap-manager.md §“Durability and the Write-Ahead Log”.

RV_fun[] in recovery.c maps each LOG_RCVINDEX to an {undofun, redofun, ...} tuple. The driver calls the matching function with a LOG_RCV carrying rcv->pgptr (fixed page), rcv->offset (slotid, possibly OR-ed with a vacuum flag), rcv->length / rcv->data (payload), and rcv->mvcc_id. Two facts shape every handler: one redo handler serves physically identical operations (heap_rv_redo_insert redoes both RVHF_INSERT and RVHF_INSERT_NEWHOME), and undo is the redo of the inverse — the heap logs no separate undo image for insert/delete, so slotid plus the redo payload reverses it.

flowchart TB
  LOG["log record\nrcvindex + LOG_RCV{pgptr, offset, data, mvcc_id}"]
  DISP["recovery driver\nlooks up RV_fun[rcvindex]"]
  LOG --> DISP
  DISP -->|redo pass| REDO["redofun\nheap_rv_redo_* / heap_rv_mvcc_redo_*"]
  DISP -->|undo pass| UNDO["undofun\nheap_rv_undo_* / heap_rv_mvcc_undo_*"]
  REDO --> SPI["spage_insert_for_recovery /\nspage_update / spage_delete"]
  UNDO --> SPD["spage_delete_for_recovery /\nspage_update"]
  SPI --> DIRTY["pgbuf_set_dirty"]
  SPD --> DIRTY

Figure 10-1 — How a log record reaches a slotted-page primitive.

The forward path (heap_mvcc_log_insert, heap_mvcc_log_delete, heap_mvcc_log_home_change_on_delete) OR-s HEAP_RV_FLAG_VACUUM_STATUS_CHANGE (0x8000) into p_addr->offset when the operation also flipped the page’s vacuum status. Every MVCC handler therefore opens with the same masking lines, elided later as // ... mask vacuum flag, see 10.2 ...:

// shared preamble in every MVCC handler -- src/storage/heap_file.c
if (slotid & HEAP_RV_FLAG_VACUUM_STATUS_CHANGE) { vacuum_status_change = true; }
slotid = slotid & (~HEAP_RV_FLAG_VACUUM_STATUS_CHANGE); /* recover real slotid */

The bit decides whether to propagate the status change to the page chain (Chapter 8); the mask recovers the real slotid.

Invariant — the recovered slotid never carries the vacuum flag into a slotted-page call. A forgotten mask gives a slotid ≥ 32768 that fails spage_find_slot. The record-rebuilding handlers (heap_rv_mvcc_redo_insert, heap_rv_undoredo_update, heap_rv_redo_update_and_update_chain, heap_rv_mvcc_undo_delete) make assert (slotid > 0) the tripwire; heap_rv_undo_insert masks without asserting because it only deletes, and spage_delete_for_recovery rejects a bad slotid itself.

10.3 INSERT recovery — heap_rv_redo_insert and heap_rv_mvcc_redo_insert

Section titled “10.3 INSERT recovery — heap_rv_redo_insert and heap_rv_mvcc_redo_insert”

Non-MVCC redo (heap_rv_redo_insert). Payload is [INT16 record_type][record bytes]:

// heap_rv_redo_insert -- src/storage/heap_file.c
recdes.type = *(INT16 *) (rcv->data); /* <- type prefix; recdes points past it */
if (recdes.type == REC_ASSIGN_ADDRESS) /* <- reserved-only slot: data IS byte count */
{ recdes.area_size = recdes.length = *(INT16 *) recdes.data; recdes.data = NULL; }
sp_success = spage_insert_for_recovery (thread_p, rcv->pgptr, slotid, &recdes); /* fail: fatal er_set */

Ordinary record copies bytes; REC_ASSIGN_ADDRESS (Chapter 4) reserves without copying. Redo for RVHF_INSERT / RVHF_INSERT_NEWHOME.

MVCC redo (heap_rv_mvcc_redo_insert). The insert MVCCID is not in the payload — it lives in rcv->mvcc_id, so the handler rebuilds the header to keep visibility data (Chapter 3) correct:

// heap_rv_mvcc_redo_insert -- src/storage/heap_file.c
// ... mask vacuum flag, see 10.2 ...
if (record_type == REC_BIGONE) /* <- overflow: no inline header rebuild */
{ HEAP_SET_RECORD (&recdes, ..., REC_BIGONE, rcv->data + sizeof (record_type)); }
else /* inline record: rebuild the header */
{ MVCC_SET_INSID (&mvcc_rec_header, rcv->mvcc_id); /* <- INSID from log header, not page */
or_mvcc_add_header (&recdes, &mvcc_rec_header, ...); }
spage_insert_for_recovery (thread_p, rcv->pgptr, slotid, &recdes);
heap_page_rv_chain_update (thread_p, rcv->pgptr, rcv->mvcc_id, vacuum_status_change); /* re-apply chain */

Both branches insert, then heap_page_rv_chain_update re-applies the page-chain MVCCID and saved vacuum-status change.

Invariant — a redone MVCC insert carries the same INSID it had before the crash. INSID comes from rcv->mvcc_id, not page bytes, so redo is idempotent and snapshot-correct; stale bytes would diverge from committed history.

10.4 INSERT undo — heap_rv_undo_insert is a delete

Section titled “10.4 INSERT undo — heap_rv_undo_insert is a delete”

An uncommitted insert must vanish on rollback: delete the slot and, only when the system is fully up, repair free-space stats:

// heap_rv_undo_insert -- src/storage/heap_file.c
if (LOG_ISRESTARTED ()) /* <- measure freed space only after restart */
{ free_space = spage_get_free_space_without_saving (thread_p, rcv->pgptr, NULL); }
slotid = rcv->offset & (~HEAP_RV_FLAG_VACUUM_STATUS_CHANGE);
(void) spage_delete_for_recovery (thread_p, rcv->pgptr, slotid); /* <- reuse the slot */
pgbuf_set_dirty (thread_p, rcv->pgptr, DONT_FREE);
if (LOG_ISRESTARTED ()) /* <- look up HFID (best-effort), fix stats */
{ if (heap_get_class_oid_from_page (...) != NO_ERROR || heap_get_class_info (...) != NO_ERROR) goto end;
heap_stats_update (thread_p, rcv->pgptr, &hfid, free_space); }
end: ; /* falls through to return NO_ERROR */

During crash recovery only the slot delete runs (best-space stats, Chapter 9, not yet trustworthy); after restart both heap_get_class_* failures goto end and swallow the error (stats are a hint). The OID is reused — an uncommitted insert was never permanent. Undo for RVHF_INSERT, RVHF_MVCC_INSERT, RVHF_INSERT_NEWHOME, RVHF_MVCC_REDISTRIBUTE.

10.5 UPDATE recovery — heap_rv_undoredo_update and the prev_version tie-in

Section titled “10.5 UPDATE recovery — heap_rv_undoredo_update and the prev_version tie-in”

UPDATE recovery is unusual: one function serves both undo and redo, since either direction overwrites slot slotid with the payload bytes. heap_rv_redo_update is a one-line wrapper; heap_rv_undo_update calls the same core then adds a vacuum check.

// heap_rv_undoredo_update -- src/storage/heap_file.c
// ... mask vacuum flag (10.2); assert (slotid > 0); point recdes past the type prefix ...
if (recdes.area_size <= 0) { sp_success = SP_SUCCESS; } /* <- empty image: header-only change */
else if (heap_update_physical (thread_p, rcv->pgptr, slotid, &recdes) != NO_ERROR)
{ assert_release (false); return ER_FAILED; } /* heap_update_physical = spage_update + type fix-up */

Why the same payload works both ways — the prev_version_lsa link. On the forward UPDATE path, before logging, CUBRID stamps the new record’s header with the LSA of the old record’s undo log record:

// heap_update_set_prev_version -- src/storage/heap_file.c
if (recdes.type == REC_HOME)
{ or_mvcc_set_log_lsa_to_record (&recdes, prev_version_lsa); } /* <- LSA of the old record's undo log */

So the UPDATE’s undo image (the old bytes) lives at exactly the LSA the new record’s prev_version_lsa records. The version-chain reader (Chapter 5) fetches the undo record there — precisely what heap_rv_undoredo_update would replay on rollback: recovery and MVCC time-travel read the same physical undo record, no second copy. (heap_update_bigone wires the same link for overflow.)

Invariant — prev_version_lsa equals the LSA of the undo log record holding the predecessor’s bytes. Enforced by heap_update_set_prev_version and heap_update_bigone. Drift means rollback restores wrong bytes or a snapshot read walks to garbage — silent corruption, not a crash.

10.6 The MVCC-delete redo — heap_rv_redo_update_and_update_chain

Section titled “10.6 The MVCC-delete redo — heap_rv_redo_update_and_update_chain”

An MVCC delete stamps a delete-MVCCID into the header (Chapter 7) rather than erasing the record — physically an update. So after masking the vacuum flag (§10.2), heap_rv_redo_update_and_update_chain is literally heap_rv_redo_update (thread_p, rcv) (the §10.5 core) then heap_page_rv_chain_update (..., rcv->mvcc_id, vacuum_status_change); an inner error propagates via ASSERT_ERROR (). It is the redo for both RVHF_MVCC_DELETE_MODIFY_HOME and RVHF_UPDATE_NOTIFY_VACUUM. The undo side, heap_rv_undo_update, restores the old header bytes and runs vacuum_rv_check_at_undo (REC_HOME / REC_NEWHOME) so a rolled-back delete is not left falsely visible to vacuum.

10.7 DELETE recovery — non-MVCC and MVCC undo

Section titled “10.7 DELETE recovery — non-MVCC and MVCC undo”

Non-MVCC delete. Redo (heap_rv_redo_delete) is a bare spage_delete; undo (heap_rv_undo_delete) is the mirror of an insert:

// heap_rv_undo_delete -- src/storage/heap_file.c
error_code = heap_rv_redo_insert (thread_p, rcv); /* <- re-insert the deleted record */
if (error_code != NO_ERROR) { return error_code; }
recdes_type = *(INT16 *) (rcv->data);
if (recdes_type == REC_NEWHOME) /* <- only REC_NEWHOME needs the guard */
{ vacuum_rv_check_at_undo (thread_p, rcv->pgptr, slotid, recdes_type); } /* fail: assert_release+ER_FAILED */

Re-inserting works because the redo payload still carries type and bytes; only REC_NEWHOME also runs the vacuum atomicity check.

MVCC delete undo (heap_rv_mvcc_undo_delete) clears the delete-MVCCID flag rather than deleting anything:

// heap_rv_mvcc_undo_delete -- src/storage/heap_file.c
slotid = rcv->offset & (~HEAP_RV_FLAG_VACUUM_STATUS_CHANGE);
spage_get_record (..., slotid, &rebuild_record, COPY); /* read current page bytes */
or_mvcc_get_header (&rebuild_record, &mvcc_rec_header);
assert (MVCC_IS_FLAG_SET (&mvcc_rec_header, OR_MVCC_FLAG_VALID_DELID)); /* must have been deleted */
MVCC_CLEAR_FLAG_BITS (&mvcc_rec_header, OR_MVCC_FLAG_VALID_DELID); /* <- un-delete, then or_mvcc_set_header */
spage_update (..., slotid, &rebuild_record); /* each step: assert_release(false)+ER_FAILED on fail */

There is no redo payload — the undo image is “the page minus the DELID flag.” Undo for RVHF_MVCC_DELETE_REC_HOME / ..._REC_NEWHOME.

10.8 Page-level recovery — heap_rv_redo_newpage and heap_rv_redo_reuse_page

Section titled “10.8 Page-level recovery — heap_rv_redo_newpage and heap_rv_redo_reuse_page”

New page (heap_rv_redo_newpage) redoes a page’s first state — set type, spage_initialize with recovery-space saving on, insert the header/chain record at the reserved slot:

// heap_rv_redo_newpage -- src/storage/heap_file.c
spage_initialize (thread_p, rcv->pgptr, heap_get_spage_type (), HEAP_MAX_ALIGN, SAFEGUARD_RVSPACE);
sp_success = spage_insert (thread_p, rcv->pgptr, &recdes, &slotid); /* recdes.type = REC_HOME */
if (sp_success != SP_SUCCESS || slotid != HEAP_HEADER_AND_CHAIN_SLOTID)
{ er_set (ER_FATAL_ERROR_SEVERITY, ...); return er_errid (); } /* header/chain must be slot 0, else fatal */

Reuse page (heap_rv_redo_reuse_page) bulk-deletes all records when a page is recycled for a different class:

// heap_rv_redo_reuse_page -- src/storage/heap_file.c
const bool is_header_page = ((rcv->offset != 0) ? true : false);
(void) heap_delete_all_page_records (thread_p, &vpid, rcv->pgptr); /* idempotent: redo may run twice */
if (!is_header_page) /* header page skips, fixed later via RVHF_STATS */
{ COPY_OID (&(chain->class_oid), (OID *) (rcv->data)); ... } /* <- re-stamp class, reset max_mvccid, vacuum=NONE */

heap_delete_all_page_records is idempotent because redo may run twice; header pages skip the chain rewrite (fixed later via RVHF_STATS).

These exist because of recovery: where forward spage_insert picks a slot, recovery must place a record at a specific, previously-assigned slotid so OIDs stay stable across a crash.

// spage_insert_for_recovery -- src/storage/slotted_page.c
if (anchor_type != ANCHORED && anchor_type != ANCHORED_DONT_REUSE_SLOTS)
{ return spage_insert_at (thread_p, page_p, slot_id, record_descriptor_p); } /* unanchored: shift */
if (slot_id < page_header_p->num_slots) /* slot exists: assert empty, then refill */
{ slot_p->record_type = REC_DELETED_WILL_REUSE; } /* <- keeps the OID at this slotid */
spage_find_empty_slot_at (thread_p, page_p, slot_id, ...);
if (record_descriptor_p->type != REC_ASSIGN_ADDRESS) /* <- ASSIGN_ADDRESS reserves, copies nothing */
{ memcpy ((char *) page_p + slot_p->offset_to_record, record_descriptor_p->data, ...); }
// spage_delete_for_recovery -- src/storage/slotted_page.c
if (spage_delete (thread_p, page_p, slot_id) != slot_id) { return NULL_SLOTID; }
if (page_header_p->anchor_type == ANCHORED_DONT_REUSE_SLOTS) /* normal delete left REC_MARKDELETED */
{
slot_p = spage_find_slot (page_p, page_header_p, slot_id, false);
if (slot_p->offset_to_record == SPAGE_EMPTY_OFFSET && slot_p->record_type == REC_MARKDELETED)
{ slot_p->record_type = REC_DELETED_WILL_REUSE; pgbuf_set_dirty (...); } /* <- override no-reuse */
}

Invariant — recovery never burns a slot for an uncommitted OID. spage_delete_for_recovery forces REC_DELETED_WILL_REUSE even on ANCHORED_DONT_REUSE_SLOTS pages; leaving REC_MARKDELETED would leak a slot per rolled-back insert and inflate OIDs after crashes.

10.10 The is_saving / spage_save_space undo-space reservation

Section titled “10.10 The is_saving / spage_save_space undo-space reservation”

SPAGE_HEADER::is_saving (Chapter 2; set at spage_initialize via SAFEGUARD_RVSPACE = true) exists purely for recovery: when a transaction frees space it must reserve it so a later rollback can re-grow the record. spage_delete calls spage_save_space only when is_saving, which short-circuits before recording an entry in three cases:

// spage_save_space -- src/storage/slotted_page.c
if (space == 0 || log_is_in_crash_recovery ()) { return NO_ERROR; }
if (VACUUM_IS_THREAD_VACUUM_WORKER (thread_p)) { return NO_ERROR; } /* vacuum never rolls back */
if (space < 0 || !logtb_is_active (thread_p, tranid)) { return NO_ERROR; }
// ... otherwise: find_or_insert SPAGE_SAVE_HEAD for VPID, extend SPAGE_SAVE_ENTRY for tranid ...

Only an active forward transaction freeing positive space records an entry, keyed by VPID in spage_Saving_hashmap and threaded onto the TDES for release at transaction end (§10.4’s heap_rv_undo_insert reads through these reservations).

Invariant — freed space is reserved for the freeing transaction until it commits or aborts, but never during recovery. Enforced by the is_saving gate plus the log_is_in_crash_recovery short-circuit; otherwise the hashmap fills with phantom entries for dead transactions.

  1. Recovery is table-driven and symmetric. RV_fun[] pins each RVHF_* to an undo/redo pair; undo-insert deletes, undo-delete inserts, and UPDATE shares one heap_rv_undoredo_update core both ways.
  2. The vacuum bit rides in rcv->offset. Handlers mask HEAP_RV_FLAG_VACUUM_STATUS_CHANGE (0x8000); record-rebuilders also assert (slotid > 0), while heap_rv_undo_insert only masks.
  3. MVCC redo rebuilds the header from the log. INSID comes from rcv->mvcc_id, not page bytes, making redo idempotent and snapshot-correct.
  4. The UPDATE undo image is the version-chain predecessor. heap_update_set_prev_version / heap_update_bigone stamp the new record’s prev_version_lsa with the old record’s undo-record LSA, so MVCC reads (Chapter 5) and rollback replay the same physical record.
  5. Recovery has its own slotted-page primitives. spage_insert_for_recovery keeps OIDs stable at a specific slotid; spage_delete_for_recovery forces REC_DELETED_WILL_REUSE so a rolled-back insert never leaks a slot.
  6. is_saving reserves freed space for rollback, but stands down during recovery via the log_is_in_crash_recovery short-circuit.
  7. This path reconstructs states; it never creates new ones — every handler is the mechanical inverse or replay of a Chapter 4–9 operation.

The following are line numbers as observed on 2026-06-08; symbols are the canonical anchor and line numbers are hints that decay.

SymbolFileLine
OR_MVCC_DELETE_ID_OFFSETsrc/base/object_representation.h486
OR_MVCC_MAX_HEADER_SIZEsrc/base/object_representation_constants.h142
OR_MVCC_MIN_HEADER_SIZEsrc/base/object_representation_constants.h145
OR_MVCC_FLAG_MASKsrc/base/object_representation_constants.h160
OR_MVCC_FLAG_VALID_INSIDsrc/base/object_representation_constants.h165
OR_MVCC_FLAG_VALID_DELIDsrc/base/object_representation_constants.h168
OR_MVCC_FLAG_VALID_PREV_VERSIONsrc/base/object_representation_constants.h171
OR_MVCC_REPID_MASKsrc/base/object_representation_constants.h173
or_mvcc_get_headersrc/base/object_representation_sr.c4237
or_mvcc_set_headersrc/base/object_representation_sr.c4296
mvcc_header_size_lookupsrc/object/object_representation.c70
or_header_sizesrc/object/object_representation.c5757
vacuum_heap_pagesrc/query/vacuum.c1577
vacuum_is_mvccid_vacuumedsrc/query/vacuum.c7463
HEAP_BESTSPACE_SYNC_THRESHOLDsrc/storage/heap_file.c90
HEAP_MVCC_SET_HEADER_MAXIMUM_SIZEsrc/storage/heap_file.c129
HEAP_UPDATE_IS_MVCC_OPsrc/storage/heap_file.c151
HEAP_NUM_BEST_SPACESTATSsrc/storage/heap_file.c182
HEAP_STATS_NEXT_BEST_INDEXsrc/storage/heap_file.c185
HEAP_STATS_PREV_BEST_INDEXsrc/storage/heap_file.c187
heap_hdr_statssrc/storage/heap_file.c191
HEAP_PAGE_FLAG_VACUUM_STATUS_MASKsrc/storage/heap_file.c240
HEAP_PAGE_SET_VACUUM_STATUSsrc/storage/heap_file.c244
HEAP_PAGE_GET_VACUUM_STATUSsrc/storage/heap_file.c262
heap_chainsrc/storage/heap_file.c270
struct heap_chainsrc/storage/heap_file.c270
heap_stats_bestspace_cachesrc/storage/heap_file.c469
heap_Find_best_page_limitsrc/storage/heap_file.c488
heap_Bestspacesrc/storage/heap_file.c499
HEAP_RV_FLAG_VACUUM_STATUS_CHANGEsrc/storage/heap_file.c514
heap_stats_add_bestspacesrc/storage/heap_file.c1024
heap_is_big_lengthsrc/storage/heap_file.c1330
heap_get_spage_typesrc/storage/heap_file.c1353
heap_is_reusable_oidsrc/storage/heap_file.c1364
heap_stats_get_min_freespacesrc/storage/heap_file.c2917
heap_stats_updatesrc/storage/heap_file.c2966
heap_stats_update_internalsrc/storage/heap_file.c3020
heap_stats_put_second_bestsrc/storage/heap_file.c3142
heap_stats_get_second_bestsrc/storage/heap_file.c3184
heap_stats_find_page_in_bestspacesrc/storage/heap_file.c3272
heap_stats_find_best_pagesrc/storage/heap_file.c3519
heap_stats_sync_bestspacesrc/storage/heap_file.c3728
heap_vpid_allocsrc/storage/heap_file.c4284
heap_remove_page_on_vacuumsrc/storage/heap_file.c4698
heap_vpid_nextsrc/storage/heap_file.c5038
heap_assign_addresssrc/storage/heap_file.c6015
xheap_reclaim_addressessrc/storage/heap_file.c6227
heap_ovf_insertsrc/storage/heap_file.c6569
heap_ovf_updatesrc/storage/heap_file.c6597
heap_get_if_diff_chnsrc/storage/heap_file.c7400
heap_prepare_get_contextsrc/storage/heap_file.c7512
heap_get_mvcc_headersrc/storage/heap_file.c7747
heap_get_record_data_when_all_readysrc/storage/heap_file.c7834
heap_next_internalsrc/storage/heap_file.c7902
heap_rv_redo_newpagesrc/storage/heap_file.c16203
heap_rv_redo_insertsrc/storage/heap_file.c16321
heap_mvcc_log_insertsrc/storage/heap_file.c16371
heap_rv_mvcc_redo_insertsrc/storage/heap_file.c16442
heap_rv_undo_insertsrc/storage/heap_file.c16536
heap_rv_redo_deletesrc/storage/heap_file.c16589
heap_mvcc_log_deletesrc/storage/heap_file.c16610
heap_rv_mvcc_undo_deletesrc/storage/heap_file.c16663
heap_rv_redo_mark_reusable_slotsrc/storage/heap_file.c16929
heap_rv_undo_deletesrc/storage/heap_file.c16946
heap_rv_undo_updatesrc/storage/heap_file.c16981
heap_rv_redo_updatesrc/storage/heap_file.c17018
heap_rv_undoredo_updatesrc/storage/heap_file.c17029
heap_rv_redo_reuse_pagesrc/storage/heap_file.c17065
heap_nextsrc/storage/heap_file.c19427
heap_get_mvcc_rec_header_from_overflowsrc/storage/heap_file.c19540
heap_set_mvcc_rec_header_on_overflowsrc/storage/heap_file.c19566
heap_set_mvcc_rec_header_on_overflowsrc/storage/heap_file.c19567
heap_get_bigone_contentsrc/storage/heap_file.c19610
heap_mvcc_log_home_change_on_deletesrc/storage/heap_file.c19689
heap_mvcc_log_home_no_changesrc/storage/heap_file.c19724
heap_rv_redo_update_and_update_chainsrc/storage/heap_file.c19745
heap_clear_operation_contextsrc/storage/heap_file.c20231
heap_build_forwarding_recdessrc/storage/heap_file.c20516
heap_insert_adjust_recdes_headersrc/storage/heap_file.c20539
heap_insert_adjust_recdes_headersrc/storage/heap_file.c20540
heap_update_adjust_recdes_headersrc/storage/heap_file.c20671
heap_insert_handle_multipage_recordsrc/storage/heap_file.c20834
heap_get_insert_location_with_locksrc/storage/heap_file.c20885
heap_find_location_and_insert_rec_newhomesrc/storage/heap_file.c21022
heap_insert_newhomesrc/storage/heap_file.c21105
heap_insert_physicalsrc/storage/heap_file.c21169
heap_log_insert_physicalsrc/storage/heap_file.c21229
heap_delete_adjust_headersrc/storage/heap_file.c21290
heap_delete_bigonesrc/storage/heap_file.c21389
heap_delete_relocationsrc/storage/heap_file.c21570
heap_delete_homesrc/storage/heap_file.c22067
heap_delete_physicalsrc/storage/heap_file.c22388
heap_log_delete_physicalsrc/storage/heap_file.c22428
heap_update_bigonesrc/storage/heap_file.c22484
heap_update_relocationsrc/storage/heap_file.c22700
heap_update_homesrc/storage/heap_file.c23026
heap_update_physicalsrc/storage/heap_file.c23257
heap_create_insert_contextsrc/storage/heap_file.c23358
heap_create_delete_contextsrc/storage/heap_file.c23385
heap_create_update_contextsrc/storage/heap_file.c23412
heap_insert_logicalsrc/storage/heap_file.c23460
heap_delete_logicalsrc/storage/heap_file.c23676
heap_update_logicalsrc/storage/heap_file.c23867
heap_vacuum_all_objectssrc/storage/heap_file.c24408
heap_page_update_chain_after_mvcc_opsrc/storage/heap_file.c24785
heap_page_set_vacuum_status_nonesrc/storage/heap_file.c24939
heap_page_get_vacuum_statussrc/storage/heap_file.c25014
heap_get_visible_version_from_logsrc/storage/heap_file.c25329
heap_get_visible_versionsrc/storage/heap_file.c25456
heap_scan_get_visible_versionsrc/storage/heap_file.c25494
heap_get_visible_version_internalsrc/storage/heap_file.c25577
heap_update_set_prev_versionsrc/storage/heap_file.c25689
heap_get_last_versionsrc/storage/heap_file.c25793
heap_prepare_object_pagesrc/storage/heap_file.c25856
heap_clean_get_contextsrc/storage/heap_file.c25904
heap_init_get_contextsrc/storage/heap_file.c25944
heap_alloc_new_pagesrc/storage/heap_file.c26241
HEAP_HEADER_AND_CHAIN_SLOTIDsrc/storage/heap_file.h62
HEAP_ISJUNK_OIDsrc/storage/heap_file.h66
HEAP_SCANCACHE_SET_NODEsrc/storage/heap_file.h83
HEAP_DROP_FREE_SPACEsrc/storage/heap_file.h103
heap_bestspacesrc/storage/heap_file.h120
heap_scancache_nodesrc/storage/heap_file.h127
heap_scancachesrc/storage/heap_file.h143
HEAP_OPERATION_TYPEsrc/storage/heap_file.h251
update_inplace_stylesrc/storage/heap_file.h253
HEAP_IS_UPDATE_INPLACEsrc/storage/heap_file.h262
heap_operation_contextsrc/storage/heap_file.h267
HEAP_PAGE_VACUUM_STATUSsrc/storage/heap_file.h354
heap_get_contextsrc/storage/heap_file.h362
spage_verify_headersrc/storage/slotted_page.c346
spage_is_valid_anchor_typesrc/storage/slotted_page.c375
spage_free_saved_spacessrc/storage/slotted_page.c393
spage_save_spacesrc/storage/slotted_page.c488
spage_initializesrc/storage/slotted_page.c1094
spage_compactsrc/storage/slotted_page.c1174
spage_find_free_slotsrc/storage/slotted_page.c1294
spage_check_spacesrc/storage/slotted_page.c1347
spage_find_empty_slotsrc/storage/slotted_page.c1396
spage_add_new_slotsrc/storage/slotted_page.c1568
spage_take_slot_in_usesrc/storage/slotted_page.c1608
spage_find_empty_slot_atsrc/storage/slotted_page.c1674
spage_check_record_for_insertsrc/storage/slotted_page.c1745
spage_insertsrc/storage/slotted_page.c1769
spage_find_slot_for_insertsrc/storage/slotted_page.c1801
spage_insert_datasrc/storage/slotted_page.c1841
spage_insert_atsrc/storage/slotted_page.c1902
spage_insert_for_recoverysrc/storage/slotted_page.c1962
spage_is_record_located_at_endsrc/storage/slotted_page.c2039
spage_reduce_a_slotsrc/storage/slotted_page.c2057
spage_deletesrc/storage/slotted_page.c2084
spage_delete_for_recoverysrc/storage/slotted_page.c2177
spage_check_updatablesrc/storage/slotted_page.c2223
spage_update_record_in_placesrc/storage/slotted_page.c2409
spage_update_record_after_compactsrc/storage/slotted_page.c2465
spage_updatesrc/storage/slotted_page.c2556
spage_reclaimsrc/storage/slotted_page.c2719
spage_mark_deleted_slot_as_reusablesrc/storage/slotted_page.c4022
spage_find_slotsrc/storage/slotted_page.c4609
spage_has_enough_total_spacesrc/storage/slotted_page.c4639
spage_has_enough_contiguous_spacesrc/storage/slotted_page.c4679
spage_vacuum_slotsrc/storage/slotted_page.c4857
spage_need_compactsrc/storage/slotted_page.c5275
ANCHOREDsrc/storage/slotted_page.h38
SP_ERRORsrc/storage/slotted_page.h49
SP_SUCCESSsrc/storage/slotted_page.h50
SP_DOESNT_FITsrc/storage/slotted_page.h51
SAFEGUARD_RVSPACEsrc/storage/slotted_page.h53
SPAGE_HEADER_FLAG_NONEsrc/storage/slotted_page.h57
spage_headersrc/storage/slotted_page.h64
spage_slotsrc/storage/slotted_page.h88
hfidsrc/storage/storage_common.h193
record_typesrc/storage/storage_common.h1145
REC_UNKNOWNsrc/storage/storage_common.h1148
REC_ASSIGN_ADDRESSsrc/storage/storage_common.h1151
REC_HOMEsrc/storage/storage_common.h1154
REC_NEWHOMEsrc/storage/storage_common.h1157
REC_RELOCATIONsrc/storage/storage_common.h1160
REC_BIGONEsrc/storage/storage_common.h1163
REC_MARKDELETEDsrc/storage/storage_common.h1168
REC_DELETED_WILL_REUSEsrc/storage/storage_common.h1173
REC_4BIT_USED_TYPE_MAXsrc/storage/storage_common.h1185
mvcc_rec_headersrc/transaction/mvcc.h38
MVCC_REC_HEADER_INITIALIZERsrc/transaction/mvcc.h47
MVCC_IS_REC_DELETED_BYsrc/transaction/mvcc.h130
MVCC_IS_CHN_UPTODATEsrc/transaction/mvcc.h137
RVHF_INSERTsrc/transaction/recovery.c279
  • cubrid-heap-manager.md — the high-level companion (design intent, theory).
  • Raw analyses under raw/code-analysis/cubrid/storage/heap_manager/.
  • Code: src/storage/heap_file.{c,h}, src/storage/slotted_page.{c,h}.
  • Methodology: knowledge/methodology/code-analysis-detail-doc.md.