CUBRID MVCC — Code-Level Deep Dive
Where this document fits: The high-level analysis
cubrid-mvcc.mdcovers design intent and theoretical background. This document traces every branch and field at the code level. Each chapter is self-contained, but reading in order follows the full lifecycle of a row version and the snapshot that decides its visibility inside the kernel.
Contents:
Chapter 1: Data-Structure Map
Section titled “Chapter 1: Data-Structure Map”Field-by-field reference for every structure the MVCC module owns. SI theory
is not re-derived here — see cubrid-mvcc.md.
| Header | Owns |
|---|---|
mvcc.h | mvcc_rec_header, mvcc_snapshot, mvcc_info, the three result enums |
mvcc_active_tran.hpp | mvcc_active_tran (bit-area active-set engine) |
mvcc_table.hpp | mvcc_trans_status, mvcctable (global coordinator) |
storage_common.h | MVCCID typedef + sentinel ladder |
1.1 The MVCCID type and its sentinel ladder
Section titled “1.1 The MVCCID type and its sentinel ladder”One unsigned 64-bit counter; low values are reserved sentinels (ids 1, 2 are skipped, first real id is 4).
// MVCCID + sentinels -- src/storage/storage_common.htypedef UINT64 MVCCID; /* MVCC ID */#define MVCCID_NULL (0)#define MVCCID_ALL_VISIBLE ((MVCCID) 3) /* visible for all transactions */#define MVCCID_FIRST ((MVCCID) 4)| Value | Name | Role / why |
|---|---|---|
| 0 | MVCCID_NULL | ”no id”; unset/uninitialized field |
| 3 | MVCCID_ALL_VISIBLE | predates any live snapshot; visible to all (set by vacuum stripping the insert id) |
| 4 | MVCCID_FIRST | first id a normal tran can get; counter never falls below |
Predicate macros classify an id (MVCCID_IS_VALID = != MVCCID_NULL,
MVCCID_IS_NORMAL = >= MVCCID_FIRST); the step macro skips the reserved band:
// MVCCID_FORWARD -- src/storage/storage_common.h#define MVCCID_FORWARD(id) \ do { (id)++; if ((id) < MVCCID_FIRST) (id) = MVCCID_FIRST; } while (0)Invariant — the sentinel band [1,2] is never a live id.
MVCCID_FORWARDsnaps any post-increment below 4 back to 4, so a real id never collides withMVCCID_ALL_VISIBLE(3).
1.2 mvcc_rec_header — the on-record MVCC stamp
Section titled “1.2 mvcc_rec_header — the on-record MVCC stamp”The per-record header stored with each heap object — the only MVCC struct on disk; everything else here is in-memory.
// mvcc_rec_header -- src/transaction/mvcc.hstruct mvcc_rec_header{ INT32 mvcc_flag:8; INT32 repid:24; int chn; MVCCID mvcc_ins_id; MVCCID mvcc_del_id; LOG_LSA prev_version_lsa;};| Field | Role / why |
|---|---|
mvcc_flag:8 | low byte of the packed INT32; OR_MVCC_FLAG_* bits say which fields are present |
repid:24 | upper 24 bits; schema representation, packed in the same word |
chn | change counter, written instead of a delete id when no DELID; detects stale caches |
mvcc_ins_id | birth stamp; did the inserter commit before my snapshot |
mvcc_del_id | death stamp; did the deleter commit before my snapshot |
prev_version_lsa | version-chain link; a too-new record walks back through it |
The OR_MVCC_FLAG_* bits (in object_representation_constants.h) tag which
fields are meaningful — VALID_INSID/VALID_DELID/VALID_PREV_VERSION under
OR_MVCC_FLAG_MASK.
Invariant —
DELIDandchnare logically exclusive, not a physical union. The on-disk header writes only one (selected byOR_MVCC_FLAG_VALID_DELID);MVCC_IS_HEADER_DELID_VALIDgates delete-id reads on that flag andMVCCID_IS_VALID, so achnis never read as a tran id.
The initializer zeroes flags/repid, sets chn to NULL_CHN (-1), both ids to
MVCCID_NULL, the LSA null:
// MVCC_REC_HEADER_INITIALIZER -- src/transaction/mvcc.h#define MVCC_REC_HEADER_INITIALIZER \{ 0, 0, NULL_CHN, MVCCID_NULL, MVCCID_NULL, LSA_INITIALIZER }MVCC_IS_HEADER_ALL_VISIBLE detects the vacuumed-clean case: no insid/delid
flags set and mvcc_ins_id == MVCCID_ALL_VISIBLE.
1.3 mvcc_active_tran — the bit-area active set
Section titled “1.3 mvcc_active_tran — the bit-area active set”Answers “is MVCCID x still active?” lock-free. Embedded by value in both
mvcc_snapshot and mvcc_trans_status. Chapter 4 dissects the probe.
// mvcc_active_tran private state -- src/transaction/mvcc_active_tran.hppusing unit_type = std::uint64_t;static const size_t BITAREA_MAX_SIZE = 500;static const unit_type ALL_ACTIVE = 0;static const unit_type ALL_COMMITTED = (unit_type) -1;
unit_type *m_bit_area;volatile MVCCID m_bit_area_start_mvccid;volatile size_t m_bit_area_length;MVCCID *m_long_tran_mvccids;volatile size_t m_long_tran_mvccids_length;bool m_initialized;| Field | Role / why |
|---|---|
m_bit_area | 64-bit words, one bit per id: set = completed, clear = active |
m_bit_area_start_mvccid | id of bit 0; window left edge, offset = id - start |
m_bit_area_length | length in bits; trims from the left as old ids complete |
m_long_tran_mvccids | overflow array of still-active ids past the left edge; lets the window slide |
m_long_tran_mvccids_length | overflow-scan bound |
m_initialized | guard between default-constructed and live; used by finalize/reset |
ALL_ACTIVE (0) / ALL_COMMITTED ((unit_type)-1) are the all-clear /
all-set word patterns; BITAREA_MAX_SIZE (500 words) caps the window at
500 × 64 = 32 000 ids before overflow.
Invariant — bit offset stays inside
[0, m_bit_area_length). The Chapter 4 probe range-checks before touchingm_bit_area, else it reads past the 500-word area;volatilewindow fields let readers snapshot(start, length)while a committer slides it.
operator= is deleted so m_bit_area is never shallow-copied; callers use
copy_to, parameterized by enum class copy_safety { THREAD_SAFE, THREAD_UNSAFE }.
1.4 mvcc_snapshot — a transaction’s frozen view
Section titled “1.4 mvcc_snapshot — a transaction’s frozen view”The immutable picture of “who had committed” when a read began.
// mvcc_snapshot -- src/transaction/mvcc.hstruct mvcc_snapshot{ MVCCID lowest_active_mvccid; MVCCID highest_completed_mvccid; mvcc_active_tran m_active_mvccs; MVCC_SNAPSHOT_FUNC snapshot_fnc; bool valid; // ... mvcc_snapshot() ctor + reset() omitted ... mvcc_snapshot &operator= (const mvcc_snapshot& snapshot) = delete; void copy_to (mvcc_snapshot & dest) const;};| Field | Role / why |
|---|---|
lowest_active_mvccid | low watermark; id < this is committed-visible without the bits |
highest_completed_mvccid | high watermark; id > this is invisible without the bits |
m_active_mvccs | embedded mvcc_active_tran; precise answer between the watermarks |
snapshot_fnc | swappable predicate; same snapshot serves normal vs dirty reads |
valid | false = not built / reset on a reused tdes slot, triggers a rebuild |
MVCC_SNAPSHOT_FUNC is a (*)(THREAD_ENTRY *, MVCC_REC_HEADER *, MVCC_SNAPSHOT *) returning MVCC_SATISFIES_SNAPSHOT_RESULT.
Invariant — the watermarks bracket the bit area (
lowest_active_mvccid <= highest_completed_mvccid + 1, bit area consulted only between them). The deletedoperator=forces copies throughcopy_to, deep-copying the bits so watermarks and bits stay consistent.
1.5 mvcc_info — per-transaction MVCC state on the tdes
Section titled “1.5 mvcc_info — per-transaction MVCC state on the tdes”One per active transaction descriptor (log_tdes).
// mvcc_info -- src/transaction/mvcc.hstruct mvcc_info{ MVCC_SNAPSHOT snapshot; MVCCID id; MVCCID recent_snapshot_lowest_active_mvccid; std::vector<MVCCID> sub_ids; bool is_sub_active; // ... mvcc_info() ctor + init() + reset() omitted ... void copy_to (mvcc_info & dest) const;};| Field | Role / why |
|---|---|
snapshot | current mvcc_snapshot (by value); rebuilt per statement (RC) or held (SR) |
id | this tran’s MVCCID, MVCCID_NULL until first write; lazily assigned (Ch 3) |
recent_snapshot_lowest_active_mvccid | cached low watermark; fast “definitely inactive” cutoff so is_active skips the global table |
sub_ids | sub-transaction ids, one per nested sub (Ch 10) |
is_sub_active | true while a sub-transaction is open, so visibility also checks sub_ids |
1.6 mvcc_trans_status — one immutable history snapshot of the active set
Section titled “1.6 mvcc_trans_status — one immutable history snapshot of the active set”The global table keeps a ring of versioned status records; each slot is an
mvcc_trans_status.
// mvcc_trans_status -- src/transaction/mvcc_table.hppstruct mvcc_trans_status{ using version_type = unsigned int; enum event_type { COMMIT, ROLLBACK, SUBTRAN };
mvcc_active_tran m_active_mvccs; MVCCID m_last_completed_mvccid; // just for info event_type m_event_type; // just for info std::atomic<version_type> m_version;};| Field | Role / why |
|---|---|
m_active_mvccs | the active-set bits as of this version |
m_last_completed_mvccid | diagnostic only; last id completed at this version |
m_event_type | diagnostic only; COMMIT / ROLLBACK / SUBTRAN tag |
m_version | seqlock linchpin: read, copy bits, re-read; equal-and-even = consistent (Ch 5) |
Invariant —
m_versionbrackets a consistent copy. A committer bumps it before and after mutating the bits; a reader seeing the same version before/after its copy knows no write interleaved (Ch 5). The two diagnostic fields are not part of this contract.
1.7 mvcctable — the global coordinator
Section titled “1.7 mvcctable — the global coordinator”One instance in log_Gl.mvcc_table; owns id assignment, the status history
ring, the per-tran lowest-visible array, and the oldest-visible watermark
driving vacuum (lowest_active_mvccid_type is std::atomic<MVCCID>).
// mvcctable private members -- src/transaction/mvcc_table.hppstatic const size_t HISTORY_MAX_SIZE = 2048; // must be a power of 2static const size_t HISTORY_INDEX_MASK = HISTORY_MAX_SIZE - 1;
lowest_active_mvccid_type *m_transaction_lowest_visible_mvccids; /* size = NUM_TOTAL_TRAN_INDICES */size_t m_transaction_lowest_visible_mvccids_size;lowest_active_mvccid_type m_current_status_lowest_active_mvccid;
mvcc_trans_status m_current_trans_status;std::atomic<size_t> m_trans_status_history_position;mvcc_trans_status *m_trans_status_history; /* ring of HISTORY_MAX_SIZE */
std::mutex m_new_mvccid_lock;std::mutex m_active_trans_mutex;
std::atomic<MVCCID> m_oldest_visible;std::atomic<size_t> m_ov_lock_count;| Field | Role / why |
|---|---|
m_transaction_lowest_visible_mvccids | per-tran atomic array; each tran publishes its oldest needed id, the min feeds oldest-visible (Ch 9) |
m_transaction_lowest_visible_mvccids_size | min-scan bound; = NUM_TOTAL_TRAN_INDICES |
m_current_status_lowest_active_mvccid | atomic low watermark of current status; fast snapshot lower bound |
m_current_trans_status | the live, newest status; mutated under m_active_trans_mutex |
m_trans_status_history_position | atomic ring cursor at newest slot; masked by HISTORY_INDEX_MASK |
m_trans_status_history | ring of HISTORY_MAX_SIZE; readers grab a stable past version (Ch 5/8) |
m_new_mvccid_lock | serializes get_new_mvccid / get_two_new_mvccid for monotonic ids |
m_active_trans_mutex | serializes current-status mutation + history advance |
m_oldest_visible | atomic global watermark; vacuum reclaims nothing newer |
m_ov_lock_count | soft pin; nonzero freezes m_oldest_visible for a stable floor (Ch 9) |
Invariant —
HISTORY_MAX_SIZEis a power of two. The ring indexpos & HISTORY_INDEX_MASKis a correct modulo only then; otherwise the mask aliases non-adjacent slots and hands readers the wrong past version.
Invariant —
m_oldest_visibleonly moves forward while unpinned (monotonic, frozen wheneverm_ov_lock_count > 0); else vacuum reclaims a version a running reader needs, a missing row (Ch 9).
1.8 The three result enums
Section titled “1.8 The three result enums”Visibility returns a typed enum, not a bool, so callers distinguish “too old” from “too new” (the latter triggers a version-chain walk).
// mvcc_satisfies_snapshot_result -- src/transaction/mvcc.henum mvcc_satisfies_snapshot_result{ TOO_OLD_FOR_SNAPSHOT, SNAPSHOT_SATISFIED, TOO_NEW_FOR_SNAPSHOT };TOO_OLD_FOR_SNAPSHOT = dead to me, stop; SNAPSHOT_SATISFIED = read it;
TOO_NEW_FOR_SNAPSHOT = born after my snapshot, follow prev_version_lsa.
// mvcc_satisfies_delete_result -- src/transaction/mvcc.henum mvcc_satisfies_delete_result{ DELETE_RECORD_INSERT_IN_PROGRESS, DELETE_RECORD_CAN_DELETE, DELETE_RECORD_DELETED, DELETE_RECORD_DELETE_IN_PROGRESS, DELETE_RECORD_SELF_DELETED};INSERT_IN_PROGRESS = inserter uncommitted, cannot touch; CAN_DELETE = clear;
DELETED = another tran committed the delete; DELETE_IN_PROGRESS = a live tran
holds it, wait or abort; SELF_DELETED = I deleted it in this tran.
// mvcc_satisfies_vacuum_result -- src/transaction/mvcc.henum mvcc_satisfies_vacuum_result{ VACUUM_RECORD_REMOVE, VACUUM_RECORD_DELETE_INSID_PREV_VER, VACUUM_RECORD_CANNOT_VACUUM};REMOVE = reclaim the dead version; DELETE_INSID_PREV_VER = keep row, drop the
useless insert id + prev-version link; CANNOT_VACUUM = not reclaimable yet.
These enums are dissected against their producing functions in Chapters 6–7.
1.9 How they point at one another
Section titled “1.9 How they point at one another”One global mvcctable in log_Gl, one mvcc_info per tdes, and the on-disk
mvcc_rec_header that visibility compares against the snapshot in that
mvcc_info. Every embedded mvcc_active_tran is an independent value-copy, not
a shared pointer — so a reader holds a stable snapshot while the global status
advances.
graph TD
subgraph Global["log_Gl"]
MT["mvcctable"]
MT --> CUR["m_current_trans_status"]
MT --> RING["m_trans_status_history[2048]"]
MT --> LVA["m_transaction_lowest_visible_mvccids[]"]
MT --> OV["m_oldest_visible + m_ov_lock_count"]
CUR --> CAT["m_active_mvccs (by value)"]
RING --> RAT["m_active_mvccs (by value)"]
end
subgraph Tdes["log_tdes (per transaction)"]
MI["mvcc_info"]
MI --> SNAP["snapshot (mvcc_snapshot)"]
MI --> ID["id"]
MI --> SUB["sub_ids + is_sub_active"]
SNAP --> SAT["m_active_mvccs (by value)"]
SNAP --> WM["lowest/highest watermarks"]
end
subgraph Disk["Heap record (on disk)"]
RH["mvcc_rec_header"]
RH --> INS["mvcc_ins_id / mvcc_del_id"]
RH --> PV["prev_version_lsa"]
end
MT -. "build_mvcc_info copies bits + watermarks" .-> SNAP
SNAP -. "snapshot_fnc compares" .-> RH
PV -. "version chain walk" .-> Disk
Figure 1-1. Solid arrows = containment (by value unless labelled); dashed
arrows = runtime data flows: table seeds a snapshot, snapshot_fnc evaluates a
header, a too-new record walks prev_version_lsa.
1.10 Cross-check notes
Section titled “1.10 Cross-check notes”- The legacy macro
MVCC_IS_REC_DELETED_BYstill dereferences adelid_chn.mvcc_del_idunion member gone frommvcc_rec_header(now separatechnandmvcc_del_id) — dead code; live reads useMVCC_GET_DELID/MVCC_IS_HEADER_DELID_VALID.
1.11 Chapter summary — key takeaways
Section titled “1.11 Chapter summary — key takeaways”MVCCIDis a 64-bit counter with a reserved low band (0/3/4 =MVCCID_NULL/MVCCID_ALL_VISIBLE/MVCCID_FIRST);MVCCID_FORWARDkeeps live ids above it.mvcc_rec_headeris the only on-disk MVCC struct;mvcc_flag:8selects which ofmvcc_ins_id,mvcc_del_id(vschn),prev_version_lsaapply —DELID/chnexclusive on disk, distinct in memory.mvcc_active_tranis a sliding bit window plus a long-tran overflow array (offset =id - m_bit_area_start_mvccid, window ≤ 500×64 ids,volatilefields for lock-free reads).mvcc_snapshotbrackets the bit area with two watermarks, takes a swappablesnapshot_fnc, and copies only viacopy_to.mvcc_infois the per-tdesbundle of snapshot, ownid, the fastrecent_snapshot_lowest_active_mvccidcutoff, and sub-transaction state.mvcc_trans_statusis a seqlock-versioned active-set snapshot validated bym_versionalone; the other two fields are diagnostic.mvcctableis the single global coordinator — id minting, a power-of-two history ring, the lowest-visible array, and the monotonic pinnablem_oldest_visiblewatermark gating vacuum.
Chapter 2: Initialization and Memory Layout
Section titled “Chapter 2: Initialization and Memory Layout”Reader question: before any transaction runs, how is each MVCC structure allocated, sized, and bootstrapped, and where do the magic sizes come from? Chapter 1 mapped the three owning objects; here we trace every constructor, initialize/finalize, reset*, and size helper. Visibility theory lives in cubrid-mvcc.md.
The MVCCID counter does not live in any of these three structures — it lives in the log header, log_Gl.hdr.mvcc_next_id (log_storage.hpp); the table only reads and forwards it under m_new_mvccid_lock and seeds its own derived start markers from it. This governs 2.6; the structs here hold only derived state.
2.1 The two sizing axes
Section titled “2.1 The two sizing axes”- Dynamic, per-tran-index — sized to
logtb_get_number_of_total_tran_indices ()(=log_Gl.trantable.num_total_indices):m_long_tran_mvccidsandm_transaction_lowest_visible_mvccids. Re-sizable as the tran table grows. - Fixed, compile-time — bit area
BITAREA_MAX_SIZE = 500units, ringHISTORY_MAX_SIZE = 2048slots. Never resize; overflow migrates (bit area to long-tran array) or overwrites (ring wraps), never reallocs.
// mvcc_active_tran (private) -- src/transaction/mvcc_active_tran.hppusing unit_type = std::uint64_t;static const size_t BITAREA_MAX_SIZE = 500; // 500 units, fixedstatic const size_t UNIT_BIT_COUNT = sizeof (unit_type) * BYTE_BIT_COUNT; // 64static const size_t BITAREA_MAX_MEMSIZE = BITAREA_MAX_SIZE * UNIT_BYTE_COUNT; // 4000 bytesstatic const size_t BITAREA_MAX_BITS = BITAREA_MAX_SIZE * UNIT_BIT_COUNT; // 32000 bitsstatic const unit_type ALL_ACTIVE = 0;static const unit_type ALL_COMMITTED = (unit_type) -1;A unit_type is 64 bits, so the bit area is 500 words = 4000 bytes tracking 32000 MVCCIDs. ALL_ACTIVE = 0 is load-bearing: a new[]()-zeroed buffer already means “every slot active,” so init never scrubs. The helpers convert between the three units:
// mvcc_active_tran helpers -- src/transaction/mvcc_active_tran.cppsize_t bit_size_to_unit_size (size_t b) { return (b + UNIT_BIT_COUNT - 1) / UNIT_BIT_COUNT; } // bits->words, ceilsize_t units_to_bits (size_t n) { return n * UNIT_BIT_COUNT; } // words -> bitssize_t units_to_bytes (size_t n) { return n * UNIT_BYTE_COUNT; } // words -> bytessize_t get_area_size () const { return bit_size_to_unit_size (m_bit_area_length); } // LIVE wordssize_t get_bit_area_memsize () const { return units_to_bytes (get_area_size ()); } // LIVE bytesget_area_size() is live words (from m_bit_area_length); BITAREA_MAX_SIZE is allocated words. The buffer is always full-width; reset and copy paths touch only get_bit_area_memsize() bytes — hence trailing words must stay ALL_ACTIVE.
flowchart LR
subgraph mvcctable["mvcctable (one per log_Gl)"]
A["m_transaction_lowest_visible_mvccids<br/>atomic MVCCID [ total_tran_indices ]"]
B["m_trans_status_history<br/>mvcc_trans_status [ 2048 ]"]
C["m_current_trans_status<br/>mvcc_trans_status"]
end
C --> F["m_active_mvccs : mvcc_active_tran"]
B --> G["[i].m_active_mvccs : mvcc_active_tran"]
F --> H["m_bit_area : unit_type [500]<br/>m_long_tran_mvccids : MVCCID [ total_tran_indices ]"]
G --> H
Figure 2-1. Ownership and sizing axes. The two MVCCID arrays are dynamic-axis; the [500] bit area and [2048] ring are fixed-axis. Each mvcc_trans_status (current + 2048 ring) embeds a full mvcc_active_tran, so a live table holds 2049 bit areas.
2.2 mvcc_active_tran: construction, initialize, finalize, reset
Section titled “2.2 mvcc_active_tran: construction, initialize, finalize, reset”The default constructor builds an empty, uninitialized object — no heap, pointers NULL, start marker MVCCID_FIRST. initialize makes the struct’s only two heap allocations, guarded for idempotency:
// mvcc_active_tran::mvcc_active_tran -- src/transaction/mvcc_active_tran.cppmvcc_active_tran::mvcc_active_tran () : m_bit_area (NULL) , m_bit_area_start_mvccid (MVCCID_FIRST) /* <- 4, never 0 */ , m_bit_area_length (0) , m_long_tran_mvccids (NULL) , m_long_tran_mvccids_length (0) , m_initialized (false) { }
// mvcc_active_tran::initialize -- src/transaction/mvcc_active_tran.cppvoid mvcc_active_tran::initialize (){ if (m_initialized) { return; } /* <- branch 1: already up, no-op */ m_bit_area = new unit_type[BITAREA_MAX_SIZE] (); /* <- () zero-inits => ALL_ACTIVE */ m_bit_area_start_mvccid = MVCCID_FIRST; m_bit_area_length = 0; m_long_tran_mvccids = new MVCCID[long_tran_max_size ()] (); /* <- sized to tran indices */ m_long_tran_mvccids_length = 0; m_initialized = true;}The trailing () value-initializes both arrays to zero; because ALL_ACTIVE == 0, that is a valid “all active, length 0” state. long_tran_max_size () returns logtb_get_number_of_total_tran_indices () — the ceiling on simultaneously-active “long” transactions (older than m_bit_area_start_mvccid); add_long_transaction asserts m_long_tran_mvccids_length < long_tran_max_size ().
finalize frees both arrays, nulls them, drops the flag — unlike ~mvcc_active_tran, it resets state so the object can be initialized again. reset and reset_active_transactions are not finalize: both keep allocations and only wipe content (snapshot-copy retry path, Chapter 5). They differ: reset memsets only the live prefix (get_bit_area_memsize (), guarded against a zero-byte call) and unseeds the start to MVCCID_NULL; reset_active_transactions memsets the whole BITAREA_MAX_MEMSIZE and keeps the start. The full clear is needed because a failed lock-free copy (version changed mid-memcpy) may have written garbage past the prefix that a prefix-only clear would miss before retry.
// mvcc_active_tran::finalize -- src/transaction/mvcc_active_tran.cppvoid mvcc_active_tran::finalize (){ delete [] m_bit_area; m_bit_area = NULL; delete [] m_long_tran_mvccids; m_long_tran_mvccids = NULL; m_initialized = false; }
// mvcc_active_tran::reset -- src/transaction/mvcc_active_tran.cppvoid mvcc_active_tran::reset (){ if (!m_initialized) { return; } /* <- branch 1: bare object => no-op */ if (m_bit_area_length > 0) /* <- branch 2: memset only LIVE prefix */ { std::memset (m_bit_area, 0, get_bit_area_memsize ()); } m_bit_area_length = 0; m_bit_area_start_mvccid = MVCCID_NULL; /* <- NULL (0), not MVCCID_FIRST */ m_long_tran_mvccids_length = 0; check_valid ();}
// mvcc_active_tran::reset_active_transactions -- src/transaction/mvcc_active_tran.cppvoid mvcc_active_tran::reset_active_transactions (){ std::memset (m_bit_area, 0, BITAREA_MAX_MEMSIZE); /* <- full 4000 bytes */ m_bit_area_length = 0; m_long_tran_mvccids_length = 0; }Invariant (trailing-words-clear). Every word past
get_unit_of(m_bit_area_length) + 1up tom_bit_area + BITAREA_MAX_SIZEequalsALL_ACTIVE(0), and bits pastm_bit_area_lengthin the last partial word are 0. Enforced bycheck_valid(an#ifndef NDEBUGloop asserting*p_area == ALL_ACTIVE); maintained byinitialize,reset,ltrim_area,reset_active_transactions. If violated,compute_lowest_active_mvccid/compute_highest_completed_mvccidread stale set-bits beyond the live length and report a wrong watermark, corrupting visibility.
Struct: mvcc_active_tran — every field
Section titled “Struct: mvcc_active_tran — every field”| Field | Role | Why it exists |
|---|---|---|
m_bit_area | Window of BITAREA_MAX_SIZE words; bit 0 = active, 1 = committed | One bit per recent MVCCID, 64/word |
m_bit_area_start_mvccid (volatile MVCCID) | MVCCID mapped to bit offset 0 | Anchors the window |
m_bit_area_length (volatile size_t) | Bits in use (not bytes, not alloc size) | Bounds scans/memsets to live prefix |
m_long_tran_mvccids | Ascending array of active MVCCIDs below the window | Overflow store for older transactions |
m_long_tran_mvccids_length (volatile size_t) | Live entries in the long-tran array | Bounds scans; asserted < long_tran_max_size() |
m_initialized | Has initialize run (and not finalized) | Idempotent init, safe-no-op reset |
volatile on three fields reflects lock-free reads of the active set mutated under m_active_trans_mutex; the real fence is the version recheck in build_mvcc_info (Chapter 5).
2.3 mvcc_trans_status: the version-tagged wrapper
Section titled “2.3 mvcc_trans_status: the version-tagged wrapper”A thin envelope: one mvcc_active_tran plus bookkeeping and the atomic version counter readers spin on. The ctor sets version 0 and neutral info fields; initialize delegates down and re-stamps version 0 (it may run on a recycled object whose version was bumped); finalize mirrors it.
// mvcc_trans_status -- src/transaction/mvcc_table.hppstruct mvcc_trans_status{ using version_type = unsigned int; enum event_type { COMMIT, ROLLBACK, SUBTRAN }; mvcc_active_tran m_active_mvccs; MVCCID m_last_completed_mvccid; // just for info event_type m_event_type; // just for info std::atomic<version_type> m_version;};// ctor / initialize / finalize -- src/transaction/mvcc_table.cppmvcc_trans_status::mvcc_trans_status () : m_active_mvccs () , m_last_completed_mvccid (MVCCID_NULL) , m_event_type (COMMIT) , m_version (0) { }void mvcc_trans_status::initialize () { m_active_mvccs.initialize (); m_version = 0; }void mvcc_trans_status::finalize () { m_active_mvccs.finalize (); }Struct: mvcc_trans_status — every field
Section titled “Struct: mvcc_trans_status — every field”| Field | Role | Why it exists |
|---|---|---|
m_active_mvccs | The active-transaction bitmap for this snapshot | The payload; rest is metadata |
m_last_completed_mvccid | Last MVCCID committed/rolled back here | Info/debug; visibility ignores it |
m_event_type | COMMIT/ROLLBACK/SUBTRAN | Info; debugger trace of ring advances |
m_version (atomic) | Monotonic stamp bumped on every status change | Lock-free guard: re-read unchanged = consistent copy |
2.4 mvcctable::initialize: the 2048-slot ring and the per-tran array
Section titled “2.4 mvcctable::initialize: the 2048-slot ring and the per-tran array”mvcctable is the single owning object (log_Gl.mvcc_table). Its constructor wires every member to a benign default but allocates nothing (markers MVCCID_FIRST/MVCCID_NULL, pointers NULL, sizes/counts 0). initialize (called once at boot from logtb_define_trantable_log_latch) does the allocation:
// mvcctable::initialize -- src/transaction/mvcc_table.cppvoid mvcctable::initialize (){ m_current_trans_status.initialize (); /* <- 1: seed the live status */ m_trans_status_history = new mvcc_trans_status[HISTORY_MAX_SIZE]; /* <- 2: 2048 slots, ctors only */ for (size_t idx = 0; idx < HISTORY_MAX_SIZE; idx++) { m_trans_status_history[idx].initialize (); } /* <- 3: each slot allocs its bit area */ m_trans_status_history_position = 0; /* <- 4: ring head at slot 0 */ m_current_status_lowest_active_mvccid = MVCCID_FIRST; /* <- 5: nothing older than 4 active */ alloc_transaction_lowest_active (); /* <- 6: per-tran array */}Step ordering is deliberate: the bare new[2048] only runs constructors (each embedded active set NULL); the per-slot initialize does the heap work. So a live table holds 2049 mvcc_active_tran instances (current + 2048 ring), each a 4000-byte buffer. HISTORY_MAX_SIZE = 2048 is a power of two so the ring wraps with & HISTORY_INDEX_MASK (= 2047), not a modulo.
Invariant (history-position-in-range).
m_trans_status_history_position < HISTORY_MAX_SIZEalways. Enforced at init (0) and on advance by(pos + 1) & HISTORY_INDEX_MASK(masks into[0, 2047]);build_mvcc_infoandis_activeassert (index < HISTORY_MAX_SIZE). If violated, readers index past the ring and read unrelated memory as an active set.
finalize tears down in reverse, zeroing the per-tran size so a later alloc re-allocates. It does not loop the ring slots: delete [] m_trans_status_history runs each ~mvcc_trans_status then ~mvcc_active_tran, freeing every slot’s buffers; the current status’s bit area is freed explicitly via m_current_trans_status.finalize ().
// mvcctable::finalize -- src/transaction/mvcc_table.cppvoid mvcctable::finalize (){ m_current_trans_status.finalize (); delete [] m_trans_status_history; m_trans_status_history = NULL; delete [] m_transaction_lowest_visible_mvccids; m_transaction_lowest_visible_mvccids = NULL; m_transaction_lowest_visible_mvccids_size = 0; /* <- forces re-alloc on next init */}flowchart TD
S(["mvcctable::initialize"]) --> A["m_current_trans_status.initialize()<br/>=> allocs current bit area + long-tran array"]
A --> B["new mvcc_trans_status[2048]<br/>=> 2048 default ctors, arrays still NULL"]
B --> C{"loop idx 0..2047"}
C -->|each| D["history[idx].initialize()<br/>=> allocs that slot's bit area + long-tran array"]
C -->|done| E["history_position = 0"]
E --> F["current_status_lowest_active = MVCCID_FIRST"]
F --> G(["alloc_transaction_lowest_active()"])
Figure 2-2. mvcctable::initialize control flow.
2.5 alloc_transaction_lowest_active: size-change detection
Section titled “2.5 alloc_transaction_lowest_active: size-change detection”The one allocation that can run more than once — at boot and on every transaction-table resize — written as a re-alloc-on-resize check:
// mvcctable::alloc_transaction_lowest_active -- src/transaction/mvcc_table.cppvoid mvcctable::alloc_transaction_lowest_active (){ if (m_transaction_lowest_visible_mvccids_size != (size_t) logtb_get_number_of_total_tran_indices ()) { // first time or tran table resized delete [] m_transaction_lowest_visible_mvccids; /* <- delete NULL is fine first time */ m_transaction_lowest_visible_mvccids_size = logtb_get_number_of_total_tran_indices (); m_transaction_lowest_visible_mvccids = new lowest_active_mvccid_type[m_transaction_lowest_visible_mvccids_size] (); // all 0 = MVCCID_NULL }}Two branches: (1) size matches — body skipped (steady-state when reached redundantly, e.g. logtb_expand_trantable re-invoked with no net change); (2) size differs (first call has stored size 0, or a genuine resize) — old array freed (delete [] NULL is legal), size updated, new lowest_active_mvccid_type (= std::atomic<MVCCID>) array value-initialized so every element reads MVCCID_NULL (0) = “no snapshot lower bound.” The hook is wired in logtb_expand_trantable, so the array tracks growth without re-running initialize.
Invariant (per-tran-array sized to live tran count).
m_transaction_lowest_visible_mvccids_size == logtb_get_number_of_total_tran_indices ()whenever the array is used. Enforced by this guard each time the table could grow. If violated (array too small),build_mvcc_info/complete_mvccindexing[tdes.tran_index]writes out of bounds;build_mvcc_infodefends withassert (tdes.tran_index < logtb_get_number_of_total_tran_indices ()).
Struct: mvcctable — every field
Section titled “Struct: mvcctable — every field”| Field | Role | Why it exists |
|---|---|---|
m_transaction_lowest_visible_mvccids | One atomic<MVCCID> per tran index: lowest visible MVCCID | Source for global oldest-visible; dynamic axis |
m_transaction_lowest_visible_mvccids_size | Allocated length of that array | Resize detection in alloc_transaction_lowest_active |
m_current_status_lowest_active_mvccid (atomic) | Lowest active MVCCID for the current status | Fast watermark; seeds snapshots |
m_current_trans_status | Single live, mutable status (under mutex) | Write target; published into the ring |
m_trans_status_history_position (atomic) | Index of most-recently-published ring slot | Lock-free reader entry point |
m_trans_status_history | 2048-slot ring of published read-only snapshots | Stable past status without blocking writers |
m_new_mvccid_lock (mutex) | Guards read-and-forward of log_Gl.hdr.mvcc_next_id | Serializes MVCCID issuance |
m_active_trans_mutex (mutex) | Guards current-status mutation + ring publish | One completion mutates status at a time |
m_oldest_visible (atomic) | Cached global oldest-visible for vacuum | Avoids re-scanning per query |
m_ov_lock_count (atomic) | Holders pinning m_oldest_visible | Vacuum freezes the watermark (Chapter 9) |
2.6 Boot/restart seeding: reset_start_mvccid
Section titled “2.6 Boot/restart seeding: reset_start_mvccid”initialize seeds markers to MVCCID_FIRST, but a real boot/restart has already issued MVCCIDs, so markers must be re-seeded from the persisted counter (introduced above):
// mvcctable::reset_start_mvccid -- src/transaction/mvcc_table.cppvoid mvcctable::reset_start_mvccid () // not thread safe (header comment){ m_current_trans_status.m_active_mvccs.reset_start_mvccid (log_Gl.hdr.mvcc_next_id); assert (m_trans_status_history_position < HISTORY_MAX_SIZE); m_trans_status_history[m_trans_status_history_position] .m_active_mvccs.reset_start_mvccid (log_Gl.hdr.mvcc_next_id); /* <- only the CURRENT ring slot */ m_current_status_lowest_active_mvccid.store (log_Gl.hdr.mvcc_next_id);}// mvcc_active_tran::reset_start_mvccid -- src/transaction/mvcc_active_tran.cppvoid mvcc_active_tran::reset_start_mvccid (MVCCID mvccid){ m_bit_area_start_mvccid = mvccid; if (m_initialized) { check_valid (); } }It re-points three things at the recovered value: the current status’s active set, the currently-published ring slot’s active set, and the global lowest-active. Only the slot at m_trans_status_history_position is touched — it alone is live after restart; the other 2047 are stale-but-zeroed and re-published as the ring advances. The only code that advances log_Gl.hdr.mvcc_next_id is get_new_mvccid/get_two_new_mvccid (MVCCID_FORWARD under m_new_mvccid_lock). reset_start_mvccid runs from log_initialize_internal and three points in log_recovery.c (after analysis, after redo, finish), each right after the header is rebuilt; “not thread safe” is fine because these precede concurrent transactions.
stateDiagram-v2 [*] --> Constructed : mvcctable ctor \n markers MVCCID_FIRST, arrays NULL Constructed --> Initialized : initialize \n allocs ring + per-tran array, seeds MVCCID_FIRST Initialized --> Reseeded : reset_start_mvccid \n markers to log_Gl.hdr.mvcc_next_id Reseeded --> Reseeded : recovery re-runs reset_start_mvccid Reseeded --> Serving : recovery done, accept transactions Serving --> Finalized : finalize \n free ring, free per-tran array Finalized --> [*]
Figure 2-3. mvcctable lifecycle: initialize once; reset_start_mvccid once per recovery phase; alloc_transaction_lowest_active re-runs on resize.
2.7 Chapter summary — key takeaways
Section titled “2.7 Chapter summary — key takeaways”- The MVCCID counter is not in the table — it lives in
log_Gl.hdr.mvcc_next_id, advanced only byget_new_mvccid/get_two_new_mvccidunderm_new_mvccid_lock; the table holds derived markers re-synced viareset_start_mvccid. - Two sizing axes —
m_long_tran_mvccidsandm_transaction_lowest_visible_mvccidssized tologtb_get_number_of_total_tran_indices ()(dynamic); bit area (500 words = 32000 bits) and ring (2048) fixed, absorbing overflow by migration/wrap. ALL_ACTIVE == 0makes zero-init meaningful — everynew[]()already means “all active, length 0,” soinitializeneeds no scrub and the trailing-words-clear invariant holds for free.- A live table holds 2049 bit areas — current status plus 2048 ring slots; the bare
new[2048]only runs ctors, the per-slotinitializedoes the heap work. initialize/reset/reset_active_transactions/finalizeare distinct — allocate; zero the live prefix and unseed toMVCCID_NULL; zero the full buffer for the failed-copy retry; free and de-initialize.alloc_transaction_lowest_activeis the only re-runnable allocation — its size-change guard reallocates only whentotal_tran_indicesdiffers (slots value-init toMVCCID_NULL); wired intologtb_expand_trantable.reset_start_mvccidis the boot/restart seam — re-points the current status, the current ring slot, and the global lowest-active at the recovered counter; single-threaded recovery only.
Chapter 3: MVCCID Birth and the On-Record Header
Section titled “Chapter 3: MVCCID Birth and the On-Record Header”This chapter answers: when does a version acquire its MVCCID, and how is
that stamp serialized onto a heap record so records that never needed delete
or prev-version metadata cost zero extra bytes? The companion
(cubrid-mvcc.md, MVCCID assignment policy / Per-record header)
establishes why CUBRID uses lazy issuance and a flag-driven header; here we
trace every branch. The lifecycle splits into birth (a 64-bit MVCCID
minted lazily, once per writing transaction, under a spinlock) and
serialization (that MVCCID plus optional delete-id and prev-version
pointer encoded into a variable-length header whose length is a 5-bit flag
field in the high byte of the rep word).
3.1 Birth: lazy issuance through curr_mvcc_info->id
Section titled “3.1 Birth: lazy issuance through curr_mvcc_info->id”A read-only transaction never writes, so it never needs an MVCCID. The
transaction descriptor (LOG_TDES) carries an mvccinfo.id that is
MVCCID_NULL (= 0) until the first write demands a stamp. The gate is
logtb_get_current_mvccid:
// logtb_get_current_mvccid -- src/transaction/log_tran_table.cif (MVCCID_IS_VALID (curr_mvcc_info->id) == false) /* <- mint UNCONDITIONALLY, first write only */ curr_mvcc_info->id = log_Gl.mvcc_table.get_new_mvccid ();if (!tdes->mvccinfo.sub_ids.empty ()) return tdes->mvccinfo.sub_ids.back (); /* <- sub shadows parent; parent already minted */return curr_mvcc_info->id;Order is load-bearing: the mint/validity test runs first and
unconditionally, so even when a sub-transaction is open the parent id is
minted (or confirmed valid) before sub_ids returns the sub-id. Branches:
(1) id valid -> no allocation; the same MVCCID serves every row (atomic
visibility); (2) sub_ids non-empty -> the sub-id (Chapter 10); (3) otherwise
the parent id.
Invariant — one stamp per writing transaction.
curr_mvcc_info->idtransitionsMVCCID_NULL -> <normal id>exactly once, only on a write, enforced by theMVCCID_IS_VALIDguard and reset toMVCCID_NULLonly at transaction end (logtb_complete_mvcc/reset()). Re-minting mid-transaction would give two rows of the same transaction different stamps — a reader could see a torn write.
A non-lazy entry exists for crash recovery:
logtb_rv_assign_mvccid_for_undo_recovery forces tdes->mvccinfo.id = mvccid straight from a log record — at recovery the id is already known and
restored verbatim, not re-minted. The “valid”/“all-visible” contract comes
from storage_common.h: MVCCID_NULL is 0, MVCCID_ALL_VISIBLE is the
literal 3, MVCCID_FIRST is 4, so a normal id is always >= 4. These
reserved low values let the header overload “no insert id” to mean “visible
to everyone” (§3.5).
flowchart TD
A["needs current MVCCID"] --> D{"id valid?"}
D -- "yes" --> S{"sub_ids empty?"}
D -- "no, first write" --> F["get_new_mvccid(): id = mvcc_next_id; FORWARD"]
F --> S
S -- "no" --> C["return sub_ids.back()"]
S -- "yes" --> E["return curr_mvcc_info->id"]
Figure 3-1: lazy issuance in logtb_get_current_mvccid; mint/validity test
precedes the sub_ids test, matching source order.
3.2 The allocator: get_new_mvccid and get_two_new_mvccid
Section titled “3.2 The allocator: get_new_mvccid and get_two_new_mvccid”The counter lives in the global log header (log_Gl.hdr.mvcc_next_id),
bumped under a dedicated lock with a tiny critical section:
// mvcctable::get_new_mvccid -- src/transaction/mvcc_table.cppm_new_mvccid_lock.lock ();id = log_Gl.hdr.mvcc_next_id;MVCCID_FORWARD (log_Gl.hdr.mvcc_next_id); /* <- ++, skipping reserved 0..3 */m_new_mvccid_lock.unlock ();MVCCID_FORWARD is (id)++ with a wrap-guard: if the post-increment lands
below MVCCID_FIRST (4) — only at the 64-bit unsigned wrap — it snaps back
to 4, so 0..3 are never handed out live.
Invariant — monotonic, gap-tolerant issuance. Under
m_new_mvccid_lockeach caller reads then forwards, so every id is strictly greater than the last. Gaps are expected and harmless — a transaction may mint an id and roll back; visibility depends on ordering, not contiguity. Dropping the lock would let two threads read the same value and share a stamp, breaking §3.1.
get_two_new_mvccid handles the parent+sub case where a transaction’s first
write happens inside a sub-transaction. It forwards twice in one lock
acquisition (first to the parent, second to the sub). The single caller,
logtb_get_new_subtransaction_mvccid, mints just one (get_new_mvccid) if
the parent already has a valid id; only when the parent is still MVCCID_NULL
does it call get_two_new_mvccid, so the parent always gets the smaller of
the pair — it must sort before its own sub-transaction in the active set
(Chapter 10).
3.3 The struct: mvcc_rec_header
Section titled “3.3 The struct: mvcc_rec_header”or_mvcc_get_header deserializes a record into a MVCC_REC_HEADER. It is
wider than the on-disk form: all fields always exist in RAM, only the flagged
ones are written back.
// mvcc_rec_header -- src/transaction/mvcc.hstruct mvcc_rec_header { INT32 mvcc_flag:8; /* MVCC flags */ INT32 repid:24; /* representation id */ int chn; /* cache coherency number */ MVCCID mvcc_ins_id; /* MVCC insert id */ MVCCID mvcc_del_id; /* MVCC delete id */ LOG_LSA prev_version_lsa; /* log address of previous version */};| Field | Role | Why it exists |
|---|---|---|
mvcc_flag:8 | low 5 bits (OR_MVCC_FLAG_MASK = 0x1f) say which optional members are present on disk | The flag is the schema for the variable-length encoding — decides header length and every branch in §3.4 |
repid:24 | Representation id of the row’s schema version | Packed into the same 32-bit word (flags top byte, repid low 24); one OR_GET_INT recovers both |
chn | Cache coherency number, bumped on non-MVCC updates. Role-shared with delete-id: VALID_DELID clear (live) -> slot is a real CHN, mvcc_del_id in-RAM MVCCID_NULL; set -> slot superseded, real deleter MVCCID 8 bytes later | A record is CHN-bearing or DELID-bearing in meaning, never both at once — see the VALID_DELID invariant in §3.4 |
mvcc_ins_id | MVCCID of the inserting transaction | Birth stamp; compared against a reader’s snapshot for insert-visibility |
mvcc_del_id | MVCCID of the deleting/updating transaction | Present only on delete/in-place-update; common-case absence saves 8 bytes |
prev_version_lsa | Log LSA of the previous version | Walks the version chain backward; present only on updated rows |
classDiagram
class mvcc_rec_header {
+INT32 mvcc_flag : 8
+INT32 repid : 24
+int chn
+MVCCID mvcc_ins_id
+MVCCID mvcc_del_id
+LOG_LSA prev_version_lsa
}
class OR_MVCC_FLAGS {
VALID_INSID 0x01
VALID_DELID 0x02
VALID_PREV_VERSION 0x04
MASK 0x1f
}
mvcc_rec_header --> OR_MVCC_FLAGS : low 5 bits select on-disk fields
Figure 3-2: the in-memory header and the flags that govern its on-disk projection.
3.4 The on-disk layout and its offset arithmetic
Section titled “3.4 The on-disk layout and its offset arithmetic”The first 4-byte word (OR_REP_OFFSET, OR_MVCC_REP_SIZE = 4) packs repid in
its low 24 bits and the flags in its high byte, shifted by
OR_MVCC_FLAG_SHIFT_BITS = 24. The CHN word (OR_CHN_SIZE = 4) always
follows. Everything after is conditional and cumulative — each optional
offset adds the size of every earlier present field:
// offset macros -- src/base/object_representation.h#define OR_MVCC_INSERT_ID_OFFSET (OR_CHN_OFFSET + OR_CHN_SIZE) /* = 8 */#define OR_MVCC_DELETE_ID_OFFSET(f) \ (OR_MVCC_INSERT_ID_OFFSET + (((f) & OR_MVCC_FLAG_VALID_INSID) ? OR_MVCC_INSERT_ID_SIZE : 0))#define OR_MVCC_PREV_VERSION_LSA_OFFSET(f) \ (OR_MVCC_DELETE_ID_OFFSET(f) + (((f) & OR_MVCC_FLAG_VALID_DELID) ? OR_MVCC_DELETE_ID_SIZE : 0))A record with only VALID_INSID puts nothing where delete-id would go. The
flag-to-length lookup is mvcc_header_size_lookup[8] (indexed by the 3 active
bits): flag 000 -> 8, 001/010 -> 16, 011 -> 24, … 111 -> 32,
adding OR_MVCCID_SIZE per id flag and OR_MVCC_PREV_VERSION_LSA_SIZE for
the prev-version bit. Endpoints are named OR_MVCC_MIN_HEADER_SIZE = 8 (no
optional fields) and OR_MVCC_MAX_HEADER_SIZE = 32 (all three). A live,
never-updated row carries a 16-byte header (rep + CHN + insid); the delete-id
and prev-version slots simply do not exist on the page. This is the “unused
slots cost zero bytes” property — it falls out of the conditional offset
macros plus the size table.
Invariant — flag bits and physical length agree.
mvcc_header_size_lookup[flag]must equal the bytes theor_mvcc_set_*sequence writes.or_mvcc_set_headerenforces it: it comparesold_mvcc_sizevsnew_mvcc_sizeand, if they differ, callsHEAP_MOVE_INSIDE_RECORDto grow/shrink the slot region so payload bytes are not overwritten. A setter writing a field whose flag was clear (or vice-versa) would misalign every later offset and parse the body as garbage.
3.4.1 or_mvcc_get_header — branch by branch
Section titled “3.4.1 or_mvcc_get_header — branch by branch”Deserialization reads in fixed field order, each step consulting the flag:
repid and mvcc_flag are unpacked from the rep word, then or_mvcc_get_chn
(always), or_mvcc_get_insid, or_mvcc_get_delid, and
or_mvcc_get_prev_version_lsa (each flag-gated), with if (rc != NO_ERROR) goto exit_on_error; after each — exit_on_error returns the code, falling
back to er_errid() / ER_FAILED. The helpers share one governing property:
a flag-gated getter that finds its flag clear returns a sentinel and leaves
buf->ptr exactly where it was, so the next read stays aligned and the
cumulative offsets remain self-consistent without the reader touching the
offset macros. Only the sentinel differs per field:
-
or_mvcc_get_insid: flag clear -> returnsMVCCID_ALL_VISIBLE; set -> reads a BIGINT, advancesOR_MVCCID_SIZE.// or_mvcc_get_insid -- src/base/object_representation_sr.cif (!(mvcc_flags & OR_MVCC_FLAG_VALID_INSID))return MVCCID_ALL_VISIBLE; /* <- ptr NOT advanced */// ... reads BIGINT, buf->ptr += OR_MVCCID_SIZE ... -
or_mvcc_get_delid: clear -> returnsMVCCID_NULL; set -> reads-and-advances. -
or_mvcc_get_chn: unconditional — CHN has no flag (the slot always exists); readsOR_INT_SIZE, advances. Whether the slot is a CHN or a displaced delete id is decided byVALID_DELID(§3.3). -
or_mvcc_get_prev_version_lsa: clear ->LSA_SET_NULL; set -> copies 8 bytes via struct assignment, advances.
3.4.2 or_mvcc_set_header, or_mvcc_add_header, and the setters
Section titled “3.4.2 or_mvcc_set_header, or_mvcc_add_header, and the setters”or_mvcc_set_header rewrites an existing header (the resize-aware path
above); or_mvcc_add_header prepends one to a fresh record (asserts
record->length == 0, sets record->length to bytes written). Both run the
same sequence — set_repid_and_flags -> set_chn -> set_insid -> set_delid -> set_prev_version_lsa — each short-circuiting on its flag: or_mvcc_set_insid
returns NO_ERROR writing nothing when VALID_INSID is clear, else
or_put_bigint; or_mvcc_set_delid and or_mvcc_set_prev_version_lsa mirror
it; or_mvcc_set_chn is unconditional. The only structural difference: set
first calls HEAP_MOVE_INSIDE_RECORD to reconcile size; add appends to a
zero-length record.
or_mvcc_get_flag / or_mvcc_set_flag are narrow accessors used when only
the flag byte must change:
// or_mvcc_set_flag -- src/base/object_representation_sr.crepid_and_flag = OR_GET_INT (record->data + OR_REP_OFFSET);repid_and_flag &= ~OR_MVCC_FLAG_MASK; /* <- clears LOW 5 bits 0x1f */repid_and_flag += ((flags & OR_MVCC_FLAG_MASK) << OR_MVCC_FLAG_SHIFT_BITS); /* <- ADD into bits 24+ */// ... or_put_int writes it back at OR_REP_OFFSET ...Note the quirk: the mask target (low 5 bits) is not where the flags are
combined (bits 24+), and the combine is +=, not bitwise OR — it works only
because the high flag region was already cleared by whatever last wrote the
word. or_mvcc_set_flag does not resize the record, so callers must
separately ensure the physical layout matches the new flag — a sharp edge
versus set_header.
flowchart TD
A["or_mvcc_set_header(record, hdr)"] --> B["old=lookup[old_flag]\nnew=lookup[hdr.flag]"]
B --> C{"old != new?"}
C -- "yes" --> D{"area_size big enough?"}
D -- "no" --> E["assert(false); exit_on_error"]
D -- "yes" --> F["HEAP_MOVE_INSIDE_RECORD"]
C -- "no" --> G["set repid+flags"]
F --> G
G --> H["set_chn -> set_insid -> set_delid -> set_prev_version_lsa"]
H --> I["NO_ERROR"]
Figure 3-3: resize-aware header rewrite in or_mvcc_set_header.
3.5 Heap-layer entry points and where the stamp lands
Section titled “3.5 Heap-layer entry points and where the stamp lands”heap_get_mvcc_header dispatches on context->record_type. REC_HOME ->
spage_get_record(..., PEEK) then or_mvcc_get_header; a PEEK failure or get
error both assert(false) and return S_ERROR (impossible by construction —
page latched, slot validated). REC_BIGONE -> needs a forward page;
delegates to the overflow reader. REC_RELOCATION -> reads the forward
page at context->forward_oid.slotid, same get + error handling. default
-> assert(false), S_ERROR — any other type is a caller bug.
heap_get_mvcc_rec_header_from_overflow is the special case: overflow records
always store a maximum-size header. If the caller passes a NULL
peek_recdes, it falls back to a stack ovf_recdes scratch buffer before
pointing ->data at overflow_get_first_page_data (ovf_page) and forcing
->length = OR_MVCC_MAX_HEADER_SIZE, then calls or_mvcc_get_header, because
on overflow pages the optional fields are always materialized (its sibling
heap_set_mvcc_rec_header_on_overflow force-sets VALID_INSID/VALID_DELID
and asserts OR_MVCC_MAX_HEADER_SIZE). The zero-cost-slot optimization is
thus a home/relocation property only; big records trade space for a fixed
layout.
Where does the insert stamp get written? The freshly built record carries
VALID_INSID, set in heap_attrinfo_transform_header_to_disk (and
heap_insert_adjust_recdes_header) via the repid_bits |= (OR_MVCC_FLAG_VALID_INSID << OR_MVCC_FLAG_SHIFT_BITS) branch, but the insid
slot is initially 0. The real MVCCID is fetched via
logtb_get_current_mvccid(thread_p) at log time. heap_mvcc_log_insert
forces lazy issuance per §3.1 and logs only the rep-word, CHN, and body as
redo crumbs — never the insid bytes. It branches on record size and on the
logging mode. For a non-REC_BIGONE record it emits four redo crumbs (record
type, the rep-word OR_INT_SIZE, CHN OR_INT_SIZE, then the body from
OR_HEADER_SIZE(p_recdes->data) onward — past the insid slot); for
REC_BIGONE it skips the header crumbs and logs only record type plus the
full record body (the overflow page carries its own max-size header). When
thread_p->no_logging is set it calls log_append_undo_crumbs (RVHF_MVCC_INSERT, ...) with no redo, otherwise log_append_undoredo_crumbs:
// heap_mvcc_log_insert -- src/storage/heap_file.credo_crumbs[n_redo_crumbs].length = sizeof (p_recdes->type); /* <- always: record type */redo_crumbs[n_redo_crumbs++].data = &p_recdes->type;if (p_recdes->type != REC_BIGONE) { // ... rep-word OR_INT_SIZE, CHN OR_INT_SIZE crumbs ... data_copy_offset = OR_HEADER_SIZE (p_recdes->data); /* <- body starts past insid slot */ } /* <- REC_BIGONE: data_copy_offset stays 0 */redo_crumbs[n_redo_crumbs].length = p_recdes->length - data_copy_offset;redo_crumbs[n_redo_crumbs++].data = p_recdes->data + data_copy_offset;if (thread_p->no_logging) log_append_undo_crumbs (thread_p, RVHF_MVCC_INSERT, p_addr, 0, NULL); /* <- undo only */else log_append_undoredo_crumbs (thread_p, RVHF_MVCC_INSERT, p_addr, 0, n_redo_crumbs, NULL, redo_crumbs);On redo, heap_rv_mvcc_redo_insert
re-stamps via MVCC_SET_INSID (&mvcc_rec_header, rcv->mvcc_id) then
or_mvcc_add_header. So the MVCCID is logically issued lazily (§3.1),
threaded through the log record’s mvcc_id, and physically written into the
insid slot by the redo/apply path — recovery re-derives the stamp from
rcv->mvcc_id, avoiding a double source of truth.
3.6 Interpreting the flags: the MVCC_IS_HEADER_* macros
Section titled “3.6 Interpreting the flags: the MVCC_IS_HEADER_* macros”Once a header is in RAM, callers ask three boolean questions, each combining a flag test with a value test:
// MVCC_IS_HEADER_* -- src/transaction/mvcc.h#define MVCC_IS_HEADER_DELID_VALID(h) \ (MVCC_IS_FLAG_SET (h, OR_MVCC_FLAG_VALID_DELID) && MVCCID_IS_VALID (MVCC_GET_DELID (h)))#define MVCC_IS_HEADER_INSID_NOT_ALL_VISIBLE(h) \ (MVCC_IS_FLAG_SET (h, OR_MVCC_FLAG_VALID_INSID) && MVCC_GET_INSID (h) != MVCCID_ALL_VISIBLE)#define MVCC_IS_HEADER_ALL_VISIBLE(h) \ (!MVCC_IS_FLAG_SET (h, OR_MVCC_FLAG_VALID_INSID|OR_MVCC_FLAG_VALID_DELID) \ && MVCC_GET_INSID (h) == MVCCID_ALL_VISIBLE)The double test guards a half-built header: a flag may be set while the id is
still MVCCID_NULL (during construction), or the in-RAM struct may hold
MVCCID_ALL_VISIBLE with no flag set (what or_mvcc_get_insid returns when
VALID_INSID is clear). ALL_VISIBLE is the steady state of a row old enough
that vacuum stripped its insid flag — no insid, no delid, in-RAM insid reads
as literal 3 — so it is unconditionally visible with no snapshot comparison.
These feed Chapter 6 (visibility) and Chapter 7 (vacuum predicates).
3.7 Open question — the two spare flag bits
Section titled “3.7 Open question — the two spare flag bits”OR_MVCC_FLAG_MASK reserves five bits (0x1f), but only three are
defined: VALID_INSID (0x01), VALID_DELID (0x02), VALID_PREV_VERSION
(0x04). Bits 0x08 and 0x10 are masked-in and shifted but never assigned a
meaning, and mvcc_header_size_lookup is sized [8] — it indexes only the
three defined bits, so setting 0x08 would index out of bounds. The header
comment implies deliberate reserve, but the intended use is undocumented.
Anyone adding a fourth on-record MVCC field must widen
mvcc_header_size_lookup, extend every cumulative offset macro, and audit the
OR_MVCC_MAX_HEADER_SIZE = 32 assertion on overflow pages.
3.8 Chapter summary — key takeaways
Section titled “3.8 Chapter summary — key takeaways”- MVCCIDs are issued lazily, once per writing transaction.
logtb_get_current_mvccidmintscurr_mvcc_info->idonly whenMVCCID_NULL; the mint/validity test runs unconditionally before the sub_ids branch, so even a sub-transaction path mints the parent id first. - The allocator is a tiny lock-guarded counter.
get_new_mvccidreads andMVCCID_FORWARDslog_Gl.hdr.mvcc_next_idunderm_new_mvccid_lock;get_two_new_mvccidforwards twice for the parent+sub bootstrap, giving the parent the lower id. Gaps from rollbacks are harmless. mvcc_rec_headeris wider in RAM than on disk.mvcc_flag’s low 5 bits decide which of insid, delid, and prev_version_lsa are physically present; thechnslot is role-shared with the delete-id region.- Zero-cost unused slots come from cumulative offset macros plus
mvcc_header_size_lookup[8]. A live un-updated row carries a 16-byte header. Bounds:OR_MVCC_MIN_HEADER_SIZE = 8,OR_MVCC_MAX_HEADER_SIZE = 32. - Get/set helpers are flag-gated and pointer-consistent. A getter that
skips a field must not advance the pointer;
or_mvcc_set_headerreconciles size viaHEAP_MOVE_INSIDE_RECORD, whileor_mvcc_set_flagclears the low 5 bits and+=-combines new flags into bits 24+ without resizing — a sharp edge. - The stamp is written on the recovery/apply path. Insert builds a header
with
VALID_INSID(viaheap_attrinfo_transform_header_to_disk/heap_insert_adjust_recdes_header) and a zero insid;heap_mvcc_log_insertforces issuance, andheap_rv_mvcc_redo_insert’sMVCC_SET_INSID(..., rcv->mvcc_id)is where the real MVCCID lands. Overflow records always store the max-size header. - Two of the five flag bits are unused and unreachable through the
[8]-wide size table — a constraint for any future on-record MVCC field.
Chapter 4: Active-Set Reads — Bit-Area Probe and Cached Scalars
Section titled “Chapter 4: Active-Set Reads — Bit-Area Probe and Cached Scalars”Chapters 1–3 built the active-set representation (sliding m_bit_area,
overflow m_long_tran_mvccids, on-record header). This chapter answers:
given an MVCCID, how does the code decide whether it is still active,
and how are the two cached short-circuit scalars (lowest_active_mvccid,
highest_completed_mvccid) derived from the raw bits?
Three layers stack: the bit-area probe mvcc_active_tran::is_active; the
derivation scans compute_highest_completed_mvccid /
compute_lowest_active_mvccid that flatten the bit area into a scalar;
and the wrappers mvcc_is_id_in_snapshot / mvcc_is_active_id that try
the scalars first and fall back to the probe only inside the active
window. For why the short-circuit is the fast path, see Snapshot
Visibility in cubrid-mvcc.md; this chapter traces the code.
4.1 The three structs in play
Section titled “4.1 The three structs in play”The probe lives on mvcc_active_tran, the scalars on mvcc_snapshot, the
per-transaction cache on mvcc_info.
flowchart TB INFO["mvcc_info (per active tran, in LOG_TDES)<br/>id / recent_snapshot_lowest_active_mvccid / sub_ids"] SNAPSHOT["mvcc_snapshot<br/>lowest_active_mvccid / highest_completed_mvccid (scalars)"] ACTIVE["mvcc_active_tran<br/>m_bit_area + start + length / m_long_tran_mvccids"] INFO -->|owns snapshot| SNAPSHOT SNAPSHOT -->|owns m_active_mvccs| ACTIVE
Figure 4-1. mvcc_info owns a mvcc_snapshot, which owns the
mvcc_active_tran bit area plus the two derived scalars.
mvcc_active_tran — mapped in Ch. 1; read-path fields recapped.
| Field | Role | Why it exists |
|---|---|---|
m_bit_area | unit_type[BITAREA_MAX_SIZE]; bit i set ⇒ start+i committed | Dense window of recently-completed MVCCIDs |
m_bit_area_start_mvccid | MVCCID at bit offset 0 | Probe subtracts it to make a bit offset |
m_bit_area_length | Window length in bits | Bounds-check; 0 ⇒ all active |
m_long_tran_mvccids | Sorted-ascending active MVCCIDs below window | Long-runners the window evicted |
m_long_tran_mvccids_length | Count of the above | Loop bound; [0] is global lowest active |
m_initialized | Allocation guard | Read paths assume m_bit_area != NULL |
mvcc_snapshot — the caller-facing snapshot record.
| Field | Role | Why it exists |
|---|---|---|
lowest_active_mvccid | Scalar: anything strictly below is committed | 1st short-circuit: PRECEDES ⇒ not in snapshot |
highest_completed_mvccid | Scalar: anything >= is active wrt this snapshot | 2nd short-circuit: FOLLOW_OR_EQUAL ⇒ in snapshot |
m_active_mvccs | The mvcc_active_tran bit area copied at build time | Authoritative answer inside the active window |
snapshot_fnc | Visibility function bound to the snapshot | Set by build_mvcc_info; Ch. 5–6 |
valid | Whether the snapshot is populated | Guards stale reads |
mvcc_info — attached to every active transaction’s LOG_TDES.
| Field | Role | Why it exists |
|---|---|---|
snapshot | The transaction’s own mvcc_snapshot | Built once per statement/transaction |
id | Own MVCCID (MVCCID_NULL if not written) | logtb_is_current_mvccid matches it — own writes active to me |
recent_snapshot_lowest_active_mvccid | Lowest-active from the most recent snapshot | mvcc_is_active_id short-circuit: below ⇒ committed, skip table |
sub_ids | Running sub-transaction MVCCIDs | logtb_is_current_mvccid also matches these (Ch. 10) |
is_sub_active | Sub-transaction running | Sub-transaction bookkeeping (Ch. 10) |
Invariant (bit semantics). A set bit means committed, a clear bit means active — inverse of the naive reading, so
is_activereturns!is_set(position).ALL_ACTIVE = 0andALL_COMMITTED = (unit_type) -1are the extreme units. Flip this polarity and every visibility decision inverts — committed rows vanish.
4.2 Bit addressing — the four helpers
Section titled “4.2 Bit addressing — the four helpers”Every probe reduces to “which unit, which bit”. Four tiny but load-bearing inline helpers do the arithmetic; an off-by-one corrupts visibility for the whole window.
// get_bit_offset / get_unit_of / get_mask_of / is_set -- src/transaction/mvcc_active_tran.cppsize_t get_bit_offset (MVCCID mvccid) const { return static_cast<size_t> (mvccid - m_bit_area_start_mvccid); } /* <- MVCCID to bit index */unit_type *get_unit_of (size_t bit_offset) const { return m_bit_area + (bit_offset / UNIT_BIT_COUNT); } /* <- which 64-bit word (UNIT_BIT_COUNT==64) */static unit_type get_mask_of (size_t bit_offset) { return ((unit_type) 1) << (bit_offset & 0x3F); } /* <- bit within word; &0x3F == mod 64 */bool is_set (size_t bit_offset) const { return ((*get_unit_of (bit_offset)) & get_mask_of (bit_offset)) != 0; }get_unit_of divides by 64 for the word; get_mask_of’s & 0x3F is a
fast % 64 for the intra-word bit; is_set composes them. is_set has
no bounds check — callers must verify the offset against
m_bit_area_length first (§4.3 invariant).
4.3 is_active — the three-case probe
Section titled “4.3 is_active — the three-case probe”The bottom of the stack: “is this MVCCID active in this captured active set”, in three mutually exclusive cases.
// mvcc_active_tran::is_active -- src/transaction/mvcc_active_tran.cpp if (MVCC_ID_PRECEDES (mvccid, m_bit_area_start_mvccid)) /* CASE 1: below the window */ { if (m_long_tran_mvccids != NULL) for (size_t i = 0; i < m_long_tran_mvccids_length; i++) if (mvccid == m_long_tran_mvccids[i]) return true; /* <- in long-tran overflow: active */ return false; /* <- below window, not long-tran: committed */ } else if (m_bit_area_length == 0) /* CASE 2: empty window */ return true; /* <- nothing committed yet: active */ else /* CASE 3: inside / above the window */ { size_t position = get_bit_offset (mvccid); if (position < m_bit_area_length) return !is_set (position); /* <- in window: set bit == committed */ else return true; /* <- above highest tracked bit: active */ }MVCC_ID_PRECEDES(id1, id2) is (id1) < (id2); the CASE 1/2/3 comments
above annotate each branch.
Invariant (probe bounds safety).
is_set/get_unit_ofare dereferenced only after theposition < m_bit_area_lengthtest in CASE 3, reached only whenm_bit_area_length != 0. The branch ordering is load-bearing — moving CASE 2 after CASE 3 would let a zero-length window reachis_set. The “all bits beyondm_bit_area_lengthareALL_ACTIVE” invariant (check_valid, Ch. 2) makes the out-of-range case safe to call active.
4.4 compute_highest_completed_mvccid — top-down highest set bit
Section titled “4.4 compute_highest_completed_mvccid — top-down highest set bit”Flattens the bit area into the highest completed MVCCID; build_mvcc_info
(Ch. 5) MVCCID_FORWARDs the result by one.
// mvcc_active_tran::compute_highest_completed_mvccid -- src/transaction/mvcc_active_tran.cpp if (m_bit_area_length == 0) return m_bit_area_start_mvccid - 1; /* <- EMPTY: nothing completed; one below start */ // ... declarations condensed ... for (highest_completed_bit_area = get_unit_of (m_bit_area_length - 1); /* <- scan units top-down */ highest_completed_bit_area >= m_bit_area; --highest_completed_bit_area) { bits = *highest_completed_bit_area; if (bits == 0) continue; /* <- ALL_ACTIVE unit: keep going down */ for (bit_pos = 0, count_bits = UNIT_BIT_COUNT / 2; count_bits > 0; count_bits /= 2) if (bits >= (1ULL << count_bits)) /* <- in-word search for highest set bit */ { bit_pos += count_bits; bits >>= count_bits; } highest_bit_position = bit_pos; break; } if (highest_completed_bit_area < m_bit_area) /* <- ran off bottom: no set bit anywhere */ return m_bit_area_start_mvccid - 1; else return get_mvccid (units_to_bits (highest_completed_bit_area - m_bit_area) + highest_bit_position);Empty window and not-found both yield m_bit_area_start_mvccid - 1. The
found path scans units top-down (skipping ALL_ACTIVE) with a 6-step
(log2 64) in-word binary search for the most-significant set bit
(software clz).
4.5 compute_lowest_active_mvccid — bottom-up lowest clear bit
Section titled “4.5 compute_lowest_active_mvccid — bottom-up lowest clear bit”The mirror: lowest still-active MVCCID. complete_mvcc (Ch. 8) calls it
to advance the watermark.
// mvcc_active_tran::compute_lowest_active_mvccid -- src/transaction/mvcc_active_tran.cpp if (m_long_tran_mvccids_length > 0 && m_long_tran_mvccids != NULL) return m_long_tran_mvccids[0]; /* <- SHORTCUT: sorted, [0] is lowest active */ if (m_bit_area_length == 0) return m_bit_area_start_mvccid; /* <- EMPTY: lowest active is window start */ unit_type *end_bit_area = get_unit_of (m_bit_area_length - 1); size_t lowest_bit_pos = 0; // other declarations condensed for (lowest_active_bit_area = m_bit_area; lowest_active_bit_area <= end_bit_area; ++lowest_active_bit_area) { bits = *lowest_active_bit_area; if (bits == ALL_COMMITTED) /* <- whole word committed: skip 64 bits */ { lowest_bit_pos += UNIT_BIT_COUNT; continue; } for (bit_pos = 0, count_bits = UNIT_BIT_COUNT / 2; count_bits > 0; count_bits /= 2) { mask = (1ULL << count_bits) - 1; if ((bits & mask) == mask) /* <- low count_bits all set: clear bit is higher */ { bit_pos += count_bits; bits >>= count_bits; } } lowest_bit_pos += bit_pos; break; } if (lowest_active_bit_area > end_bit_area) /* <- every tracked bit set: no active bit */ return get_mvccid (m_bit_area_length); else return get_mvccid (lowest_bit_pos);The early returns cover the sorted-array [0] shortcut and the empty
window; otherwise units are scanned upward (skipping ALL_COMMITTED by
+64) and the inner loop mirrors §4.4 for the least-significant clear
bit. Not-found returns get_mvccid(m_bit_area_length).
Invariant (sorted overflow array). The
[0]shortcut is correct only becausem_long_tran_mvccidsis ascending —add_long_transactionasserts each new entry exceeds the last. If the order broke, the watermark would jump forward and VACUUM could reclaim still-visible versions.
4.6 Snapshot-layer wrapper — mvcc_is_id_in_snapshot
Section titled “4.6 Snapshot-layer wrapper — mvcc_is_id_in_snapshot”Puts the two scalars in front of the probe so the scan runs only inside the active window.
// mvcc_is_id_in_snapshot -- src/transaction/mvcc.c (signature/assert elided) if (MVCC_ID_PRECEDES (mvcc_id, snapshot->lowest_active_mvccid)) return false; /* <- below lowest active: committed before snapshot */ if (MVCC_ID_FOLLOW_OR_EQUAL (mvcc_id, snapshot->highest_completed_mvccid)) return true; /* <- at/above highest completed: active wrt snapshot */ return snapshot->m_active_mvccs.is_active (mvcc_id); /* <- gray zone: consult the bit area */Only the gray band reaches the §4.3 probe. (MVCC_ID_FOLLOW_OR_EQUAL is
>=; build_mvcc_info sets highest_completed_mvccid = compute_highest_completed_mvccid() + 1 — the MVCCID_FORWARD in Ch. 5.)
4.7 Snapshot-layer wrapper — mvcc_is_active_id
Section titled “4.7 Snapshot-layer wrapper — mvcc_is_active_id”Answers “is this MVCCID active right now” (not against a frozen snapshot; dirty/delete paths in Ch. 7).
// mvcc_is_active_id -- src/transaction/mvcc.c (tdes lookup / asserts elided) curr_mvcc_info = &tdes->mvccinfo; if (MVCC_ID_PRECEDES (mvccid, curr_mvcc_info->recent_snapshot_lowest_active_mvccid)) return false; /* <- below recent lowest-active: committed */ if (logtb_is_current_mvccid (thread_p, mvccid)) return true; /* <- my own id or a sub-id: active to me */ return log_Gl.mvcc_table.is_active (mvccid); /* <- otherwise ask the shared table */Both short-circuits answer without locking — below
recent_snapshot_lowest_active_mvccid (committed), or matched by
logtb_is_current_mvccid against own id/sub_ids (active to itself);
otherwise it consults mvcctable::is_active (§4.8).
4.8 Table-level mvcctable::is_active — version-validated retry
Section titled “4.8 Table-level mvcctable::is_active — version-validated retry”The shared active set lives in the lock-free history ring
(m_trans_status_history, Ch. 8). A committing transaction can swap the
live entry mid-scan, so is_active validates m_version around the probe.
// mvcctable::is_active -- src/transaction/mvcc_table.cpp (decls elided) do { index = m_trans_status_history_position.load (); /* <- current ring slot */ version = m_trans_status_history[index].m_version.load ();/* <- snapshot the version BEFORE */ ret_active = m_trans_status_history[index].m_active_mvccs.is_active (mvccid); /* <- §4.3 probe */ } while (version != m_trans_status_history[index].m_version.load ()); /* <- version moved? redo */ return ret_active;Invariant (version stability). A read is trusted only if
m_versionmatches before and after the probe. Writers publish a new status slot with an incrementedm_version, then advancem_trans_status_history_positionto it (next_trans_status_startstamps the bumped version on the next slot;next_tran_status_finishadvances the position once the slot is built — both Ch. 8). A reader that latched index + version before the swap sees a mismatch on reload and retries. There is no retry cap — progress relies on writers being short versus readers.
4.9 Chapter summary — key takeaways
Section titled “4.9 Chapter summary — key takeaways”- Set bit means committed, clear bit means active (inverse of the
naive reading);
is_activereturns!is_set(position), withALL_ACTIVE = 0/ALL_COMMITTED = -1as the extreme units. is_activehas three ordered cases: below the window (scan sorted long-tran array, else committed), empty (active), inside/above (bit lookup, or active above the tracked length); the order keepsis_setfrom being dereferenced on an empty/out-of-range window.- The two derivation scans mirror each other — highest set bit
top-down vs. lowest clear bit bottom-up — both a 6-step in-word
binary search with explicit not-found fallbacks (
start - 1,get_mvccid(m_bit_area_length)). compute_lowest_active_mvccidshort-circuits onm_long_tran_mvccids[0]because the overflow array is kept sorted ascending.- The cached scalars front-run the probe:
mvcc_is_id_in_snapshottouches the bit area only for the gray band between the two scalars;mvcc_is_active_idadds therecent_snapshot_lowest_active_mvccidcache andlogtb_is_current_mvccid(own id + sub-ids). - The shared table read is optimistic, not locked:
mvcctable::is_activevalidatesm_versionaround the probe, retrying on change.
Chapter 5: Snapshot Construction
Section titled “Chapter 5: Snapshot Construction”A read-only transaction photographs the global commit state into a private
structure, then reads against that photograph rather than freezing the system.
This chapter dissects how the photograph is taken: the entry guard in
logtb_get_mvcc_snapshot, the lock-free retry loop inside
mvcctable::build_mvcc_info, the strict-order publish dance that keeps VACUUM
honest, and which bytes fill each snapshot field. The meaning of a snapshot
is in the companion cubrid-mvcc.md §“Snapshot isolation”; Ch. 6 consumes the
structure built here; Ch. 4 explains the mvcc_active_tran layout it copies.
5.1 The four structures this chapter fills
Section titled “5.1 The four structures this chapter fills”A snapshot is not one object. It is mvcc_snapshot nested inside mvcc_info,
fed by a copy of one mvcc_trans_status slot, whose payload is a
mvcc_active_tran bit-area. Figure 5-1 shows the containment.
flowchart LR
subgraph TDES["log_tdes (per transaction)"]
INFO["mvcc_info mvccinfo"]
end
INFO --> SNAP["mvcc_snapshot snapshot"]
SNAP --> ACT["mvcc_active_tran m_active_mvccs"]
subgraph GLOBAL["mvcctable (global)"]
HIST["m_trans_status_history[index]<br/>mvcc_trans_status"]
end
HIST --> HACT["mvcc_active_tran m_active_mvccs"]
HACT -. "copy_to THREAD_UNSAFE" .-> ACT
Figure 5-1. Snapshot construction copies one global history slot’s bit-area into the transaction-private snapshot.
mvcc_snapshot (mvcc.h) — the read-against photograph:
| Field | Role | Why it exists |
|---|---|---|
lowest_active_mvccid | Anything < this is committed; never needs a bit probe | Fast lower-bound cutoff during visibility (Ch. 6) |
highest_completed_mvccid | Anything >= this was born after the snapshot, hence invisible | Fast upper-bound cutoff during visibility (Ch. 6) |
m_active_mvccs | Bit-area: per-MVCCID committed/active status for the gap between the two bounds | Exact answer for IDs in the ambiguous middle range |
snapshot_fnc | Function pointer to the visibility predicate, set to mvcc_satisfies_snapshot | Lets callers invoke visibility polymorphically (dirty/snapshot variants share the call shape) |
valid | True once fully built; checked by the entry guard | Avoids rebuilding mid-transaction (RR/SR) and signals RC invalidation |
mvcc_info (mvcc.h) — the per-transaction MVCC envelope:
| Field | Role | Why it exists |
|---|---|---|
snapshot | The mvcc_snapshot built here | The read photograph |
id | This transaction’s own MVCCID (Ch. 3); MVCCID_NULL until first write | Self-visibility checks |
recent_snapshot_lowest_active_mvccid | Cached copy of the snapshot’s lowest active | A second fast cutoff used outside the snapshot struct (sibling predicates, Ch. 7) |
sub_ids | Sub-transaction MVCCID stack (Ch. 10) | Savepoint / nested-statement visibility |
is_sub_active | True while a sub-transaction runs | Routes special paths (Ch. 10) |
mvcc_trans_status (mvcc_table.hpp) — one global commit-state ring slot:
| Field | Role | Why it exists |
|---|---|---|
m_active_mvccs | The authoritative live bit-area at the moment this slot was published | The data the snapshot copies |
m_last_completed_mvccid | Last MVCCID completed when slot was written; “just for info” | Debugging / history forensics only |
m_event_type | COMMIT/ROLLBACK/SUBTRAN, “just for info” | Debugging only |
m_version | atomic<version_type> bumped each time the writer rewrites this slot | The lock-free guard: read it before and after the copy |
mvcc_active_tran (mvcc_active_tran.hpp) — the bit-area itself. Deep
semantics (bit packing, long-tran migration, BITAREA_MAX_SIZE = 500) are in
Ch. 4; this chapter only exercises copy and reset (§5.4). All six fields:
| Field | Role | Why it exists |
|---|---|---|
m_bit_area | Pointer to the unit_type[] bit buffer; bit n = status of start + n | The packed committed/active map (Ch. 4) |
m_bit_area_start_mvccid | MVCCID mapped by bit 0 of m_bit_area | Anchors the bit range to an absolute MVCCID |
m_bit_area_length | Live length in bits; bits past it are all-active (0) | Bounds the valid prefix; drives get_bit_area_memsize |
m_long_tran_mvccids | Overflow array of still-active MVCCIDs older than start | Holds long transactions that fell off the bit-area window |
m_long_tran_mvccids_length | Count of entries in the overflow array | Bounds the long-tran copy/scan |
m_initialized | True once buffers are allocated | copy_to/check_valid assert on it before touching buffers |
5.2 The entry guard: who is allowed a snapshot
Section titled “5.2 The entry guard: who is allowed a snapshot”Every read path funnels through logtb_get_mvcc_snapshot. It is a guard, not
a builder — it decides whether to (re)build at all.
// logtb_get_mvcc_snapshot -- src/transaction/log_tran_table.cLOG_TDES *tdes = LOG_FIND_TDES (LOG_FIND_THREAD_TRAN_INDEX (thread_p));if (!tdes->is_active_worker_transaction ()) { return NULL; /* <- system trans read latest committed, no MVCC photo */ }assert (tdes != NULL); /* <- in source: AFTER the early return */THREAD_ENTRY *main_thread_p = NULL;if (thread_p->m_px_orig_thread_entry != NULL) { main_thread_p = thread_get_main_thread (thread_p); pthread_mutex_lock (&main_thread_p->m_px_lock_mutex); /* <- parallel-px workers share one snapshot */ }if (!tdes->mvccinfo.snapshot.valid) { log_Gl.mvcc_table.build_mvcc_info (*tdes); /* <- only build when invalid */ }if (main_thread_p != NULL) { pthread_mutex_unlock (&main_thread_p->m_px_lock_mutex); }return &tdes->mvccinfo.snapshot;Branch accounting:
- Not an active worker transaction → return
NULL. System transactions (VACUUM, checkpoint, recovery) have no MVCC photo; callers treatNULLas “see everything committed.” - Parallel-px worker (
m_px_orig_thread_entry != NULL) → take the main thread’sm_px_lock_mutex. px sub-threads share the main transaction’stdes, so the lock serializes thevalidcheck and build — two workers must not both callbuild_mvcc_infoon onetdes. - Snapshot valid → skip the build (common for RR/SR after the first statement); invalid → build, then unlock if step 2 locked. Always return the pointer.
Invariant — one build per validity epoch. build_mvcc_info runs only while
snapshot.valid == false. RR/SR stay valid for the whole transaction (built
once); RC’s logtb_invalidate_snapshot_data sets valid = false per statement
(§5.6). Building while valid == true would overwrite a live snapshot mid-scan
and corrupt in-flight visibility decisions.
5.3 build_mvcc_info: the lock-free copy with version re-check
Section titled “5.3 build_mvcc_info: the lock-free copy with version re-check”mvcctable::build_mvcc_info must copy a global ring slot that the commit path
may be rewriting concurrently, without taking a lock. It does so with an
optimistic version re-check. Figure 5-2 is the authoritative control flow; the
notes below add only what the diagram cannot show.
flowchart TD
A["initialize snapshot bit-area"] --> B["tx_lowest_active =<br/>load m_transaction_lowest_visible[tran_index]"]
B --> C{"MVCCID_IS_VALID<br/>tx_lowest_active?"}
C -- "no = not yet published" --> D["set my slot = MVCCID_ALL_VISIBLE<br/>then read crt_status_lowest_active<br/>then store it back -- strict order"]
C -- "yes = already published" --> E["crt_status_lowest_active =<br/>load m_current_status_lowest_active"]
D --> F["index = history_position.load"]
E --> F
F --> G["ver = slot.m_version.load"]
G --> H["slot.m_active_mvccs.copy_to<br/>dest, THREAD_UNSAFE"]
H --> I["logtb_load_global_statistics_to_tran"]
I --> J{"ver == slot.m_version.load?"}
J -- "yes = stable" --> K["break"]
J -- "no = writer raced us" --> L["dest.reset_active_transactions"]
L --> M["retry_count++"]
M --> G
K --> N["check_valid; fill scalar fields"]
Figure 5-2. build_mvcc_info principal control flow. The non-fatal statistics-load error path (node I) is described in prose below; it sets an error and continues without breaking the loop.
The lowest-visible publish dance — the documented VACUUM race. The function’s
subtlest code. When !MVCCID_IS_VALID (tx_lowest_active) (no lowest-visible
value published yet), it does three atomics in strict order: publish the
sentinel MVCCID_ALL_VISIBLE into my slot, read the global lowest active,
store that global value back into my slot.
// mvcctable::build_mvcc_info -- src/transaction/mvcc_table.cppoldest_active_set (m_transaction_lowest_visible_mvccids[tdes.tran_index], tdes.tran_index, MVCCID_ALL_VISIBLE, oldest_active_event::BUILD_MVCC_INFO);/* Is important that between next two code lines to not have delays (e.g. instructions adding). */crt_status_lowest_active = oldest_active_get (m_current_status_lowest_active_mvccid, 0, oldest_active_event::BUILD_MVCC_INFO);oldest_active_set (m_transaction_lowest_visible_mvccids[tdes.tran_index], tdes.tran_index, crt_status_lowest_active, oldest_active_event::BUILD_MVCC_INFO);The in-source comment walks a 5-step scenario the sentinel defeats. The key steps, quoted verbatim:
- the transaction having global lowest active MVCCID commits, so the global value is updated (advanced)
- the VACUUM thread computes the MVCCID threshold as the updated global lowest active MVCCID
- the snapshot thread resumes and p_transaction_lowest_active_mvccid is set to initial value of global lowest active MVCCID
- the VACUUM thread computes the threshold again and found a value (initial global lowest active MVCCID) less than the previous threshold
That is: VACUUM computes the threshold from the advanced global lowest, then
the suspended snapshot thread resumes and stores its older initial value, so
VACUUM’s next threshold comes out less than the previous one — moving the
watermark backward. Setting the sentinel makes compute_oldest_visible_mvccid
(Ch. 9) wait for this slot. When tx_lowest_active is already valid (the
common retry path), the dance is skipped and only the single
m_current_status_lowest_active_mvccid load runs.
Snapshotting the history slot. m_trans_status_history_position always
points at the current (newest) slot; the commit path advances it (Ch. 8). The
index is read once, then that slot’s m_version is read before the copy.
// mvcctable::build_mvcc_info -- src/transaction/mvcc_table.cppindex = m_trans_status_history_position.load ();assert (index < HISTORY_MAX_SIZE);const mvcc_trans_status &trans_status = m_trans_status_history[index];trans_status_version = trans_status.m_version.load (); /* <- version BEFORE copy */trans_status.m_active_mvccs.copy_to (tdes.mvccinfo.snapshot.m_active_mvccs, mvcc_active_tran::copy_safety::THREAD_UNSAFE);THREAD_UNSAFE is deliberate: check_valid cannot run mid-copy on a slot that
may mutate under us; validity is verified only after the loop confirms
stability. logtb_load_global_statistics_to_tran runs next; on error it sets
ER_MVCC_CANT_GET_SNAPSHOT but does not abort the build or break the loop.
The version re-check — the lock-free pivot.
// mvcctable::build_mvcc_info -- src/transaction/mvcc_table.cppif (trans_status_version == trans_status.m_version.load ()) /* <- version AFTER copy */ { break; /* <- writer did not touch slot; copy is consistent */ }else { tdes.mvccinfo.snapshot.m_active_mvccs.reset_active_transactions (); /* <- discard torn copy */ }Invariant — version-stable copy. The bit-area handed to the caller equals
exactly one published mvcc_trans_status image, because the same m_version
was observed before and after copy_to. The commit path bumps m_version
around every slot rewrite (Ch. 8), so any concurrent write is detected and
forces a retry; on detection reset_active_transactions zeroes the destination
so a stale tail cannot poison the next attempt. Without the re-check, a snapshot
could mix a pre-commit bit-area with post-commit bounds, and visibility (Ch. 6)
would answer inconsistently for the racing MVCCID.
Scalar fill after the loop.
// mvcctable::build_mvcc_info -- src/transaction/mvcc_table.cpptdes.mvccinfo.snapshot.m_active_mvccs.check_valid (); /* <- now safe to validate */highest_completed_mvccid = tdes.mvccinfo.snapshot.m_active_mvccs.compute_highest_completed_mvccid ();MVCCID_FORWARD (highest_completed_mvccid); /* <- exclusive upper bound */tdes.mvccinfo.recent_snapshot_lowest_active_mvccid = crt_status_lowest_active;tdes.mvccinfo.snapshot.snapshot_fnc = mvcc_satisfies_snapshot;tdes.mvccinfo.snapshot.lowest_active_mvccid = crt_status_lowest_active;tdes.mvccinfo.snapshot.highest_completed_mvccid = highest_completed_mvccid;tdes.mvccinfo.snapshot.valid = true; /* <- LAST; publishes the snapshot */compute_highest_completed_mvccid (Ch. 4) returns m_bit_area_start_mvccid - 1
if empty; MVCCID_FORWARD advances it into the exclusive upper bound. Both
lowest fields take crt_status_lowest_active; valid = true is set last so a
peer never observes a half-filled snapshot. Perf accounting then adds
snapshot_retry_count - 1 to PSTAT_LOG_SNAPSHOT_RETRY_COUNTERS (the “minus
one” drops the mandatory first pass, making the metric contention, not work)
and elapsed time to PSTAT_LOG_SNAPSHOT_TIME_COUNTERS.
5.4 mvcc_active_tran::copy_to internals
Section titled “5.4 mvcc_active_tran::copy_to internals”The copy is a sized memcpy with a shrink-clear and a long-transaction tail,
gated by the safety flag.
// mvcc_active_tran::copy_to -- src/transaction/mvcc_active_tran.cppassert (m_initialized && dest.m_initialized);if (safety == copy_safety::THREAD_SAFE) { check_valid (); /* <- source validated only when safe */ dest.check_valid (); }size_t new_bit_area_memsize = get_bit_area_memsize (); /* <- source live bytes */size_t old_bit_area_memsize = dest.get_bit_area_memsize (); /* <- dest's previous live bytes */char *dest_bit_area = (char *) dest.m_bit_area;if (new_bit_area_memsize > 0) { std::memcpy (dest_bit_area, m_bit_area, new_bit_area_memsize); }if (old_bit_area_memsize > new_bit_area_memsize) { /* <- dest was longer last time; zero the now-unused tail */ std::memset (dest_bit_area + new_bit_area_memsize, 0, old_bit_area_memsize - new_bit_area_memsize); }if (m_long_tran_mvccids_length > 0) { std::memcpy (dest.m_long_tran_mvccids, m_long_tran_mvccids, get_long_tran_memsize ()); }dest.m_bit_area_start_mvccid = m_bit_area_start_mvccid;dest.m_bit_area_length = m_bit_area_length;dest.m_long_tran_mvccids_length = m_long_tran_mvccids_length;if (safety == copy_safety::THREAD_SAFE) { dest.check_valid (); }The five branches: THREAD_SAFE brackets the copy with check_valid calls
(clone wrappers, §5.5, over quiescent sources) while THREAD_UNSAFE skips them
(build_mvcc_info over a live slot, relying on the re-check);
new_bit_area_memsize > 0 skips the copy on a fresh empty system;
old_bit_area_memsize > new_bit_area_memsize zeroes the leftover tail when the
destination held a longer area; m_long_tran_mvccids_length > 0 copies the
overflow array (Ch. 4) before the scalar assignments mirror the metadata.
Invariant — the tail is always all-active (zero). check_valid asserts in
debug builds that every bit past m_bit_area_length is ALL_ACTIVE (0), and
that long-tran MVCCIDs are strictly ordered and precede
m_bit_area_start_mvccid. The shrink-clear above and reset_active_transactions
keep this true. A stale committed tail bit would make
compute_highest_completed_mvccid report an MVCCID outside the active set,
corrupting the upper bound.
reset_active_transactions is the blunt reset used on a torn copy:
// mvcc_active_tran::reset_active_transactions -- src/transaction/mvcc_active_tran.cppstd::memset (m_bit_area, 0, BITAREA_MAX_MEMSIZE); /* <- zero the WHOLE max buffer, not just live part */m_bit_area_length = 0;m_long_tran_mvccids_length = 0;It zeroes the entire BITAREA_MAX_MEMSIZE and drops both lengths to zero, leaves
m_bit_area_start_mvccid untouched, and does not call check_valid (the
caller is mid-retry, not yet consistent).
5.5 The other two copy_to wrappers
Section titled “5.5 The other two copy_to wrappers”mvcc_snapshot::copy_to and mvcc_info::copy_to are not on the hot path —
they clone an already-built snapshot (e.g. parent → sub-transaction, Ch. 10),
using THREAD_SAFE because the source is a finished, non-mutating local.
mvcc_snapshot::copy_to calls dest.m_active_mvccs.initialize (), then
copy_to (..., THREAD_SAFE), then mirrors lowest_active_mvccid,
highest_completed_mvccid, snapshot_fnc, and valid — every field, so the
clone is usable without a rebuild. mvcc_info::copy_to calls
this->snapshot.copy_to (dest.snapshot) then layers the envelope fields:
// mvcc_info::copy_to -- src/transaction/mvcc.cdest.id = this->id;dest.recent_snapshot_lowest_active_mvccid = this->recent_snapshot_lowest_active_mvccid;dest.sub_ids = this->sub_ids; /* <- std::vector deep copy */dest.is_sub_active = this->is_sub_active;5.6 Isolation level decides when, not how
Section titled “5.6 Isolation level decides when, not how”build_mvcc_info is isolation-agnostic. The acquisition timing lives at the
call sites and at logtb_invalidate_snapshot_data, not here.
| Isolation | Snapshot taken | Mechanism |
|---|---|---|
TRAN_READ_COMMITTED | Once per statement | logtb_invalidate_snapshot_data sets valid = false at each statement boundary; next logtb_get_mvcc_snapshot rebuilds |
TRAN_REPEATABLE_READ | Once per transaction | First logtb_get_mvcc_snapshot builds; valid stays true (invalidate is a no-op) |
TRAN_SERIALIZABLE | Once per transaction | Same as RR for snapshot acquisition |
The guard is logtb_invalidate_snapshot_data:
// logtb_invalidate_snapshot_data -- src/transaction/log_tran_table.cif (tdes == NULL || tdes->isolation >= TRAN_REPEATABLE_READ) { return NO_ERROR; /* <- RR/SR keep their snapshot across statements */ }if (tdes->mvccinfo.snapshot.valid) { tdes->mvccinfo.snapshot.valid = false; /* <- RC: drop it so next read rebuilds */ logtb_tran_reset_count_optim_state (thread_p); }RC sees transactions that committed between its statements; RR/SR are pinned to
the first statement’s photograph. The >= TRAN_REPEATABLE_READ test relies on
the enum ordering RC < RR < SR. None of this touches build_mvcc_info.
5.7 Chapter summary — key takeaways
Section titled “5.7 Chapter summary — key takeaways”- The entry guard gates everything.
logtb_get_mvcc_snapshotreturnsNULLfor system transactions, serializes parallel-px workers onm_px_lock_mutex, and builds only whenvalid == false. - Construction is lock-free via version re-check.
build_mvcc_inforeadsm_versionbefore and after aTHREAD_UNSAFEcopy_to; equal → coherent (break), unequal → torn (reset_active_transactionsand retry). - The lowest-visible dance prevents a backward VACUUM watermark. Write the
MVCCID_ALL_VISIBLEsentinel first, read the global lowest, store it back with no intervening work, so VACUUM (Ch. 9) waits rather than undercut it. THREAD_UNSAFEvsTHREAD_SAFEis about the source. The build path copies a live mutating slot (skipscheck_valid, relies on the re-check); clone paths copy quiescent locals (validate).- The bit-area tail must stay zero.
copy_to’s shrink-clear,reset_active_transactions’s full-buffermemset, andcheck_valid’s asserts keep no stale committed bit pastm_bit_area_length. - Scalar fields fill in a fixed order,
validlast.highest_completed = MVCCID_FORWARD(...)is the exclusive upper bound; both lowest fields take the global lowest;snapshot_fnc = mvcc_satisfies_snapshot. - Isolation timing is external. RC rebuilds per statement (via
logtb_invalidate_snapshot_data), RR/SR build once; the builder is isolation-agnostic.
title: “CUBRID MVCC Detail — Chapter 6: Visibility Evaluation” category: code-analysis project: cubrid module: mvcc sources: [raw/code-analysis/cubrid/storage/mvcc/] references: [src/transaction/mvcc.c, src/transaction/mvcc.h] summary: “Branch-complete dissection of mvcc_satisfies_snapshot — the verdict for every insert/delete state, the insert/delete asymmetry, and the perfmon classification leaves.” created: 2026-06-07 updated: 2026-06-07 tags: [mvcc, visibility, snapshot, perfmon, detail]
Section titled “title: “CUBRID MVCC Detail — Chapter 6: Visibility Evaluation” category: code-analysis project: cubrid module: mvcc sources: [raw/code-analysis/cubrid/storage/mvcc/] references: [src/transaction/mvcc.c, src/transaction/mvcc.h] summary: “Branch-complete dissection of mvcc_satisfies_snapshot — the verdict for every insert/delete state, the insert/delete asymmetry, and the perfmon classification leaves.” created: 2026-06-07 updated: 2026-06-07 tags: [mvcc, visibility, snapshot, perfmon, detail]”Chapter 6: Visibility Evaluation
Section titled “Chapter 6: Visibility Evaluation”The MVCC apparatus of Chapters 2–5 exists to answer one yes/no question, asked millions of times per second: given the snapshot this transaction reads under and the MVCC header on a record version, is this the version I should see? This chapter dissects mvcc_satisfies_snapshot branch by branch — every conditional, return, and perfmon leaf. For the conceptual framing (snapshot as a half-open MVCCID interval, what “active” means, the committed-before / active / committed-after model) see the companion cubrid-mvcc.md, §“Snapshot semantics” and §“Visibility”; this chapter does not re-derive it.
6.1 The three inputs and the three outputs
Section titled “6.1 The three inputs and the three outputs”mvcc_satisfies_snapshot is a pure decision function over the thread (who am I), a record header, and a snapshot — no side effects, unlike mvcc_satisfies_dirty (Ch. 7) which mutates the snapshot. The verdict is one of three enum values:
// mvcc_satisfies_snapshot_result -- src/transaction/mvcc.henum mvcc_satisfies_snapshot_result{ TOO_OLD_FOR_SNAPSHOT, /* not visible, deleted by me or deleted by inactive transaction */ SNAPSHOT_SATISFIED, /* is visible and valid */ TOO_NEW_FOR_SNAPSHOT /* not visible, inserter is still active. * ... check previous versions in log ... */};| Verdict | Meaning | What the caller does next |
|---|---|---|
SNAPSHOT_SATISFIED | This exact version is the one the reader sees. | Return the record. |
TOO_NEW_FOR_SNAPSHOT | Born too late; an older version may be visible. | Walk prev_version_lsa back one link and re-evaluate (Ch. 3). |
TOO_OLD_FOR_SNAPSHOT | Already dead from the reader’s view; no older version can save it. | Stop — the reader sees nothing on this chain head. |
The directionality invariant. TOO_NEW points backward along the version chain; TOO_OLD is terminal. A delete committed within the snapshot can never be undone by an older version — that version is the same logical row that was deleted. Hence only TOO_NEW’s enum comment mentions “check previous versions in log.” §6.7 gives the worked example.
6.2 The structs in play
Section titled “6.2 The structs in play”mvcc_satisfies_snapshot reads two structs: mvcc_rec_header (per-version stamps) and mvcc_snapshot (the visibility frontier). Both are defined in full in Chapter 1; the field tables below cover every member.
mvcc_rec_header (src/transaction/mvcc.h) — fields mvcc_flag:8, repid:24, chn, mvcc_ins_id, mvcc_del_id, prev_version_lsa:
| Field | Role here | Why it exists |
|---|---|---|
mvcc_flag | Read via MVCC_IS_HEADER_DELID_VALID and MVCC_IS_FLAG_SET(.., VALID_INSID) — which stamps are present. | A version may lack an insert stamp (vacuum stripped it) or delete stamp (never deleted). |
repid / chn | Not read. | Representation id / cache-coherency number — record format and client cache. |
mvcc_ins_id | Inserter’s MVCCID; compared vs snapshot and “me.” | The transaction whose commit makes this version appear. |
mvcc_del_id | Deleter’s MVCCID; compared vs snapshot and “me.” | The transaction whose commit makes this version disappear. |
prev_version_lsa | Not read inside; the link the caller follows on TOO_NEW. | Threads the version chain backward (Ch. 3). |
The flag bits live in object_representation_constants.h: OR_MVCC_FLAG_VALID_INSID = 0x01, OR_MVCC_FLAG_VALID_DELID = 0x02. MVCC_IS_HEADER_DELID_VALID(h) is MVCC_IS_FLAG_SET (h, OR_MVCC_FLAG_VALID_DELID) && MVCCID_IS_VALID (MVCC_GET_DELID (h)) — the flag set and the id not MVCCID_NULL.
mvcc_snapshot (src/transaction/mvcc.h) — fields lowest_active_mvccid, highest_completed_mvccid, m_active_mvccs, snapshot_fnc, valid, plus member functions (ctor, reset, deleted operator=, copy_to):
| Field | Role here | Why it exists |
|---|---|---|
lowest_active_mvccid | mvcc_is_id_in_snapshot: id strictly below is committed-before (visible). | Lower bound of the in-doubt band; fast reject. |
highest_completed_mvccid | mvcc_is_id_in_snapshot: id at-or-above is active at snapshot time (in-snapshot). | Upper bound; fast accept. |
m_active_mvccs | Bit-area / cached-scalar probe for ids inside the band (Ch. 4). | Exact membership test for concurrent ids. |
snapshot_fnc | Not read — the pointer that selected this function. | Plugs in snapshot / dirty / delete polymorphically. |
valid | Not read (caller guarantees a built snapshot). | Marks a snapshot as constructed (Ch. 5). |
| member functions | Not invoked here. | Construction / copy helpers (Ch. 5). |
flowchart TD
H["mvcc_rec_header<br/>mvcc_flag, mvcc_ins_id, mvcc_del_id"]
S["mvcc_snapshot<br/>lowest_active_mvccid, highest_completed_mvccid, m_active_mvccs"]
F["mvcc_satisfies_snapshot"]
H -->|"DELID_VALID? INSID flag? ins_id / del_id"| F
S -->|"is_id_in_snapshot(ins_id / del_id)"| F
TH["thread_p<br/>logtb_is_current_mvccid -> me?"]
TH -->|"INSERTED_BY_ME / DELETED_BY_ME"| F
F --> V{"verdict"}
V --> R1["SNAPSHOT_SATISFIED"]
V --> R2["TOO_NEW_FOR_SNAPSHOT"]
V --> R3["TOO_OLD_FOR_SNAPSHOT"]
Figure 6-1 — Inputs to the verdict. Header supplies stamps and flags; snapshot supplies the frontier; thread supplies identity.
6.3 The top split and the frontier helper
Section titled “6.3 The top split and the frontier helper”The first branch is the only structural fork in the function:
// mvcc_satisfies_snapshot -- src/transaction/mvcc.c assert (rec_header != NULL && snapshot != NULL);
if (!MVCC_IS_HEADER_DELID_VALID (rec_header)) { /* The record is not deleted */ // ... insert-side ladder (§6.4) ... } else { /* The record is deleted */ // ... delete-side ladder (§6.5) ... }A version is “not deleted” when VALID_DELID is clear or mvcc_del_id == MVCCID_NULL — both fold into MVCC_IS_HEADER_DELID_VALID. The not-deleted side asks only “did the inserter become visible to me?”; the deleted side reasons about inserter and deleter, so its ladder is longer. Both lean on mvcc_is_id_in_snapshot, the band test for one MVCCID:
// mvcc_is_id_in_snapshot -- src/transaction/mvcc.c (body, condensed) if (MVCC_ID_PRECEDES (mvcc_id, snapshot->lowest_active_mvccid)) return false; /* below band -> committed-before, NOT in-snapshot */ if (MVCC_ID_FOLLOW_OR_EQUAL (mvcc_id, snapshot->highest_completed_mvccid)) return true; /* at/above band -> not completed, IS in-snapshot */ return snapshot->m_active_mvccs.is_active (mvcc_id); /* inside band -> exact probe (Ch. 4) */“In snapshot” means this MVCCID was still active, or had not yet started, when the snapshot was taken — its effects must be invisible. The macros MVCC_IS_REC_INSERTER_IN_SNAPSHOT / MVCC_IS_REC_DELETER_IN_SNAPSHOT are thin wrappers feeding mvcc_ins_id / mvcc_del_id into this helper.
6.4 The not-deleted ladder — four cases
Section titled “6.4 The not-deleted ladder — four cases”// mvcc_satisfies_snapshot -- src/transaction/mvcc.c (not-deleted branch, condensed) if (!MVCC_IS_HEADER_DELID_VALID (rec_header)) { if (!MVCC_IS_FLAG_SET (rec_header, OR_MVCC_FLAG_VALID_INSID)) { /* ... perfmon ... */ return SNAPSHOT_SATISFIED; } /* (a) no insert stamp -> all-visible */ else if (MVCC_IS_REC_INSERTED_BY_ME (thread_p, rec_header)) { /* ... perfmon ... */ return SNAPSHOT_SATISFIED; } /* (b) I inserted it */ else if (MVCC_IS_REC_INSERTER_IN_SNAPSHOT (thread_p, rec_header, snapshot)) { /* ... perfmon ... */ return TOO_NEW_FOR_SNAPSHOT; } /* (c) inserter still in-snapshot */ else { /* ... perfmon ... */ return SNAPSHOT_SATISFIED; } /* (d) inserter committed before snapshot */ }Figure 6-2 encodes the four short-circuit cases: (a) a stripped insert stamp means vacuum already declared the version all-visible; (b) MVCC_IS_REC_INSERTED_BY_ME reaches logtb_is_current_mvccid, which also matches the transaction’s own sub-transaction ids (Ch. 10); (c) the lone TOO_NEW path; (d) the by-elimination residue, the only leaf with sub-classification (§6.8).
Insert-side invariant. On the not-deleted branch the verdict is TOO_NEW iff the inserter is concurrent (case c); otherwise SNAPSHOT_SATISFIED. Enforced by ordering the unconditional-visible tests (a, b) before the in-snapshot test (c) — identity before frontier. If (c) fired for an inserter that had committed before the snapshot, the reader would chase a previous version and surface a stale row; correctness rests on mvcc_is_id_in_snapshot being exact inside the band (Ch. 4).
flowchart TD
A{"VALID_INSID flag set?"}
A -->|"no"| AV["SNAPSHOT_SATISFIED<br/>INSERTED_VACUUMED / VISIBLE"]
A -->|"yes"| B{"INSERTED_BY_ME?"}
B -->|"yes"| BV["SNAPSHOT_SATISFIED<br/>INSERTED_CURR_TRAN / VISIBLE"]
B -->|"no"| C{"inserter IN_SNAPSHOT?"}
C -->|"yes"| CV["TOO_NEW_FOR_SNAPSHOT<br/>INSERTED_OTHER_TRAN / INVISIBLE"]
C -->|"no"| DV["SNAPSHOT_SATISFIED<br/>INSERTED_COMMITED[_LOST] / VISIBLE"]
Figure 6-2 — Not-deleted ladder. Only the in-snapshot inserter yields TOO_NEW.
6.5 The deleted ladder — four cases
Section titled “6.5 The deleted ladder — four cases”When MVCC_IS_HEADER_DELID_VALID is true the version carries a committed-or-pending delete stamp, so both inserter and deleter matter.
// mvcc_satisfies_snapshot -- src/transaction/mvcc.c (deleted branch, condensed) else { if (MVCC_IS_REC_DELETED_BY_ME (thread_p, rec_header)) { /* ... perfmon ... */ return TOO_OLD_FOR_SNAPSHOT; } /* (e) I deleted it */ else if (MVCC_IS_REC_INSERTER_IN_SNAPSHOT (thread_p, rec_header, snapshot)) { /* !!TODO: Is this check necessary? It seems that if inserter is active, then so will be the deleter (actually * they will be the same). It only adds an extra-check in a function frequently called. */ /* ... perfmon ... */ return TOO_NEW_FOR_SNAPSHOT; /* (f) inserter in-snapshot */ } else if (MVCC_IS_REC_DELETER_IN_SNAPSHOT (thread_p, rec_header, snapshot)) { /* ... perfmon ... */ return SNAPSHOT_SATISFIED; } /* (g) deleter in-snapshot -> not yet visible */ else { /* ... perfmon ... */ return TOO_OLD_FOR_SNAPSHOT; } /* (h) deleter committed before snapshot */ }Figure 6-3 encodes the cases: (e) terminal — chasing back would resurrect my own delete; (f) a concurrent transaction inserted then deleted before committing, so the version never existed for the reader (TOO_NEW); (g) the lone visible deleted-leaf — inserter committed-before but the deleter is still concurrent, so the row shows as if undeleted; (h) fall-through — both stamps committed-before (TOO_OLD, terminal). The !!TODO on (f) is discussed in §6.10.
Delete-side invariant. On the deleted branch, SNAPSHOT_SATISFIED only in (g) (deleter concurrent); TOO_NEW only in (f) (inserter concurrent); every other case is TOO_OLD. Ordering: (e) before (f) decides a self-delete by identity, not frontier; (f) before (g) because an in-snapshot inserter is a strictly stronger reason to look backward. Testing (g) first would let an insert-and-delete by one concurrent transaction wrongly return SNAPSHOT_SATISFIED, surfacing a row that never committed.
flowchart TD
E{"DELETED_BY_ME?"}
E -->|"yes"| EV["TOO_OLD_FOR_SNAPSHOT<br/>DELETED_CURR_TRAN / INVISIBLE"]
E -->|"no"| F{"inserter IN_SNAPSHOT?"}
F -->|"yes"| FV["TOO_NEW_FOR_SNAPSHOT<br/>INSERTED_DELETED / INVISIBLE"]
F -->|"no"| G{"deleter IN_SNAPSHOT?"}
G -->|"yes"| GV["SNAPSHOT_SATISFIED<br/>DELETED_OTHER_TRAN / VISIBLE"]
G -->|"no"| HV["TOO_OLD_FOR_SNAPSHOT<br/>DELETED_COMMITTED[_LOST] / INVISIBLE"]
Figure 6-3 — Deleted ladder. Visible only when the deleter is still concurrent.
6.6 How TOO_NEW drives the version-chain walk
Section titled “6.6 How TOO_NEW drives the version-chain walk”mvcc_satisfies_snapshot never touches prev_version_lsa; it only emits TOO_NEW_FOR_SNAPSHOT, telling the caller to walk it (the link is set at insert/update time, Ch. 3). The scan/fetch layer (heap_file.c) interprets the verdict: SNAPSHOT_SATISFIED returns the version; TOO_OLD means the chain head is dead; TOO_NEW dereferences prev_version_lsa, loads the prior header, and re-calls. The walk terminates because each link’s inserter MVCCID is strictly older than the prior link’s, so eventually an in-snapshot test fails (case d) or the chain ends at a null LSA. Only cases (c) and (f) produce TOO_NEW.
6.7 The insert/delete asymmetry — a worked example
Section titled “6.7 The insert/delete asymmetry — a worked example”Reader R holds a snapshot with frontier [lowest=100, highest=100) — transaction 100 is the only active id R knows of, taken just before 100 did anything. Four scenarios against the §6.4–§6.5 ladders:
| Scenario | Header | Helper results | Case | Verdict |
|---|---|---|---|---|
| Committed T50 insert, never deleted | ins=50, no DELID | inserter not in-snapshot (50 < 100) | (d) | SNAPSHOT_SATISFIED |
| Concurrent T100 insert, uncommitted | ins=100, no DELID | inserter in-snapshot (100 >= 100) | (c) | TOO_NEW -> walk prev |
| T50 insert, concurrent T100 delete | ins=50, del=100 | inserter not in-snapshot; deleter in-snapshot | (g) | SNAPSHOT_SATISFIED |
| T50 insert, committed T60 delete | ins=50, del=60 | neither in-snapshot | (h) | TOO_OLD |
Rows 2 and 3 are the asymmetry: a too-new insert sends the reader backward for an older version; a too-new delete keeps the row visible (a delete uncommitted from R’s view has not happened). An insert MVCCID gates a version’s appearance; a delete MVCCID gates its disappearance.
6.8 Performance-monitoring leaves
Section titled “6.8 Performance-monitoring leaves”Every leaf, when tracking is on, calls perfmon_mvcc_snapshot(thread_p, snapshot_type, rec_type, visibility) — snapshot_type is always PERF_SNAPSHOT_SATISFIES_SNAPSHOT, rec_type classifies why, visibility is the outcome. The call is guarded by perfmon_is_perf_tracking_and_active (PERFMON_ACTIVATION_FLAG_MVCC_SNAPSHOT), so the hot path is free when off. The rec_type bucket per leaf is PERF_SNAPSHOT_RECORD_<suffix>, where Figures 6-2/6-3 already give the eight suffixes and visibility bits: (a) INSERTED_VACUUMED, (b) INSERTED_CURR_TRAN, (c) INSERTED_OTHER_TRAN, (d) INSERTED_COMMITED, (e) DELETED_CURR_TRAN, (f) INSERTED_DELETED, (g) DELETED_OTHER_TRAN, (h) DELETED_COMMITTED. Two leaves split a step further: (d) and (h) each have a _LOST variant — INSERTED_COMMITED_LOST (d’) and DELETED_COMMITTED_LOST (h’).
The _LOST variants are the interesting ones. In the committed-inserter leaf (d) the code asks an extra question before choosing between (d) and (d’):
// mvcc_satisfies_snapshot -- src/transaction/mvcc.c (case d, perfmon detail) if (rec_header->mvcc_ins_id != MVCCID_ALL_VISIBLE && vacuum_is_mvccid_vacuumed (rec_header->mvcc_ins_id)) { perfmon_mvcc_snapshot (thread_p, PERF_SNAPSHOT_SATISFIES_SNAPSHOT, PERF_SNAPSHOT_RECORD_INSERTED_COMMITED_LOST, PERF_SNAPSHOT_VISIBLE); } else { perfmon_mvcc_snapshot (thread_p, PERF_SNAPSHOT_SATISFIES_SNAPSHOT, PERF_SNAPSHOT_RECORD_INSERTED_COMMITED, PERF_SNAPSHOT_VISIBLE); }vacuum_is_mvccid_vacuumed (in vacuum.h) returns true when the MVCCID is older than vacuum’s oldest-visible watermark (Ch. 9): vacuum was entitled to strip this stamp but the version still carries it. So _LOST counts versions still wearing a stamp vacuum should have removed — a vacuum-lag measure. The != MVCCID_ALL_VISIBLE guard skips the all-visible sentinel (never a real id). The delete-side _LOST leaf (h’) is the symmetric probe on mvcc_del_id, sans guard since a delete stamp is never the sentinel.
6.9 The visibility invariant, stated precisely
Section titled “6.9 The visibility invariant, stated precisely”Visibility invariant. For a built (non-dirty) snapshot S and header H, mvcc_satisfies_snapshot returns SNAPSHOT_SATISFIED iff H’s inserter is visible to S (committed-before, or me, or vacuum-stripped) AND H’s deleter is not-yet-visible (no delete stamp, or deleter concurrent); TOO_NEW_FOR_SNAPSHOT exactly when the inserter is concurrent (regardless of delete state); TOO_OLD_FOR_SNAPSHOT exactly when the inserter is visible but the deleter is also visible-or-me. The per-side ordering arguments in §6.4 and §6.5 enforce it; any reordering produces phantom rows (visible too-new versions) or vanished rows (invisible committed-before versions).
6.10 Cross-check notes
Section titled “6.10 Cross-check notes”- The
!!TODOon case (f). The comment doubts the inserter-in-snapshot test is reachable independently of the deleter test. Sound for the common case (one transaction inserts and deletes), but not obviously safe to remove: a row inserted by concurrent T_a then deleted by concurrent T_b (a != b, both in-snapshot) would, without (f), fall through to (g) and returnSNAPSHOT_SATISFIED— exposing a row whose insert never committed. Whether such a header can arise depends on locking rules outsidemvcc.c. Treat the branch as load-bearing. - No side effects.
mvcc_satisfies_dirty(same file) mutatessnapshot->lowest_active_mvccid/highest_completed_mvccid;snapshot_fncselects which runs. Ch. 7 covers the dirty/delete/vacuum siblings.
6.11 Chapter summary — key takeaways
Section titled “6.11 Chapter summary — key takeaways”mvcc_satisfies_snapshotis pure and side-effect-free, with one top split onMVCC_IS_HEADER_DELID_VALIDyieldingSNAPSHOT_SATISFIED,TOO_NEW_FOR_SNAPSHOT, orTOO_OLD_FOR_SNAPSHOT.- Not-deleted ladder, four ordered cases: no insert stamp (visible), inserted-by-me (visible), inserter-in-snapshot (
TOO_NEW), fall-through committed-before (visible). Only the in-snapshot inserter looks backward. - Deleted ladder, four ordered cases: deleted-by-me (
TOO_OLD), inserter-in-snapshot (TOO_NEW), deleter-in-snapshot (visible), fall-through (TOO_OLD). Visible only when the deleter is still concurrent. - The asymmetry: a too-new insert hides this version (look older); a too-new delete keeps it visible.
TOO_NEWwalksprev_version_lsabackward;TOO_OLDis terminal. - Every leaf reports a
PERF_SNAPSHOT_*bucket; the_LOSTbuckets fire whenvacuum_is_mvccid_vacuumedsays a stamp should already be gone — a vacuum-lag measure, guarded againstMVCCID_ALL_VISIBLEon the insert side. - Visibility invariant: visible iff inserter visible AND deleter not-yet-visible;
TOO_NEWiff inserter concurrent;TOO_OLDiff inserter visible but deleter visible-or-me. Case ordering (identity before frontier) enforces it. - The
!!TODOon the deleted-branch inserter test is an open redundancy question, not dead code: removing it risks exposing rows whose insert never committed when insert and delete come from two distinct concurrent transactions.
Chapter 7: Sibling Predicates for Delete Dirty and Vacuum
Section titled “Chapter 7: Sibling Predicates for Delete Dirty and Vacuum”mvcc_satisfies_snapshot (Chapter 6) answers reads. But the same MVCC_REC_HEADER — the four-field on-record stamp from Chapter 3 — is interrogated by three other callers that need a different verdict from the same bytes: a writer about to delete/update needs liveness (any in-progress deleter to block behind?); a dirty read (mvcc_satisfies_dirty) needs in-progress writers’ effects and reports which MVCCID it treated as active; vacuum needs to know whether a dead version is old enough that no running transaction can still need it.
Each tilts one comparison differently — frozen snapshot membership vs. raw liveness vs. a static watermark, the axis carried by the 7.7 contrast table. The header-decode idioms (MVCC_IS_HEADER_DELID_VALID, MVCC_IS_FLAG_SET) are assumed from Chapters 3 and 6.
7.1 The two comparison primitives: ACTIVE vs. IN-SNAPSHOT
Section titled “7.1 The two comparison primitives: ACTIVE vs. IN-SNAPSHOT”Everything hinges on two helper-macro families. The _ACTIVE family wraps the live probe; the _IN_SNAPSHOT family wraps the frozen one (the DELETER variants pass mvcc_del_id instead of mvcc_ins_id):
// MVCC_IS_REC_{INSERTER,DELETER}_{ACTIVE,IN_SNAPSHOT} -- src/transaction/mvcc.c#define MVCC_IS_REC_INSERTER_ACTIVE(thread_p, rec_header_p) \ (mvcc_is_active_id (thread_p, (rec_header_p)->mvcc_ins_id))#define MVCC_IS_REC_INSERTER_IN_SNAPSHOT(thread_p, rec_header_p, snapshot) \ (mvcc_is_id_in_snapshot (thread_p, (rec_header_p)->mvcc_ins_id, (snapshot)))mvcc_is_id_in_snapshot is the frozen test — it compares against the captured snapshot bounds and bit-area (Chapter 4): “was this writer active at the instant the snapshot was taken?” mvcc_is_active_id is the live test — it consults the current global active set, not a captured copy:
// mvcc_is_active_id -- src/transaction/mvcc.c if (MVCC_ID_PRECEDES (mvccid, curr_mvcc_info->recent_snapshot_lowest_active_mvccid)) return false; /* below recent watermark: long dead */ if (logtb_is_current_mvccid (thread_p, mvccid)) return true; /* mine (or my sub-tx) */ return log_Gl.mvcc_table.is_active (mvccid); /* live global probe, not the snapshot copy */Invariant — delete uses liveness, read uses the frozen snapshot. A reader probes the captured snapshot (
mvcc_is_id_in_snapshot) for a stable view; a writer probes the live set (mvcc_is_active_id) and blocks on any deleter active right now, so it never loses an update. Swapping the two causes lost updates or deadlocks reads against post-snapshot commits.
The watermark macros MVCC_IS_REC_{INSERTED,DELETED}_SINCE_MVCCID (vacuum only) are pure arithmetic: INSERTED_SINCE is !MVCC_ID_PRECEDES(ins_id, mvcc_id), i.e. ins_id >= mvcc_id, with no table probe.
7.2 Result struct/enum coverage
Section titled “7.2 Result struct/enum coverage”The four predicates return three enums. mvcc_satisfies_snapshot_result (from Chapter 6) is reused by ..._dirty and ..._is_not_deleted_for_snapshot with a narrower domain.
mvcc_satisfies_snapshot_result (returned by ..._dirty, ..._is_not_deleted_for_snapshot):
| Field | Role | Why it exists |
|---|---|---|
TOO_OLD_FOR_SNAPSHOT | not visible; deleted by me or by a committed tx | ”dead for good” vs. “not yet born” so a chain walker stops |
SNAPSHOT_SATISFIED | visible and valid | the only “yes” answer |
TOO_NEW_FOR_SNAPSHOT | not visible; inserter still active — follow prev_version_lsa | only mvcc_satisfies_snapshot returns it (7.4, 7.5 never do) |
mvcc_satisfies_delete_result (returned by ..._delete):
| Field | Role | Why it exists |
|---|---|---|
DELETE_RECORD_INSERT_IN_PROGRESS | inserted by another active tx | not yet committed-visible; do not touch |
DELETE_RECORD_CAN_DELETE | visible, valid — proceed | all-visible, by-me, or inserter-committed |
DELETE_RECORD_DELETED | deleted by a committed tx | target gone; caller raises serialization/not-found |
DELETE_RECORD_DELETE_IN_PROGRESS | deleted by another active tx | caller must wait on the deleter (lock-manager handoff, see cubrid-lock-manager-detail) then retry |
DELETE_RECORD_SELF_DELETED | deleted by this tx | idempotent; treated as removed |
mvcc_satisfies_vacuum_result (returned by ..._vacuum):
| Field | Role | Why it exists |
|---|---|---|
VACUUM_RECORD_REMOVE | physically remove the whole record | deleter committed before the oldest reader — never seen again |
VACUUM_RECORD_DELETE_INSID_PREV_VER | keep record, strip insert MVCCID and prev_version_lsa | all-visible-and-live: stamp and chain are dead weight, row still needed |
VACUUM_RECORD_CANNOT_VACUUM | leave alone | already vacuumed, or recently inserted/deleted — a running tx may need it |
7.3 mvcc_satisfies_delete — the five-state liveness verdict
Section titled “7.3 mvcc_satisfies_delete — the five-state liveness verdict”The predicate a DELETE or UPDATE runs against the heap row it intends to modify. It takes no snapshot argument — liveness is always “now”. Top split is MVCC_IS_HEADER_DELID_VALID: does the row carry a delete stamp?
Not-yet-deleted branch (!MVCC_IS_HEADER_DELID_VALID):
// mvcc_satisfies_delete -- src/transaction/mvcc.c if (!MVCC_IS_HEADER_DELID_VALID (rec_header)) { if (!MVCC_IS_FLAG_SET (rec_header, OR_MVCC_FLAG_VALID_INSID)) return DELETE_RECORD_CAN_DELETE; /* no insid stamp: all-visible */ if (MVCC_IS_REC_INSERTED_BY_ME (thread_p, rec_header)) return DELETE_RECORD_CAN_DELETE; /* only I can see it; safe to drop */ else if (MVCC_IS_REC_INSERTER_ACTIVE (thread_p, rec_header)) return DELETE_RECORD_INSERT_IN_PROGRESS; /* another tx is still inserting */ else /* inserter committed; ... perfmon ... */ return DELETE_RECORD_CAN_DELETE; }Four terminal sub-branches: no VALID_INSID (insid vacuumed away, Chapter 9), inserted-by-me, and inserter-committed all yield CAN_DELETE; only an ACTIVE inserter yields INSERT_IN_PROGRESS. The else perfmon block only splits “committed” from “committed-then-insid-vacuumed” for stats; no result change.
Already-deleted branch (else) — symmetric three-way split, same live probe:
// mvcc_satisfies_delete -- src/transaction/mvcc.c (else arm) else if (MVCC_IS_REC_DELETED_BY_ME (thread_p, rec_header)) return DELETE_RECORD_SELF_DELETED; else if (MVCC_IS_REC_DELETER_ACTIVE (thread_p, rec_header)) return DELETE_RECORD_DELETE_IN_PROGRESS; /* must WAIT on that deleter */ else /* ... perfmon ... */ return DELETE_RECORD_DELETED; /* deleter committed: target gone */Three terminal sub-branches: deleted-by-me → SELF_DELETED; deleter-committed → DELETED; deleter still ACTIVE → DELETE_IN_PROGRESS — the case that requires the live probe: the caller blocks behind the in-progress deleter and re-reads (a frozen snapshot could miss a post-snapshot deleter and lose the update). Figure 7-1 maps both DELID_VALID arms onto the five verdicts.
flowchart TD A["DELID valid?"] -->|no| B["ins state?"] A -->|yes| F["del state?"] B -->|else| C1["CAN_DELETE"] B -->|ins ACTIVE| G1["INSERT_IN_PROGRESS"] F -->|mine| H1["SELF_DELETED"] F -->|del ACTIVE| H2["DELETE_IN_PROGRESS"] F -->|committed| H3["DELETED"]
7.4 mvcc_satisfies_dirty — the side-effecting predicate
Section titled “7.4 mvcc_satisfies_dirty — the side-effecting predicate”mvcc_satisfies_dirty answers read-uncommitted visibility: a dirty read sees committed and in-progress effects. The function header warns it has side effects, changing snapshot->lowest_active_mvccid and snapshot->highest_completed_mvccid — and that the snapshot argument can never be the transaction snapshot. Here snapshot is a scratch struct whose two scalars are output channels: the predicate zeroes both, walks the same DELID_VALID split as 7.3 with the live ACTIVE probe, and stamps a scalar in exactly one of two mutually-exclusive ACTIVE arms:
// mvcc_satisfies_dirty -- src/transaction/mvcc.c snapshot->lowest_active_mvccid = MVCCID_NULL; /* both scalars cleared up front */ snapshot->highest_completed_mvccid = MVCCID_NULL; // ... not-deleted arm: only the ACTIVE-inserter branch writes a scalar ... else if (MVCC_IS_REC_INSERTER_ACTIVE (thread_p, rec_header)) snapshot->lowest_active_mvccid = MVCC_GET_INSID (rec_header); /* side effect, then SATISFIED */ // ... already-deleted arm: only the ACTIVE-deleter branch writes a scalar ... else if (MVCC_IS_REC_DELETER_ACTIVE (thread_p, rec_header)) snapshot->highest_completed_mvccid = rec_header->mvcc_del_id; /* side effect, then SATISFIED */Branch shape mirrors 7.3. Not-deleted arm, four sub-branches all SNAPSHOT_SATISFIED: no VALID_INSID / inserted-by-me / inserter-committed write nothing; only inserter ACTIVE stamps lowest_active_mvccid. Already-deleted arm, three sub-branches: deleted-by-me and deleter-committed → TOO_OLD_FOR_SNAPSHOT; only deleter ACTIVE → SNAPSHOT_SATISFIED, stamping highest_completed_mvccid. Dirty never returns TOO_NEW — it accepts active inserters as visible.
Invariant — dirty’s two scalars are an output, not the transaction view. On the real snapshot (Chapters 5–6) these fields are captured inputs bounding the active set; here they are outputs on a throwaway struct, at most one non-NULL per call. Passing the live snapshot would corrupt its bounds — hence the header note.
7.5 mvcc_is_not_deleted_for_snapshot — the cheap still-deletable check
Section titled “7.5 mvcc_is_not_deleted_for_snapshot — the cheap still-deletable check”The lightest predicate: “is this row not deleted from my snapshot’s view?” — used where the caller already knows the row is otherwise visible. Unlike delete and dirty, it uses frozen IN-SNAPSHOT semantics.
// mvcc_is_not_deleted_for_snapshot -- src/transaction/mvcc.c if (!MVCC_IS_HEADER_DELID_VALID (rec_header)) return SNAPSHOT_SATISFIED; /* never deleted: trivially "not deleted" */ else if (MVCC_IS_REC_DELETED_BY_ME (thread_p, rec_header)) return TOO_OLD_FOR_SNAPSHOT; /* I deleted it: gone for me */ else if (MVCC_IS_REC_DELETER_IN_SNAPSHOT (thread_p, rec_header, snapshot)) /* frozen */ return SNAPSHOT_SATISFIED; /* deleter active/after-snapshot: still here */ else return TOO_OLD_FOR_SNAPSHOT; /* deleter committed before snapshot: gone */Four terminal branches — the not-deleted short-circuit plus a three-way deleted arm — no insert-side logic (the caller’s job), no TOO_NEW. The deleter test is MVCC_IS_REC_DELETER_IN_SNAPSHOT (frozen), not delete’s ACTIVE probe, because this is a read-style verdict: an in-snapshot deleter (active, or committed after the snapshot) is not yet visible, so the row counts as “not deleted”.
7.6 mvcc_satisfies_vacuum — the three-way watermark verdict
Section titled “7.6 mvcc_satisfies_vacuum — the three-way watermark verdict”Vacuum takes only oldest_mvccid, the oldest-active watermark from Chapter 9. Below it no running transaction can need a version, so the decision is pure arithmetic. The outer split is “not deleted, or deleted too recently (del_id >= oldest) to remove wholesale.”
// mvcc_satisfies_vacuum -- src/transaction/mvcc.c if (!MVCC_IS_HEADER_DELID_VALID (rec_header) || MVCC_IS_REC_DELETED_SINCE_MVCCID (rec_header, oldest_mvccid)) { if (!MVCC_IS_HEADER_INSID_NOT_ALL_VISIBLE (rec_header) || MVCC_IS_REC_INSERTED_SINCE_MVCCID (rec_header, oldest_mvccid)) return VACUUM_RECORD_CANNOT_VACUUM; /* insid gone OR inserted too recently; ...perfmon... */ else return VACUUM_RECORD_DELETE_INSID_PREV_VER; /* inserter committed before oldest: insid dead weight */ } else return VACUUM_RECORD_REMOVE; /* deleter committed before oldest: nobody sees it */Three terminal outcomes, every branch accounted for: (1) REMOVE (outer else) — del_id < oldest_mvccid; (2) CANNOT_VACUUM (first arm) when the insid is gone (!INSID_NOT_ALL_VISIBLE) or inserted-since-oldest (ins_id >= oldest_mvccid); (3) DELETE_INSID_PREV_VER (inner else) — inserter committed before oldest_mvccid with insid still present, so the insert stamp and prev_version_lsa are dead metadata.
Invariant — vacuum only looks backward past a single global watermark. Every comparison is
header_idvs.oldest_mvccid— no snapshot, no live probe. The watermark is monotonic non-decreasing (Chapter 9), so aREMOVEverdict can never later become needed; anoldest_mvccidset too high would vacuum a still-visible version — hence Chapter 9’s conservative computation.
7.7 Four predicates, one header — the contrast table
Section titled “7.7 Four predicates, one header — the contrast table”| Predicate | Extra input | Comparison primitive | Sees in-progress writers? | Result domain | Side effects |
|---|---|---|---|---|---|
mvcc_satisfies_snapshot (Ch.6) | transaction snapshot | frozen IN-SNAPSHOT | no (active inserter -> TOO_NEW) | 3-state snapshot | none |
mvcc_is_not_deleted_for_snapshot | snapshot | frozen IN-SNAPSHOT (deleter only) | no | 2 of 3 snapshot (no TOO_NEW) | none |
mvcc_satisfies_dirty | scratch snapshot struct | live ACTIVE | yes (active writer -> SNAPSHOT_SATISFIED) | 2 of 3 snapshot (no TOO_NEW) | writes lowest_active/highest_completed |
mvcc_satisfies_delete | none | live ACTIVE | yes, as wait signals | 5-state delete | none |
mvcc_satisfies_vacuum | oldest_mvccid watermark | watermark SINCE | no — only fully-past versions act | 3-state vacuum | none |
Only the middle column distinguishes them; the DELID_VALID split, the inserted-by-me short-circuit, and the perfmon bookkeeping are shared.
7.8 Chapter summary — key takeaways
Section titled “7.8 Chapter summary — key takeaways”- The four predicates differ almost entirely in one comparison primitive:
mvcc_is_id_in_snapshot(frozen, reads),mvcc_is_active_id(live, delete and dirty), orMVCC_ID_PRECEDESagainstoldest_mvccid(watermark, vacuum). Onlymvcc_satisfies_snapshotever returnsTOO_NEW. mvcc_satisfies_deleteuses the live ACTIVE probe so a writer blocks behind an in-progress deleter (DELETE_IN_PROGRESS), avoiding lost updates. Five states: a 4-branch not-deleted arm (all but oneCAN_DELETE) and a 3-branch already-deleted arm (SELF_DELETED/DELETE_IN_PROGRESS/DELETED).mvcc_satisfies_dirtyis the only side-effecting predicate, also live: onSNAPSHOT_SATISFIEDit stamps the active inserter intolowest_active_mvccidor the active deleter intohighest_completed_mvccid(never both) — outputs on a scratch struct, never the real one.mvcc_is_not_deleted_for_snapshotis the cheap, delete-only, frozen check: four branches (not-deleted short-circuit plus a three-way deleted arm) viaMVCC_IS_REC_DELETER_IN_SNAPSHOT, no insert logic, noTOO_NEW.mvcc_satisfies_vacuumis pure watermark arithmetic:REMOVEwhen the deleter committed beforeoldest_mvccid,DELETE_INSID_PREV_VERwhen the inserter did,CANNOT_VACUUMotherwise — safe only because the watermark is conservatively low.
Chapter 8: Commit and the History Ring Advance
Section titled “Chapter 8: Commit and the History Ring Advance”When a write transaction finishes, three things must happen, in an order that survives a crash and stays correct for lock-free snapshot readers: the transaction’s MVCCID is marked inactive in the global active set, the new active set is published into the history ring so concurrent build_mvcc_info callers see it, and the global oldest-active watermark VACUUM trusts is advanced — but not so far that VACUUM erases data the still-uncommitted transaction may need to recover. This chapter traces that path end to end.
For the read side of these structures (bit-area probe, cached scalars, version-recheck on read) see the high-level companion cubrid-mvcc.md and Chapters 4–5. Here we cover only the write side: retirement, publication, and watermark maintenance.
8.1 The three structures in play
Section titled “8.1 The three structures in play”Commit touches all three central structs (full field roles for mvcc_active_tran are in Ch. 1; the commit-relevant maintenance fields are repeated here).
mvcctable — the process-global coordinator (one instance, log_Gl.mvcc_table):
| Field | Role | Why it exists |
|---|---|---|
m_transaction_lowest_visible_mvccids | per-tran-index atomic<MVCCID>; oldest MVCCID this tran must keep visible | VACUUM floor; commit clamps the committer’s slot |
m_transaction_lowest_visible_mvccids_size | length of that array | realloc guard |
m_current_status_lowest_active_mvccid | atomic global oldest-active watermark | advance_oldest_active CAS-bumps it; VACUUM reads it |
m_current_trans_status | the live mvcc_trans_status, mutated under the mutex | never read lock-free |
m_trans_status_history_position | atomic ring index of the newest published status | the single visibility store |
m_trans_status_history | ring of HISTORY_MAX_SIZE (2048) status slots | lock-free readers grab a recent snapshot |
m_active_trans_mutex | serializes status mutation | one completer at a time |
m_new_mvccid_lock, m_oldest_visible, m_ov_lock_count | MVCCID issuance + oldest-visible watermark | off commit path; Ch. 3 and Ch. 9 |
mvcc_trans_status — one global “snapshot generation”:
| Field | Role | Why it exists |
|---|---|---|
m_active_mvccs | the mvcc_active_tran payload | the active-set data |
m_last_completed_mvccid | last MVCCID retired into this status | highest_completed hints |
m_event_type | COMMIT, ROLLBACK, or SUBTRAN | post-mortem of the generation |
m_version | atomic<version_type>, bumped per generation | reader recheck token |
mvcc_active_tran — every field (maintenance side at commit):
| Field | Role at commit | Why it exists |
|---|---|---|
m_bit_area | 500 uint64_t units; bit set = MVCCID committed | O(1) recent-MVCCID status |
m_bit_area_start_mvccid | MVCCID mapped to bit 0 | window base; ltrim_area advances it |
m_bit_area_length | window length in bits | grown by set_bitarea_mvccid, shrunk by trims |
m_long_tran_mvccids | sorted array of active MVCCIDs older than the window | window can slide past stragglers |
m_long_tran_mvccids_length | long-tran entry count | bounds array; drives compute_lowest_active_mvccid |
m_initialized | lifecycle flag; asserted by copy_to and reset_start_mvccid | guards init/finalize idempotency, no use-before-init |
flowchart LR CUR["m_current_trans_status<br/>(live, under mutex)"] -->|copy_to THREAD_SAFE| RING["m_trans_status_history[2048]"] POS["m_trans_status_history_position"] -->|newest| RING CUR -->|set_inactive_mvccid| AT["m_active_mvccs<br/>m_bit_area / m_long_tran_mvccids"] CUR -.->|compute_lowest + CAS| LOW["m_current_status_lowest_active_mvccid"]
Figure 8-1: the live status feeds the published ring (copy_to), the active set (set_inactive_mvccid), and the watermark (CAS).
8.2 logtb_complete_mvcc — caller and read-only fast path
Section titled “8.2 logtb_complete_mvcc — caller and read-only fast path”logtb_complete_mvcc (in log_tran_table.c) runs at every commit and rollback, first deciding whether the transaction even has an MVCCID to retire:
// logtb_complete_mvcc -- src/transaction/log_tran_table.cmvccid = curr_mvcc_info->id;tran_index = LOG_FIND_THREAD_TRAN_INDEX (thread_p);
if (MVCCID_IS_VALID (mvccid)) { mvcc_table->complete_mvcc (tran_index, mvccid, committed); /* <- write tran: full path */ }else { if (committed && logtb_tran_update_all_global_unique_stats (thread_p) != NO_ERROR) { assert (false); } /* read-only tran never allocated an MVCCID; just drop its visibility floor */ log_Gl.mvcc_table.reset_transaction_lowest_active (tran_index); /* <- stores MVCCID_NULL */ }curr_mvcc_info->recent_snapshot_lowest_active_mvccid = MVCCID_NULL;// ... condensed: reset count-optim state, curr_mvcc_info->reset (), perf ...A read-only tran (no MVCCID) has nothing in any active set: it skips complete_mvcc, and reset_transaction_lowest_active stores MVCCID_NULL into its slot, releasing its floor. No mutex, no ring advance. Everything below is the write path.
8.3 complete_mvcc end to end
Section titled “8.3 complete_mvcc end to end”The body runs under m_active_trans_mutex until an explicit ulock.unlock (). Branch-complete walkthrough:
- Lock
m_active_trans_mutex. next_trans_status_start(§8.4) — reserve the next ring slot, bump the version, invalidate the slot.if (committed)→logtb_tran_update_all_global_unique_stats; failure tripsassert (false). else (rollback) skip.set_inactive_mvccid(mvccid)(§8.7), then setm_last_completed_mvccid = mvccidandm_event_type = COMMIT/ROLLBACK.next_tran_status_finish(§8.5) — copy the active set, publish the position.- Clamp branch —
if (committed)clamp the floor up tomvccid(only when slot isMVCCID_NULLor precedes); else set the floorMVCCID_NULL. unlock.- Post-unlock advance —
ifglobal== mvccidormvccidprecedesbit_area_start→compute_lowest_active_mvccid, thenifversion unchanged →advance_oldest_active; else skip the stale result. else skip entirely.
Step 6, the clamp branch, adjusts the committer’s own slot in m_transaction_lowest_visible_mvccids[tran_index]:
// mvcctable::complete_mvcc -- src/transaction/mvcc_table.cppif (committed) { /* be sure that transaction modifications can't be vacuumed up to LOG_COMMIT. ... * It will be set to NULL after LOG_COMMIT */ MVCCID tran_lowest_active = oldest_active_get (m_transaction_lowest_visible_mvccids[tran_index], ...); if (tran_lowest_active == MVCCID_NULL || MVCC_ID_PRECEDES (tran_lowest_active, mvccid)) { oldest_active_set (..., mvccid, ...); /* <- clamp UP to mvccid, never down */ } }else { oldest_active_set (..., MVCCID_NULL, ...); /* <- rollback releases the floor immediately */ }Invariant — VACUUM must not pass a committing transaction before its
LOG_COMMIT. Commit raises the slot tomvccid(only ifMVCCID_NULLor strictly older), pinning VACUUM at or belowmvcciduntilLOG_COMMITis durable; it resets toMVCCID_NULLonly afterLOG_COMMIT. Enforced by the clamp condition, which never lowers the floor here. If violated VACUUM could erase this tran’s modifications, and a crash beforeLOG_COMMITleaves them unrecoverable. Rollback has no such hazard, so it drops the floor immediately.
8.4 next_trans_status_start — reserve and invalidate
Section titled “8.4 next_trans_status_start — reserve and invalidate”// mvcctable::next_trans_status_start -- src/transaction/mvcc_table.cppnext_index = (m_trans_status_history_position.load () + 1) & HISTORY_INDEX_MASK; /* ring +1 */next_version = ++m_current_trans_status.m_version; /* bump GLOBAL version */mvcc_trans_status &next_trans_status = m_trans_status_history[next_index];next_trans_status.m_version.store (next_version); /* poison the target slot */return next_trans_status;Three effects under the mutex: the ring index advances modulo HISTORY_MAX_SIZE (2048, power of two, so & HISTORY_INDEX_MASK wraps); the current status’s version increments (the reader’s recheck token); and the target slot’s version is stamped to next_version before its payload exists.
Invariant — a half-written slot is detectably stale, and publication is the last store. The slot’s
m_versionis stamped before its bit-area is populated;next_tran_status_finish(§8.5) fills the payload and only then storesm_trans_status_history_position— the single publish. The version is a plain monotonically incrementingunsigned int(version_type), +1 per generation; the reader’s recheck is an exact-value compare (next_status.m_version.load () == next_version), not seqlock parity. Enforced by stamp-first + publish-last statement ordering plus the reader recheck (Ch. 4/5). If violated a reader could match an advanced version over partial bits, accepting a torn active set.
8.5 next_tran_status_finish — publish last
Section titled “8.5 next_tran_status_finish — publish last”// mvcctable::next_tran_status_finish -- src/transaction/mvcc_table.cppm_current_trans_status.m_active_mvccs.copy_to (next_trans_status.m_active_mvccs, mvcc_active_tran::copy_safety::THREAD_SAFE); /* deep-copy the active set */next_trans_status.m_last_completed_mvccid = m_current_trans_status.m_last_completed_mvccid;next_trans_status.m_event_type = m_current_trans_status.m_event_type;m_trans_status_history_position.store (next_index); /* <- THE publish */copy_safety::THREAD_SAFE makes copy_to run check_valid on source and destination. The payload fills the already-version-stamped slot; the trailing m_trans_status_history_position.store is the publish (see the §8.4 invariant). Until it lands, readers see the previous slot as newest.
8.6 advance_oldest_active — the post-unlock CAS loop
Section titled “8.6 advance_oldest_active — the post-unlock CAS loop”After unlocking, complete_mvcc recomputes the global watermark only when the retired mvccid was the bottleneck:
// mvcctable::complete_mvcc (post-unlock) -- src/transaction/mvcc_table.cppMVCCID global_lowest_active = m_current_status_lowest_active_mvccid;if (global_lowest_active == mvccid || MVCC_ID_PRECEDES (mvccid, next_status.m_active_mvccs.get_bit_area_start_mvccid ())) { MVCCID new_lowest_active = next_status.m_active_mvccs.compute_lowest_active_mvccid (); if (next_status.m_version.load () == next_version) /* <- recheck: result still ours? */ { advance_oldest_active (new_lowest_active); } }Two trigger conditions. Recompute only if (a) mvccid equals the global watermark — it was the oldest — or (b) mvccid precedes the slot’s bit_area_start_mvccid — a long transaction finished and the floor lives in the long-tran array; otherwise the watermark is unaffected and the work skipped. Version recheck. compute_lowest_active_mvccid reads next_status lock-free; if another completer reused the slot its version differs and the value is discarded.
// mvcctable::advance_oldest_active -- src/transaction/mvcc_table.cppdo { crt_oldest_active = m_current_status_lowest_active_mvccid.load (); if (crt_oldest_active >= next_oldest_active) { return; } /* <- monotonic guard */ }while (!m_current_status_lowest_active_mvccid.compare_exchange_strong (crt_oldest_active, next_oldest_active));Invariant — the global watermark is monotonically non-decreasing.
advance_oldest_activeonly ever raises it. Enforced by thecrt_oldest_active >= next_oldest_activeearly return inside the CAS loop, re-evaluated on each retry. If violated VACUUM could see the value drop and reclaim data an active reader needs. The CAS handles racing completers — the higher value wins, losers re-read and bail.
8.7 set_inactive_mvccid — routing the retirement
Section titled “8.7 set_inactive_mvccid — routing the retirement”Back inside the mutex, set_inactive_mvccid routes the retiring MVCCID:
// mvcc_active_tran::set_inactive_mvccid -- src/transaction/mvcc_active_tran.cppif (MVCC_ID_PRECEDES (mvccid, m_bit_area_start_mvccid)) { remove_long_transaction (mvccid); /* <- slid out of the window: it's a long tran */ }else { set_bitarea_mvccid (mvccid); /* <- still in the window: set its committed bit */ }An MVCCID older than the window base is in the long-tran array; everything else lives in the bit area (two-tier design, Ch. 4). The bit-area branch (set_bitarea_mvccid) then runs three maintenance triggers, traced in §8.9.
8.8 remove_long_transaction and add_long_transaction
Section titled “8.8 remove_long_transaction and add_long_transaction”// mvcc_active_tran::remove_long_transaction -- src/transaction/mvcc_active_tran.cppassert (m_long_tran_mvccids_length > 0);for (i = 0; i < m_long_tran_mvccids_length - 1; i++) { if (m_long_tran_mvccids[i] == mvccid) { size_t memsize = (m_long_tran_mvccids_length - i - 1) * sizeof (MVCCID); std::memmove (&m_long_tran_mvccids[i], &m_long_tran_mvccids[i + 1], memsize); /* close the gap */ break; } }assert ((i < m_long_tran_mvccids_length - 1) || m_long_tran_mvccids[i] == mvccid);--m_long_tran_mvccids_length;Linear scan; on match, memmove closes the gap (keeping the array dense and sorted), then length decrements. The loop stops at length - 1: a last-element target is never matched in the body, but the trailing assert confirms it was the tail and the unconditional --m_long_tran_mvccids_length drops it.
add_long_transaction is the inverse, used only during migration:
// mvcc_active_tran::add_long_transaction -- src/transaction/mvcc_active_tran.cppassert (m_long_tran_mvccids_length < long_tran_max_size ());assert (m_long_tran_mvccids_length == 0 || m_long_tran_mvccids[m_long_tran_mvccids_length - 1] < mvccid);m_long_tran_mvccids[m_long_tran_mvccids_length++] = mvccid; /* append; caller guarantees ascending */Invariant — the long-tran array is sorted ascending and bounded. Enforced by
add_long_transaction’s two asserts (each append exceeds the prior tail; length belowlong_tran_max_size); the migration source iterates the bit-area low-to-high, so appends are ascending. If violatedcompute_lowest_active_mvccid(Ch. 6/9) returnsm_long_tran_mvccids[0]as the minimum — an unsorted array yields a wrong watermark.
8.9 set_bitarea_mvccid — set the bit and trigger maintenance
Section titled “8.9 set_bitarea_mvccid — set the bit and trigger maintenance”// mvcc_active_tran::set_bitarea_mvccid -- src/transaction/mvcc_active_tran.cppconst size_t CLEANUP_THRESHOLD = UNIT_BIT_COUNT; /* 64 bits */const size_t LONG_TRAN_THRESHOLD = BITAREA_MAX_BITS - long_tran_max_size ();
size_t position = get_bit_offset (mvccid);if (position >= BITAREA_MAX_BITS) { cleanup_migrate_to_long_transations (); /* <- window full: force migration to make room */ position = get_bit_offset (mvccid); /* recompute: start_mvccid moved */ }assert (position < BITAREA_MAX_BITS);if (position >= m_bit_area_length) { m_bit_area_length = position + 1; /* extend; new bits already zero (ALL_ACTIVE) */ }*get_unit_of (position) |= get_mask_of (position); /* set the committed bit */check_valid ();
if (m_bit_area_length > CLEANUP_THRESHOLD) { /* trim all-committed prefix units */ for (first_not_all_committed = 0; first_not_all_committed < get_area_size (); first_not_all_committed++) if (m_bit_area[first_not_all_committed] != ALL_COMMITTED) break; ltrim_area (first_not_all_committed); check_valid (); }if (m_bit_area_length > LONG_TRAN_THRESHOLD) { cleanup_migrate_to_long_transations (); }Overflow branch. If the offset reaches BITAREA_MAX_BITS (500 × 64 = 32000 bits), the window cannot hold it: cleanup_migrate_to_long_transations slides the window forward, the offset recomputes against the new base, and the assert guarantees it fits. Extend branch. A bit past the current length simply raises m_bit_area_length — storage is pre-zeroed (ALL_ACTIVE), so no clearing. Cleanup threshold. Once length exceeds one unit (64), the code finds the first unit not ALL_COMMITTED and ltrim_areas everything before it — the cheap common compaction. Long-tran threshold. If length still exceeds LONG_TRAN_THRESHOLD (bit-area max minus long-tran capacity), remaining stragglers migrate wholesale, leaving exactly enough long-tran slots for every possible active tran.
8.10 ltrim_area — slide the window base
Section titled “8.10 ltrim_area — slide the window base”// mvcc_active_tran::ltrim_area -- src/transaction/mvcc_active_tran.cppif (trim_size == 0) { return; }size_t new_memsize = (get_area_size () - trim_size) * sizeof (unit_type);if (new_memsize > 0) { std::memmove (m_bit_area, &m_bit_area[trim_size], new_memsize); } /* shift survivors down */size_t trimmed_bits = units_to_bits (trim_size);m_bit_area_length -= trimmed_bits;m_bit_area_start_mvccid += trimmed_bits; /* base advances */std::memset (&m_bit_area[get_area_size ()], ALL_ACTIVE, trim_size * sizeof (unit_type)); /* re-zero tail */ltrim_area removes trim_size units from the front: survivors memmove down, length shrinks by the trimmed bit count, and m_bit_area_start_mvccid advances by the same amount so the MVCCID↔offset mapping stays consistent. Vacated tail units reset to ALL_ACTIVE (0).
Invariant — units beyond
bit_area_lengthare alwaysALL_ACTIVE(0). Enforced by the trailingmemsethere, the pre-zeroedinitialize, andcheck_valid’s debug loop. If violated the “extend by raising length” shortcut inset_bitarea_mvccidinherits stale committed bits, marking never-seen MVCCIDs committed.
8.11 cleanup_migrate_to_long_transations — keep 16, evict the rest
Section titled “8.11 cleanup_migrate_to_long_transations — keep 16, evict the rest”// mvcc_active_tran::cleanup_migrate_to_long_transations -- src/transaction/mvcc_active_tran.cppconst size_t BITAREA_SIZE_AFTER_CLEANUP = 16;size_t delete_count = get_area_size () - BITAREA_SIZE_AFTER_CLEANUP;for (size_t i = 0; i < delete_count; i++) { bits = m_bit_area[i]; for (bit_pos = 0, mask = 1, long_tran_mvccid = get_mvccid (i * UNIT_BIT_COUNT); bit_pos < UNIT_BIT_COUNT && bits != ALL_COMMITTED; ++bit_pos, mask <<= 1, ++long_tran_mvccid) { if ((bits & mask) == 0) /* bit clear == still active */ { add_long_transaction (long_tran_mvccid); /* push straggler to long-tran array */ bits |= mask; /* set locally to allow early ALL_COMMITTED exit */ } } }ltrim_area (delete_count);Retain the most-recent 16 units, evict the older get_area_size () - 16. In each evicted unit, every clear bit (still-active MVCCID) is appended to the long-tran array; setting the bit locally lets the inner loop short-circuit once the unit reads ALL_COMMITTED. Then ltrim_area (delete_count) drops the migrated units. Low-to-high scanning makes appends ascending, satisfying the §8.8 sort invariant.
8.12 check_valid — the debug invariant gate
Section titled “8.12 check_valid — the debug invariant gate”// mvcc_active_tran::check_valid -- src/transaction/mvcc_active_tran.cpp (debug-only, #if !defined(NDEBUG))// 1. bits in the final partial unit, past bit_area_length, must be 0if ((m_bit_area_length % UNIT_BIT_COUNT) != 0) { size_t last_bit_pos = m_bit_area_length - 1; unit_type last_unit = *get_unit_of (last_bit_pos); for (size_t i = (last_bit_pos + 1); i < UNIT_BIT_COUNT; i++) if ((get_mask_of (i) & last_unit) != 0) { assert (false); } /* a set bit past the length is corruption */ }// 2. every unit fully past bit_area_length must equal ALL_ACTIVEfor (unit_type *p_area = get_unit_of (m_bit_area_length) + 1; p_area < m_bit_area + BITAREA_MAX_SIZE; ++p_area) if (*p_area != ALL_ACTIVE) { assert (false); }// 3. long-tran array is ascending and every entry precedes bit_area_start_mvccidfor (size_t i = 0; i < m_long_tran_mvccids_length; i++) { assert (MVCC_ID_PRECEDES (m_long_tran_mvccids[i], m_bit_area_start_mvccid)); assert (i == 0 || MVCC_ID_PRECEDES (m_long_tran_mvccids[i - 1], m_long_tran_mvccids[i])); }check_valid is a no-op in release builds (#if !defined (NDEBUG)) but runs after every commit-path mutation (set_bitarea_mvccid, ltrim_area, cleanup_migrate_to_long_transations, remove_long_transaction, copy_to under THREAD_SAFE). Its first conditional fires only when m_bit_area_length is not unit-aligned: it scans the final partial unit from last_bit_pos + 1 to UNIT_BIT_COUNT, asserting each bit clear. #2 checks whole units past the length; #3 the sorted long-tran array strictly below the base. Any violation aborts in debug, surfacing maintenance bugs at the point of corruption.
8.13 Chapter summary — key takeaways
Section titled “8.13 Chapter summary — key takeaways”-
A read-only commit is nearly free. With no MVCCID,
logtb_complete_mvccskipscomplete_mvccand only resets the visibility slot toMVCCID_NULL; only write transactions take the mutex-protected path. -
Publication is a single, last store.
next_trans_status_startstamps the slot’s version (a plain monotonicunsigned int, +1 per generation, rechecked by exact equality) before the payload exists;next_tran_status_finishcopies the active set and only then storesm_trans_status_history_position— which, with version-recheck, makes the new snapshot atomically visible to lock-free readers. -
Commit raises the visibility floor, rollback drops it. Commit clamps the per-tran slot up to
mvccid(never down) so VACUUM cannot reclaim data beforeLOG_COMMITis durable; rollback setsMVCCID_NULLat once. -
The watermark advance is lazy and monotonic. Post-unlock,
complete_mvccrecomputes oldest-active only when the retired MVCCID was the bottleneck, rechecks the slot version, thenadvance_oldest_activeCAS-bumps the watermark, which only ever increases. -
Retirement routes by window position and self-compacts.
set_inactive_mvccidsends sub-window MVCCIDs toremove_long_transactionand the rest toset_bitarea_mvccid, whose three triggers compact the window:ltrim_areadrops the all-committed prefix past 64 bits,cleanup_migrate_to_long_transationskeeps 16 units pastLONG_TRAN_THRESHOLD, and aBITAREA_MAX_BITSoverflow forces migration.check_valid(debug-only) asserts clean tail bits/units and a sorted long-tran array below the base after every mutation.
Chapter 9: Vacuum Coordination and the Oldest-Visible Watermark
Section titled “Chapter 9: Vacuum Coordination and the Oldest-Visible Watermark”MVCC keeps an old version until no live snapshot can need it. That decision
collapses into one global scalar — the oldest-visible MVCCID watermark,
mvcctable::m_oldest_visible. This chapter: how it is computed across every
live snapshot, how vacuum consumes it to drive VACUUM_RECORD_REMOVE, and why
one long-running small-MVCCID writer freezes reclamation database-wide. The
companion cubrid-mvcc.md covers why vacuum exists and the master/worker
split; inputs (m_transaction_lowest_visible_mvccids[] seeded with
MVCCID_ALL_VISIBLE) came from Ch.4/Ch.5, the mvcc_satisfies_vacuum
predicate from Ch.7. Here we close the loop.
9.1 The mvcctable struct — watermark fields in context
Section titled “9.1 The mvcctable struct — watermark fields in context”History-ring fields were covered in Ch.8. This chapter’s watermark substate:
// class mvcctable (watermark fields) -- src/transaction/mvcc_table.hppusing lowest_active_mvccid_type = std::atomic<MVCCID>; // ... history-ring fields elided (Ch.8) ...lowest_active_mvccid_type *m_transaction_lowest_visible_mvccids; /* per-tran array */size_t m_transaction_lowest_visible_mvccids_size;lowest_active_mvccid_type m_current_status_lowest_active_mvccid; /* global floor */std::atomic<MVCCID> m_oldest_visible; /* cached watermark */std::atomic<size_t> m_ov_lock_count; /* >0 pins the watermark */| Field | Role | Why it exists |
|---|---|---|
m_transaction_lowest_visible_mvccids | Array, one atomic<MVCCID> per tran index; each slot is the oldest MVCCID that transaction’s snapshot must keep visible. | Per-snapshot input to the min. Sized to logtb_get_number_of_total_tran_indices(). |
m_transaction_lowest_visible_mvccids_size | Cached array length. | Sweep iterates without re-querying the transaction table. |
m_current_status_lowest_active_mvccid | Global floor: lowest active MVCCID in the current trans-status; advanced monotonically by advance_oldest_active (Ch.8). | Seeds the sweep so a transaction that has not yet published its per-tran value is still bounded. |
m_oldest_visible | Cached watermark; stored by update_global_oldest_visible, read by get_global_oldest_visible. | One atomic read per vacuum consumer; recompute amortized to the master heartbeat. |
m_ov_lock_count | Count of operations that have pinned the watermark; > 0 means the cached value must not advance. | A caller that reads get_global_oldest_visible() and acts on it holds the floor steady until it finishes. |
m_current_trans_status / m_trans_status_history / m_trans_status_history_position / m_new_mvccid_lock / m_active_trans_mutex | Owned by other chapters — the history ring and status mutex (Ch.8), the MVCCID-allocation lock (Ch.3). | Not watermark state; the lock-free sweep reads none of them. |
flowchart TB FLOOR["m_current_status_lowest_active_mvccid\n(global floor, monotonic)"] --> COMPUTE["compute_oldest_visible_mvccid()"] ARR["m_transaction_lowest_visible_mvccids[0..N]\n(per-snapshot inputs)"] --> COMPUTE COMPUTE --> UPDATE["update_global_oldest_visible()"] LC["m_ov_lock_count\n(pin counter)"] --> UPDATE UPDATE --> OV["m_oldest_visible\n(cached watermark)"] OV --> GET["get_global_oldest_visible()"] --> VAC["vacuum consumers\n(threshold_mvccid)"]
Figure 9-1 — Watermark substate, from inputs to the cached scalar vacuum reads.
Invariant (watermark monotonicity): m_oldest_visible never decreases.
update_global_oldest_visible enforces it with
assert (m_oldest_visible.load () <= oldest_visible) before the store. A
regression would let a worker holding an older threshold_mvccid remove a
version a newer-but-lower snapshot still needs. The monotonic floor
m_current_status_lowest_active_mvccid keeps the computed min non-decreasing.
9.2 compute_oldest_visible_mvccid — the cross-snapshot sweep
Section titled “9.2 compute_oldest_visible_mvccid — the cross-snapshot sweep”const, lock-free: reads atomics, never takes m_active_trans_mutex. Returns
the min MVCCID any live snapshot can still see.
// mvcctable::compute_oldest_visible_mvccid -- src/transaction/mvcc_table.cppcubmem::appendable_array<size_t, 32> waiting_mvccids_pos;MVCCID lowest_active_mvccid = oldest_active_get (m_current_status_lowest_active_mvccid, 0, /*...*/); /* <- seed = floor */for (size_t idx = 0; idx < m_transaction_lowest_visible_mvccids_size; idx++) { MVCCID loaded = oldest_active_get (m_transaction_lowest_visible_mvccids[idx], idx, /*...*/); if (loaded == MVCCID_ALL_VISIBLE) waiting_mvccids_pos.append (idx); /* <- in flight; defer (9.2.1) */ else if (loaded != MVCCID_NULL && MVCC_ID_PRECEDES (loaded, lowest_active_mvccid)) lowest_active_mvccid = loaded; /* <- min; NULL = ended, ignored */ }// ... re-check loop for deferred slots (9.2.1) ...assert (MVCCID_IS_NORMAL (lowest_active_mvccid)); /* return value */The sweep classifies each slot by the three sentinel cases Ch.4/Ch.5 wrote
there: MVCCID_ALL_VISIBLE (== 3) means build_mvcc_info is mid-flight (slot
claimed, real value not yet published) — defer the index and re-check;
MVCCID_NULL (== 0) means the transaction ended (written by
reset_transaction_lowest_active, 9.5) — ignore; any >= MVCCID_FIRST is
a published value — take the min via MVCC_ID_PRECEDES.
9.2.1 The deferred re-check loop and the 20-retry backoff
Section titled “9.2.1 The deferred re-check loop and the 20-retry backoff”MVCCID_ALL_VISIBLE is transient (Ch.5: stamp, read floor, overwrite), so the
loop spins until each deferred slot publishes a real value or drops to
MVCCID_NULL:
// mvcctable::compute_oldest_visible_mvccid (re-check loop) -- src/transaction/mvcc_table.cppsize_t retry_count = 0;while (waiting_mvccids_pos.get_size () > 0) { if (++retry_count % 20 == 0) { thread_sleep (10); } /* <- 10ms backoff every 20 spins */ for (size_t i = waiting_mvccids_pos.get_size () - 1; i < waiting_mvccids_pos.get_size (); --i) { /* reverse walk: decrement past 0 wraps >= size, so erase shrinks the tail safely */ size_t pos = waiting_mvccids_pos.get_array ()[i]; MVCCID loaded = oldest_active_get (m_transaction_lowest_visible_mvccids[pos], pos, /*...*/); if (loaded == MVCCID_ALL_VISIBLE) { continue; } /* <- still unset; keep in set */ if (loaded != MVCCID_NULL && MVCC_ID_PRECEDES (loaded, lowest_active_mvccid)) lowest_active_mvccid = loaded; waiting_mvccids_pos.erase (i); /* <- resolved (value or NULL); drop */ } }A re-read still showing MVCCID_ALL_VISIBLE keeps the slot (continue); any
other value resolves it — a normal MVCCID smaller than the current min lowers
the min, anything else is dropped via erase.
Invariant (sweep terminates with a normal result): ends with
assert (MVCCID_IS_NORMAL (...)). The seed is always >= MVCCID_FIRST, and
every MVCCID_ALL_VISIBLE slot resolves because the publishing writer runs its
two lines back-to-back (Ch.5); the loop cannot hang under correct operation.
flowchart TD
A["seed = m_current_status_lowest_active_mvccid"] --> B["sweep idx 0..size"]
B --> C{slot value?}
C -->|ALL_VISIBLE| D["defer: append idx"]
C -->|NULL| E["ignore"]
C -->|normal| F["if PRECEDES min: min = slot"]
D --> G{waiting set empty?}
E --> G
F --> G
G -->|yes| Z["assert NORMAL; return min"]
G -->|no| H["retry++; every 20th sleep 10ms"]
H --> I["reverse-walk waiting set"] --> J{re-read slot}
J -->|still ALL_VISIBLE| K["keep"] --> G
J -->|normal/NULL| L["maybe update min; erase"] --> G
Figure 9-2 — compute_oldest_visible_mvccid control flow, all branches.
9.3 update_global_oldest_visible — the pinned double-check store
Section titled “9.3 update_global_oldest_visible — the pinned double-check store”The master heartbeat (9.6) recomputes only if no operation pins the watermark.
m_ov_lock_count is checked twice — before computing and after the sweep,
before the store.
// mvcctable::update_global_oldest_visible -- src/transaction/mvcc_table.cppMVCCID mvcctable::update_global_oldest_visible (){ if (m_ov_lock_count == 0) /* <- gate 1: skip work if pinned */ { MVCCID oldest_visible = compute_oldest_visible_mvccid (); if (m_ov_lock_count == 0) /* <- gate 2: pin may have arrived during sweep */ { assert (m_oldest_visible.load () <= oldest_visible); /* monotonicity (9.1) */ m_oldest_visible.store (oldest_visible); } } return m_oldest_visible.load (); /* <- always return cached (possibly stale) */}Three outcomes: a pin at gate 1 skips the sweep and returns the cached value; a
pin arriving during the sweep (gate 2 != 0) discards the fresh value; only
== 0 at both gates asserts monotonicity and stores. Gate 2 matters because the
sweep can take milliseconds — and a plain atomic load suffices, since a pinning
caller increments m_ov_lock_count before reading the watermark, so the
happens-before edge lives on the caller side.
9.4 The pin API — lock / unlock / is_locked / get
Section titled “9.4 The pin API — lock / unlock / is_locked / get”All trivial atomics:
// pin API -- src/transaction/mvcc_table.cppMVCCID mvcctable::get_global_oldest_visible () const { return m_oldest_visible.load (); }void mvcctable::lock_global_oldest_visible () { ++m_ov_lock_count; }void mvcctable::unlock_global_oldest_visible () { assert (m_ov_lock_count > 0); --m_ov_lock_count; }bool mvcctable::is_global_oldest_visible_locked () const { return m_ov_lock_count != 0; }get_global_oldest_visible is the vacuum fast path; lock/unlock are the pin
pair; is_..._locked reports whether any pin is outstanding.
Invariant (balanced pins): every lock pairs with exactly one unlock;
the unlock asserts m_ov_lock_count > 0, tripping on double-unlock or
missing-lock. The log_tdes wrappers carry the pairing across function
boundaries: the unlock lives in log_complete (log_manager.c), while the
matching lock is taken on the locator side (locator_sr.c, e.g. in
xlocator_upgrade_instances_domain just before heap_vacuum_all_objects, and
in redistribute_partition_data) so the watermark stays pinned while that
operation reads get_global_oldest_visible(). A leaked pin freezes
m_oldest_visible forever and stops reclamation.
9.5 reset_transaction_lowest_active — clearing a slot at transaction end
Section titled “9.5 reset_transaction_lowest_active — clearing a slot at transaction end”A finished transaction’s slot must return to MVCCID_NULL. This is the only
writer of MVCCID_NULL into the per-tran array from the commit path:
// mvcctable::reset_transaction_lowest_active -- src/transaction/mvcc_table.cppvoid mvcctable::reset_transaction_lowest_active (int tran_index){ oldest_active_set (m_transaction_lowest_visible_mvccids[tran_index], tran_index, MVCCID_NULL, oldest_active_event::RESET);}Ordering against LOG_COMMIT is the reason for the pin (9.4). log_complete
does, in order: append commit/abort record → drop pin → then, if
committed, reset the slot:
// log_complete (commit tail) -- src/transaction/log_manager.clog_append_commit_log (thread_p, tdes, &commit_lsa);/* ... */tdes->unlock_global_oldest_visible_mvccid (); /* <- drop the pin first */if (iscommitted == LOG_COMMIT) log_Gl.mvcc_table.reset_transaction_lowest_active (LOG_FIND_THREAD_TRAN_INDEX (thread_p));complete_mvcc (Ch.8) already set the slot to this transaction’s own mvccid,
and the pin held the watermark steady; only after LOG_COMMIT is appended is
the pin dropped and the slot reset. Resetting earlier would let vacuum clean
modifications a post-crash recovery expects to exist.
9.6 The vacuum side — master heartbeat and per-record consumption
Section titled “9.6 The vacuum side — master heartbeat and per-record consumption”On every vacuum_master_task::execute the master refreshes and captures the
watermark; it also runs at vacuum_boot (“for debug only”) and
vacuum_data_load_and_recover, so a fresh server always has a watermark before
the first job:
// vacuum_master_task::execute -- src/query/vacuum.cm_oldest_visible_mvccid = log_Gl.mvcc_table.update_global_oldest_visible ();The master gates block eligibility on it:
// vacuum_master_task::is_cursor_entry_ready_to_vacuum -- src/query/vacuum.cif (m_cursor.get_current_entry ().newest_mvccid >= m_oldest_visible_mvccid) return false; /* <- newest still visible; whole block not vacuumable */Blocks are scanned in blockid order and a later block cannot be ready if the
current one is not, so the master breaks on the first not-ready block.
9.6.1 vacuum_process_log_block — capturing the threshold per job
Section titled “9.6.1 vacuum_process_log_block — capturing the threshold per job”Each worker re-reads the watermark into a local threshold_mvccid, then an
NDEBUG tripwire bounds every op three ways:
// vacuum_process_log_block -- src/query/vacuum.cMVCCID threshold_mvccid = log_Gl.mvcc_table.get_global_oldest_visible (); /* <- one atomic load */#if !defined (NDEBUG)if (MVCC_ID_FOLLOW_OR_EQUAL (mvccid, threshold_mvccid) /* not yet below watermark? */ || MVCC_ID_PRECEDES (mvccid, data->oldest_visible_mvccid) /* older than block floor? */ || MVCC_ID_PRECEDES (data->newest_mvccid, mvccid)) /* newer than block ceiling? */ { assert (0); logpb_fatal_error (thread_p, true, ARG_FILE_LINE, "vacuum_process_log_block"); goto end; }#endifVACUUM_DATA_ENTRY::oldest_visible_mvccid (captured at log time, 9.6.3) bounds
the job from below, live threshold_mvccid from above. An op at-or-above the
current watermark should never reach a job — the master gate would have deferred
its block — hence the assert. Gathered heap objects go to vacuum_heap_page,
which carries threshold_mvccid to the record predicate.
9.6.2 mvcc_satisfies_vacuum — the per-record verdict
Section titled “9.6.2 mvcc_satisfies_vacuum — the per-record verdict”vacuum_heap_page asserts MVCCID_IS_NORMAL (threshold_mvccid) and dispatches
on the Ch.7 predicate’s verdict for each candidate record:
// vacuum_heap_page (per-record) -- src/query/vacuum.chelper.can_vacuum = mvcc_satisfies_vacuum (thread_p, &helper.mvcc_header, threshold_mvccid);if (helper.can_vacuum == VACUUM_RECORD_REMOVE) vacuum_heap_record (thread_p, &helper); /* <- whole version dies */else if (helper.can_vacuum == VACUUM_RECORD_DELETE_INSID_PREV_VER) vacuum_heap_record_insid_and_prev_version (thread_p, &helper); /* <- shrink header *//* else VACUUM_RECORD_CANNOT_VACUUM: leave it */The predicate (whose body is dissected in Ch.7) takes oldest_mvccid, which
here is the watermark — so m_oldest_visible alone decides each verdict
against the record header:
| Record state vs. watermark | Verdict | Effect |
|---|---|---|
Deleter committed and delete MVCCID < watermark | VACUUM_RECORD_REMOVE | Remove entirely. |
Not deleted (or deleted >= watermark) and insert all-visible / < watermark | VACUUM_RECORD_DELETE_INSID_PREV_VER | Keep version; trim insert MVCCID + prev-version LSA. |
Inserted >= watermark, or insert not all-visible | VACUUM_RECORD_CANNOT_VACUUM | A live snapshot may need it; leave it. |
The predicate is invoked from the per-record job path in vacuum_heap_page;
two other sites exist — is_not_vacuumed_and_lost (a consistency check against
vacuum_Data.oldest_unvacuumed_mvccid) and vacuum_rv_check_at_undo (the
undo-time recheck against get_global_oldest_visible()) — but neither is on the
block-job hot path. All three feed the same monotonic watermark lineage.
9.6.3 Block-level capture of the watermark at log time
Section titled “9.6.3 Block-level capture of the watermark at log time”VACUUM_DATA_ENTRY records the watermark as of when the block was logged, so
the job keeps that floor even after the global watermark advances:
// struct vacuum_data_entry -- src/query/vacuum.cstruct vacuum_data_entry { VACUUM_LOG_BLOCKID blockid; LOG_LSA start_lsa; // lsa of last mvcc op log record in block MVCCID oldest_visible_mvccid; // oldest visible MVCCID while block was logged MVCCID newest_mvccid; // newest MVCCID in log block // ...};On append, oldest_visible_mvccid is asserted <= get_global_oldest_visible()
and >= the previous block’s value — the 9.1 monotonicity invariant projected
onto the block stream.
9.7 The structural limitation — one small writer pins everything
Section titled “9.7 The structural limitation — one small writer pins everything”The watermark is a global minimum, only ever as fresh as the oldest live
snapshot. An idle T_old holding a small MVCCID keeps its slot small, so every
sweep takes it as the min and m_oldest_visible freezes:
flowchart LR TOLD["T_old (small MVCCID, idle)"] -->|slot stays small| ARR["per-tran array min"] ARR --> WM["m_oldest_visible frozen"] WM --> PRED["mvcc_satisfies_vacuum -> CANNOT_VACUUM"] PRED --> ACC["dead versions accumulate"]
Figure 9-3 — A single long-running writer pins the global watermark and stalls reclamation database-wide.
This is inherent to a single watermark with no per-table or per-tablespace
scope: the slowest snapshot governs the system — the MVCC analogue of a long
transaction blocking autovacuum/oldest xmin elsewhere. Remedies are
operational (bound transaction lifetime, avoid idle-in-transaction). The
m_ov_lock_count pin (9.3/9.4) is a deliberate bounded version of the same
freeze, scoped to one pinned operation window.
9.8 Chapter summary — key takeaways
Section titled “9.8 Chapter summary — key takeaways”- The watermark is one scalar,
m_oldest_visible— the min MVCCID any live snapshot can see; vacuum reads it viaget_global_oldest_visible, recomputed at the master heartbeat. compute_oldest_visible_mvccidsweeps the per-tran array lock-free, seeded by the monotonic floor:MVCCID_ALL_VISIBLE→ defer,MVCCID_NULL→ ignore, normal → min; the deferred slots re-check until they publish a value (10 ms backoff every 20 spins), then assert a normal result.update_global_oldest_visibledouble-checksm_ov_lock_count(before sweep, before store), so a mid-sweep pin discards the new value; the store asserts monotonicity.- The pin freezes the watermark across the pinned operation window; the
lock is taken on the locator side, the unlock in
log_complete, after which the slot is reset toMVCCID_NULLbyreset_transaction_lowest_active. - Vacuum consumes the watermark twice: master block gate
(
newest_mvccid >= m_oldest_visible_mvccid→ skip), and workerthreshold_mvccidintomvcc_satisfies_vacuum, which alone picksREMOVE/DELETE_INSID_PREV_VER/CANNOT_VACUUM. - One small-MVCCID long-running writer pins the watermark and stalls reclamation database-wide — the cost of a single global minimum with no per-object scoping; the remedy is operational.
Chapter 10: Sub-Transactions and Special Paths
Section titled “Chapter 10: Sub-Transactions and Special Paths”Chapters 3 through 9 traced the clean lifecycle: one transaction gets one
MVCCID, stamps a record, becomes visible at commit, is reclaimed by vacuum.
This chapter covers the paths that do not fit that model: sub-transactions
(savepoints / system operations) that need their own MVCCID while the parent is
still open and complete before it; MVCC-disabled classes (root class,
_db_serial, collation/HA cached-OID classes) whose records carry no MVCC
header fields; restart seeding via reset_start_mvccid; and the
2048-ring saturation question deferred from Chapter 5.
For the snapshot/visibility theory these paths perturb, see cubrid-mvcc.md.
For the sub-transaction boundary vs. lock escalation and SERIALIZABLE
write-skew, see §10.8.
10.1 The two structs, completed
Section titled “10.1 The two structs, completed”Closing coverage of mvcc_info (per-transaction state, Ch.1/Ch.3) and
mvcc_trans_status (one ring slot, Ch.1/Ch.8) with the fields only the
sub-transaction paths use.
// struct mvcc_info -- src/transaction/mvcc.hstruct mvcc_info{ MVCC_SNAPSHOT snapshot; /* MVCC Snapshot */ MVCCID id; /* the transaction's own MVCCID (Ch.3) */ MVCCID recent_snapshot_lowest_active_mvccid; /* fast-reject floor (Ch.4) */ std::vector<MVCCID> sub_ids; /* MVCC sub-transaction ID array */ bool is_sub_active; /* true while a sub-transaction is running */ // ... methods condensed ...};| Field | Role | Why it exists |
|---|---|---|
snapshot | The read view (built Ch.5, consumed Ch.6). | One snapshot per worker per consistency window. |
id | The transaction’s own MVCCID, lazily allocated (Ch.3). | The stamp written into record headers by top-level writes. |
recent_snapshot_lowest_active_mvccid | Floor below which any MVCCID is definitely inactive — fast-reject gate in mvcc_is_active_id (Ch.4). | Avoids a global probe for old IDs. |
sub_ids | LIFO stack of MVCCIDs for nested sub-transactions, newest at back(). | A parent may need several MVCCIDs over its life, one per open system operation; they must complete in reverse order. |
is_sub_active | Flag set true while a sub-transaction owns the “current” write identity. | Signals the active write MVCCID is sub_ids.back(), not id. Mirrored in copy_to but never read inside the MVCC core — informational state for passive servers. |
sub_ids is a stack, not a set. logtb_assign_subtransaction_mvccid only
push_backs; logtb_complete_sub_mvcc only pop_backs the value read from
back(). Out-of-order completion would pop the wrong id; CUBRID’s
system-operation/savepoint machinery guarantees the required LIFO nesting.
// struct mvcc_trans_status -- src/transaction/mvcc_table.hppstruct mvcc_trans_status{ enum event_type { COMMIT, ROLLBACK, SUBTRAN }; mvcc_active_tran m_active_mvccs; /* the bit-area + long-list snapshot */ MVCCID m_last_completed_mvccid; // just for info event_type m_event_type; // just for info std::atomic<version_type> m_version; // ... methods condensed ...};| Field | Role | Why it exists |
|---|---|---|
m_active_mvccs | The active-transaction set (bit-area + long-list) snapshotted in this slot. | The payload a snapshot builder copies (Ch.4/Ch.5). |
m_last_completed_mvccid | MVCCID of the completion that produced this slot. Diagnostic only. | Debug/trace aid; not read by visibility. |
m_event_type | Tags the completion: COMMIT, ROLLBACK, or SUBTRAN. | Diagnostic. Distinguishes a sub-completion from a top-level one. |
m_version | Monotonic counter bumped on every status transition; re-read by snapshot builders to detect mid-copy mutation (the Ch.5 retry loop). | The lock-free consistency check letting readers copy without holding the mutex. |
Role matrix for m_event_type:
| Producing call | m_event_type | Advances oldest? |
|---|---|---|
complete_mvcc(.., committed=true) | COMMIT | Yes — committed work may raise the floor. |
complete_mvcc(.., committed=false) | ROLLBACK | Yes — rolled-back ID leaves the active set. |
complete_sub_mvcc | SUBTRAN intended; actually left untouched (§10.5 bug) | No — parent still open, so the sub-id can never be the lowest. |
10.2 Allocating a sub-transaction MVCCID
Section titled “10.2 Allocating a sub-transaction MVCCID”logtb_get_new_subtransaction_mvccid is the entry point when a system
operation or savepoint needs to write under its own identity.
// logtb_get_new_subtransaction_mvccid -- src/transaction/log_tran_table.cvoidlogtb_get_new_subtransaction_mvccid (THREAD_ENTRY * thread_p, MVCC_INFO * curr_mvcc_info){ MVCCID mvcc_subid; mvcctable *mvcc_table = &log_Gl.mvcc_table;
if (MVCCID_IS_VALID (curr_mvcc_info->id)) { mvcc_subid = mvcc_table->get_new_mvccid (); /* parent already has an id */ } else { mvcc_table->get_two_new_mvccid (curr_mvcc_info->id, mvcc_subid); /* seed parent + sub */ } logtb_assign_subtransaction_mvccid (thread_p, curr_mvcc_info, mvcc_subid);}Two branches on whether the parent already owns an MVCCID. Parent id valid →
allocate one via get_new_mvccid. Parent id NULL → the parent never wrote
(Ch.3’s lazy allocation never fired); since visibility (Ch.6) requires the
parent’s id to precede its sub’s, get_two_new_mvccid pulls two consecutive
ids under one lock — first to the parent (by reference), second to the sub:
// mvcctable::get_two_new_mvccid -- src/transaction/mvcc_table.cppvoidmvcctable::get_two_new_mvccid (MVCCID &first, MVCCID &second){ m_new_mvccid_lock.lock (); first = log_Gl.hdr.mvcc_next_id; MVCCID_FORWARD (log_Gl.hdr.mvcc_next_id); second = log_Gl.hdr.mvcc_next_id; MVCCID_FORWARD (log_Gl.hdr.mvcc_next_id); m_new_mvccid_lock.unlock ();}Invariant: parent id strictly precedes every sub-id. Either branch keeps the
parent below the sub (later counter, or first vs second = first + 1), and
both ids are taken under one m_new_mvccid_lock so nothing slots between them.
Visibility relies on this so a sub-stamped record reads as “newer” than a
parent-stamped one.
flowchart TD
A["logtb_get_new_subtransaction_mvccid"] --> B{"MVCCID_IS_VALID(curr_mvcc_info->id)?"}
B -- "yes (parent stamped)" --> C["get_new_mvccid() -> mvcc_subid"]
B -- "no (parent unstamped)" --> D["get_two_new_mvccid(id, mvcc_subid)<br/>id := first, mvcc_subid := second"]
C --> E["logtb_assign_subtransaction_mvccid"]
D --> E
E --> F["sub_ids.push_back(mvcc_subid)"]
Figure 10-1: branch structure of sub-transaction MVCCID allocation.
logtb_assign_subtransaction_mvccid carries the load-bearing assertion:
// logtb_assign_subtransaction_mvccid -- src/transaction/log_tran_table.cstatic voidlogtb_assign_subtransaction_mvccid (THREAD_ENTRY * thread_p, MVCC_INFO * curr_mvcc_info, MVCCID mvcc_subid){ assert (MVCCID_IS_VALID (curr_mvcc_info->id)); /* <- parent MUST be stamped by now */ curr_mvcc_info->sub_ids.push_back (mvcc_subid);}By the time we push, the parent’s id is valid (valid on entry, or just set by
get_two_new_mvccid); a push onto an unstamped parent is a bug, caught here in
debug builds.
10.3 A parent sees its own sub-transaction writes
Section titled “10.3 A parent sees its own sub-transaction writes”A sub-stamped record must read back to the parent (and its later subs) as
“written by me”, never as foreign active work — the job of
logtb_is_current_mvccid, reached through Chapter 6’s
MVCC_IS_REC_INSERTED_BY_ME / MVCC_IS_REC_DELETED_BY_ME macros.
// logtb_is_current_mvccid -- src/transaction/log_tran_table.cboollogtb_is_current_mvccid (THREAD_ENTRY * thread_p, MVCCID mvccid){ // ... condensed: tdes lookup + assert ... MVCC_INFO *curr_mvcc_info = &tdes->mvccinfo; if (curr_mvcc_info->id == mvccid) { return true; /* the parent's own id */ } else if (curr_mvcc_info->sub_ids.size () > 0) { for (size_t i = 0; i < curr_mvcc_info->sub_ids.size (); i++) { if (curr_mvcc_info->sub_ids[i] == mvccid) { return true; /* one of my sub-transactions */ } } } return false;}Every exit: (1) id == mvccid → true, parent’s top-level write. (2) Else if
sub_ids is non-empty, linear-scan the whole vector (i < size(), not just
back()) — nested system operations may stack several sub-ids, and a record
from an earlier still-open sub must also count as “mine”. (3) Empty or no
match → false, falling through to the snapshot-based active check.
MVCC_IS_REC_INSERTED_BY_ME expands straight to
logtb_is_current_mvccid (thread_p, rec_header->mvcc_ins_id).
The companions logtb_find_current_mvccid / logtb_get_current_mvccid resolve
the write identity from the other end: sub_ids.back() if non-empty (innermost
open sub), else id. So a write while a sub is open is stamped with the sub-id,
and logtb_is_current_mvccid guarantees it reads back as “mine”.
mvcc_is_active_id (Ch.4) layers the fast-reject floor on top:
// mvcc_is_active_id -- src/transaction/mvcc.cSTATIC_INLINE boolmvcc_is_active_id (THREAD_ENTRY * thread_p, MVCCID mvccid){ // ... condensed: tdes lookup + assert ... MVCC_INFO *curr_mvcc_info = &tdes->mvccinfo; if (MVCC_ID_PRECEDES (mvccid, curr_mvcc_info->recent_snapshot_lowest_active_mvccid)) { return false; /* below the floor: definitely inactive */ } if (logtb_is_current_mvccid (thread_p, mvccid)) { return true; /* mine (parent or any sub) */ } return log_Gl.mvcc_table.is_active (mvccid); /* foreign: global probe */}stateDiagram-v2 [*] --> CheckFloor CheckFloor --> Inactive: mvccid precedes recent_lowest CheckFloor --> CheckMine: at or above floor CheckMine --> Active: id or any sub_id matches CheckMine --> GlobalProbe: no local match GlobalProbe --> Active: mvcc_table.is_active true GlobalProbe --> Inactive: not in active set
Figure 10-2: mvcc_is_active_id — the local sub_ids check sits between the
cheap floor reject and the expensive global probe.
10.4 Completing a sub-transaction
Section titled “10.4 Completing a sub-transaction”A sub ends before its parent. logtb_complete_sub_mvcc runs the
per-transaction half, then patches the parent’s live snapshot.
// logtb_complete_sub_mvcc -- src/transaction/log_tran_table.cvoidlogtb_complete_sub_mvcc (THREAD_ENTRY * thread_p, LOG_TDES * tdes){ MVCC_INFO *curr_mvcc_info = &tdes->mvccinfo; MVCCID mvcc_sub_id = curr_mvcc_info->sub_ids.back (); /* innermost open sub */
mvcc_table->complete_sub_mvcc (mvcc_sub_id); /* global half */ curr_mvcc_info->sub_ids.pop_back (); /* drop it from the stack */
if (tdes->mvccinfo.snapshot.valid) { MVCC_SNAPSHOT *snapshot = &tdes->mvccinfo.snapshot; if (mvcc_sub_id >= snapshot->highest_completed_mvccid) { snapshot->highest_completed_mvccid = mvcc_sub_id; MVCCID_FORWARD (snapshot->highest_completed_mvccid); } snapshot->m_active_mvccs.set_inactive_mvccid (mvcc_sub_id); }}Branches: (1) Read sub_ids.back() (LIFO §10.1), call the global
complete_sub_mvcc (§10.5), then pop_back. After the pop,
logtb_is_current_mvccid no longer matches the sub-id for new reads, so the
fix-up must repair the existing snapshot. (2) Valid snapshot → if
mvcc_sub_id >= highest_completed_mvccid, raise the ceiling one past the sub-id;
then unconditionally clear it from the active bit-area (set_inactive_mvccid).
(3) No valid snapshot (READ COMMITTED between statements, or none yet) → skip;
the next build_mvcc_info picks up the global state updated in step 1.
Invariant: a parent’s snapshot never loses sight of its own committed sub-transaction. The sub-id was allocated after the snapshot’s ceiling, so an unpatched snapshot would judge it “too new”; the ceiling-raise plus active-set clear repair both halves of the Ch.6 predicate so the parent reads its own sub’s rows immediately.
10.5 The SUBTRAN event in the global ring
Section titled “10.5 The SUBTRAN event in the global ring”mvcctable::complete_sub_mvcc is the global counterpart — almost identical to
complete_mvcc (Ch.8) but omitting the oldest-active recompute.
// mvcctable::complete_sub_mvcc -- src/transaction/mvcc_table.cppvoidmvcctable::complete_sub_mvcc (MVCCID mvccid){ assert (MVCCID_IS_VALID (mvccid)); std::unique_lock<std::mutex> ulock (m_active_trans_mutex); /* only one status change at a time */
mvcc_trans_status::version_type next_version; size_t next_index; mvcc_trans_status &next_status = next_trans_status_start (next_version, next_index);
// update current trans status m_current_trans_status.m_active_mvccs.set_inactive_mvccid (mvccid); m_current_trans_status.m_last_completed_mvccid = mvccid; m_current_trans_status.m_last_completed_mvccid = mvcc_trans_status::SUBTRAN; /* source-as-is; see note */
next_tran_status_finish (next_status, next_index); /* publish new ring slot */ ulock.unlock (); // mvccid can't be lowest, so no need to update it here}Walkthrough: (1) take m_active_trans_mutex. (2) next_trans_status_start
bumps m_version and reserves+invalidates the next slot (Ch.8’s version
protocol — the bump makes a concurrent snapshot copy retry). (3) clear the
sub-id from the current status active set. (4) record the info fields, then
publish via next_tran_status_finish (copies the active set into the reserved
slot, advances m_trans_status_history_position). (5) No
advance_oldest_active — the comment says it: an open parent holds an older
id, so the sub-id can never be the oldest-visible watermark (Ch.9). The double
assignment to m_last_completed_mvccid is a copy-paste slip vs. complete_mvcc;
harmless since both fields are // just for info (Open Question #2).
flowchart TD A["complete_sub_mvcc(mvccid)"] --> B["lock m_active_trans_mutex"] B --> C["next_trans_status_start<br/>bump m_version, reserve slot"] C --> D["m_current.set_inactive_mvccid(mvccid)"] D --> E["record info fields"] E --> F["next_tran_status_finish<br/>copy active set, advance position"] F --> G["unlock"] G --> H["return — NO advance_oldest_active"]
Figure 10-3: complete_sub_mvcc flow — note the absent oldest-active recompute.
10.6 MVCC-disabled classes
Section titled “10.6 MVCC-disabled classes”mvcc_is_mvcc_disabled_class decides participation purely from the class OID:
// mvcc_is_mvcc_disabled_class -- src/transaction/mvcc.cboolmvcc_is_mvcc_disabled_class (const OID * class_oid){ if (OID_ISNULL (class_oid) || OID_IS_ROOTOID (class_oid)) { return true; /* root class (the class-of-classes) */ } if (oid_is_serial (class_oid)) { return true; /* _db_serial: serial/auto-increment generators */ } if (oid_check_cached_class_oid (OID_CACHE_COLLATION_CLASS_ID, class_oid)) { return true; /* _db_collation */ } if (oid_check_cached_class_oid (OID_CACHE_HA_APPLY_INFO_CLASS_ID, class_oid)) { return true; /* HA apply-info catalog */ } return false; /* normal MVCC class */}| Branch | Class | Why MVCC is disabled |
|---|---|---|
OID_ISNULL || OID_IS_ROOTOID | Root class (schema metaclass) | Catalog bootstrap; cannot itself be versioned. |
oid_is_serial | _db_serial | Generated values must be globally visible at once; a versioned serial would let two txns draw the same value. |
OID_CACHE_COLLATION_CLASS_ID | _db_collation | Effectively static metadata; in-place is cheaper. |
OID_CACHE_HA_APPLY_INFO_CLASS_ID | HA apply-info | Replication progress must be observed without snapshot lag. |
What “MVCC disabled” means for a record: its header carries no
OR_MVCC_FLAG_VALID_INSID flag, so mvcc_ins_id reads as MVCCID_ALL_VISIBLE,
and every visibility/vacuum entry guards on that value. In
mvcc_satisfies_snapshot (Ch.6) the first branch short-circuits to
SNAPSHOT_SATISFIED (always visible); the perfmon block below skips its
..._LOST accounting via the same != MVCCID_ALL_VISIBLE test; and the vacuum
predicate mvcc_satisfies_vacuum asks the identical question through the
MVCC_IS_HEADER_INSID_NOT_ALL_VISIBLE macro:
// mvcc_satisfies_snapshot guard -- src/transaction/mvcc.cif (rec_header->mvcc_ins_id != MVCCID_ALL_VISIBLE && vacuum_is_mvccid_vacuumed (rec_header->mvcc_ins_id)) { perfmon_mvcc_snapshot (thread_p, PERF_SNAPSHOT_SATISFIES_SNAPSHOT, PERF_SNAPSHOT_RECORD_INSERTED_COMMITED_LOST, PERF_SNAPSHOT_VISIBLE); }// ... condensed: same != MVCCID_ALL_VISIBLE guard recurs in mvcc_satisfies_delete / _dirty ...Invariant: a disabled-class record is never handed to the active/visible
machinery. With mvcc_ins_id == MVCCID_ALL_VISIBLE, those guards read false,
so the row is committed-visible and un-vacuumable on the insert side. Callers
memoize the verdict per class (heap insert) rather than re-walk it per row.
Cross-check. The function header comment lists “root class and
_db_serial,db_partition”, but the code checks collation and HA apply-info, not partition. The comment is stale; the field-by-field table above follows the actualoid_check_cached_class_oidbranches.
10.7 Restart seeding — re-anchoring the bit-area
Section titled “10.7 Restart seeding — re-anchoring the bit-area”After recovery, the in-memory mvcctable is re-anchored to the MVCCID counter
restored into log_Gl.hdr.mvcc_next_id:
// mvcctable::reset_start_mvccid -- src/transaction/mvcc_table.cppvoidmvcctable::reset_start_mvccid (){ m_current_trans_status.m_active_mvccs.reset_start_mvccid (log_Gl.hdr.mvcc_next_id);
assert (m_trans_status_history_position < HISTORY_MAX_SIZE); m_trans_status_history[m_trans_status_history_position].m_active_mvccs.reset_start_mvccid (log_Gl.hdr.mvcc_next_id);
m_current_status_lowest_active_mvccid.store (log_Gl.hdr.mvcc_next_id);}Three places, all seeded from the same restored counter: (1) the current
status’s active-set start (m_bit_area_start_mvccid, per-class half below); (2)
the current ring slot’s active-set start — the other 2047 slots stay untouched,
overwritten lazily as completions cycle the ring (Ch.8); (3) the cached
m_current_status_lowest_active_mvccid scalar, set to mvcc_next_id (no active
transactions yet).
// mvcc_active_tran::reset_start_mvccid -- src/transaction/mvcc_active_tran.cppvoidmvcc_active_tran::reset_start_mvccid (MVCCID mvccid){ m_bit_area_start_mvccid = mvccid; if (m_initialized) { check_valid (); /* debug: bits past length must be zero */ }}Invariant: after restart, the active-set origin equals the next-to-issue
MVCCID, with an empty active region. Every id the recovered database issues is
>= mvcc_next_id, so the bit-area starts empty and correctly positioned. Called
from the boot path in log_manager.c and three points in log_recovery.c
(after analysis, after redo, after the final pass); explicitly // not thread safe, running single-threaded before any worker can build a snapshot.
10.8 Open questions
Section titled “10.8 Open questions”-
2048-ring saturation under a slow snapshot build. Chapter 5’s
build_mvcc_infocopies a slot’sm_active_mvccs, then re-checks the capturedtrans_status_versionagainstm_version.load (), resetting and looping on mismatch. The retry defends against a single concurrent mutation, but whether it is provably safe against a full-ring (HISTORY_MAX_SIZE = 2048) overwrite during one uninterruptedcopy_to— where the slot mutates without a distinguishable version change — is not established by code or comments. Not expected in practice, but no explicit bound enforces it. -
complete_sub_mvccinformational-field bug (§10.5). The double assignment tom_last_completed_mvccid(second writes theSUBTRANenum) plus the never-assignedm_event_typelook like a copy-paste slip vs.complete_mvcc. Harmless (both// just for info), but a reader trusting the field on a sub-tran slot gets the enum, not an MVCCID. -
is_sub_activewrite path. Copied inmvcc_info::copy_tobut never set by the MVCC core read here; its producer lives in the savepoint/system-operation layer. For visibility,sub_idsemptiness is the operative signal.
For the sub-transaction boundary vs. lock acquisition, escalation, and
SERIALIZABLE write-skew detection — where MVCC visibility alone is
insufficient and locks must close the gap — see the lock-manager detail
companion (cubrid-lock-manager-detail.md), chapters on escalation and
serializable conflict handling.
10.9 Chapter summary — key takeaways
Section titled “10.9 Chapter summary — key takeaways”- Sub-transactions get their own MVCCIDs on a LIFO
sub_idsstack.logtb_get_new_subtransaction_mvccidallocates one id (parent stamped) or two atomically viaget_two_new_mvccid(parent unstamped), always keeping the parent below every sub-id. - A parent sees its own and its subs’ writes via
logtb_is_current_mvccid, which checksidthen linear-scans the wholesub_idsvector — not just the top — so an earlier still-open sub’s write also reads back as “mine”. - Sub-completion is a snapshot fix-up, not a vacuum event.
logtb_complete_sub_mvccbumps the snapshot’shighest_completed_mvccidpast the sub-id and clears it from the active set;mvcctable::complete_sub_mvccpublishes aSUBTRANslot but skipsadvance_oldest_activesince an open parent’s sub can never be the oldest. - MVCC-disabled classes carry no insert id.
mvcc_is_mvcc_disabled_classreturns true for the root class,_db_serial,_db_collation, and HA apply-info; their records readMVCCID_ALL_VISIBLE, short-circuiting every visibility/vacuum guard to “always visible, never reclaimed”. - Restart re-anchors the table from the log header.
reset_start_mvccidsets the current-status and current-ring-slot bit-area origins plus the cached lowest-active scalar tolog_Gl.hdr.mvcc_next_id, leaving the active region empty — single-threaded during boot and at three recovery checkpoints. - Two latent issues are open questions: a full-ring (2048) overwrite during
one in-flight snapshot copy, and the
complete_sub_mvccinformational-field double-assignment — neither affects correctness in observed paths.
Position hints as of this revision
Section titled “Position hints as of this revision”| Symbol | File | Line |
|---|---|---|
OR_MVCC_INSERT_ID_OFFSET | src/base/object_representation.h | 483 |
OR_MVCC_DELETE_ID_OFFSET | src/base/object_representation.h | 486 |
OR_MVCC_PREV_VERSION_LSA_OFFSET | src/base/object_representation.h | 490 |
OR_GET_MVCC_FLAG | src/base/object_representation.h | 548 |
OR_MVCC_MAX_HEADER_SIZE | src/base/object_representation_constants.h | 142 |
OR_MVCC_MIN_HEADER_SIZE | src/base/object_representation_constants.h | 145 |
OR_MVCC_FLAG_MASK | src/base/object_representation_constants.h | 160 |
OR_MVCC_FLAG_VALID_INSID | src/base/object_representation_constants.h | 165 |
OR_MVCC_FLAG_VALID_DELID | src/base/object_representation_constants.h | 168 |
OR_MVCC_FLAG_VALID_PREV_VERSION | src/base/object_representation_constants.h | 171 |
or_mvcc_get_header | src/base/object_representation_sr.c | 4237 |
or_mvcc_set_header | src/base/object_representation_sr.c | 4296 |
or_mvcc_add_header | src/base/object_representation_sr.c | 4381 |
or_mvcc_get_flag | src/base/object_representation_sr.c | 4473 |
or_mvcc_set_flag | src/base/object_representation_sr.c | 4488 |
or_mvcc_get_insid | src/base/object_representation_sr.c | 4517 |
or_mvcc_set_insid | src/base/object_representation_sr.c | 4544 |
or_mvcc_get_delid | src/base/object_representation_sr.c | 4564 |
or_mvcc_get_chn | src/base/object_representation_sr.c | 4592 |
or_mvcc_set_delid | src/base/object_representation_sr.c | 4617 |
or_mvcc_set_chn | src/base/object_representation_sr.c | 4638 |
or_mvcc_set_prev_version_lsa | src/base/object_representation_sr.c | 4654 |
or_mvcc_get_prev_version_lsa | src/base/object_representation_sr.c | 4680 |
PERF_SNAPSHOT_SATISFIES_SNAPSHOT | src/base/perf_monitor.h | 238 |
PERF_SNAPSHOT_RECORD_INSERTED_VACUUMED | src/base/perf_monitor.h | 246 |
PERF_SNAPSHOT_RECORD_INSERTED_COMMITED_LOST | src/base/perf_monitor.h | 250 |
PERF_SNAPSHOT_RECORD_INSERTED_DELETED | src/base/perf_monitor.h | 252 |
PERF_SNAPSHOT_RECORD_DELETED_COMMITTED_LOST | src/base/perf_monitor.h | 257 |
perfmon_mvcc_snapshot | src/base/perf_monitor.h | 1693 |
mvcc_header_size_lookup | src/object/object_representation.c | 70 |
vacuum_data_entry | src/query/vacuum.c | 104 |
vacuum_boot | src/query/vacuum.c | 1291 |
vacuum_heap_page | src/query/vacuum.c | 1577 |
vacuum_master_task::execute | src/query/vacuum.c | 3002 |
vacuum_master_task::is_cursor_entry_ready_to_vacuum | src/query/vacuum.c | 3106 |
vacuum_process_log_block | src/query/vacuum.c | 3251 |
is_not_vacuumed_and_lost | src/query/vacuum.c | 7379 |
vacuum_rv_check_at_undo | src/query/vacuum.c | 7627 |
vacuum_is_mvccid_vacuumed | src/query/vacuum.h | 271 |
heap_get_mvcc_header | src/storage/heap_file.c | 7747 |
heap_attrinfo_transform_header_to_disk | src/storage/heap_file.c | 11937 |
heap_mvcc_log_insert | src/storage/heap_file.c | 16371 |
heap_rv_mvcc_redo_insert | src/storage/heap_file.c | 16442 |
heap_get_mvcc_rec_header_from_overflow | src/storage/heap_file.c | 19541 |
heap_insert_adjust_recdes_header | src/storage/heap_file.c | 20540 |
NULL_CHN | src/storage/storage_common.h | 66 |
MVCCID | src/storage/storage_common.h | 186 |
MVCCID_NULL | src/storage/storage_common.h | 327 |
MVCCID_ALL_VISIBLE | src/storage/storage_common.h | 329 |
MVCCID_FIRST | src/storage/storage_common.h | 330 |
MVCCID_IS_NORMAL | src/storage/storage_common.h | 335 |
MVCCID_FORWARD | src/storage/storage_common.h | 343 |
xlocator_upgrade_instances_domain | src/transaction/locator_sr.c | 12126 |
log_Gl.mvcc_table | src/transaction/log_impl.h | 707 |
mvcc_next_id | src/transaction/log_storage.hpp | 131 |
logtb_expand_trantable | src/transaction/log_tran_table.c | 251 |
logtb_define_trantable | src/transaction/log_tran_table.c | 366 |
logtb_get_number_of_total_tran_indices | src/transaction/log_tran_table.c | 696 |
logtb_rv_assign_mvccid_for_undo_recovery | src/transaction/log_tran_table.c | 1115 |
logtb_invalidate_snapshot_data | src/transaction/log_tran_table.c | 3861 |
logtb_find_current_mvccid | src/transaction/log_tran_table.c | 3910 |
logtb_get_current_mvccid | src/transaction/log_tran_table.c | 3939 |
logtb_is_current_mvccid | src/transaction/log_tran_table.c | 3972 |
logtb_get_mvcc_snapshot | src/transaction/log_tran_table.c | 4007 |
logtb_complete_mvcc | src/transaction/log_tran_table.c | 4050 |
logtb_get_new_subtransaction_mvccid | src/transaction/log_tran_table.c | 4547 |
logtb_assign_subtransaction_mvccid | src/transaction/log_tran_table.c | 4578 |
logtb_complete_sub_mvcc | src/transaction/log_tran_table.c | 4593 |
log_tdes::lock_global_oldest_visible_mvccid | src/transaction/log_tran_table.c | 6220 |
log_tdes::unlock_global_oldest_visible_mvccid | src/transaction/log_tran_table.c | 6230 |
MVCC_IS_REC_INSERTER_ACTIVE | src/transaction/mvcc.c | 46 |
MVCC_IS_REC_DELETER_ACTIVE | src/transaction/mvcc.c | 49 |
MVCC_IS_REC_INSERTER_IN_SNAPSHOT | src/transaction/mvcc.c | 52 |
MVCC_IS_REC_DELETER_IN_SNAPSHOT | src/transaction/mvcc.c | 55 |
MVCC_IS_REC_INSERTED_SINCE_MVCCID | src/transaction/mvcc.c | 58 |
MVCC_IS_REC_DELETED_SINCE_MVCCID | src/transaction/mvcc.c | 61 |
mvcc_is_id_in_snapshot | src/transaction/mvcc.c | 90 |
mvcc_is_active_id | src/transaction/mvcc.c | 122 |
mvcc_satisfies_snapshot | src/transaction/mvcc.c | 156 |
mvcc_is_not_deleted_for_snapshot | src/transaction/mvcc.c | 280 |
mvcc_satisfies_vacuum | src/transaction/mvcc.c | 321 |
mvcc_satisfies_delete | src/transaction/mvcc.c | 389 |
mvcc_satisfies_dirty | src/transaction/mvcc.c | 513 |
mvcc_is_mvcc_disabled_class | src/transaction/mvcc.c | 628 |
mvcc_snapshot::copy_to | src/transaction/mvcc.c | 679 |
mvcc_info::copy_to | src/transaction/mvcc.c | 714 |
mvcc_rec_header | src/transaction/mvcc.h | 38 |
MVCC_REC_HEADER_INITIALIZER | src/transaction/mvcc.h | 47 |
MVCC_IS_HEADER_DELID_VALID | src/transaction/mvcc.h | 87 |
MVCC_IS_HEADER_INSID_NOT_ALL_VISIBLE | src/transaction/mvcc.h | 91 |
MVCC_IS_HEADER_ALL_VISIBLE | src/transaction/mvcc.h | 95 |
MVCC_IS_REC_INSERTED_BY_ME | src/transaction/mvcc.h | 118 |
MVCC_IS_REC_DELETED_BY_ME | src/transaction/mvcc.h | 122 |
MVCC_IS_REC_DELETED_BY | src/transaction/mvcc.h | 130 |
MVCC_ID_PRECEDES | src/transaction/mvcc.h | 141 |
MVCC_ID_FOLLOW_OR_EQUAL | src/transaction/mvcc.h | 142 |
MVCC_GET_PREV_VERSION_LSA | src/transaction/mvcc.h | 156 |
mvcc_satisfies_snapshot_result | src/transaction/mvcc.h | 159 |
MVCC_SNAPSHOT_FUNC | src/transaction/mvcc.h | 171 |
mvcc_snapshot | src/transaction/mvcc.h | 173 |
mvcc_info | src/transaction/mvcc.h | 196 |
mvcc_satisfies_delete_result | src/transaction/mvcc.h | 222 |
mvcc_satisfies_vacuum_result | src/transaction/mvcc.h | 232 |
mvcc_active_tran::mvcc_active_tran | src/transaction/mvcc_active_tran.cpp | 31 |
mvcc_active_tran::initialize | src/transaction/mvcc_active_tran.cpp | 47 |
mvcc_active_tran::finalize | src/transaction/mvcc_active_tran.cpp | 62 |
mvcc_active_tran::reset | src/transaction/mvcc_active_tran.cpp | 74 |
mvcc_active_tran::long_tran_max_size | src/transaction/mvcc_active_tran.cpp | 99 |
mvcc_active_tran::bit_size_to_unit_size | src/transaction/mvcc_active_tran.cpp | 105 |
mvcc_active_tran::units_to_bits | src/transaction/mvcc_active_tran.cpp | 111 |
mvcc_active_tran::units_to_bytes | src/transaction/mvcc_active_tran.cpp | 117 |
mvcc_active_tran::get_mask_of | src/transaction/mvcc_active_tran.cpp | 123 |
mvcc_active_tran::get_bit_offset | src/transaction/mvcc_active_tran.cpp | 129 |
mvcc_active_tran::get_mvccid | src/transaction/mvcc_active_tran.cpp | 135 |
mvcc_active_tran::get_unit_of | src/transaction/mvcc_active_tran.cpp | 141 |
mvcc_active_tran::is_set | src/transaction/mvcc_active_tran.cpp | 147 |
mvcc_active_tran::get_area_size | src/transaction/mvcc_active_tran.cpp | 153 |
mvcc_active_tran::get_bit_area_memsize | src/transaction/mvcc_active_tran.cpp | 159 |
mvcc_active_tran::compute_highest_completed_mvccid | src/transaction/mvcc_active_tran.cpp | 171 |
mvcc_active_tran::compute_lowest_active_mvccid | src/transaction/mvcc_active_tran.cpp | 220 |
mvcc_active_tran::copy_to | src/transaction/mvcc_active_tran.cpp | 280 |
mvcc_active_tran::is_active | src/transaction/mvcc_active_tran.cpp | 318 |
mvcc_active_tran::remove_long_transaction | src/transaction/mvcc_active_tran.cpp | 356 |
mvcc_active_tran::add_long_transaction | src/transaction/mvcc_active_tran.cpp | 377 |
mvcc_active_tran::ltrim_area | src/transaction/mvcc_active_tran.cpp | 386 |
mvcc_active_tran::set_bitarea_mvccid | src/transaction/mvcc_active_tran.cpp | 414 |
mvcc_active_tran::cleanup_migrate_to_long_transations | src/transaction/mvcc_active_tran.cpp | 462 |
mvcc_active_tran::set_inactive_mvccid | src/transaction/mvcc_active_tran.cpp | 493 |
mvcc_active_tran::reset_start_mvccid | src/transaction/mvcc_active_tran.cpp | 506 |
mvcc_active_tran::reset_active_transactions | src/transaction/mvcc_active_tran.cpp | 517 |
mvcc_active_tran::check_valid | src/transaction/mvcc_active_tran.cpp | 525 |
mvcc_active_tran | src/transaction/mvcc_active_tran.hpp | 31 |
mvcc_active_tran::unit_type | src/transaction/mvcc_active_tran.hpp | 63 |
BITAREA_MAX_SIZE | src/transaction/mvcc_active_tran.hpp | 65 |
mvcc_active_tran::BITAREA_MAX_SIZE | src/transaction/mvcc_active_tran.hpp | 65 |
UNIT_BIT_COUNT | src/transaction/mvcc_active_tran.hpp | 69 |
BITAREA_MAX_MEMSIZE | src/transaction/mvcc_active_tran.hpp | 71 |
BITAREA_MAX_BITS | src/transaction/mvcc_active_tran.hpp | 72 |
ALL_ACTIVE | src/transaction/mvcc_active_tran.hpp | 74 |
mvcc_active_tran::ALL_ACTIVE | src/transaction/mvcc_active_tran.hpp | 74 |
mvcc_active_tran::ALL_COMMITTED | src/transaction/mvcc_active_tran.hpp | 75 |
mvcc_active_tran::m_bit_area | src/transaction/mvcc_active_tran.hpp | 78 |
mvcc_active_tran::m_bit_area_start_mvccid | src/transaction/mvcc_active_tran.hpp | 80 |
mvcc_active_tran::m_bit_area_length | src/transaction/mvcc_active_tran.hpp | 82 |
mvcc_active_tran::m_long_tran_mvccids | src/transaction/mvcc_active_tran.hpp | 85 |
mvcc_active_tran::m_long_tran_mvccids_length | src/transaction/mvcc_active_tran.hpp | 87 |
mvcc_active_tran::m_initialized | src/transaction/mvcc_active_tran.hpp | 89 |
oldest_active_set | src/transaction/mvcc_table.cpp | 92 |
oldest_active_get | src/transaction/mvcc_table.cpp | 102 |
mvcc_trans_status::mvcc_trans_status | src/transaction/mvcc_table.cpp | 116 |
mvcc_trans_status::initialize | src/transaction/mvcc_table.cpp | 128 |
mvcc_trans_status::finalize | src/transaction/mvcc_table.cpp | 135 |
mvcctable::advance_oldest_active | src/transaction/mvcc_table.cpp | 142 |
mvcctable::mvcctable | src/transaction/mvcc_table.cpp | 164 |
mvcctable::initialize | src/transaction/mvcc_table.cpp | 184 |
mvcctable::alloc_transaction_lowest_active | src/transaction/mvcc_table.cpp | 199 |
mvcctable::finalize | src/transaction/mvcc_table.cpp | 212 |
mvcctable::build_mvcc_info | src/transaction/mvcc_table.cpp | 226 |
mvcctable::compute_oldest_visible_mvccid | src/transaction/mvcc_table.cpp | 355 |
mvcctable::is_active | src/transaction/mvcc_table.cpp | 422 |
mvcctable::next_trans_status_start | src/transaction/mvcc_table.cpp | 441 |
mvcctable::next_tran_status_finish | src/transaction/mvcc_table.cpp | 455 |
mvcctable::complete_mvcc | src/transaction/mvcc_table.cpp | 465 |
mvcctable::complete_sub_mvcc | src/transaction/mvcc_table.cpp | 541 |
mvcctable::get_new_mvccid | src/transaction/mvcc_table.cpp | 565 |
mvcctable::get_two_new_mvccid | src/transaction/mvcc_table.cpp | 579 |
mvcctable::reset_transaction_lowest_active | src/transaction/mvcc_table.cpp | 593 |
mvcctable::reset_start_mvccid | src/transaction/mvcc_table.cpp | 599 |
mvcctable::get_global_oldest_visible | src/transaction/mvcc_table.cpp | 611 |
mvcctable::update_global_oldest_visible | src/transaction/mvcc_table.cpp | 617 |
mvcctable::lock_global_oldest_visible | src/transaction/mvcc_table.cpp | 632 |
mvcctable::unlock_global_oldest_visible | src/transaction/mvcc_table.cpp | 638 |
mvcctable::is_global_oldest_visible_locked | src/transaction/mvcc_table.cpp | 645 |
mvcc_trans_status | src/transaction/mvcc_table.hpp | 40 |
mvcctable | src/transaction/mvcc_table.hpp | 64 |
HISTORY_MAX_SIZE | src/transaction/mvcc_table.hpp | 97 |
mvcctable::HISTORY_MAX_SIZE | src/transaction/mvcc_table.hpp | 97 |
HISTORY_INDEX_MASK | src/transaction/mvcc_table.hpp | 98 |
mvcctable::m_transaction_lowest_visible_mvccids | src/transaction/mvcc_table.hpp | 101 |
mvcctable::m_current_status_lowest_active_mvccid | src/transaction/mvcc_table.hpp | 104 |
mvcctable::m_current_trans_status | src/transaction/mvcc_table.hpp | 107 |
m_trans_status_history_position | src/transaction/mvcc_table.hpp | 110 |
mvcctable::m_trans_status_history_position | src/transaction/mvcc_table.hpp | 110 |
mvcctable::m_trans_status_history | src/transaction/mvcc_table.hpp | 111 |
mvcctable::m_oldest_visible | src/transaction/mvcc_table.hpp | 118 |
mvcctable::m_ov_lock_count | src/transaction/mvcc_table.hpp | 119 |
Sources
Section titled “Sources”cubrid-mvcc.md— the high-level companion (design intent, theory).- Raw analyses under
raw/code-analysis/cubrid/storage/mvcc/. - Code:
src/transaction/mvcc.{h,c},mvcc_table.{hpp,cpp},mvcc_active_tran.{hpp,cpp}; MVCC record headers insrc/storage/heap_file.c; vacuum coordination insrc/transaction/vacuum.c. - Methodology:
knowledge/methodology/code-analysis-detail-doc.md.