CUBRID Page Buffer Manager — Code-Level Deep Dive
Where this document fits: The high-level analysis
cubrid-page-buffer-manager.mdcovers design intent and theoretical background. This document traces every branch and field at the code level. Each chapter is self-contained, but reading in order follows the full lifecycle of a single data page inside the buffer pool — fix, latch, dirty, flush, victimize.
Contents:
Chapter 1: Data-Structure Map
Section titled “Chapter 1: Data-Structure Map”Field-level reference for every struct the page buffer manager owns. The
high-level companion (cubrid-page-buffer-manager.md) explains why a
buffer manager needs a BCB, a VPID hash, a free list, and multi-zone LRUs —
see its ### Buffer Control Block — PGBUF_BCB and ### Three-zone LRU lists sections; we do not repeat that framing. What the companion
simplified — the BCB latch, drawn there as a pthread_mutex_t + latch_mode pair — is corrected here against the real source. All structs
are TU-private to src/storage/page_buffer.c except pgbuf_watcher /
PGBUF_LATCH_MODE (public, in page_buffer.h); the single global is
pgbuf_Pool of type PGBUF_BUFFER_POOL. Each table below uses one
Role & rationale column to keep field coverage exhaustive at low cost.
1.1 The packed-word vocabulary
Section titled “1.1 The packed-word vocabulary”Two BCB words are bit-packed; later chapters read them through accessor
macros. The flags word (volatile int, 32 bits): BCB-flag bits in the
high byte, the zone in bits 16-19, the LRU index in the low 16 bits.
// PGBUF zone + index layout -- src/storage/page_buffer.c#define PGBUF_LRU_NBITS 16#define PGBUF_LRU_INDEX_MASK (PGBUF_LRU_LIST_MAX_COUNT - 1) /* 0x0000FFFF */ PGBUF_LRU_1_ZONE = 1 << PGBUF_LRU_NBITS, /* 0x00010000 */ PGBUF_LRU_2_ZONE = 2 << PGBUF_LRU_NBITS, /* 0x00020000 */ PGBUF_LRU_3_ZONE = 3 << PGBUF_LRU_NBITS, /* 0x00030000 */ PGBUF_LRU_ZONE_MASK = PGBUF_LRU_1_ZONE | PGBUF_LRU_2_ZONE | PGBUF_LRU_3_ZONE, PGBUF_INVALID_ZONE = 1 << (PGBUF_LRU_NBITS + 2), /* 0x00040000 */ PGBUF_VOID_ZONE = 2 << (PGBUF_LRU_NBITS + 2), /* 0x00080000 */The two-bit skip is deliberate: LRU zones use bits 16-17; INVALID/VOID
jump to bit 18+ so their masks never collide with an LRU index setting bit
16 or 17. The flag bits sit in the top byte (PGBUF_BCB_DIRTY_FLAG
0x80000000, then ..._FLUSHING_TO_DISK, ..._VICTIM_DIRECT,
..._INVALIDATE_DIRECT_VICTIM, ..._MOVE_TO_LRU_BOTTOM, ..._TO_VACUUM,
..._ASYNC_FLUSH_REQ, descending one bit each). One word reads high-to-low:
[flag byte | reserved | zone 16-19 | lru index 0-15]. Flag semantics
belong to Chapter 6.
The second packed word, count_fix_and_avoid_dealloc, splits via
PGBUF_BCB_COUNT_FIX_SHIFT_BITS (16) and PGBUF_BCB_AVOID_DEALLOC_MASK
(0x0000FFFF): high 16 a saturating fix counter (hot-page detection), low 16
an atomically-mutated avoid-deallocation count. Fused into one int because
2-byte atomics are not portable — the field comment says so verbatim.
1.2 pgbuf_atomic_latch_impl — the real BCB latch
Section titled “1.2 pgbuf_atomic_latch_impl — the real BCB latch”This is the single biggest correction to the high-level doc. The latch
is not a mutex plus a latch_mode int; it is a 64-bit atomic word
reinterpreted through a union:
// PGBUF_ATOMIC_LATCH + union pgbuf_atomic_latch_impl -- page_buffer.h / page_buffer.ctypedef std::atomic<uint64_t> PGBUF_ATOMIC_LATCH;union pgbuf_atomic_latch_impl{ uint64_t raw; struct { PGBUF_LATCH_MODE latch_mode; /* uint16_t enum: NO/READ/WRITE/FLUSH/INVALID */ uint16_t waiter_exists; /* a thread is parked on next_wait_thrd */ int32_t fcnt; /* current fix count under the latch */ } impl;};The BCB’s atomic_latch field is the std::atomic<uint64_t> itself. Code
loads it (memory_order_acquire) into a stack PGBUF_ATOMIC_LATCH_IMPL,
edits the sub-fields, CAS-es the whole raw back (set_latch_and_fcnt,
set_latch_and_add_fcnt; get_latch does the read half).
| Field | Role & rationale |
|---|---|
raw | The whole 64-bit word; CAS updates mode+waiter+fcnt in one instruction, no hot-path mutex |
impl.latch_mode | Current mode (PGBUF_LATCH_MODE, uint16_t); separates shared-read from exclusive-write |
impl.waiter_exists | 1 if a thread is parked; tells the unlatcher to wake next_wait_thrd |
impl.fcnt | Fix count under the latch; read mode allows fcnt > 1, releases at 0 |
PGBUF_LATCH_MODE is uint16_t-backed to fit the union’s first two bytes:
PGBUF_NO_LATCH=0, _READ=1, _WRITE=2, _FLUSH=3 (block mode only — a page is
never fixed in flush mode), _INVALID=4.
Invariant — the union is only ever touched as a whole
rawword. Writinglatch_modedirectly through the live atomic would tear the 64-bit word and losefcnt/waiter_existsupdates racing on the other halves. The BCB’smutexguards list/flag transitions, not the latch. Chapter 5 traces the CAS loop branch-by-branch.
1.3 pgbuf_bcb — the buffer control block, field by field
Section titled “1.3 pgbuf_bcb — the buffer control block, field by field”// struct pgbuf_bcb -- src/storage/page_buffer.c (condensed; SERVER_MODE fields elided)struct pgbuf_bcb{ VPID vpid; PGBUF_ATOMIC_LATCH atomic_latch; /* the 64-bit union latch from 1.2 */ volatile int flags; /* flag byte | zone | lru index (1.1) */ PGBUF_BCB *hash_next, *prev_BCB, *next_BCB; int tick_lru_list, tick_lru3; volatile int count_fix_and_avoid_dealloc; /* two-purpose; see 1.1 */ int hit_age; LOG_LSA oldest_unflush_lsa; PGBUF_IOPAGE_BUFFER *iopage_buffer; // ... condensed: mutex, owner_mutex, next_wait_thrd, latch_last_thread under #if SERVER_MODE ...};| Field | Role & rationale |
|---|---|
mutex (SM) | Per-BCB pthread_mutex_t; serializes list/flag transitions the latch doesn’t cover |
owner_mutex (SM) | Index of thread holding mutex; assert aid for wrong-owner/double unlock |
vpid | Volume+page id of resident page; the hash key |
atomic_latch | R/W page latch (union, 1.2); user-level latch off the kernel-mutex hot path |
flags | Packed flag-byte + zone + LRU index; one atomically-readable replacement/dirty word |
next_wait_thrd (SM) | FIFO head of threads blocked on the latch; waiter_exists points here |
latch_last_thread (SM) | Last thread that latched; diagnostic trail |
hash_next | Next BCB in hash bucket chain; collision chaining (1.5) |
prev_BCB | Previous LRU node; doubly-linked LRU gives O(1) unlink |
next_BCB | Next LRU node or free-list next; reused per the §1.3 invariant |
tick_lru_list | List tick when BCB entered its LRU; vs list tick_list to decide age boost |
tick_lru3 | Position stamp inside zone 3; tells victim-hint which zone-3 BCB is lowest |
count_fix_and_avoid_dealloc | Hi 16 fix count, lo 16 avoid-dealloc; hot-page detection fused with dealloc protection (1.1) |
hit_age | Age stamp of last hit; feeds activity/quota (Ch 10) |
oldest_unflush_lsa | Oldest LSA of unflushed change; WAL anchor — page not written until log durable (Ch 8) |
iopage_buffer | Pointer to this BCB’s payload slot; separates control metadata from aligned payload |
Invariant —
next_BCBbelongs to exactly one list at a time. A BCB is in an LRU list, the invalid free list, or transientlyPGBUF_VOID_ZONE(neither). Theflagszone field is the source of truth; zone change andnext_BCBrelink must be one critical section or a BCB appears in two lists. Chapter 7 traces the relink.
1.4 pgbuf_iopage_buffer — the page payload slot
Section titled “1.4 pgbuf_iopage_buffer — the page payload slot”// struct pgbuf_iopage_buffer -- src/storage/page_buffer.cstruct pgbuf_iopage_buffer{ PGBUF_BCB *bcb; /* back-pointer to owning BCB */#if (__WORDSIZE == 32) int dummy; /* pad so iopage starts 8-byte aligned */#endif FILEIO_PAGE iopage; /* the actual buffered IO page */};| Field | Role & rationale |
|---|---|
bcb | Back-pointer to owning pgbuf_bcb; a PAGE_PTR into iopage.page recovers the BCB via CAST_PGPTR_TO_BFPTR |
dummy (32-bit) | 4-byte filler; on 32-bit bcb is 4 bytes, so the pad pushes iopage to offset 8 to align LOG_LSA |
iopage | Embedded FILEIO_PAGE; one allocation holds control + on-disk image inline |
The first bytes of iopage are the header (prv), whose lsa and ptype
the WAL/recovery paths read directly:
// struct fileio_page -- src/storage/file_io.hstruct fileio_page_reserved { LOG_LSA lsa; INT32 pageid; INT16 volid; unsigned char ptype; unsigned char pflag; /* ... condensed ... */ };struct fileio_page_watermark { LOG_LSA lsa; /* duplicates prv.lsa */ };struct fileio_page{ FILEIO_PAGE_RESERVED prv; /* system area at start */ char page[1]; /* user area */ FILEIO_PAGE_WATERMARK prv2; /* end-of-page watermark, duplicates prv.lsa */};fileio_page is not header-only: a trailing prv2
(FILEIO_PAGE_WATERMARK) sits at end-of-page holding a copy of prv.lsa.
Since page[1] is a flexible array the layout is logical, not literal —
fileio_get_page_watermark_pos computes prv2’s real address from the page
size rather than dereferencing the member.
Invariant —
iopage_buffer->bcbround-trips andprv.lsanever moves backward. The back-pointer (bcb->iopage_buffer->bcb == bcb) is set once at init (Ch 2).prv.lsais the page’s durable-recovery watermark — it advances as the page changes, is mirrored intoprv2at flush, and is whatoldest_unflush_lsaand the WAL rule (Ch 8) compare against.
1.5 The VPID hash: pgbuf_buffer_hash and pgbuf_buffer_lock
Section titled “1.5 The VPID hash: pgbuf_buffer_hash and pgbuf_buffer_lock”pgbuf_buffer_hash is { pthread_mutex_t hash_mutex; PGBUF_BCB *hash_next; PGBUF_BUFFER_LOCK *lock_next; }; pgbuf_buffer_lock is { VPID vpid; PGBUF_BUFFER_LOCK *lock_next; THREAD_ENTRY *next_wait_thrd; } (mutex/thread
fields under #if SERVER_MODE).
| Struct.Field | Role & rationale |
|---|---|
buffer_hash.hash_mutex | Bucket lock; protects both chains in the bucket |
buffer_hash.hash_next | Resident-BCB chain head; lookup walks it matching vpid (Ch 3) |
buffer_hash.lock_next | Pending PGBUF-lock chain head; VPIDs being read in, not yet a BCB |
buffer_lock.vpid | VPID reserved for read-in; a second fixer finds the in-flight read and waits |
buffer_lock.lock_next | Next lock record in bucket; chains concurrent in-flight reads |
buffer_lock.next_wait_thrd (SM) | Queue waiting on this read-in; woken when the page lands |
The buffer-lock table is fixed-size — one record per thread, since at most one outstanding read per thread. Chapter 4 traces how a miss claims a lock before allocating a BCB.
flowchart LR H["pgbuf_buffer_hash[bucket]"] -->|hash_next| B1["BCB"] -->|hash_next| B2["BCB"] H -->|lock_next| K1["pgbuf_buffer_lock vpid=A"] -->|lock_next| K2["pgbuf_buffer_lock vpid=B"] K1 -->|next_wait_thrd| T["waiting threads"]
Figure 1-1: one bucket anchors a resident-BCB chain and a pending-lock chain under one hash_mutex.
1.6 pgbuf_lru_list — one multi-zone LRU
Section titled “1.6 pgbuf_lru_list — one multi-zone LRU”struct pgbuf_lru_list holds, after #if SERVER_MODE pthread_mutex_t mutex:
PGBUF_BCB *top, *bottom, *bottom_1, *bottom_2; PGBUF_BCB *volatile victim_hint; int count_lru1/2/3, count_vict_cand, threshold_lru1/2, quota, tick_list, tick_lru3; volatile int flags; int index;.
| Field | Role & rationale |
|---|---|
mutex (SM) | List lock; protects link integrity |
top / bottom | Head (MRU) / tail (LRU); new/boosted link at top, eviction ends at bottom |
bottom_1 | Last BCB of zone 1 (NULL if empty); O(1) zone-1→2 boundary move |
bottom_2 | Last BCB of zone 2 (NULL if empty); zone 2/3 boundary marker |
victim_hint | Volatile victim-scan start; avoids re-walking pinned BCBs each search |
count_lru1/2/3 | Per-zone BCB counts; drive rebalancing and victim availability |
count_vict_cand | Victimizable BCB count; lets victim search skip empty lists (Ch 9) |
threshold_lru1/2 | Target sizes of zones 1/2; a BCB falls a zone when its zone exceeds threshold, zone 3 is the rest |
quota | Target size of a private list; adaptive per-session (Ch 10), unused for shared |
tick_list | Bumped on add/boost; BCB stores its entry tick, the difference gauges staleness |
tick_lru3 | Bumped when a BCB falls to zone 3; stamps bcb->tick_lru3 for victim-hint order |
flags | Per-list flag word; marks bulk/quota list state |
index | This list’s index in buf_LRU_list[]; stored into BCB flags low 16 so a BCB knows its home list |
Invariant —
victim_hintmay drift below the true first victim, and the code tolerates it. Everything below the hint should be dirty, but TPCC core dumps showed it sometimes sitting before the first victimizable BCB — a known unfixed bug flaggedTODO. Consumers treat the hint as a start point and re-validate every candidate; trusting it as exact would skip valid victims or victimize a dirty page. Chapter 9 walks the scan.
1.7 pgbuf_invalid_list — the free pool
Section titled “1.7 pgbuf_invalid_list — the free pool”struct pgbuf_invalid_list is { pthread_mutex_t invalid_mutex (SERVER_MODE); PGBUF_BCB *invalid_top; int invalid_cnt; }.
| Field | Role & rationale |
|---|---|
invalid_mutex (SM) | Free-list lock; serializes push/pop of free BCBs |
invalid_top | Free-chain head (links via BCB next_BCB); a miss pops here before victimizing (Ch 4) |
invalid_cnt | Count of free BCBs; “pool empty → victimize” without walking the chain |
A free BCB is PGBUF_INVALID_ZONE and uses only next_BCB; prev_BCB is
unused (per §1.3 invariant).
1.8 The holder triad — per-thread fix bookkeeping
Section titled “1.8 The holder triad — per-thread fix bookkeeping”A thread records each fix in a pgbuf_holder it owns, not in the BCB.
Ownership chain: pgbuf_holder_anchor (per-thread head) → pgbuf_holder
(one per held page) → pgbuf_holder_set (the slab holders are carved from).
// pgbuf_holder / _anchor / _set -- src/storage/page_buffer.c (condensed)struct pgbuf_holder { int fix_count; PGBUF_BCB *bufptr; PGBUF_HOLDER *thrd_link, *next_holder; /* hold-list / free-list links */ PGBUF_HOLDER_STAT perf_stat; /* #if !NDEBUG: char fixed_at[64*1024]; int fixed_at_size; */ int watch_count; PGBUF_WATCHER *first_watcher, *last_watcher; };// pgbuf_holder_anchor: { int num_free_cnt, num_hold_cnt; PGBUF_HOLDER *thrd_free_list, *thrd_hold_list; }// pgbuf_holder_set: { PGBUF_HOLDER element[PGBUF_NUM_ALLOC_HOLDER /*==10*/]; PGBUF_HOLDER_SET *next_set; }| Struct.Field | Role & rationale |
|---|---|
holder.fix_count | Re-fix depth on this BCB; only the last unfix releases the latch |
holder.bufptr | The held BCB; links per-thread holder to shared BCB |
holder.thrd_link | Next in hold list; lets pgbuf_unfix_all walk every held page |
holder.next_holder | Next in free list; recycles slots without re-alloc |
holder.perf_stat | PGBUF_HOLDER_STAT flags; perf accounting of page usage |
holder.fixed_at/fixed_at_size (dbg) | Fix call-site capture; debug where-fixed tracing |
holder.watch_count | Watchers attached here; ordered-fix watchers (Ch 10) hang off the holder |
holder.first_watcher/last_watcher | Watcher list ends; O(1) append/detach during ordered fix |
anchor.num_free_cnt/num_hold_cnt | Free/used counters; fast “need a new slab?” decision |
anchor.thrd_free_list/thrd_hold_list | Free/hold list heads; thread’s private view of its fixes |
holder_set.element[10] | Slab of 10 pre-allocated holders; handed out in batches, never returned |
holder_set.next_set | Next slab; the global free-holder pool is a list of slabs |
1.9 pgbuf_watcher — the ordered-fix bit-field
Section titled “1.9 pgbuf_watcher — the ordered-fix bit-field”// struct pgbuf_watcher -- src/storage/page_buffer.h (debug magic/strings elided)struct pgbuf_watcher { PAGE_PTR pgptr; PGBUF_WATCHER *next, *prev; PGBUF_ORDERED_GROUP group_id; /* VPID of group's HEAP header */ unsigned latch_mode:7; /* requested latch mode */ unsigned page_was_unfixed:1; /* set if any refix occurred */ unsigned initial_rank:4; /* rank at init */ unsigned curr_rank:4; }; /* rank after fix */| Field | Role & rationale |
|---|---|
pgptr | The watched page handle; what the caller reads/writes |
next/prev | Links in the holder’s watcher list; a page may carry several, O(1) detach |
group_id | VPID of the grouping heap header; ordered fix orders pages within a group to avoid deadlock |
latch_mode:7 | Requested latch mode; 7 bits cover the small enum, packed tight |
page_was_unfixed:1 | Set on unfix+refix during reorder; cached pgptr may have moved, revalidate |
initial_rank:4 | Rank at watcher init; desired fix order before reorder |
curr_rank:4 | Rank after fixing; detects out-of-order fixes |
Chapter 10 traces the ordered-fix reorder loop driving these bits.
1.10 pgbuf_buffer_pool — the global root
Section titled “1.10 pgbuf_buffer_pool — the global root”pgbuf_Pool ties everything together. Fields a modifier must know:
| Field | Role & rationale |
|---|---|
num_buffers | Total BCB frames (≈ 10 × num_trans); fixed pool size bounding every table |
BCB_table | PGBUF_BCB[]; the control blocks |
buf_hash_table | PGBUF_BUFFER_HASH[]; the VPID hash (1.5) |
buf_lock_table | PGBUF_BUFFER_LOCK[]; one pending-read record per thread (1.5) |
iopage_table | PGBUF_IOPAGE_BUFFER[]; page payloads, parallel to BCB_table |
num_LRU_list | Number of shared LRU lists; first slice of buf_LRU_list |
ratio_lru1/ratio_lru2 | Zone-1/2 size ratios; seed each list’s threshold_lru1/2 |
buf_LRU_list | PGBUF_LRU_LIST[] shared+garbage+private; one backing array, index decides class |
buf_AOUT_list | PGBUF_AOUT_LIST victim history; the “Aout” half of 2Q (Ch 7) |
buf_invalid_list | PGBUF_INVALID_LIST free pool (1.7); source of fresh BCBs |
victim_cand_list | Victim-candidate array; flush daemon working set (Ch 8) |
seq_chkpt_flusher | PGBUF_SEQ_FLUSHER; rate-controlled checkpoint flush state |
monitor | PGBUF_PAGE_MONITOR; dirty count, per-LRU hits, victim/fix counters |
quota | PGBUF_PAGE_QUOTA; private-list quota tuning (Ch 10) |
thrd_holder_info | PGBUF_HOLDER_ANCHOR[] per thread; per-thread holder anchors (1.8) |
thrd_reserved_holder | Backing memory for all holders; pre-reserved holder space |
free_holder_set_mutex (SM) | Shared free-holder pool lock; serializes slab hand-out |
free_holder_set/free_index | First slab with free entries, first free slot; global holder allocator cursor |
check_for_interrupts | Set when interrupts must be checked; log mgr toggles under TR_TABLE_CS |
is_flushing_victims/is_checkpoint (SM) | Daemon-state flags; coordinate flush vs. checkpoint |
direct_victims (SM) | Victim array + two priority waiter LFCQs; direct victim hand-off (Ch 9) |
flushed_bcbs (SM) | LFCQ of post-flush BCBs; post-flush processing queue |
private/big_private/shared_lrus_with_victims | Three LFCQs of LRU indices with victims; victim search consults these vs scanning all (Ch 9) |
show_status/_old/_snapshot/_mutex | SHOW STATUS reporting state; statistics surfaced to SHOW queries |
1.11 The zone/index accessors
Section titled “1.11 The zone/index accessors”The only sanctioned readers of the packed flags word; the rest of the code
never masks flags by hand.
// pgbuf_bcb_get_zone / _get_lru_index / PGBUF_IS_BCB_IN_LRU -- page_buffer.cSTATIC_INLINE PGBUF_ZONE pgbuf_bcb_get_zone (const PGBUF_BCB * bcb){ return PGBUF_GET_ZONE (bcb->flags); } /* (flags & PGBUF_ZONE_MASK) */STATIC_INLINE int pgbuf_bcb_get_lru_index (const PGBUF_BCB * bcb){ assert (PGBUF_IS_BCB_IN_LRU (bcb)); /* <- precondition */ return PGBUF_GET_LRU_INDEX (bcb->flags); } /* (flags & 0x0000FFFF) */#define PGBUF_IS_BCB_IN_LRU(bcb) ((pgbuf_bcb_get_zone (bcb) & PGBUF_LRU_ZONE_MASK) != 0)Branch analysis:
pgbuf_bcb_get_zone— single unconditional return, no error path. Every legalflagsyields exactly one of the fivePGBUF_ZONEvalues; a corruptedflagsfalls to the caller’sswitchdefault.pgbuf_bcb_get_lru_index— two branches via theassert. Debug, BCB in an LRU zone: assert passes, returns the low 16 bits (the home list’sindex, §1.6). Debug, not in an LRU zone (INVALID/VOID): assert fires — those bits are meaningless. In release the assert is compiled out and the function returns whatever the low 16 bits hold, so callers own the precondition; hence call sites guard withPGBUF_IS_BCB_IN_LRU.PGBUF_IS_BCB_IN_LRU— one boolean, two outcomes. ANDs the zone againstPGBUF_LRU_ZONE_MASK(zones 1/2/3). INVALID/VOID set bits outside that mask → false; any LRU zone → true. The gate making branch 2 safe.
1.12 Pointer-relationship panorama
Section titled “1.12 Pointer-relationship panorama”flowchart TB POOL["pgbuf_buffer_pool (pgbuf_Pool)"] POOL -->|BCB_table| BCB["pgbuf_bcb[]"] POOL -->|iopage_table| IOP["pgbuf_iopage_buffer[]"] POOL -->|buf_hash_table| HASH["pgbuf_buffer_hash[]"] POOL -->|buf_lock_table| LOCK["pgbuf_buffer_lock[]"] POOL -->|buf_LRU_list| LRU["pgbuf_lru_list[] shared+garbage+private"] POOL -->|buf_invalid_list| INV["pgbuf_invalid_list"] POOL -->|thrd_holder_info| ANC["pgbuf_holder_anchor[] per thread"] POOL -->|free_holder_set| SET["pgbuf_holder_set slabs"] HASH -->|hash_next| BCB HASH -->|lock_next| LOCK BCB -->|iopage_buffer| IOP IOP -->|bcb| BCB BCB -->|atomic_latch| LATCH["pgbuf_atomic_latch_impl union"] LRU -->|top/bottom/bottom_1/bottom_2/victim_hint| BCB INV -->|invalid_top via next_BCB| BCB ANC -->|thrd_hold_list| HLD["pgbuf_holder"] SET -->|element[10]| HLD HLD -->|bufptr| BCB HLD -->|first_watcher/last_watcher| WAT["pgbuf_watcher"]
Figure 1-2: the full pointer panorama. Every later chapter operates on a sub-graph of this picture.
1.13 Chapter summary — key takeaways
Section titled “1.13 Chapter summary — key takeaways”- The BCB latch is a 64-bit atomic union, not a mutex.
pgbuf_atomic_latch_implpackslatch_mode+waiter_exists+fcntinto onestd::atomic<uint64_t>, touched only as the wholerawword. The BCB’smutexguards list/flag transitions, not the latch. flagsis three things in one word — flag byte (24-31), zone (16-19, with a skip so INVALID/VOID don’t collide with the index), LRU index (0-15). Read it only through the §1.11 accessors.count_fix_and_avoid_deallocis two fused counters — hi 16 a saturating fix count, lo 16 an atomic avoid-dealloc count, fused because 2-byte atomics aren’t portable.next_BCBis shared between the LRU and free lists; the zone is the single source of truth. Relink and zone change must be one critical section.victim_hintis advisory and can drift below the true first victim — a known unfixed bug; treat it as a start point and re-validate.- The iopage is embedded with a back-pointer and an alignment pad, so
PAGE_PTR → BCBround-trips andFILEIO_PAGEstays 8-byte aligned; itsprv.lsa(mirrored at end-of-page inprv2) is the WAL/recovery watermark. - Fixes are bookkept per-thread in the holder triad (anchor → holder →
slab of 10);
pgbuf_bcb_get_lru_indexis valid only whenPGBUF_IS_BCB_IN_LRUholds — the assert encodes that precondition.
Chapter 2: Initialization and Memory Layout
Section titled “Chapter 2: Initialization and Memory Layout”This chapter answers: where does every page-buffer structure come from at server start, and how is each table sized, allocated, and cross-wired before the first pgbuf_fix runs? The high-level companion (cubrid-page-buffer-manager.md) names the players in its CUBRID’s Approach section (BCB, page table, invalid list, three-zone LRU, private LRUs, Aout, LFCQs) and Ch. 1 gives the struct map; neither is re-derived below.
Everything lives in one file-scope object, pgbuf_Pool (type PGBUF_BUFFER_POOL). pgbuf_initialize is the orchestrator: it zeroes pgbuf_Pool field-by-field, derives sizes, then calls ten sub-initializers in a fixed order — quota parameters first (they fix the LRU count), then the dependent tables, then the quota/monitor arrays sized by PGBUF_TOTAL_LRU_COUNT.
2.1 pgbuf_initialize — the orchestrator and its size derivation
Section titled “2.1 pgbuf_initialize — the orchestrator and its size derivation”The function opens with per-field zeroing plus memset on embedded sub-structs. The std::atomic_int members of monitor cannot be memset, so they use .store(0). This manual reset is what makes pgbuf_finalize safe on a half-built pool: every pointer is NULL before any allocation (scalar sentinels like free_index are set to 0 here and re-set to -1 later in §2.9). num_buffers is read then floored; the two LRU-zone ratios are each clamped:
// pgbuf_initialize -- src/storage/page_buffer.cpgbuf_Pool.num_buffers = prm_get_integer_value (PRM_ID_PB_NBUFFERS);if (pgbuf_Pool.num_buffers < PGBUF_MINIMUM_BUFFERS) /* MAX_NTRANS * 10 */ pgbuf_Pool.num_buffers = PGBUF_MINIMUM_BUFFERS; /* <- silent floor, never error */pgbuf_Pool.ratio_lru1 = prm_get_float_value (PRM_ID_PB_LRU_HOT_RATIO);pgbuf_Pool.ratio_lru2 = prm_get_float_value (PRM_ID_PB_LRU_BUFFER_RATIO);pgbuf_Pool.ratio_lru1 = MAX (pgbuf_Pool.ratio_lru1, PGBUF_LRU_ZONE_MIN_RATIO); /* clamp lru1 into */pgbuf_Pool.ratio_lru1 = MIN (pgbuf_Pool.ratio_lru1, PGBUF_LRU_ZONE_MAX_RATIO); /* [0.05f, 0.90f] */pgbuf_Pool.ratio_lru2 = MAX (pgbuf_Pool.ratio_lru2, PGBUF_LRU_ZONE_MIN_RATIO); /* lru2 floor */pgbuf_Pool.ratio_lru2 = MIN (pgbuf_Pool.ratio_lru2, 1.0f - PGBUF_LRU_ZONE_MIN_RATIO - pgbuf_Pool.ratio_lru1);The ratios are only stored here; they govern LRU1/LRU2 thresholds in Ch. 6/7. Two asserts follow: ratio_lru2 stays in [0.05, 0.90] and the two-zone sum stays in [0.099, 0.951]. Each sub-initializer failure does goto error, which calls pgbuf_finalize.
flowchart TD A["pgbuf_initialize"] --> B["pgbuf_initialize_page_quota_parameters\nfixes num_private_LRU_list"] B --> C["pgbuf_initialize_bcb_table\nBCB_table + iopage_table"] C --> D["pgbuf_initialize_hash_table\n2^20 buckets"] D --> E["pgbuf_initialize_lock_table\none record per thread"] E --> F["pgbuf_initialize_lru_list\nfixes num_LRU_list, builds shared+private"] F --> G["pgbuf_initialize_invalid_list\nall BCBs seeded here"] G --> H["pgbuf_initialize_aout_list"] H --> I["pgbuf_initialize_thrd_holder\npre-allocate holder sets"] I --> J["pgbuf_initialize_page_quota\narrays sized by TOTAL_LRU_COUNT"] J --> K["pgbuf_initialize_page_monitor\nlru_hits/lru_activity arrays"] K --> L["victim_cand_list + seq_chkpt_flusher\n+ SERVER_MODE LFCQs + show_status"] B -.error.-> Z["goto error -> pgbuf_finalize"] C -.error.-> Z F -.error.-> Z L --> M["NO_ERROR"]
Figure 2-1. The ten sub-initializers in call order. Quota parameters are first because they set num_private_LRU_list, which feeds PGBUF_TOTAL_LRU_COUNT, which sizes the LRU, quota, and monitor arrays. Note the invalid list (G) is seeded before the Aout list (H).
After the ten, the orchestrator allocates victim_cand_list (one per buffer), sizes the checkpoint flusher at MIN(0.25 * num_buffers, 65536), and under SERVER_MODE allocates the direct-victim array (bcb_victims, one per thread) and three lockfree::circular_queue objects (waiter_threads_high_priority, waiter_threads_low_priority, flushed_bcbs). The private/big_private_lrus_with_victims queues are created only if PGBUF_PAGE_QUOTA_IS_ENABLED; shared_lrus_with_victims always. Finally show_status (MAX_NTRANS + 1 records) is allocated and zeroed. These LFCQs and the daemons are Ch. 8–9.
Invariant — “every pointer NULL before first allocation.” Enforcement: the opening per-field reset NULLs every pointer member, so error: can call pgbuf_finalize at any point and finalize frees exactly what was allocated (every free is != NULL-guarded). What breaks: a pointer field added but not NULL-initialized here would feed garbage to free_and_init on a mid-init failure — a free of uninitialized memory.
2.2 pgbuf_initialize_bcb_table — BCB/iopage allocation, cross-linking, alignment
Section titled “2.2 pgbuf_initialize_bcb_table — BCB/iopage allocation, cross-linking, alignment”Two parallel arrays are allocated and each validated with MEM_SIZE_IS_VALID: BCB_table (metadata, num_buffers * PGBUF_BCB_SIZEOF) and iopage_table (page frames, num_buffers * PGBUF_IOPAGE_BUFFER_SIZE). Both iopage failure branches (bad-size and OOM) roll back BCB_table themselves (free_and_init, guarded != NULL) rather than relying on finalize, then return ER_PRM_BAD_VALUE / ER_OUT_OF_VIRTUAL_MEMORY. The per-BCB loop then initializes each BCB and cross-links it to its iopage frame symmetrically; next_BCB chains every BCB into one forward list (last NULL) that the invalid list inherits:
// pgbuf_initialize_bcb_table -- src/storage/page_buffer.cfor (i = 0; i < pgbuf_Pool.num_buffers; i++) { bufptr = PGBUF_FIND_BCB_PTR (i); /* base + i * sizeof(PGBUF_BCB) */ pthread_mutex_init (&bufptr->mutex, NULL); VPID_SET_NULL (&bufptr->vpid); placement_new (&bufptr->atomic_latch, 0); /* C++ atomic needs placement-new, not memset */ bufptr->atomic_latch.store (impl.raw); /* impl = {mode INVALID, no waiter, fcnt 0}; Ch.5 */ bufptr->next_BCB = (i == pgbuf_Pool.num_buffers - 1) ? NULL : PGBUF_FIND_BCB_PTR (i + 1); /* chain */ bufptr->flags = PGBUF_BCB_INIT_FLAGS; /* == PGBUF_INVALID_ZONE, no other flag */ /* ... clear hash_next/prev_BCB/count_fix_and_avoid_dealloc/hit_age/oldest_unflush_lsa/ticks ... */ ioptr = PGBUF_FIND_IOPAGE_PTR (i); /* base + i * PGBUF_IOPAGE_BUFFER_SIZE */ /* ... fileio_init_lsa_of_page; set iopage.prv pageid/volid = -1, ptype UNKNOWN ... */ bufptr->iopage_buffer = ioptr; ioptr->bcb = bufptr; /* <- symmetric cross-link */ }graph LR
subgraph BCB_table
b0["BCB[0]"]
b1["BCB[1]"]
end
subgraph iopage_table
p0["iopage[0]"]
p1["iopage[1]"]
end
b0 -->|iopage_buffer| p0
p0 -->|bcb| b0
b1 -->|iopage_buffer| p1
p1 -->|bcb| b1
b0 -->|next_BCB| b1
Figure 2-2. Parallel arrays, symmetric per-slot cross-link, and the next_BCB chain the invalid list inherits.
Invariant — “iopage is 8-byte aligned.” Enforcement: struct pgbuf_iopage_buffer places PGBUF_BCB *bcb first, then on 32-bit builds (__WORDSIZE == 32) inserts an explicit int dummy so the following FILEIO_PAGE iopage starts on an 8-byte boundary (an unsupported platform that is neither LINUX/WINDOWS/AIX trips a #error). PGBUF_IOPAGE_BUFFER_SIZE (offsetof(.., iopage) + SIZEOF_IOPAGE_PAGESIZE_AND_GUARD()) is the stride PGBUF_FIND_IOPAGE_PTR multiplies by i, so every frame stays aligned. What breaks: dropping the dummy yields misaligned buffers and undefined direct-I/O behavior.
2.3 pgbuf_initialize_hash_table — the fixed 2^20 bucket page table
Section titled “2.3 pgbuf_initialize_hash_table — the fixed 2^20 bucket page table”The page table size is a compile-time constant, independent of num_buffers:
// pgbuf_initialize_hash_table -- src/storage/page_buffer.chashsize = PGBUF_HASH_SIZE; /* (1 << HASH_SIZE_BITS) == 1 << 20 == 1048576 */pgbuf_Pool.buf_hash_table = (PGBUF_BUFFER_HASH *) malloc (hashsize * PGBUF_BUFFER_HASH_SIZEOF);/* ... OOM check; loop: pthread_mutex_init each hash_mutex; hash_next = lock_next = NULL ... */A power-of-two bucket count keeps the final masking step a single AND — pgbuf_hash_func_mirror finishes with hash_val & ((1 << HASH_SIZE_BITS) - 1), so no modulo/division is needed (the function still bit-reverses the 8 LSBs of volid into the high bits in a small loop before XOR-ing with pageid). Each bucket has its own hash_mutex (SERVER_MODE only) — no global page-table lock. The hash_next walk is Ch. 3.
2.4 pgbuf_initialize_lock_table — one buffer-lock record per thread
Section titled “2.4 pgbuf_initialize_lock_table — one buffer-lock record per thread”The buffer-lock table has one slot per server thread, indexed by thread index; it is the rendezvous used while a miss is being resolved (Ch. 4):
// pgbuf_initialize_lock_table -- src/storage/page_buffer.cthrd_num_total = thread_num_total_threads (); /* SA mode asserts thrd_num_total == 1 */pgbuf_Pool.buf_lock_table = (PGBUF_BUFFER_LOCK *) malloc (thrd_num_total * PGBUF_BUFFER_LOCK_SIZEOF);/* ... OOM check; loop: VPID_SET_NULL(vpid); lock_next = NULL; (SERVER_MODE) next_wait_thrd = NULL ... */Sizing by thread count works because a thread resolves at most one miss at a time; its record is reused for whichever VPID it is bringing in.
2.5 pgbuf_initialize_lru_list — shared + private list count and per-list reset
Section titled “2.5 pgbuf_initialize_lru_list — shared + private list count and per-list reset”This initializer first fixes num_LRU_list (the shared count): a non-zero parameter is taken verbatim; zero is auto-derived:
// pgbuf_initialize_lru_list -- src/storage/page_buffer.cpgbuf_Pool.num_LRU_list = prm_get_integer_value (PRM_ID_PB_NUM_LRU_CHAINS);if (pgbuf_Pool.num_LRU_list == 0) { pgbuf_Pool.num_LRU_list = (int) MAX_NTRANS; /* default: one shared list per transaction slot */ if (pgbuf_Pool.num_buffers / pgbuf_Pool.num_LRU_list < PGBUF_MIN_PAGES_IN_SHARED_LIST) /* 1000 */ pgbuf_Pool.num_LRU_list = pgbuf_Pool.num_buffers / PGBUF_MIN_PAGES_IN_SHARED_LIST; /* coarsen */ pgbuf_Pool.num_LRU_list = MAX (pgbuf_Pool.num_LRU_list, 4); /* floor: at least 4 shared LRUs */ }Branch logic: one list per transaction; if that gives fewer than 1000 pages per list, coarsen; never below 4. The allocation covers shared + private (PGBUF_TOTAL_LRU_COUNT = PGBUF_SHARED_LRU_COUNT + PGBUF_PRIVATE_LRU_COUNT, the latter = num_private_LRU_list from §2.7). Private lists occupy the high index range; PGBUF_IS_PRIVATE_LRU_INDEX(i) is true for i >= PGBUF_SHARED_LRU_COUNT.
// pgbuf_initialize_lru_list -- src/storage/page_buffer.cpgbuf_Pool.buf_LRU_list = (PGBUF_LRU_LIST *) malloc (PGBUF_TOTAL_LRU_COUNT * PGBUF_LRU_LIST_SIZEOF);/* ... OOM check; loop over PGBUF_TOTAL_LRU_COUNT lists: ... */ pgbuf_Pool.buf_LRU_list[i].index = i; /* self-index, used to recover list from a BCB */ /* ... pthread_mutex_init; top/bottom/bottom_1/bottom_2 = NULL; counts/victim_hint/ticks cleared ... */ pgbuf_Pool.buf_LRU_list[i].threshold_lru1 = 0; /* <- initial threshold ZERO, set later */ pgbuf_Pool.buf_LRU_list[i].threshold_lru2 = 0; pgbuf_Pool.buf_LRU_list[i].quota = 0; pgbuf_Pool.buf_LRU_list[i].flags = 0;Both kinds of list use the same loop — they differ only by index range, not by struct. The thresholds and quota start at 0, not from the §2.1 ratios; they get real values from the quota machinery (Ch. 7/10) once num_buffers is distributed. At init every list is empty, so zero is correct.
2.6 pgbuf_initialize_invalid_list and the Aout list
Section titled “2.6 pgbuf_initialize_invalid_list and the Aout list”The invalid (free) list is the cheapest initializer — it points its head at BCB[0] and trusts the next_BCB chain from §2.2:
// pgbuf_initialize_invalid_list -- src/storage/page_buffer.cpthread_mutex_init (&pgbuf_Pool.buf_invalid_list.invalid_mutex, NULL);pgbuf_Pool.buf_invalid_list.invalid_top = PGBUF_FIND_BCB_PTR (0); /* head of the next_BCB chain */pgbuf_Pool.buf_invalid_list.invalid_cnt = pgbuf_Pool.num_buffers; /* every BCB starts invalid */Invariant — “all BCBs begin in the invalid list.” Enforcement: every BCB’s flag is PGBUF_INVALID_ZONE (§2.2) and invalid_cnt == num_buffers — the same truth stored twice. What breaks: the first num_buffers misses pop here before any eviction; if the count and the flags disagree, a popped BCB could be double-counted or skipped, so Ch. 7 keeps the two in sync on every move.
The Aout list (pgbuf_initialize_aout_list, struct pgbuf_aout_list) records eviction history to decide whether a re-faulted page was recently evicted. Capacity is num_buffers * aout_ratio (where aout_ratio = prm_get_float_value(PRM_ID_PB_AOUT_RATIO)), capped at PGBUF_LIMIT_AOUT_BUFFERS (32768); a non-positive ratio disables it (max_count = 0, early return NO_ERROR after Aout_mutex is already initialized). Otherwise it pre-allocates a bufarray of max_count PGBUF_AOUT_BUF nodes chained into a free list (Aout_free at bufarray[0]), then builds num_hashes = MAX(max_count / AOUT_HASH_DIVIDE_RATIO, 1) MHT tables. The error_return path nulls Aout_free, frees bufarray, then destroys the MHTs with a loop that stops at the first NULL slot (for (i = 0; list->aout_buf_ht[i] != NULL; i++)) — so only the tables actually created are destroyed, unlike the pgbuf_finalize loop which iterates the full num_hashes — frees aout_buf_ht, destroys Aout_mutex, returns ER_FAILED.
PGBUF_AOUT_LIST (the Aout container):
| Field | Role | Why it exists |
|---|---|---|
Aout_mutex (SERVER_MODE) | guards the whole Aout list | history mutated on every eviction/refault |
Aout_top | most-recently-evicted end | newest history entry |
Aout_bottom | oldest end | the entry discarded when the list overflows |
Aout_free | head of the free node list | nodes preallocated, never malloc’d per insert |
bufarray | the single allocation of all nodes | one block beats per-node alloc |
num_hashes | count of MHT lookup tables | shards the lookup to cut contention |
aout_buf_ht | array of MHT tables, VPID to node | O(1) “was this page recently evicted?” |
max_count | capacity; 0 means disabled | bounds memory and acts as the on/off switch |
PGBUF_PAGE_QUOTA (adaptive private-LRU sizing — populated in §2.7):
| Field | Role | Why it exists |
|---|---|---|
num_private_LRU_list | number of private LRUs; 0 disables quota | master switch for the private-LRU feature |
lru_victim_flush_priority_per_lru | per-LRU flush priority (TOTAL_LRU_COUNT floats) | tells flush daemons where dirty pressure is |
private_lru_session_cnt | active sessions per private LRU | a list with 0 sessions can be reclaimed |
private_pages_ratio | fraction of all BCBs that are private | target the quota adjuster steers toward |
add_shared_lru_idx | round-robin cursor for relocating to shared | spreads BCBs evenly across shared lists |
avoid_shared_lru_idx | shared LRU to skip when relocating | avoids piling onto an oversized list |
last_adjust_time | timestamp of last quota adjustment | rate-limits the adjuster |
adjust_age | monotonic adjustment counter | versions the quota state |
is_adjusting | re-entrancy guard for the adjuster | only one thread adjusts quotas at a time |
PGBUF_PAGE_MONITOR (per-LRU statistics — populated in §2.8):
| Field | Role | Why it exists |
|---|---|---|
dirties_cnt | count of dirty BCBs (INT64) | drives flush urgency |
lru_hits | LRU1 hits per LRU (TOTAL_LRU_COUNT ints) | recency-quality signal for quota tuning |
lru_activity | activity level per LRU | detects idle private lists for reclamation |
lru_shared_pgs_cnt | BCBs across all shared LRUs (volatile) | complements private_pages_ratio |
pg_unfix_cnt | unfix counter (std::atomic_int) | triggers periodic quota refresh |
lru_victim_req_cnt | victim requests across all LRUs | victim-pressure gauge |
fix_req_cnt | fix requests (std::atomic_int) | overall load gauge |
bcb_locks (SERVER_MODE) | per-thread BCB-mutex usage tracking | lock-contention diagnostics |
victim_rich | true when victims are plentiful | fast-path hint for the fix code |
2.7 Quota bootstrap — pgbuf_initialize_page_quota_parameters then _page_quota
Section titled “2.7 Quota bootstrap — pgbuf_initialize_page_quota_parameters then _page_quota”The split is deliberate. Parameters runs before the BCB/LRU tables because it fixes num_private_LRU_list (a dependency of PGBUF_TOTAL_LRU_COUNT); the data initializer runs after them because it allocates arrays sized by that total.
// pgbuf_initialize_page_quota_parameters -- src/storage/page_buffer.cquota = &(pgbuf_Pool.quota); memset (quota, 0, sizeof (PGBUF_PAGE_QUOTA));tsc_getticks ("a->last_adjust_time); quota->adjust_age = 0; quota->is_adjusting = 0;#if defined (SERVER_MODE) quota->num_private_LRU_list = prm_get_integer_value (PRM_ID_PB_NUM_PRIVATE_CHAINS); if (quota->num_private_LRU_list == -1) quota->num_private_LRU_list = MAX_NTRANS + VACUUM_MAX_WORKER_COUNT; /* auto: one per worker */ else if (quota->num_private_LRU_list == 0) { /* disabled */ } /* <- explicit no-op branch */ else if (quota->num_private_LRU_list < PGBUF_PRIVATE_LRU_MIN_COUNT) /* 4 */ quota->num_private_LRU_list = PGBUF_PRIVATE_LRU_MIN_COUNT; /* floor when user-set */#else quota->num_private_LRU_list = 0; /* SA_MODE: no private LRUs */#endifOutcomes: -1 (auto) becomes MAX_NTRANS + VACUUM_MAX_WORKER_COUNT; 0 stays disabled; positive below 4 is raised to 4; SA-mode is always 0. This integer drives PGBUF_PAGE_QUOTA_IS_ENABLED (> 0) everywhere. The data initializer then allocates the two arrays and seeds the session counts:
// pgbuf_initialize_page_quota -- src/storage/page_buffer.cquota->lru_victim_flush_priority_per_lru = (float *) malloc (PGBUF_TOTAL_LRU_COUNT * sizeof (float)); /* ALL lists */quota->private_lru_session_cnt = (int *) malloc (PGBUF_PRIVATE_LRU_COUNT * sizeof (int)); /* PRIVATE only *//* ... each OOM -> error_status, goto exit; loop zeros priority for all, session_cnt only where ... *//* ... PGBUF_IS_PRIVATE_LRU_INDEX(i) holds, indexed via PGBUF_PRIVATE_LIST_FROM_LRU_INDEX(i) ... */quota->private_pages_ratio = PGBUF_PAGE_QUOTA_IS_ENABLED ? 1.0f : 0; /* start fully private if enabled */quota->add_shared_lru_idx = 0; quota->avoid_shared_lru_idx = -1;Both failures land on a single exit: (which returns error_status); the orchestrator’s goto error then runs finalize, which frees whatever was allocated.
2.8 pgbuf_initialize_page_monitor
Section titled “2.8 pgbuf_initialize_page_monitor”Mirroring quota-data, the monitor first re-NULLs its pointer members, then allocates two per-LRU integer arrays sized by PGBUF_TOTAL_LRU_COUNT:
// pgbuf_initialize_page_monitor -- src/storage/page_buffer.cmonitor->lru_hits = (int *) malloc (PGBUF_TOTAL_LRU_COUNT * sizeof (int));monitor->lru_activity = (int *) malloc (PGBUF_TOTAL_LRU_COUNT * sizeof (int));/* ... each OOM -> goto exit; loop zeros both; lru_victim_req_cnt/lru_shared_pgs_cnt = 0 ... */monitor->fix_req_cnt.store (0); monitor->pg_unfix_cnt.store (0); /* atomics: .store, not memset */#if defined (SERVER_MODE) if (pgbuf_Monitor_locks) /* forced true in !NDEBUG; param-driven in NDEBUG */ monitor->bcb_locks = (PGBUF_MONITOR_BCB_MUTEX *) calloc (count_threads, sizeof (PGBUF_MONITOR_BCB_MUTEX));#endifmonitor->victim_rich = false; /* no BCBs in lists yet, so no victims */bcb_locks is per-thread (sized by thread_num_total_threads()), allocated only when lock monitoring is on (pgbuf_Monitor_locks is set in §2.1: forced true in debug builds, read from PRM_ID_PB_MONITOR_LOCKS in NDEBUG). All error paths funnel through exit:.
2.9 pgbuf_initialize_thrd_holder — pre-allocated per-thread holder pools
Section titled “2.9 pgbuf_initialize_thrd_holder — pre-allocated per-thread holder pools”A holder records that a thread has a BCB fixed. Each thread gets a private free list of PGBUF_DEFAULT_FIX_COUNT (7) holders so the common fix path never allocates:
// pgbuf_initialize_thrd_holder -- src/storage/page_buffer.cthrd_num_total = thread_num_total_threads ();pgbuf_Pool.thrd_holder_info = (PGBUF_HOLDER_ANCHOR *) malloc (thrd_num_total * PGBUF_HOLDER_ANCHOR_SIZEOF);pgbuf_Pool.thrd_reserved_holder = (PGBUF_HOLDER *) malloc (thrd_num_total * PGBUF_DEFAULT_FIX_COUNT * PGBUF_HOLDER_SIZEOF);/* ... each OOM check; per-thread anchor i: num_hold_cnt=0, num_free_cnt=7, thrd_hold_list=NULL ... */ pgbuf_Pool.thrd_holder_info[i].thrd_free_list = &(pgbuf_Pool.thrd_reserved_holder[i * PGBUF_DEFAULT_FIX_COUNT]); /* ... inner loop chains the 7 reserved holders via next_holder, last == NULL ... */pthread_mutex_init (&pgbuf_Pool.free_holder_set_mutex, NULL);pgbuf_Pool.free_holder_set = NULL; pgbuf_Pool.free_index = -1; /* -1 == no shared free holder; grow on demand */The reserved holders are one flat array sliced per thread by i * PGBUF_DEFAULT_FIX_COUNT. When a thread exceeds 7 concurrent fixes, pgbuf_allocate_thrd_holder_entry falls back to the shared free_holder_set, malloc’d in PGBUF_HOLDER_SET blocks (PGBUF_NUM_ALLOC_HOLDER = 10 elements each) and never freed until finalize; free_index == -1 is the “pool empty, grow it” sentinel set here (it was a transient 0 from the §2.1 reset).
2.10 pgbuf_thread_variables_init — a worker claims its private LRU index
Section titled “2.10 pgbuf_thread_variables_init — a worker claims its private LRU index”Called when a worker’s THREAD_ENTRY comes online, this hook wires the worker to its private LRU and holder anchor:
// pgbuf_thread_variables_init -- src/storage/page_buffer.cif (!thread_p) return;if (pgbuf_Pool.quota.num_private_LRU_list > 0 && thread_p->private_lru_index != -1) thread_p->m_is_private_lru_enabled = true; /* quota on AND this worker has a private slot */else thread_p->m_is_private_lru_enabled = false;if (!thread_p->m_holder_anchor) thread_p->m_holder_anchor = &pgbuf_Pool.thrd_holder_info[thread_p->index]; /* bind to its slice */private_lru_index lives on THREAD_ENTRY (default -1), assigned elsewhere when a transaction acquires a private list. This function only interprets it: a worker uses a private LRU iff quota is enabled and its index != -1. The anchor bind is idempotent (guarded by !m_holder_anchor) and gives O(1) access to the §2.9 slice. Vacuum workers and SA-mode fall to false, using shared LRUs only.
2.11 pgbuf_finalize — teardown order
Section titled “2.11 pgbuf_finalize — teardown order”Teardown is not the strict reverse of init; it is a flat sequence of NULL-guarded frees, each safe because of the §2.1 invariant: (1) hash table — destroy all hash_mutexes, free buf_hash_table; (2) lock table — free buf_lock_table; (3) BCB table — destroy every BCB mutex, free BCB_table, set num_buffers = 0; (4) free iopage_table; (5) LRU lists — destroy every list mutex, free buf_LRU_list; (6) destroy invalid_mutex; (7) thrd holder — free thrd_holder_info/thrd_reserved_holder, destroy free_holder_set_mutex, walk and free every lazily-grown free_holder_set block; (8) victim_cand_list, then Aout (free bufarray, mht_destroy each of num_hashes slots, free aout_buf_ht, destroy Aout_mutex, zero fields), then seq_chkpt_flusher.flush_list; (9) quota arrays; (10) monitor arrays + (SERVER_MODE) bcb_locks; (11) SERVER_MODE: free direct_victims.bcb_victims, delete the two waiter queues and flushed_bcbs; (12) delete the three _lrus_with_victims queues; (13) free show_status, destroy its mutex; (14) thread_clear_all_holder_anchor () — the symmetric undo of §2.10.
C++ objects (lockfree::circular_queue) use delete, not free_and_init, because they were new’d; mixing would corrupt the heap. num_buffers is zeroed early (step 3) so a racing reader sees an empty pool. With every free != NULL-guarded and pointers NULL-initialized in §2.1, finalize is correct whether the pool is fully built or failed mid-init.
2.12 Chapter summary — key takeaways
Section titled “2.12 Chapter summary — key takeaways”- Ten sub-initializers, fixed order.
pgbuf_initializezeroespgbuf_Poolfield-by-field (atomics via.store), then calls ten sub-initializers; quota parameters must run first (they fixnum_private_LRU_list) and quota/monitor data must run last (sized byPGBUF_TOTAL_LRU_COUNT). num_buffersis floored, not validated (belowMAX_NTRANS * 10it is silently raised); each LRU zone ratio is independently clamped (lru1 into[0.05, 0.90], lru2 floored at0.05then capped so the sum leaves room) and only stored.- BCB and iopage are parallel arrays, symmetrically cross-linked; the
next_BCBchain is what the invalid list inherits; theint dummypadding enforces 8-byte iopage alignment on 32-bit builds. - The page table is a fixed 2^20 buckets, each with its own
hash_mutex(no global lock); a power-of-two size makes the final hash step a single AND. Lock and holder pools are sized by thread count, not buffer count. - All BCBs start in the invalid list (
invalid_cnt == num_buffers, every flagPGBUF_INVALID_ZONE) — one truth stored twice. LRU thresholds start at 0 because every list is empty. - Quota is one integer switch —
num_private_LRU_list(-1auto,0disabled, positive floored to 4, 0 in SA mode) drivesPGBUF_PAGE_QUOTA_IS_ENABLED; a worker uses a private LRU iff quota is on and itsTHREAD_ENTRY.private_lru_index != -1. - Finalize is a flat NULL-guarded sequence, safe at any partial-build point; it
deletes C++ queues butfrees C arrays, and ends by clearing per-thread holder-anchor back-pointers — see Ch. 8 for the daemon set this chapter only constructs.
Chapter 3: The Fix Entry Path and Page-Table Lookup
Section titled “Chapter 3: The Fix Entry Path and Page-Table Lookup”Every page access enters through pgbuf_fix (compiled as
pgbuf_fix_release in release builds, pgbuf_fix_debug under !NDEBUG).
This chapter dissects that function as a master state machine, from
argument validation to the moment a hit hands off to latching
(Chapter 5) or a miss hands off to BCB claim (Chapter 4). For the
big-picture flow and the meaning of the zones, flags, and BCB struct,
see ### How a page fix flows, ### Page table — VPID hash, and
### Buffer Control Block — PGBUF_BCB in cubrid-page-buffer-manager.md.
The fetch mode (PAGE_FETCH_MODE) is the biggest source of branching:
its seven values reappear at the lock-free fast path, the miss fork, the
page-VPID check, and the PAGE_UNKNOWN switch near the exit.
3.1 The seven PAGE_FETCH_MODE values
Section titled “3.1 The seven PAGE_FETCH_MODE values”// PAGE_FETCH_MODE -- src/storage/page_buffer.htypedef enum{ OLD_PAGE = 0, /* must already exist on disk or in buffer */ NEW_PAGE, /* newly allocated; may be created in buffer */ OLD_PAGE_IF_IN_BUFFER, /* return only if resident; never fix from disk */ OLD_PAGE_PREVENT_DEALLOC, /* fetch + mark to block dealloc */ OLD_PAGE_DEALLOCATED, /* deliberately fetch a deallocated page */ OLD_PAGE_MAYBE_DEALLOCATED, /* fetch, tolerate deallocated (warn) */ RECOVERY_PAGE /* recovery: new/old/deallocated all valid */} PAGE_FETCH_MODE;| Mode | Validation skipped? | Miss → claim from disk? | Behaviour on PAGE_UNKNOWN page at exit |
|---|---|---|---|
OLD_PAGE | no | yes | assert(false) + ER_ERROR_SEVERITY ER_PB_BAD_PAGEID, unfix, return NULL |
NEW_PAGE | no | yes (created in buffer) | accepted, returned |
OLD_PAGE_IF_IN_BUFFER | suppresses errors in pgbuf_is_valid_page | no — returns NULL on miss | accepted, returned |
OLD_PAGE_PREVENT_DEALLOC | no | yes | treated like OLD_PAGE: assert(false), unfix, NULL |
OLD_PAGE_DEALLOCATED | no | yes | accepted, returned |
OLD_PAGE_MAYBE_DEALLOCATED | no | yes | warning ER_PB_BAD_PAGEID, unfix, return NULL |
RECOVERY_PAGE | bypasses the page-validation block entirely | yes | accepted, returned |
3.2 Argument validation and the unconditional→conditional downgrade
Section titled “3.2 Argument validation and the unconditional→conditional downgrade”pgbuf_fix_release validates before touching shared state. Four guards
fire in order, each an early return NULL. The first two are
assert_release (false) checks rejecting an illegal request_mode
(non-R/W) or condition; then pgbuf_Pool.monitor.fix_req_cnt is bumped,
then the page-validation and pageid guards:
// pgbuf_fix_release -- src/storage/page_buffer.cif (pgbuf_get_check_page_validation_level (PGBUF_DEBUG_PAGE_VALIDATION_FETCH) && fetch_mode != RECOVERY_PAGE) /* <- recovery skips validation */ { if (pgbuf_is_valid_page (thread_p, vpid, fetch_mode == OLD_PAGE_IF_IN_BUFFER) != DISK_VALID) return NULL; /* IF_IN_BUFFER suppresses errors */ }if (vpid->pageid < 0) /* <- always-on cheap check */ { er_set (ER_FATAL_ERROR_SEVERITY, ARG_FILE_LINE, ER_PB_BAD_PAGEID, 2, ...); return NULL; /* fatal: ER_FATAL_ERROR_SEVERITY */ }The page-validation block runs only when debug validation is armed and
the mode is not RECOVERY_PAGE (recovery may legitimately fix pages disk
metadata says do not exist). For OLD_PAGE_IF_IN_BUFFER the second
argument is true, suppressing the error log since “not valid” is normal.
The pivotal transformation comes next: if condition == PGBUF_UNCONDITIONAL_LATCH and pgbuf_find_current_wait_msecs (thread_p)
is LK_ZERO_WAIT or LK_FORCE_ZERO_WAIT, condition is silently set to
PGBUF_CONDITIONAL_LATCH.
Invariant — a zero-wait transaction never blocks on a page latch. The downgrade happens here, before any hashing or latching, and everything downstream keys off
condition. Skipping it would let a zero-wait transaction sleep indefinitely inpgbuf_latch_bcb_upon_fix.
3.3 The try_again loop and the interrupt check
Section titled “3.3 The try_again loop and the interrupt check”Perf tracking is sampled just before the try_again: label, the loop
re-entry point:
// pgbuf_fix_release -- src/storage/page_buffer.ctry_again: if (logtb_get_check_interrupt (thread_p) == true) if (logtb_is_interrupted (thread_p, true, &pgbuf_Pool.check_for_interrupts) == true) { er_set (ER_ERROR_SEVERITY, ARG_FILE_LINE, ER_INTERRUPTED, 0); PGBUF_BCB_CHECK_MUTEX_LEAKS (); /* <- assert no mutex held on exit */ return NULL; }The interrupt check sits inside the loop, so every retry re-checks
for interruption. Exactly one statement jumps back to try_again: the
miss path’s pgbuf_claim_bcb_for_fix returning NULL with its retry
out-parameter set (a BCB-claim race; Chapter 4). The §3.2 guards sit
above the label and run once.
// pgbuf_fix_release -- src/storage/page_buffer.c (miss fork, retry edge)bufptr = pgbuf_claim_bcb_for_fix (thread_p, vpid, fetch_mode, hash_anchor, &perf, &retry, false);if (bufptr == NULL) { if (retry) { retry = false; goto try_again; } /* <- the only re-entry */ ASSERT_ERROR (); return NULL; }pgbuf_fix_with_retry is a thin wrapper around pgbuf_fix, not part of
the loop. It re-calls pgbuf_fix while it returns NULL, switching on
er_errid (): NO_ERROR/ER_INTERRUPTED retry without bumping i; the
three timeout errors (ER_LK_UNILATERALLY_ABORTED, ER_LK_PAGE_TIMEOUT,
ER_PAGE_LATCH_TIMEDOUT) do i++; anything else sets noretry. The loop
breaks (with ER_PAGE_LATCH_ABORTED) once noretry || i > retry — so
interrupts never consume retry budget and any other error exits at once.
3.4 Hashing the VPID
Section titled “3.4 Hashing the VPID”The page table is indexed by PGBUF_HASH_VALUE, calling
pgbuf_hash_func_mirror:
// pgbuf_hash_func_mirror -- src/storage/page_buffer.c#define HASH_SIZE_BITS 20 /* 2^20 ~ 1M anchors, fixed */#define VOLID_LSB_BITS 8 reverse_mask = 1 << (HASH_SIZE_BITS - 1); /* top bit of the 20-bit space */ for (i = VOLID_LSB_BITS; i > 0; i--) /* bit-reverse low 8 volid bits */ { if (volid_lsb & lsb_mask) reversed_volid_lsb |= reverse_mask; reverse_mask >>= 1; lsb_mask <<= 1; } hash_val = vpid->pageid ^ reversed_volid_lsb; /* XOR pageid with mirrored volid */ hash_val = hash_val & ((1 << HASH_SIZE_BITS) - 1); /* clamp to 2^20 buckets */The “mirror” trick bit-reverses the low 8 volid bits into the top of the 20-bit space, then XORs with the pageid (which dominates the low bits), so different volumes get disjoint high-bit signatures and adjacent ids across volumes do not share chains.
Two sibling helpers serve the Aout victim-history mht table (Chapter 7),
not the main page table: pgbuf_hash_vpid is a generic modulo hash,
((vpid->pageid | ((unsigned int) vpid->volid) << 24) % htsize), and
pgbuf_compare_vpid is its ordering callback (same volume ⇒ pageid
difference, else volid difference). The main buf_hash_table uses
pgbuf_hash_func_mirror only, comparing via VPID_EQ.
3.5 The lock-free read-only fast path
Section titled “3.5 The lock-free read-only fast path”Before grabbing any anchor mutex, a read fix of a present page tries to
fix without locking. The guard requires all four of: request_mode == PGBUF_LATCH_READ, fetch_mode in the three eligible modes, and
condition == PGBUF_UNCONDITIONAL_LATCH — so after the §3.2 downgrade a
zero-wait transaction is ineligible. On a non-NULL pgbuf_lockfree_fix_ro
it bumps num_hit and goto fast_path, bypassing the hash walk and the
latch pass. The function does a lock-free chain walk, then a CAS on the
BCB latch word:
// pgbuf_lockfree_fix_ro -- src/storage/page_buffer.cbufptr = pgbuf_search_hash_chain_no_bcb_lock (thread_p, &pgbuf_Pool.buf_hash_table[PGBUF_HASH_VALUE (vpid)], vpid);if (bufptr == NULL) return NULL; /* not resident -> slow path */do { impl = get_impl (&bufptr->atomic_latch); new_impl = impl; if (impl.impl.latch_mode != PGBUF_LATCH_READ /* must already be read-latched */ || impl.impl.waiter_exists || impl.impl.fcnt == 0 /* no writer queued, still held */ || bufptr->vpid.pageid != vpid->pageid /* re-validate identity ... */ || bufptr->vpid.volid != vpid->volid) /* ... against ABA reuse */ return NULL; /* any failure -> slow path */ new_impl.impl.fcnt++; /* bump fix count */} while (!bufptr->atomic_latch.compare_exchange_weak (impl.raw, new_impl.raw, std::memory_order_acq_rel, std::memory_order_acquire));Invariant — the fast path only adds a reader to an already-read-held BCB. The CAS refuses unless the latch is
PGBUF_LATCH_READ,fcnt != 0, and has no waiter — so it never upgrades from free/write, never starves a queued writer, and the in-loop VPID re-check defeats ABA. Any failure returns NULL to the slow path; never an error.
The chain walk it uses, pgbuf_search_hash_chain_no_bcb_lock, is bare: it
pointer-chases hash_anchor->hash_next returning the first VPID_EQ
match, with no mutex or trylock — the CAS above does all synchronization.
On a successful CAS the function still has holder bookkeeping to do before
returning the page: pgbuf_find_thrd_holder either finds the caller is
already a holder (bump holder->fix_count, set hold_has_read_latch) or,
in SERVER_MODE, allocates a fresh holder via
pgbuf_allocate_thrd_holder_entry (a NULL return there is assert(false)
- NULL). Only then does
CAST_BFPTR_TO_PGPTRproduce thePAGE_PTRand the caller reachesfast_path:.
3.6 The locked hash-chain walk and the hit/miss fork
Section titled “3.6 The locked hash-chain walk and the hit/miss fork”If the fast path is skipped or returns NULL, the slow path sets
hash_anchor, clears buf_lock_acquired, and calls
pgbuf_search_hash_chain. If that returns a direct-victim BCB
(pgbuf_bcb_is_direct_victim), pgbuf_bcb_update_flags (..., PGBUF_BCB_INVALIDATE_DIRECT_VICTIM_FLAG, ...) tells the victim-waiter it
cannot use this BCB.
The anchor it walks is one slot of the buf_hash_table[], a
PGBUF_BUFFER_HASH:
| Field | Role | Why |
|---|---|---|
hash_mutex | per-bucket pthread_mutex_t (SERVER_MODE only) | Serializes chain insert/remove and the buffer-lock chain; the only mutex phase two holds while walking. Per-bucket, not global, so different buckets hash concurrently. |
hash_next | head of the BCB hash chain (PGBUF_BCB *) | The chain pgbuf_search_hash_chain pointer-chases via each BCB’s own hash_next; resident pages for this bucket live here. |
lock_next | head of the buffer-lock chain (PGBUF_BUFFER_LOCK *) | Records VPIDs a thread has claimed but not yet inserted (the miss path, Chapter 4), so a second fixer for the same VPID waits instead of double-claiming. Also protected by hash_mutex. |
pgbuf_search_hash_chain is the workhorse: a two-phase search with an
exact return contract — non-NULL ⇒ caller holds bufptr->mutex (not
the hash mutex); NULL ⇒ caller holds hash_anchor->hash_mutex.
Phase one (one_phase:) walks the chain without the hash mutex,
trying a non-blocking PGBUF_BCB_TRYLOCK on the matched BCB. The
load-bearing core, per matched bufptr:
// pgbuf_search_hash_chain -- src/storage/page_buffer.c (one_phase core) rv = PGBUF_BCB_TRYLOCK (bufptr); if (rv != 0) { if (rv != EBUSY) goto two_phase; /* trylock error -> escalate */ PGBUF_BCB_LOCK (bufptr); } /* EBUSY -> block on the bcb mutex */ if (!VPID_EQ (&(bufptr->vpid), vpid)) /* bcb reused under us? */ { PGBUF_BCB_UNLOCK (bufptr); goto one_phase; } /* <- restart phase 1 */ break; /* matched + locked -> return bufptr */Three branches leave phase one (Figure 3-1): clean trylock + VPID recheck
(return bufptr); EBUSY → blocking PGBUF_BCB_LOCK then recheck; and a
non-EBUSY error escalating via goto two_phase. The post-lock recheck
catches a slot repurposed between match and lock.
Phase two (two_phase:/try_again:) re-runs the same walk under the hash
mutex, differing in three points: on a clean trylock it unlocks the hash
mutex before returning; on EBUSY it unlocks the hash mutex before the
blocking PGBUF_BCB_LOCK and re-validates via goto try_again; and a
non-EBUSY failure is fatal — er_set_with_oserror (ER_CSS_PTHREAD_MUTEX_TRYLOCK) then return NULL.
Invariant — lock ordering is hash mutex then BCB mutex, never the reverse. Phase two always drops the hash mutex before a blocking
PGBUF_BCB_LOCK; inverting it would deadlock insert/remove paths. TheER_CSS_PTHREAD_MUTEX_TRYLOCKbranch is the one place the function returns NULL while not holding the hash mutex — a fatal OS failure.
flowchart TD
A["pgbuf_search_hash_chain"] --> B["one_phase: walk chain, no hash mutex"]
B --> C{"VPID match?"}
C -- "no, end of chain" --> TP["two_phase"]
C -- "yes" --> D["PGBUF_BCB_TRYLOCK"]
D -- "rv==0" --> E{"VPID still equal?"}
D -- "EBUSY" --> F["PGBUF_BCB_LOCK block"]
D -- "other err" --> TP
F --> E
E -- "no, reused" --> B
E -- "yes" --> R1["return bufptr, holds bcb mutex"]
TP --> G["lock hash_mutex; walk chain"]
G --> H{"VPID match?"}
H -- "no, end" --> R2["return NULL, holds hash mutex"]
H -- "yes" --> I["PGBUF_BCB_TRYLOCK"]
I -- "rv==0 or EBUSY" --> JK["unlock hash_mutex; if EBUSY PGBUF_BCB_LOCK"]
I -- "other err" --> ERR["fatal: return NULL"]
JK --> L{"VPID still equal?"}
L -- "no" --> G
L -- "yes" --> R3["return bufptr, holds bcb mutex"]
Back in pgbuf_fix_release, the returned bufptr drives the hit/miss
fork into three outcomes:
-
Hit (
bufptr != NULL): incrementnum_hit; ifNEW_PAGE, assert the page is clean-LSA or dirty (a NEW_PAGE re-using a buffered, invalidated page). Control falls through topgbuf_bcb_register_fixand the latch pass (Chapter 5). -
OLD_PAGE_IF_IN_BUFFERmiss: this mode never reads from disk, so unlock the hash mutex andreturn NULL— the only mode that short-circuits a miss. -
General miss: call
pgbuf_claim_bcb_for_fix(Chapter 4). On NULL withretry,goto try_again; on NULL without retry,ASSERT_ERRORand return NULL; on success, setbuf_lock_acquired = trueand continue to the page-VPID check.
3.7 Post-claim VPID re-check and the maybe_deallocated branch
Section titled “3.7 Post-claim VPID re-check and the maybe_deallocated branch”After a hit or a successful claim the caller holds bufptr->mutex;
pgbuf_bcb_register_fix and pgbuf_set_bcb_page_vpid run, then page
identity is re-validated:
// pgbuf_fix_release -- src/storage/page_buffer.cmaybe_deallocated = (fetch_mode == OLD_PAGE_MAYBE_DEALLOCATED);if (pgbuf_check_bcb_page_vpid (bufptr, maybe_deallocated) != true) { if (buf_lock_acquired) { pgbuf_put_bcb_into_invalid_list (thread_p, bufptr); /* releases bcb mutex */ (void) pgbuf_unlock_page (thread_p, hash_anchor, vpid, true); } else { PGBUF_BCB_UNLOCK (bufptr); } /* hit case: just unlock */ PGBUF_BCB_CHECK_MUTEX_LEAKS (); return NULL; }if (fetch_mode == OLD_PAGE_PREVENT_DEALLOC) pgbuf_bcb_register_avoid_deallocation (bufptr); /* pin against dealloc */The maybe_deallocated flag relaxes pgbuf_check_bcb_page_vpid so a
deallocated VPID is not a failure for OLD_PAGE_MAYBE_DEALLOCATED. The
cleanup branch differs by ownership: a fresh claim (buf_lock_acquired)
is recycled to the invalid list and the page lock dropped; a hit only
unlocks the BCB. Past here the function enters the latch pass (Chapter 5)
and, on success, jumps to fast_path: where the §3.1 PAGE_UNKNOWN
switch runs — the last place fetch mode steers the result.
3.8 Chapter summary — key takeaways
Section titled “3.8 Chapter summary — key takeaways”pgbuf_fix_releaseis a state machine: four early-return validations, then atry_againloop whose only re-entry edge is a BCB-claim race viapgbuf_claim_bcb_for_fix’sretryout-parameter.- A zero-wait transaction (
LK_ZERO_WAIT/LK_FORCE_ZERO_WAIT) has its unconditional fix rewritten to conditional before hashing. - The lock-free fast path (
pgbuf_lockfree_fix_ro) covers read latches in the three eligible modes under an unconditional request; its CAS only adds a reader to an already-read-held, waiter-free BCB, re-validating the VPID against ABA. pgbuf_hash_func_mirrorbit-reverses the low 8 volid bits into the top of a 20-bit space and XORs with pageid;pgbuf_hash_vpid/pgbuf_compare_vpidbelong to the separate Aoutmhttable.pgbuf_search_hash_chainis two-phase; its return contract (non-NULL ⇒ holds BCB mutex; NULL ⇒ holds hash mutex) and strict hash-then-BCB lock ordering are load-bearing invariants.- The hit/miss fork: hit → latch pass (Ch. 5);
OLD_PAGE_IF_IN_BUFFERmiss → immediate NULL; general miss → BCB claim (Ch. 4) with a retry. - Fetch mode steers four points — fast-path eligibility, the miss
short-circuit, the
maybe_deallocatedre-check, and the finalPAGE_UNKNOWNswitch — so any fix bug starts with the caller’s mode.
Chapter 4: Miss Handling BCB Claim and the PGBUF Allocation Lock
Section titled “Chapter 4: Miss Handling BCB Claim and the PGBUF Allocation Lock”Chapter 3 left us where a pgbuf_fix lookup returns not in the page
table. This chapter answers: how does a thread reserve the VPID
against racing allocators, obtain a fresh BCB from the invalid list or a
victim, read the page bytes, and insert the BCB into the hash chain?
The companion (cubrid-page-buffer-manager.md §“How a page fix flows”,
§“PGBUF lock”) sketches Step 1 / Step 2; here we trace every branch,
assuming the reader knows the PGBUF_BCB layout, the five zones, and the
victim sources from Chapters 1-3 and the companion’s §“LFCQ”.
The miss path is a four-layer onion: pgbuf_claim_bcb_for_fix (outer
coordinator) takes the per-bucket VPID lock via pgbuf_lock_page, calls
pgbuf_allocate_bcb (source selector: invalid list, then victim, then
sleep on the direct-victim queue), whose cheapest source is
pgbuf_get_bcb_from_invalid_list. Victim search (pgbuf_get_victim) and
the direct-victim hand-off are Chapter 9 black boxes.
4.1 pgbuf_invalid_list — the free pool, every field
Section titled “4.1 pgbuf_invalid_list — the free pool, every field”The invalid list is the pool of BCBs bound to no page (all BCBs at server start; error-rolled-back or invalidated BCBs at runtime). It is a LIFO stack guarded by one mutex.
// struct pgbuf_invalid_list -- src/storage/page_buffer.cstruct pgbuf_invalid_list{#if defined(SERVER_MODE) pthread_mutex_t invalid_mutex; /* integrity of the singly-linked list */#endif PGBUF_BCB *invalid_top; /* head of the list (LIFO) */ int invalid_cnt; /* # of entries */};| Field | Role | Why it exists |
|---|---|---|
invalid_mutex | Serializes push/pop of the stack | Without it two poppers could grab the same head. SERVER_MODE only — SA mode is single-threaded. |
invalid_top | Head pointer; chain runs through bufptr->next_BCB | An invalid BCB is on no LRU list, so it reuses its next_BCB LRU pointer as the invalid-chain link — no separate field. |
invalid_cnt | Live count of free BCBs | Read by quota math (pgbuf_adjust_quotas, Ch 10); also a fast “is the pool exhausted?” probe before taking the mutex. |
Invariant — invalid_top chains exclusively through next_BCB, and a
BCB is on the invalid list iff its zone is PGBUF_INVALID_ZONE.
pgbuf_get_bcb_from_invalid_list flips the popped BCB to
PGBUF_VOID_ZONE; pgbuf_put_bcb_into_invalid_list flips it back and
asserts (bufptr->flags & PGBUF_BCB_FLAGS_MASK) == 0 — a BCB returning
to the pool must carry no dirty/flushing/victim flag. If violated, the
BCB re-enters the free pool still advertising a pending flush, and a
later claimer treats stale page bytes as a clean fresh page.
flowchart LR IT["invalid_top"] --> B1["BCB a"] B1 -->|next_BCB| B2["BCB b"] B2 -->|next_BCB| B3["BCB c"] B3 -->|next_BCB| NUL["NULL"]
Figure 4-1 — The invalid list is a LIFO stack threaded through each BCB’s next_BCB pointer; invalid_cnt tracks its length.
pgbuf_get_bcb_from_invalid_list — double-checked-locking pop
Section titled “pgbuf_get_bcb_from_invalid_list — double-checked-locking pop”This is pgbuf_allocate_bcb’s cheapest source. It pops one BCB with a
lock-free fast path so the common “pool empty” case never touches the
mutex.
// pgbuf_get_bcb_from_invalid_list -- src/storage/page_buffer.c if (pgbuf_Pool.buf_invalid_list.invalid_top == NULL) /* (1) fast path: empty */ return NULL; /* no mutex taken */ rv = pthread_mutex_lock (&pgbuf_Pool.buf_invalid_list.invalid_mutex); if (pgbuf_Pool.buf_invalid_list.invalid_top == NULL) /* (2) re-check under mutex */ { pthread_mutex_unlock (...); return NULL; } /* someone emptied it */ else /* (3) pop the LIFO top */ { bufptr = pgbuf_Pool.buf_invalid_list.invalid_top; pgbuf_Pool.buf_invalid_list.invalid_top = bufptr->next_BCB; /* advance head */ pgbuf_Pool.buf_invalid_list.invalid_cnt -= 1; pthread_mutex_unlock (...); PGBUF_BCB_LOCK (bufptr); /* now hold bufptr->mutex */ bufptr->next_BCB = NULL; /* sever invalid-chain link */ pgbuf_bcb_change_zone (thread_p, bufptr, 0, PGBUF_VOID_ZONE); /* INVALID -> VOID */ return bufptr; }Three branches: (1) unlocked empty check returns NULL (no mutex).
(2) post-mutex re-check returns NULL if a racing popper drained the
list between the two reads. (3) the pop advances invalid_top,
decrements invalid_cnt, drops the list mutex, then locks the BCB, nulls
its chain link, and flips it to PGBUF_VOID_ZONE, returned under
bufptr->mutex.
4.2 pgbuf_claim_bcb_for_fix — the outer coordinator
Section titled “4.2 pgbuf_claim_bcb_for_fix — the outer coordinator”The fix path calls this on a page-table miss. Its contract is unusual: it
is entered holding hash_anchor->hash_mutex and may exit having
released it, having set *try_again, or having returned a fully-loaded
BCB under bufptr->mutex. The four exit branches:
// pgbuf_claim_bcb_for_fix -- src/storage/page_buffer.c /* Branch A: a prior trylock on the bucket failed -> bail, no retry. */ if (er_errid () == ER_CSS_PTHREAD_MUTEX_TRYLOCK) { pthread_mutex_unlock (&hash_anchor->hash_mutex); return NULL; } /* Branch B: take the VPID lock; hash_mutex is released inside. */ if (!already_locked && pgbuf_lock_page (...) != PGBUF_LOCK_HOLDER) { *try_again = true; return NULL; } /* <- LOSER of a same-VPID race */ bufptr = pgbuf_allocate_bcb (thread_p, vpid); if (bufptr == NULL) /* Branch C: pool dirty / interrupted */ { ASSERT_ERROR (); (void) pgbuf_unlock_page (..., true); return NULL; } /* Branch D: success. Scrub the fresh BCB. */ bufptr->vpid = *vpid; /* atomic_latch <- {PGBUF_NO_LATCH, waiter=false, fcnt=0}; clears stale victim latch */ pgbuf_bcb_update_flags (..., 0, PGBUF_BCB_ASYNC_FLUSH_REQ); /* clear stray flag */ LSA_SET_NULL (&bufptr->oldest_unflush_lsa); /* nothing unflushed yet */Branch A — a failed trylock on the bucket leaves
ER_CSS_PTHREAD_MUTEX_TRYLOCK in the error slot; drop the mutex and return
NULL without touching *try_again (the caller pre-initialized it to
false, so its goto try_again does not fire and pgbuf_fix’s own retry
loop re-drives the lookup).
Branch B is the race protocol: already_locked is true only for the
dealloc-aware caller. HOLDER → we own the VPID, fall through. WAITER →
another thread is allocating it; we already slept in pgbuf_lock_page,
so set *try_again = true and the caller’s goto try_again re-runs the
lookup and hits the BCB the winner inserted.
Branch C (allocation failure, §4.4) must undo the VPID lock: it
holds no mutex, so pgbuf_unlock_page(..., true) re-acquires
hash_anchor->hash_mutex to unlink the record.
Branch D scrubs the BCB; the atomic_latch reset clears a victim’s
stale PGBUF_LATCH_INVALID (the real latch is acquired in Chapter 5).
Bytes are then loaded (§4.5).
flowchart TD
S["enter holding hash_mutex"] --> A{"errid == TRYLOCK?"}
A -- yes --> AR["unlock hash_mutex; return NULL\ntry_again untouched, stays false"]
A -- no --> B{"pgbuf_lock_page\n== HOLDER?"}
B -- "WAITER" --> BR["try_again=true; return NULL\n-> caller re-looks-up, hits"]
B -- "HOLDER" --> C["pgbuf_allocate_bcb"]
C --> D{"bufptr == NULL?"}
D -- yes --> DR["unlock_page need_hash=true\nreturn NULL; propagate error"]
D -- no --> E["init BCB; load bytes -> 4.5"]
E --> G["return BCB under bufptr->mutex"]
Figure 4-2 — pgbuf_claim_bcb_for_fix branch map. Only B-WAITER and the success branch leave the caller a different state to act on; the two error branches both unwind the VPID lock.
4.3 pgbuf_lock_page / pgbuf_unlock_page — the VPID race lock
Section titled “4.3 pgbuf_lock_page / pgbuf_unlock_page — the VPID race lock”The PGBUF lock is not the BCB latch and not the bucket mutex. It
is a logical lock keyed on the VPID, on a chain hanging off the same hash
bucket as the BCB chain: when no BCB exists yet for a VPID, it ensures
exactly one thread allocates it. The lock record pgbuf_buffer_lock is
statically pre-allocated one per thread (no malloc on the hot path):
// struct pgbuf_buffer_lock -- src/storage/page_buffer.cstruct pgbuf_buffer_lock{ VPID vpid; /* the VPID being allocated */ PGBUF_BUFFER_LOCK *lock_next; /* next record on this bucket's lock chain */#if defined(SERVER_MODE) THREAD_ENTRY *next_wait_thrd; /* FIFO of threads blocked on this VPID */#endif};pgbuf_lock_page is entered holding hash_anchor->hash_mutex and
always releases it before returning. Two branches:
// pgbuf_lock_page -- src/storage/page_buffer.c for (cur = hash_anchor->lock_next; cur != NULL; cur = cur->lock_next) if (VPID_EQ (&cur->vpid, vpid)) /* LOSER: VPID already being allocated */ { cur_thrd_entry->next_wait_thrd = cur->next_wait_thrd; cur->next_wait_thrd = cur_thrd_entry; /* push onto waiter FIFO */ pgbuf_sleep (cur_thrd_entry, &hash_anchor->hash_mutex); /* releases mutex, sleeps */ if (cur_thrd_entry->resume_status != THREAD_PGBUF_RESUMED) { /* woke for interrupt: re-take mutex, splice self out of waiter list */ } return PGBUF_LOCK_WAITER; } /* WINNER: VPID absent. Claim this thread's static record. */ cur = &pgbuf_Pool.buf_lock_table[cur_thrd_entry->index]; cur->vpid = *vpid; cur->next_wait_thrd = NULL; cur->lock_next = hash_anchor->lock_next; hash_anchor->lock_next = cur; /* link at head */ pthread_mutex_unlock (&hash_anchor->hash_mutex); return PGBUF_LOCK_HOLDER;Loser branch (VPID on the chain): push onto that record’s
next_wait_thrd FIFO; pgbuf_sleep releases hash_mutex and suspends.
The resume_status != THREAD_PGBUF_RESUMED sub-branch handles an
interrupt that woke the thread without the winner unlocking: it re-takes
hash_mutex and splices itself out of the waiter list so a later
pgbuf_unlock_page does not wake a departed thread. Result is WAITER.
Winner branch (VPID absent): claim this thread’s static record, link
at the chain head, return HOLDER.
Invariant — at most one BCB-less allocation per VPID is in flight.
Enforced because the winner installs its record under hash_mutex
before releasing it, so any later scanner under the same mutex sees the
record and becomes a waiter. Two winners would create two BCBs for one
VPID — the lookup would be nondeterministic and one copy’s writes lost.
pgbuf_unlock_page is the mirror. need_hash_mutex says whether to
acquire hash_anchor->hash_mutex itself (error paths, true) or whether
the caller already holds it (success path after
pgbuf_insert_into_hash_chain, false).
// pgbuf_unlock_page -- src/storage/page_buffer.c if (need_hash_mutex) pthread_mutex_lock (&hash_anchor->hash_mutex); /* find this VPID's record; if found, unlink it ... */ if (cur != NULL) { /* splice out of lock_next chain */ pthread_mutex_unlock (&hash_anchor->hash_mutex); while ((t = cur->next_wait_thrd) != NULL) /* wake EVERY waiter */ { cur->next_wait_thrd = t->next_wait_thrd; t->next_wait_thrd = NULL; pgbuf_wakeup_uncond (t); } } else pthread_mutex_unlock (&hash_anchor->hash_mutex); /* record gone (error case) */It unlinks the record, drops the mutex, then wakes all waiters. Each woken loser re-runs the fix, finds the BCB in the table (inserted before unlocking on the success path), and proceeds via the Chapter 3 hash-hit path. Waking after dropping the mutex avoids immediate re-contention.
4.4 pgbuf_allocate_bcb — the source selector
Section titled “4.4 pgbuf_allocate_bcb — the source selector”The VPID-lock winner needs an actual BCB. The selector tries three sources in cost order.
// pgbuf_allocate_bcb -- src/storage/page_buffer.c bufptr = pgbuf_get_bcb_from_invalid_list (thread_p); /* Source 1: free list, cheapest */ if (bufptr != NULL) return bufptr; /* short-circuit: SKIPS the 'end:' victimize */ bufptr = pgbuf_get_victim (thread_p); /* Source 2: scan LFCQs */ if (bufptr != NULL) goto end; /* victim still needs pgbuf_victimize_bcb */Source 1 short-circuits: a BCB off the invalid list (§4.1) is bound to
no page and has no flags, so it returns immediately — it does not reach
end: and so is not victimized. Source 2 is different: by the time
pgbuf_get_victim returns, the victim has already been unlinked from its
LRU list and flipped to PGBUF_VOID_ZONE — pgbuf_get_victim_from_lru_list
calls pgbuf_remove_from_lru_list (which does the unlink + zone flip)
before handing the BCB back (§4.6). What still remains is the BCB’s link
in the hash chain and its old latch. That is precisely why Source 2
must fall through to end:: pgbuf_victimize_bcb is what detaches the hash
chain and invalidates the latch. So the goto end is about hash-chain and
latch cleanup, not about LRU unlinking, which already happened inside
pgbuf_get_victim.
If both fail, behavior forks on build mode and daemon availability. In SERVER_MODE with the flush daemon up, the thread enqueues on a direct-victim waiter queue and suspends with a timeout:
// pgbuf_allocate_bcb -- src/storage/page_buffer.c (SERVER_MODE, flush daemon up) retry: high_priority = high_priority || VACUUM_IS_THREAD_VACUUM (thread_p) || pgbuf_is_thread_high_priority (thread_p); thread_lock_entry (thread_p); if (high_priority) waiter_threads_high_priority->produce (thread_p); else if (!waiter_threads_low_priority->produce (thread_p)) /* low queue jammed */ { if (!waiter_threads_high_priority->produce (thread_p)) { assert(false); goto end; } } pgbuf_wakeup_page_flush_daemon (thread_p); /* ensure SOMEONE will feed us */ r = thread_suspend_timeout_wakeup_and_unlock_entry (..., THREAD_ALLOC_BCB_SUSPENDED);high_priority is true for vacuum threads, threads already holding a
hot-page latch, or on a retry. The low-priority produce-failure
sub-branch guards a preempted-consumer wedge: if the low queue cannot
accept, the thread is pushed to the high queue. After enqueueing it wakes
the flush daemon and suspends. On wake, four sub-branches:
- Normal handoff (
THREAD_ALLOC_BCB_RESUMED): a producer put a BCB in this thread’s slot;pgbuf_get_direct_victimreads it and wegoto end. - Stolen-back:
pgbuf_get_direct_victimreturnsNULL(the BCB was re-fixed between assign and get, companion §“Direct victim hand-off”); sethigh_priorityandgoto retry. - Interrupt/shutdown (other
resume_status): undo any half-assigned victim, then raiseER_INTERRUPTEDso the claim path takes Branch C. r != NO_ERROR(the timeoutelse): asserts no timeout, re-stampsresume_status, and is a can’t-happen path under the assert.
The no-daemon else (SA mode or crash recovery) cannot sleep for a
producer, so it flushes via pgbuf_wakeup_page_flush_daemon, re-scans
with pgbuf_get_victim, and asserts a victim now exists. The shared tail
victimizes any acquired victim:
// pgbuf_allocate_bcb -- src/storage/page_buffer.c end: if (bufptr != NULL) { if (pgbuf_victimize_bcb (thread_p, bufptr) != NO_ERROR) { assert (false); bufptr = NULL; } } else if (er_errid () == NO_ERROR) er_set (..., ER_PB_ALL_BUFFERS_DIRTY, ...); return bufptr;pgbuf_victimize_bcb re-checks victimizability under the BCB mutex,
unlinks the BCB from its hash chain (pgbuf_delete_from_hash_chain), and
stamps PGBUF_LATCH_INVALID into the atomic latch. It does not change
the zone — the victim was already moved to VOID by
pgbuf_remove_from_lru_list during pgbuf_get_victim (§4.6).
Invalid-list BCBs skip this (never in a chain). If bufptr is still
NULL, ER_PB_ALL_BUFFERS_DIRTY is set so Branch C has an error to
propagate.
Invariant — every BCB leaving pgbuf_allocate_bcb non-NULL is detached
from any hash chain and LRU list, held under bufptr->mutex.
Invalid-list BCBs never attached; victims detach via
pgbuf_remove_from_lru_list + pgbuf_victimize_bcb. A still-linked BCB
leaking out would let the claimer bind a second VPID onto a slot still
reachable by its old VPID.
4.5 Loading the page bytes — NEW_PAGE vs read vs DWB
Section titled “4.5 Loading the page bytes — NEW_PAGE vs read vs DWB”Back in pgbuf_claim_bcb_for_fix (Branch D), the initialized BCB needs
its page bytes. The fork is on fetch_mode.
// pgbuf_claim_bcb_for_fix -- src/storage/page_buffer.c (read branch) if (fetch_mode != NEW_PAGE) { /* DWB first: a torn-write copy may be fresher than the volume. */ if (dwb_read_page (thread_p, vpid, &...iopage, &success) != NO_ERROR) { assert (false); return NULL; } /* (1) DWB error: can't-happen */ else if (success == true) { /* copied from DWB, no disk read */ } else if (fileio_read (...) == NULL) /* (2) volume read failed */ { ASSERT_ERROR (); pgbuf_put_bcb_into_invalid_list (thread_p, bufptr); /* releases bufptr->mutex */ (void) pgbuf_unlock_page (..., true); return NULL; } /* (3) decrypt if TDE-protected; on failure roll back like (2) */ if (tde_algo != TDE_ALGORITHM_NONE && tde_decrypt_data_page (...) != NO_ERROR) { ASSERT_ERROR (); pgbuf_put_bcb_into_invalid_list (...); pgbuf_unlock_page (..., true); return NULL; } if (pgbuf_is_temporary_volume (vpid->volid) && !pgbuf_is_temp_lsa (...)) /* temp first-touch */ { pgbuf_init_temp_page_lsa (...); pgbuf_set_dirty_buffer_ptr (thread_p, bufptr); } }The read branch honors DWB-first ordering (companion §“Double
Write Buffer”): dwb_read_page sets success when the double-write
buffer holds a copy of this VPID, short-circuiting the disk read. Only on
a DWB miss does fileio_read hit the volume. Three error sub-branches —
(1) the DWB-error guard, (2) the volume-read failure, (3) the TDE-decrypt
failure — each roll the BCB back via pgbuf_put_bcb_into_invalid_list
(nulls the VPID, sets PGBUF_LATCH_INVALID, flips the zone to INVALID,
releases bufptr->mutex) then pgbuf_unlock_page(..., true), so half-read
bytes never linger in the pool. The temp first-touch sub-branch stamps a
sentinel temp LSA and marks the page dirty.
// pgbuf_claim_bcb_for_fix -- src/storage/page_buffer.c (NEW_PAGE branch) else { if (pgbuf_is_temporary_volume (vpid->volid)) pgbuf_init_temp_page_lsa (&...iopage, IO_PAGESIZE); else fileio_init_lsa_of_page (&...iopage, IO_PAGESIZE); if (bufptr->vpid.volid > NULL_VOLID) /* perm: mark page immature */ { ...iopage.prv.pageid = -1; ...iopage.prv.volid = -1; } } return bufptr;The NEW_PAGE branch has nothing on disk to read, so it initializes the
in-page LSA and (for permanent volumes) stamps prv.pageid/volid = -1 to
mark the page immature — the real identity is written later by
pgbuf_set_bcb_page_vpid (§4.6). It cannot fail, so there is no rollback
sub-branch. Both branches return the loaded BCB under bufptr->mutex.
4.6 pgbuf_set_bcb_page_vpid and the hash-chain insertion
Section titled “4.6 pgbuf_set_bcb_page_vpid and the hash-chain insertion”The fix path (Chapter 5 territory, shown for the bind step) stamps the
page identity and inserts the BCB. pgbuf_set_bcb_page_vpid has three
branches:
// pgbuf_set_bcb_page_vpid -- src/storage/page_buffer.c if (bufptr == NULL || VPID_ISNULL (&bufptr->vpid)) /* (A) guard: nothing to stamp */ { assert (bufptr != NULL); assert (!VPID_ISNULL (&bufptr->vpid)); return; } if (bufptr->vpid.volid > NULL_VOLID) /* permanent volume only */ { if (prv.pageid == NULL_PAGEID && prv.volid == NULL_VOLID) /* (B) first time */ { prv.pageid = bufptr->vpid.pageid; /* write identity into header */ prv.volid = bufptr->vpid.volid; prv.ptype = PAGE_UNKNOWN; /* + p_reserve_1/2, tde_nonce zeroed */ } else /* (C) already stamped */ { assert (prv.volid == bufptr->vpid.volid); /* values not reset on dealloc */ assert (prv.pageid == bufptr->vpid.pageid); } /* identity must match -- no rewrite */ }(A) the top guard: a NULL BCB or null VPID is a caller bug — it
asserts and returns without touching bytes. (B) first-time path
(immature NEW_PAGE sentinel from §4.5): writes the VPID into the
in-page header, making the bytes self-identifying on disk. (C) the
else — a re-allocated/already-stamped perm page (the in-page identity
survives deallocation): the function leaves the bytes untouched and only
asserts that the stored prv.volid/pageid still equal the BCB’s VPID.
Temp pages (volid <= NULL_VOLID) fall through all three with no action.
// pgbuf_insert_into_hash_chain -- src/storage/page_buffer.c pthread_mutex_lock (&hash_anchor->hash_mutex); bufptr->hash_next = hash_anchor->hash_next; hash_anchor->hash_next = bufptr; /* link at head */ /* hash_mutex stays held; released by the following pgbuf_unlock_page (need_hash_mutex=false) */pgbuf_insert_into_hash_chain links the BCB at the head of the bucket’s
BCB chain and deliberately keeps hash_mutex held — the
immediately-following pgbuf_unlock_page(..., false) unlinks the VPID
lock record and releases the same mutex. Holding it across both means a
racing loser cannot see a window where the BCB is in the chain but the
VPID lock is already gone (which would let it wrongly become a winner).
Where does the BCB land in the LRU? At this point the BCB is in
PGBUF_VOID_ZONE. An invalid-list BCB was put there by
pgbuf_get_bcb_from_invalid_list; a victim was moved there by
pgbuf_remove_from_lru_list (call chain pgbuf_get_victim -> pgbuf_get_victim_from_lru_list -> pgbuf_remove_from_lru_list, whose tail
does the pgbuf_bcb_change_zone (..., PGBUF_VOID_ZONE)), and
pgbuf_victimize_bcb afterward only detaches it from the hash chain and
invalidates the latch. VOID means “on no list yet.” The BCB does not
enter an LRU list during claim/insert; it lands in a zone only at the
matching unfix, where pgbuf_unfix routes it to LRU 2 normally,
or LRU 1 when the fixer is a vacuum worker or the VPID was an
Aout-history hit (boost) — Chapter 7’s subject.
stateDiagram-v2 [*] --> INVALID: server start INVALID --> VOID: pop from invalid list Victim --> VOID: pgbuf_remove_from_lru_list during get_victim VOID --> LRU2: unfix normal -> Ch 7 VOID --> LRU1: unfix vacuum or Aout-hit -> Ch 7 VOID --> INVALID: read error rollback
Figure 4-3 — Zone trajectory of a claimed BCB. The claim path lands it in VOID; LRU placement happens at unfix (Chapter 7). Any read error returns it straight to INVALID.
4.7 Chapter summary — key takeaways
Section titled “4.7 Chapter summary — key takeaways”-
pgbuf_claim_bcb_for_fixhas four exits: trylock-bail (Branch A), loser-retry with*try_again=true(B), allocation-failure unwind (C), and success returning a loaded BCB underbufptr->mutex(D). -
The PGBUF lock makes same-VPID allocation single-winner.
pgbuf_lock_pagereturns HOLDER to one thread and parks the rest as WAITERs on a per-record FIFO; the winner inserts the BCB andpgbuf_unlock_pagewakes everyone, who re-drive the fix and hit it. -
pgbuf_allocate_bcbshort-circuits the invalid list (returns without victimizing); a victim falls through toend:forpgbuf_victimize_bcb. When both fail, server mode sleeps on a direct-victim queue with four wake sub-branches (handoff, stolen-back, interrupt,r != NO_ERRORtimeout-assert); SA/recovery flushes inline. -
pgbuf_get_bcb_from_invalid_listuses double-checked locking: an unlocked empty-list fast return, a post-mutex re-check return, and the pop branch that advancesinvalid_top, decrementsinvalid_cnt, locks the BCB and flips it to VOID. Every read-error path funnels throughpgbuf_put_bcb_into_invalid_list+pgbuf_unlock_page(...,true), so no half-bound slot leaks. -
DWB is consulted before disk on every read miss (
dwb_read_pagesetssuccess, eliminatingfileio_read); NEW_PAGE never reads or fails, marking permanent pages immature (prv.pageid/volid = -1).pgbuf_set_bcb_page_vpidthen stamps identity only on the first-time branch; on a re-allocated perm page it leaves bytes untouched and only asserts the stored identity still matches the VPID. -
The claim path leaves the BCB in VOID. A victim reaches VOID via
pgbuf_remove_from_lru_listduringpgbuf_get_victim— not viapgbuf_victimize_bcb, which only detaches the hash chain and invalidates the latch. Hash-chain insertion holdshash_mutexacross the VPID-lock release to close the winner/loser race window; LRU-zone placement is deferred to unfix in Chapter 7.
Chapter 5: The BCB Atomic Latch Acquire Block and Wake
Section titled “Chapter 5: The BCB Atomic Latch Acquire Block and Wake”Chapter 3 left us with a BCB in hand and its mutex held. The question:
given a fixer that owns the BCB mutex, how does the per-page read/write latch
decide compatibility, block an incompatible request on a per-page waiter list,
time it out, and wake waiters in order on release? For latch semantics at
design level see the high-level companion’s “Latch modes and the fix protocol”
section; here we trace every branch.
The subsystem rests on one 64-bit word — the atomic_latch — plus a singly
linked next_wait_thrd queue on the BCB. The BCB mutex serializes the queue;
the latch word is mutated by lock-free CAS so a waiter being woken can publish
its grant without re-taking the mutex.
5.1 The packed latch word: pgbuf_atomic_latch_impl
Section titled “5.1 The packed latch word: pgbuf_atomic_latch_impl”The latch is std::atomic<uint64_t> atomic_latch, accessed through a union:
// union pgbuf_atomic_latch_impl -- src/storage/page_buffer.cunion pgbuf_atomic_latch_impl { uint64_t raw; /* the word actually CAS'd */ struct { PGBUF_LATCH_MODE latch_mode; /* enum:uint16_t NO_LATCH=0, READ=1, WRITE=2, FLUSH=3 */ uint16_t waiter_exists; /* 1 if next_wait_thrd has a R/W waiter */ int32_t fcnt; /* number of granted fixes */ } impl;};| Field | Role | Why it exists |
|---|---|---|
raw | uint64_t payload compare_exchange_* operates on | Moves the three fields atomically in one CAS; a torn read is impossible only because they share this word. The layout (16 + 16 + 32 bits) exactly fills 64. |
impl.latch_mode | Current grant mode. PGBUF_LATCH_FLUSH is only a block mode, never a grant — the header at PGBUF_LATCH_MODE says so. | The compatibility decision (5.3) reads this first; PGBUF_NO_LATCH doubles as the “idle” sentinel. |
impl.waiter_exists | A hint, not a count: true once a R/W request is queued. | The writer-starvation guard (5.3, Case 1). |
impl.fcnt | Total fix count across holders sharing the current mode | At 0 the page can switch mode or be victimized; compared against one holder’s fix_count for “am I the only reader”. |
get_impl snapshots with an acquire load. The single-field mutators
(set_latch, add_fcnt, set_waiter_exists, set_latch_and_fcnt,
set_latch_and_add_fcnt) each run a load/compare_exchange_weak loop.
pgbuf_latch_bcb_upon_fix does not call those helpers: it computes the
whole new_impl from a fresh old_impl and retries a single
compare_exchange_strong on bufptr->atomic_latch — so its entire decision
tree is one atomic transition, recomputed on contention.
Invariant — the latch word transitions atomically; no partial publish. Every mutation is
do { old = get_impl(); ...build new... } while (!CAS(old, new)). With two separate stores a concurrent fixer could observelatch_mode == READwith a stalefcnt, grant a compatible read, and corrupt holder accounting. The single-word CAS forbids that torn intermediate. The decision-tree CAS uses the strong form (no spurious failure); the per-field helpers and the error/wakeup repair loops use the weak form inside their retry loops.
flowchart LR BCB["pgbuf_bcb"] --> AL["atomic_latch\n(uint64_t word)"] BCB --> NWT["next_wait_thrd\n(THREAD_ENTRY*)"] AL --> IMPL["impl: latch_mode | waiter_exists | fcnt"] NWT --> T1["THREAD_ENTRY\nrequest_latch_mode\nrequest_fix_count\nwait_for_latch_promote"] T1 --> T2["THREAD_ENTRY ..."] TH["THREAD_ENTRY\nm_holder_anchor"] --> HL["thrd_hold_list"] HL --> H1["pgbuf_holder\nfix_count, bufptr"] H1 --> H2["pgbuf_holder (thrd_link)"]
Figure 5-1 — The latch word and the two lists on it: the per-BCB waiter queue
(next_wait_thrd) and the per-thread holder list (thrd_hold_list).
5.2 Holder bookkeeping: pgbuf_holder
Section titled “5.2 Holder bookkeeping: pgbuf_holder”A pgbuf_holder records this thread’s fixes on one BCB. The latch word
counts fixes globally; the holder counts my slice so promotion and unfix can
be reasoned about locally.
// struct pgbuf_holder -- src/storage/page_buffer.cstruct pgbuf_holder { int fix_count; /* the count of fix by the holder */ PGBUF_BCB *bufptr; /* pointer to BCB */ PGBUF_HOLDER *thrd_link; /* next holder in this thread's hold list */ PGBUF_HOLDER *next_holder;/* next in this thread's *free* list */ PGBUF_HOLDER_STAT perf_stat;#if !defined(NDEBUG) char fixed_at[64 * 1024]; /* call-site trail for leak debugging */ int fixed_at_size;#endif int watch_count; /* number of PGBUF_WATCHERs on this holder */ PGBUF_WATCHER *first_watcher; PGBUF_WATCHER *last_watcher;};| Field | Role | Why it exists |
|---|---|---|
fix_count | How many times this thread fixed this BCB | old_impl.impl.fcnt == holder->fix_count (5.3) means “global == mine”, i.e. only fixer. At 0 the holder is recycled. |
bufptr | Back-pointer to the recorded BCB | pgbuf_find_thrd_holder matches on this. |
thrd_link | Next holder in this thread’s in-use list (thrd_hold_list) | Links the many pages a thread holds; next_holder must be NULL while on this list (asserted in pgbuf_find_thrd_holder). |
next_holder | Next holder in this thread’s free list (thrd_free_list) | next_holder meaningful only while free, thrd_link only while in use. Never both. |
perf_stat | PGBUF_HOLDER_STAT bitfield (dirty_before_hold, dirtied_by_holder, hold_has_write_latch, hold_has_read_latch) | Feeds perfmon; the hold_has_* bits are set when the latch is granted in any branch of pgbuf_latch_bcb_upon_fix. |
fixed_at / fixed_at_size | Fixed-size buffer holding the concatenated call-site trail (file:line of each fix) and its length | Leak / double-fix debugging in non-NDEBUG builds only; absent from release builds. |
watch_count / first_watcher / last_watcher | Ordered-fix watcher chain (Chapter 10) | Must be 0 / NULL before recycle — pgbuf_remove_thrd_holder asserts watch_count == 0. |
Invariant — a holder lives on exactly one of the two per-thread lists. While in use it is on
thrd_hold_listwithnext_holder == NULL(pgbuf_find_thrd_holderasserts this on every node it walks); while free it is onthrd_free_listreached vianext_holder, andthrd_linkis dead. The free-list and hold-list links never both point somewhere at once.
Three helpers maintain the lists. pgbuf_allocate_thrd_holder_entry pops
thrd_free_list if non-empty (no global mutex); else takes
free_holder_set_mutex, carves the next element from the shared
free_holder_set, and grows it with a fresh malloced PGBUF_HOLDER_SET when
free_index == -1. Either way the holder is pushed onto thrd_hold_list:
// pgbuf_allocate_thrd_holder_entry -- src/storage/page_buffer.cholder->next_holder = NULL; /* disconnect from free list */holder->thrd_link = thrd_holder_info->thrd_hold_list; /* push onto hold list */thrd_holder_info->thrd_hold_list = holder;thrd_holder_info->num_hold_cnt += 1;holder->first_watcher = NULL; holder->last_watcher = NULL; holder->watch_count = 0;pgbuf_find_thrd_holder walks thrd_hold_list for the holder whose bufptr
matches, else NULL; its assert (holder->next_holder == NULL) enforces the
“never on both lists” invariant. pgbuf_remove_thrd_holder asserts
fix_count == 0 and watch_count == 0, prepends the holder to thrd_free_list
first, then unlinks it from thrd_hold_list (head special-case, else walk to
the predecessor); a missing entry trips assert (false) and returns
ER_FAILED.
5.3 pgbuf_latch_bcb_upon_fix — the compatibility decision tree
Section titled “5.3 pgbuf_latch_bcb_upon_fix — the compatibility decision tree”The caller holds the BCB mutex; a scope_exit unlock_BCB guard releases it on
every exit unless a branch .release()s it. It looks up the caller’s holder
once, then runs a do { ...recompute new_impl... } while (!compare_exchange_strong)
loop. request_fcnt starts at 1 and is reset at the top of every retry.
flowchart TB
S["snapshot old_impl; new_impl = old_impl\nrequest_fcnt = 1"] --> IDLE{"buf_lock_acquired\nor latch_mode == NO_LATCH ?"}
IDLE -- yes --> SETIDLE["is_page_idle=true\nnormalize old to clean idle\nnew: mode=request, fcnt=1"]
IDLE -- no --> C1{"READ req on\nREAD-latched page ?"}
C1 -- yes --> W{"waiter_exists ?"}
W -- no --> GR1["can_latch=true; fcnt++"]
W -- yes --> OWN{"holder != NULL ?"}
OWN -- yes --> GR2["can_latch=true; fcnt++"]
OWN -- no --> BLK1["can_latch=false\nwriter-starvation guard"]
C1 -- no --> H{"holder != NULL ?"}
H -- no --> BLK2["Case 3: can_latch=false\nwaiter_exists=true"]
H -- yes --> WR{"latch_mode == WRITE ?"}
WR -- yes --> GR3["Sub 2-1: can_latch=true; fcnt++"]
WR -- no --> SOLE{"old fcnt == holder fix_count ?"}
SOLE -- yes --> GR4["Sub 2-2: in-place promote\nmode=WRITE, fcnt=1"]
SOLE -- no --> COND{"CONDITIONAL ?"}
COND -- yes --> CFAIL["can_latch=false\nwaiter_exists=true, then reject"]
COND -- no --> PROM["promote_needed=true\nfcnt -= holder fix_count\nwaiter_exists=true"]
Figure 5-2 — Every branch of the new_impl computation. The loop CASes
old_impl.raw to new_impl.raw with compare_exchange_strong; on failure it
re-snapshots and re-walks the whole tree.
Idle short-circuit. If buf_lock_acquired (fresh BCB, Chapter 4) or the
page is PGBUF_NO_LATCH, the code normalizes old_impl to a clean idle state
before building new_impl — this matters for the CAS expectation:
// pgbuf_latch_bcb_upon_fix -- src/storage/page_buffer.cif (is_page_idle == true) { old_impl.impl.waiter_exists = false; /* <- expect a clean word */ old_impl.impl.latch_mode = PGBUF_NO_LATCH; old_impl.impl.fcnt = 0; new_impl = old_impl; new_impl.impl.latch_mode = request_mode; new_impl.impl.fcnt = 1; /* grant */}(In SA_MODE only, a non-idle page with no holder is a leaked latch — the code
assert (0)s and treats it as idle.)
Case 1 — R on R. No waiter: grant (can_latch = true; fcnt++). Waiter
present: the reader may join only if already a holder (re-entrant); a
brand-new reader (holder == NULL) blocks.
Invariant — readers yield to a queued writer. Once
waiter_existsis set, only re-entrant readers may join an R-latch (theholder == NULLtest). If violated, a stream of fresh readers would indefinitely defer the queued writer.
Case 2 — caller already a holder (not R-on-R, so the page is WRITE-latched or the caller is a R-holder asking for WRITE):
- Sub-2-1, page WRITE-latched: re-fix (R or W) is a pure passthrough,
can_latch = true; fcnt++— the W-holder shortcut. - Sub-2-2, page READ-latched requesting WRITE (in-place promotion): if
old_impl.impl.fcnt == holder->fix_countthe caller is the sole reader, so the latch flips to WRITE in place (mode = WRITE; fcnt = 1). If other readers exist, aPGBUF_CONDITIONAL_LATCHsetswaiter_existsand falls to rejection; an unconditional one setspromote_needed, deducts its own fixes (new_impl.impl.fcnt -= holder->fix_count), setswaiter_exists, and blocks as a tail waiter (see 5.4).
The “unreachable in-place upgrade” the high-level companion mentions refers to a
historically-removed contended upgrade; the sole-reader in-place flip
(sub-2-2) is reachable today, alongside the dedicated promotion entry point
pgbuf_promote_read_latch_debug and the one-promoter assert in
pgbuf_block_bcb (see summary item 3).
Case 3 — caller not a holder, request incompatible (W on R, R/W on W by a
stranger): can_latch = false; waiter_exists = true; the thread blocks.
After the CAS succeeds the function dispatches on the outcome flags:
is_page_idleorcan_latch— granted: allocate a holder (idle / stranger paths) or bump the existing holder’sfix_count, set perf bits, updatelatch_last_thread, return NO_ERROR.promote_needed— roll the holder’s read fixes intorequest_fcnt(request_fcnt += holder->fix_count), zero the holder,pgbuf_remove_thrd_holder, fall into the block path.- block/promote +
PGBUF_CONDITIONAL_LATCH— rejectER_FAILED(raiseER_LK_PAGE_TIMEOUTfirst if the txn’swait_msec == LK_ZERO_WAIT). - block/promote, unconditional —
unlock_BCB.release(), callpgbuf_block_bcb(..., as_promote = false); on return the latch is held, so allocate a holder withfix_count = request_fcnt, set*is_latch_wait = true, return NO_ERROR.
5.4 pgbuf_block_bcb — enqueue and sleep
Section titled “5.4 pgbuf_block_bcb — enqueue and sleep”The caller holds the BCB mutex with waiter_exists true (asserted). It stamps
request_latch_mode and request_fix_count (the count to credit fcnt with on
wake), then enqueues by the as_promote flag:
// pgbuf_block_bcb -- src/storage/page_buffer.ccur_thrd_entry->request_latch_mode = request_mode;cur_thrd_entry->request_fix_count = request_fcnt; /* SPECIAL_NOTE */if (as_promote) { /* Safe guard: there can be only one promoter. */ assert (bufptr->next_wait_thrd == NULL || !bufptr->next_wait_thrd->wait_for_latch_promote); cur_thrd_entry->next_wait_thrd = bufptr->next_wait_thrd; /* head insert */ bufptr->next_wait_thrd = cur_thrd_entry;} else { cur_thrd_entry->next_wait_thrd = NULL; /* ... walk to tail, link ... */ }The as_promote flag is the caller’s choice, and it splits two distinct
callers:
- Head insert (
as_promote = true) is used only bypgbuf_promote_read_latch_debug(the explicitpgbuf_promote_read_latchpath): a promoter that already released its read latch must win the race against fresh waiters, so it jumps the queue. The assert enforces at most one promoter in the queue. - Tail insert (
as_promote = false, FIFO) is used by every other caller — including thepromote_neededbranch ofpgbuf_latch_bcb_upon_fix(5.3), which callspgbuf_block_bcb(..., false). So a promotion discovered during a fix enqueues at the tail, not the head; only the dedicated promote API head-inserts. The async-flush path (Chapter 8) also tail-inserts withPGBUF_LATCH_FLUSH.
Then it sleeps by mode:
PGBUF_LATCH_FLUSH(flush-waiter, Chapter 8) sleeps infinitely viathread_suspend_wakeup_and_unlock_entry; on a non-RESUMED wake (interrupt) it re-locks the BCB, unlinks itself fromnext_wait_thrd, returnsER_FAILED.- R/W waiters go through
pgbuf_timed_sleep. CUBRID builds no wait-for graph across page latches; it relies on timeout — “When the request is waken up by timeout, the request is treated as a victim.” On a successful return the function setsbufptr->latch_last_thread = thread_p.
5.5 pgbuf_timed_sleep and pgbuf_timed_sleep_error_handling
Section titled “5.5 pgbuf_timed_sleep and pgbuf_timed_sleep_error_handling”pgbuf_timed_sleep locks the thread entry, then drops the BCB mutex (ordering:
thread entry inside BCB), computes the timeout, and suspends:
// pgbuf_timed_sleep -- src/storage/page_buffer.cthread_lock_entry (thread_p); PGBUF_BCB_UNLOCK (bufptr);old_wait_msecs = wait_secs = pgbuf_find_current_wait_msecs (thread_p);/* LK_ZERO_WAIT/LK_FORCE_ZERO_WAIT -> 0, else wait_secs = pgbuf_latch_timeout */try_again: to.tv_sec = (int) time (NULL) + wait_secs; thread_p->resume_status = THREAD_PGBUF_SUSPENDED; r = thread_suspend_timeout_wakeup_and_unlock_entry (thread_p, &to, THREAD_PGBUF_SUSPENDED);pgbuf_latch_timeout defaults to 300 * 1000, reset from
PRM_ID_PAGE_LATCH_TIMEOUT at boot. Three return branches:
NO_ERROR— signalled. Re-lock the entry. Ifresume_status == THREAD_PGBUF_RESUMEDa waker granted the latch (5.6) — return NO_ERROR (the latch is already ours,fcntbumped by the waker). Else an interrupt: setrequest_latch_mode = PGBUF_NO_LATCH, call the error handler, raiseER_INTERRUPTED, returnER_FAILED.ER_CSS_PTHREAD_COND_TIMEDOUT— timed out. If RESUMED in the race, return NO_ERROR. If the txn is no longer active (logtb_is_current_activefalse) loop totry_again(don’t time out a committing/aborting txn). Else a page-latch deadlock victim: save the mode, setrequest_latch_mode = PGBUF_NO_LATCH(the marker the waker uses to skip us), run the error handler,goto er_set_return.- else —
pthread_condfailure:er_set_with_oserror (ER_CSS_PTHREAD_COND_TIMEDWAIT), returnER_FAILED.
er_set_return formats by the original wait spec, then releases the BCB mutex
and returns ER_FAILED:
LK_INFINITE_WAIT—ER_PAGE_LATCH_TIMEDOUTthenER_LK_UNILATERALLY_ABORTED(guarded by anassert (0)marked FIXME).- positive
old_wait_msecs—ER_PAGE_LATCH_TIMEDOUT+ER_LK_PAGE_TIMEOUT(the latter reportssave_request_latch_mode). - otherwise — just unlock.
pgbuf_timed_sleep_error_handling runs when a waiter abandons the queue,
re-locks the BCB, and unlinks the thread in three cases:
flowchart TB
L["PGBUF_BCB_LOCK"] --> E{"next_wait_thrd == NULL ?"}
E -- yes --> R0["case 1: already removed by waker, return"]
E -- no --> F{"head == thrd_entry ?"}
F -- no --> M["case 2: walk list, unlink thrd_entry, return"]
F -- yes --> H["case 3: pop head\nthen wake consecutive READ waiters\nuntil non-grantable or WRITE"]
Figure 5-3 — Three removal cases. Only case 3 (the abandoning thread at the
head) must repair the queue by waking the readers it was shadowing. In case 3
it pops the head, then loops: for each follower, if the page latch_mode == READ and the waiter wants READ it CASes (compare_exchange_weak)
fcnt += request_fix_count, locks the entry, unlinks it, and wakes it
(pgbuf_wakeup); a WRITE waiter or non-grantable state breaks the loop.
5.6 pgbuf_wakeup_reader_writer — ordered wake on unlatch
Section titled “5.6 pgbuf_wakeup_reader_writer — ordered wake on unlatch”When unfix drops fcnt to 0 and resets the mode to PGBUF_NO_LATCH (both
asserted on entry), this function walks next_wait_thrd once and grants what it
can. The caller holds the BCB mutex.
// pgbuf_wakeup_reader_writer -- src/storage/page_buffer.cfor (thrd_entry = bufptr->next_wait_thrd; thrd_entry != NULL; thrd_entry = next_thrd_entry) { next_thrd_entry = thrd_entry->next_wait_thrd; if (thrd_entry->request_latch_mode == PGBUF_NO_LATCH) { /* unlink, continue -- corpse */ } if (thrd_entry->request_latch_mode == PGBUF_LATCH_FLUSH) { assert (pgbuf_bcb_is_async_flush_request (bufptr) || pgbuf_bcb_is_flushing (bufptr)); prev_thrd_entry = thrd_entry; continue; /* skip -- leave in list, do NOT wake */ } /* ... R/W grant via compare_exchange_strong loop ... */}Branch by branch:
-
PGBUF_NO_LATCHwaiter — a thread that gave up (timed out / interrupted, not yet self-removed). Unlink andcontinue— “clean a timed-out waiter”. -
PGBUF_LATCH_FLUSHwaiter — not a latch holder; flush wakes it separately. Advanceprev_thrd_entryandcontinue, leaving it in place — the “skip the FLUSH waiter” rule. Advancingprevrather than unlinking keeps the FLUSH entry queued and followers reachable behind it. -
R/W waiter — enter the inner CAS loop (
compare_exchange_strong) on a freshimpl:latch_mode == NO_LATCH, or (latch_mode == READand waiter wants READ): grantable. Lock the thread entry; re-checkrequest_latch_mode == PGBUF_NO_LATCH(it may have timed out between the outer test and the lock) — if so unlink,can_grant = false, break. Elsecan_grant = true, addrequest_fix_counttofcnt, setlatch_modeto the waiter’s mode.- Else if
latch_mode == READ(page already R-granted this pass, this waiter wants WRITE): setprev_thrd_entry = thrd_entryand break only the inner CAS loop.should_stopstays false. The writer is left in place and skipped; the outer walk continues to look for more READ waiters behind it (“Look for other readers.”). The walk truly stops only at the WRITE arm. - Else (
latch_mode == WRITE):should_stop = true, break.
After the inner loop,
should_stopbreaks the outer loop; otherwisecan_grantunlinks the waiter andpgbuf_wakeups it.
Net effect (matching the header comment): each READ grant leaves
latch_mode == READ, so all consecutive READ waiters at the head are woken; a
WRITE waiter met while the page is R-granted is skipped and scanning continues
behind it; only a held WRITE latch (the WRITE arm via should_stop) ends the
pass, so at most one writer is granted.
Finally the function recomputes the hint:
// pgbuf_wakeup_reader_writer -- src/storage/page_buffer.cif (!pgbuf_is_exist_blocked_reader_writer (bufptr)) set_waiter_exists (&bufptr->atomic_latch, false); /* clear the guard */Invariant —
waiter_existsis true iff a R/W waiter is queued.pgbuf_is_exist_blocked_reader_writerwalksnext_wait_thrdand counts onlyPGBUF_LATCH_READ/PGBUF_LATCH_WRITEentries (FLUSH and NO_LATCH don’t count). A stale bit after the last R/W waiter was woken would make Case 1 ofpgbuf_latch_bcb_upon_fixblock fresh readers forever (phantom starvation).
5.7 Chapter summary — key takeaways
Section titled “5.7 Chapter summary — key takeaways”- The latch is one
uint64_t(pgbuf_atomic_latch_impl):latch_mode | waiter_exists | fcnt(16 + 16 + 32 bits). Single-word CASes are the only thing that makes the mode/count triple tear-free under concurrency. pgbuf_latch_bcb_upon_fixis a branch-complete tree retried under onecompare_exchange_strong: idle short-circuit; R-on-R (grant unless a waiter exists and the caller is not a holder — the writer-starvation guard); W-holder passthrough (sub-2-1); sole-reader in-place promotion (sub-2-2); and block paths for strangers (Case 3) and contended promotions.- The sole-reader in-place R-to-W flip (sub-2-2) is live and reachable; the
source-anchored promotion API is
pgbuf_promote_read_latch_debugplus the one-promoter assert inpgbuf_block_bcb. Any deprecation of a contended in-place upgrade is external project history, not visible in this source. - Holders (
pgbuf_holder) record this thread’s per-BCBfix_count; comparing it against the latch word’sfcntdecides “am I the only fixer?”.allocate/find/removekeep a holder on exactly one of the free or hold lists, never both (thenext_holder == NULLassert). - Blocking is timed, not graph-based:
pgbuf_block_bcbenqueues onnext_wait_thrd— tail for ordinary waiters and thepromote_neededbranch of the fix path, head only for the dedicatedpgbuf_promote_read_latch_debugAPI (as_promote).pgbuf_timed_sleepwaitspgbuf_latch_timeout; a timeout makes the waiter a deadlock victim andpgbuf_timed_sleep_error_handlingremoves it (repairing the head case by waking shadowed readers). pgbuf_wakeup_reader_writerwalks the queue once: clean NO_LATCH corpses, skip FLUSH waiters, wake all consecutive READ waiters, grant at most one WRITE. A WRITE waiter met mid-pass while readers are granted does not stop further reader grants — only a held WRITE latch (should_stop) ends the walk.waiter_existsis a hint retracked exactly bypgbuf_is_exist_blocked_reader_writerafter every wake, so a stale bit cannot phantom-starve fresh readers.
Chapter 6: Dirtying a Page and the Packed Flags and Zone Word
Section titled “Chapter 6: Dirtying a Page and the Packed Flags and Zone Word”A modification to a page is three small acts under the BCB’s write latch: stamp the page image with the redo LSA, mark the BCB dirty, and — the first time only — record its oldest unflushed LSA. None takes a lock; they mutate one 32-bit word, bcb->flags, with a lock-free compare-and-swap retry loop, plus one separately-packed counter word. This chapter dissects that accessor layer, which every later chapter — unfix (Ch 7), flush under WAL (Ch 8), victim selection (Ch 9) — reads or mutates, so its invariants are load-bearing. For the why — the WAL contract, the checkpoint horizon — see the high-level companion; this chapter does not re-derive that theory.
6.1 One word, three fields: flags, zone, lru index
Section titled “6.1 One word, three fields: flags, zone, lru index”PGBUF_BCB::flags is volatile int. The source’s own comment fixes the layout exactly (Figure 6-1): “(bcb flags + zone = 2 bytes) + (lru index = 2 bytes)”. So the 32-bit word splits at the half: the low 16 bits are the LRU list index (PGBUF_LRU_INDEX_MASK = 0x0000FFFF, since PGBUF_LRU_NBITS = 16), and the high 16 bits carry the flag bits and the zone selector together.
// pgbuf_bcb (struct) -- src/storage/page_buffer.c volatile int flags; /* <- packed: flag bits + zone + lru index */ // ... condensed ... volatile int count_fix_and_avoid_dealloc; /* <- a SECOND packed word, see 6.8 */ LOG_LSA oldest_unflush_lsa; /* <- WAL watermark, established once per dirty cycle */Within that high half the two namespaces are bit-disjoint: the seven flag bits occupy the very top (0x80000000..0x02000000, bits 25-31), while the zone enum sits just above the index, at bits 16-19. Zone values are shifts of PGBUF_LRU_NBITS: the LRU sub-zones are 1<<16, 2<<16, 3<<16; the non-LRU zones skip two further bits (PGBUF_LRU_NBITS + 2 = 18) so they cannot collide with the LRU mask — PGBUF_INVALID_ZONE = 1<<18, PGBUF_VOID_ZONE = 2<<18. Because PGBUF_BCB_FLAGS_MASK and PGBUF_ZONE_MASK | PGBUF_LRU_INDEX_MASK never share a bit, pgbuf_bcb_update_flags can touch only flag bits and pgbuf_bcb_change_zone only zone+index — each preserving the other — both via CAS on the same word.
A BCB is born with PGBUF_BCB_INIT_FLAGS = PGBUF_INVALID_ZONE: no flag bits set, zone = INVALID, index 0. That is the start state of Figure 6-3.
The complete zone catalogue:
| Zone value | Bits | Numeric | Meaning | Set by (zone moves go through pgbuf_bcb_change_zone) |
|---|---|---|---|---|
PGBUF_INVALID_ZONE | 1<<18 | 0x00040000 | free/uninitialized BCB (on invalid list) | PGBUF_BCB_INIT_FLAGS; reset on free |
PGBUF_VOID_ZONE | 2<<18 | 0x00080000 | transient: read from disk before list insert, or removed from list before victimizing | read-from-disk path; victim extraction |
PGBUF_LRU_1_ZONE | 1<<16 | 0x00010000 | hottest LRU sub-zone; never victimized | unfix/boost into zone 1 |
PGBUF_LRU_2_ZONE | 2<<16 | 0x00020000 | buffer sub-zone between hot and victim; never victimized | LRU zone adjustment |
PGBUF_LRU_3_ZONE | 3<<16 | 0x00030000 | victimization sub-zone; only zone with eligible candidates | LRU zone adjustment / fall from zone 2 |
Three masks decode it: PGBUF_LRU_ZONE_MASK (= 1|2|3 << 16) ORs the three LRU sub-zone bits; PGBUF_ZONE_MASK (= PGBUF_LRU_ZONE_MASK | PGBUF_INVALID_ZONE | PGBUF_VOID_ZONE) covers every zone; PGBUF_LRU_INDEX_MASK carries the low-16 list index. PGBUF_GET_ZONE(flags) is (PGBUF_ZONE)(flags & PGBUF_ZONE_MASK).
6.2 The flag catalogue: every bit, producer, and consumer
Section titled “6.2 The flag catalogue: every bit, producer, and consumer”Seven flags live in the high bits of flags. The composite PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK is the OR of the first four — the states that disqualify a BCB from being victimized; “Blocks victim?” marks membership in that mask. The table carries producer, clearer, and reader per flag.
| Flag | Bit | Producer | Cleared by | Reader | Blocks victim? |
|---|---|---|---|---|---|
PGBUF_BCB_DIRTY_FLAG | 0x80000000 | pgbuf_bcb_set_dirty, _update_flags | _clear_dirty, _mark_is_flushing | pgbuf_bcb_is_dirty | yes |
PGBUF_BCB_FLUSHING_TO_DISK_FLAG | 0x40000000 | pgbuf_bcb_mark_is_flushing | _mark_was_flushed / _was_not_flushed | pgbuf_bcb_is_flushing | yes |
PGBUF_BCB_VICTIM_DIRECT_FLAG | 0x20000000 | direct-victim hand-off (Ch 9) | replaced by INVALIDATE | pgbuf_bcb_is_direct_victim | yes |
PGBUF_BCB_INVALIDATE_DIRECT_VICTIM_FLAG | 0x10000000 | fixer grabbing a direct victim (Ch 4/5) | when the waiter re-requests | pgbuf_bcb_is_invalid_direct_victim | yes |
PGBUF_BCB_MOVE_TO_LRU_BOTTOM_FLAG | 0x08000000 | dealloc path | unfix that moves it (Ch 7) | pgbuf_bcb_should_be_moved_to_bottom_lru | no |
PGBUF_BCB_TO_VACUUM_FLAG | 0x04000000 | pgbuf_notify_vacuum_follows | vacuum routing | pgbuf_bcb_is_to_vacuum | no |
PGBUF_BCB_ASYNC_FLUSH_REQ | 0x02000000 | async flush requesters | pgbuf_bcb_mark_is_flushing | pgbuf_bcb_is_async_flush_request | no |
mark_is_flushing (when the page is dirty) clears both DIRTY — the flush captured the image — and ASYNC_FLUSH_REQ — the request is now in flight — while setting FLUSHING: one transition swaps three bits.
// PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK -- src/storage/page_buffer.c#define PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK \ (PGBUF_BCB_DIRTY_FLAG \ | PGBUF_BCB_FLUSHING_TO_DISK_FLAG \ | PGBUF_BCB_VICTIM_DIRECT_FLAG \ | PGBUF_BCB_INVALIDATE_DIRECT_VICTIM_FLAG) /* <- the 4 disqualifiers; the other 3 flags are victim-neutral */Invariant — the victim-candidate count tracks exactly the BCBs in LRU zone 3 with none of the four disqualifier bits. Any transition adding/removing a disqualifier bit while the BCB sits in zone 3 must symmetrically add/remove it from candidacy, enforced in pgbuf_bcb_update_flags, pgbuf_bcb_change_zone, and the pgbuf_bcb_set_dirty fast path. Omit it in any one and the LRU victim counter drifts, so victimizers skip valid candidates or chase phantoms (Ch 9). pgbuf_bcb_avoid_victim is the read-side query of the same mask.
6.3 The shared CAS-loop shape: pgbuf_bcb_update_flags and pgbuf_bcb_change_zone
Section titled “6.3 The shared CAS-loop shape: pgbuf_bcb_update_flags and pgbuf_bcb_change_zone”Both share one lock-free skeleton — read bcb->flags, compute the new word, CAS, retry — differing only in which half they recompute and what they reconcile afterward.
pgbuf_bcb_update_flags is the general flag mutator: set some bits, clear others, preserving zone and unnamed flags. Every flag transition except the dirty fast path goes through it.
// pgbuf_bcb_update_flags -- src/storage/page_buffer.c assert ((set_flags & (~PGBUF_BCB_FLAGS_MASK)) == 0); /* <- callers may only touch flag bits ... */ assert ((clear_flags & (~PGBUF_BCB_FLAGS_MASK)) == 0); /* <- ... never zone/index bits */ do { old_flags = bcb->flags; new_flags = old_flags | set_flags; new_flags = new_flags & (~clear_flags); if (old_flags == new_flags) return; /* <- no-op: bits already as desired, skip CAS + bookkeeping (contention saver) */ } while (!ATOMIC_CAS_32 (&bcb->flags, old_flags, new_flags));pgbuf_bcb_change_zone does the opposite: same loop, but the assignment recomputes zone+index — new_flags = (old_flags & PGBUF_BCB_FLAGS_MASK) | new_zone_idx; where new_zone_idx = PGBUF_MAKE_ZONE (new_lru_idx, new_zone) — preserving all flag bits, and it has no early no-op return.
After the CAS the two diverge. update_flags runs two fix-ups (Figure 6-2): a zone-3 victim-candidacy adjustment (only when PGBUF_GET_ZONE (old_flags) == PGBUF_LRU_3_ZONE — read from old_flags since the zone never changes here), and a dirties_cnt adjustment keyed on whether DIRTY toggled, closed by an assert pinning 0 <= dirties_cnt <= num_buffers.
flowchart TD
A["enter: set_flags, clear_flags — Figure 6-2"] --> B["old = bcb->flags<br/>new = (old | set) & ~clear"]
B --> C{"old == new?"}
C -->|yes| R["return (no-op)"]
C -->|no| D{"CAS(flags, old, new)?"}
D -->|fail| B
D -->|ok| E{"zone(old) == LRU_3?"}
E -->|yes| F{"victim candidacy changed?"}
F -->|"became valid"| G["lru_add_victim_candidate"]
F -->|"became invalid"| H["lru_remove_victim_candidate"]
F -->|"no change"| I
E -->|no| I["dirty bit toggled?"]
G --> I
H --> I
I -->|"set->clear"| J["dirties_cnt -= 1"]
I -->|"clear->set"| K["dirties_cnt += 1"]
I -->|"unchanged"| L["assert range; done"]
J --> L
K --> L
change_zone reconciles per-list zone counters and victim candidacy. Zone moves run under the LRU list mutex, so count_lru1/2/3 are plain increments, not atomics — only the CAS on flags is lock-free, since a concurrent pgbuf_set_dirty may flip a flag bit on the same word with no mutex. Branch map:
is_valid_victim_candidate = (old_flags & INVALID_VICTIM_CANDIDATE_MASK) == 0— a flags property, unchanged by the move, so it holds on both sides.- Leaving (
old_flags & PGBUF_LRU_ZONE_MASK): decrementlru_shared_pgs_cntif the old list was shared;switchon old zone decrementing the rightcount_lruN; in zone 3 and valid-candidate,pgbuf_lru_remove_victim_candidate;default: assert(false). - Entering (
new_zone & PGBUF_LRU_ZONE_MASK): symmetric increments, withpgbuf_lru_add_victim_candidatein zone 3 for a valid candidate;default: assert(false).
The default: assert(false) arms encode a totality invariant: an LRU-zone BCB is in exactly one of zones 1/2/3 (the zone field is a single value, not a mask of memberships). A second assert guards hint coherence: lru_list->victim_hint != bcb || zone(old) != LRU_3 — the hint must already have been retargeted before a zone-3 BCB leaves, unless checkpoint (via update_flags) is concurrently retargeting it.
stateDiagram-v2
[*] --> INVALID: init flags = PGBUF_INVALID_ZONE
INVALID --> VOID: read from disk
VOID --> LRU: unfix inserts into list
LRU --> LRU: adjust zones 1 to 2 to 3
LRU --> VOID: selected as victim
VOID --> [*]: reused for new page
note right of LRU
only LRU_3 sub-zone
is victim-eligible
end note
Figure 6-3: zone transitions driven by pgbuf_bcb_change_zone. The flag namespace rides along untouched on every edge.
6.4 pgbuf_bcb_get_zone and the decode macros
Section titled “6.4 pgbuf_bcb_get_zone and the decode macros”pgbuf_bcb_get_zone is a pure decode — it masks the word and returns the zone enum:
// pgbuf_bcb_get_zone -- src/storage/page_buffer.cSTATIC_INLINE PGBUF_ZONEpgbuf_bcb_get_zone (const PGBUF_BCB * bcb){ return PGBUF_GET_ZONE (bcb->flags); /* <- (flags & PGBUF_ZONE_MASK) */}Two macros build on it to answer the two questions later chapters ask most:
// PGBUF_IS_BCB_IN_LRU* -- src/storage/page_buffer.c#define PGBUF_IS_BCB_IN_LRU_VICTIM_ZONE(bcb) (pgbuf_bcb_get_zone (bcb) == PGBUF_LRU_3_ZONE)#define PGBUF_IS_BCB_IN_LRU(bcb) ((pgbuf_bcb_get_zone (bcb) & PGBUF_LRU_ZONE_MASK) != 0)PGBUF_IS_BCB_IN_LRU_VICTIM_ZONE is exact-equality (only zone 3 is victim-eligible); PGBUF_IS_BCB_IN_LRU is a mask test — PGBUF_LRU_ZONE_MASK ORs all three LRU sub-zone bits, so zones 1/2/3 match but VOID (2<<18) and INVALID (1<<18) do not, since their bits fall outside the mask. pgbuf_bcb_get_lru_index asserts PGBUF_IS_BCB_IN_LRU before returning the low-16 index.
6.5 Setting dirty: three entry points, one fast path
Section titled “6.5 Setting dirty: three entry points, one fast path”A modifier reaches dirty through a tiny call chain. The public pgbuf_set_dirty recovers the BCB via CAST_PGPTR_TO_BFPTR, validates vpid (debug only), delegates to pgbuf_set_dirty_buffer_ptr, then unfixes only if the caller passed free_page == FREE. pgbuf_set_dirty_buffer_ptr is the latch/perf layer over the real mutator:
// pgbuf_set_dirty_buffer_ptr -- src/storage/page_buffer.c pgbuf_bcb_set_dirty (thread_p, bufptr); holder = pgbuf_find_thrd_holder (thread_p, bufptr); assert (get_latch (&bufptr->atomic_latch) == PGBUF_LATCH_WRITE); /* <- dirtier MUST hold the write latch */ assert (holder != NULL); // ... condensed: mark holder->perf_stat.dirtied_by_holder, perfmon PSTAT_PB_NUM_DIRTIES ...Invariant — a page is only dirtied while its setter holds the BCB write latch. The assert (get_latch (...) == PGBUF_LATCH_WRITE) enforces it, serializing concurrent writers (Ch 5) so the DIRTY/LSA pair stays consistent though each is mutated lock-free. The CAS in pgbuf_bcb_set_dirty defends only against other threads racing on unrelated bits of the same word (a no-latch change_zone), not against two writers.
pgbuf_bcb_set_dirty is a hand-coded fast path that bypasses update_flags because dirtying is the hottest case (the source comment says so explicitly):
// pgbuf_bcb_set_dirty -- src/storage/page_buffer.c do { old_flags = bcb->flags; if (old_flags & PGBUF_BCB_DIRTY_FLAG) return; /* <- already dirty: skip CAS + counter (common case) */ } while (!ATOMIC_CAS_32 (&bcb->flags, old_flags, old_flags | PGBUF_BCB_DIRTY_FLAG)); ATOMIC_INC_64 (&pgbuf_Pool.monitor.dirties_cnt, 1); /* <- dirties_cnt += 1; assert range follows */ if (PGBUF_GET_ZONE (old_flags) == PGBUF_LRU_3_ZONE && (old_flags & PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK) == 0) pgbuf_lru_remove_victim_candidate (thread_p, pgbuf_lru_list_from_bcb (bcb), bcb); /* <- newly dirty -> drop candidacy */Branch map: (1) already-dirty → early return; (2) CAS sets the bit; (3) dirties_cnt += 1 then assert range; (4) if the BCB was a valid zone-3 candidate before DIRTY (the test reads old_flags), the new bit disqualifies it, so remove it — the §6.2 invariant inlined for speed. Note this fast path only ever sets DIRTY, so unlike update_flags it never needs the dirty-cleared branch or the add-candidate branch.
6.6 Recording the oldest unflushed LSA — pgbuf_set_lsa
Section titled “6.6 Recording the oldest unflushed LSA — pgbuf_set_lsa”pgbuf_set_lsa (log/recovery manager only) stamps the redo LSA and establishes oldest_unflush_lsa once per dirty cycle, with special branches for temporary and auxiliary volumes:
// pgbuf_set_lsa -- src/storage/page_buffer.c // ... condensed: debug-gated page-pointer validation may return NULL; assert (lsa_ptr != NULL) ... if (pgbuf_is_temp_lsa (bufptr->iopage_buffer->iopage.prv.lsa) || PGBUF_IS_AUXILIARY_VOLUME (bufptr->vpid.volid) == true) return NULL; /* <- branch 2: temp/aux pages are never WAL-tracked: bail */ if (pgbuf_is_temporary_volume (bufptr->vpid.volid) == true) { pgbuf_init_temp_page_lsa (&bufptr->iopage_buffer->iopage, IO_PAGESIZE); /* <- branch 3: force sentinel temp LSA */ if (logtb_is_current_active (thread_p)) return NULL; /* <- active txn on temp page carries no real LSA */ } fileio_set_page_lsa (&bufptr->iopage_buffer->iopage, lsa_ptr, IO_PAGESIZE); /* <- branch 4: write redo LSA into image */ if (LSA_ISNULL (&bufptr->oldest_unflush_lsa)) /* <- branch 5: FIRST dirty since last flush? */ { if (LSA_LT (lsa_ptr, &log_Gl.chkpt_redo_lsa)) { /* ... condensed: re-read chkpt_redo_lsa under chkpt_lsa_lock; if still older, raise ER_LOG_CHECKPOINT_SKIP_INVALID_PAGE + assert(false) ... */ } LSA_COPY (&bufptr->oldest_unflush_lsa, lsa_ptr); /* <- watermark established */ } // ... condensed: branch 6, #if defined(NDEBUG) also calls pgbuf_set_dirty_buffer_ptr (safety net) ...Two facts the comments compress: pgbuf_is_temp_lsa compares the stored LSA against sentinel PGBUF_TEMP_LSA = { NULL_LOG_PAGEID - 1, NULL_LOG_OFFSET - 1 } (i.e. (-2,-2)), and the watermark lives here, not in set-dirty, because pages can be dirtied before any LSA exists — so it anchors at the first LSA set. The release-build #if defined(NDEBUG) tail forcing pgbuf_set_dirty_buffer_ptr is a safety net for any missed set-dirty call, since an LSA was just written and must be flushed.
Invariant — oldest_unflush_lsa is the LSA of the earliest modification not yet on disk, set once on the first dirty after a clean state and cleared only on flush. The LSA_ISNULL guard makes later set_lsa calls in the same cycle leave it untouched; it never advances forward. Ch 8’s WAL rule reads it for log-flush ordering and resets it to NULL on a successful flush; checkpoint reads it to find the oldest dirty page.
6.7 The readers: pgbuf_bcb_is_dirty and pgbuf_bcb_avoid_victim
Section titled “6.7 The readers: pgbuf_bcb_is_dirty and pgbuf_bcb_avoid_victim”Both are single-mask predicates over the same word — no lock, just a volatile read:
// pgbuf_bcb_is_dirty -- src/storage/page_buffer.cSTATIC_INLINE boolpgbuf_bcb_is_dirty (const PGBUF_BCB * bcb){ return (bcb->flags & PGBUF_BCB_DIRTY_FLAG) != 0;}
// pgbuf_bcb_avoid_victim -- src/storage/page_buffer.cSTATIC_INLINE boolpgbuf_bcb_avoid_victim (const PGBUF_BCB * bcb){ return (bcb->flags & PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK) != 0; /* <- ANY of the 4 disqualifiers */}The relation is hierarchical: a dirty BCB always makes avoid_victim true (DIRTY ∈ the mask), but avoid_victim can also be true for a clean BCB that is mid-flush or a (invalidated) direct victim. Hence Ch 9’s victim scan calls avoid_victim, not is_dirty — dirtiness is only one of four ways to be ineligible. The sibling per-flag readers (pgbuf_bcb_is_flushing, _is_direct_victim, _is_invalid_direct_victim, _is_async_flush_request, _is_to_vacuum, _should_be_moved_to_bottom_lru) are each the same one-bit (flags & FLAG) != 0 test.
6.8 The dual-purpose counter — count_fix_and_avoid_dealloc
Section titled “6.8 The dual-purpose counter — count_fix_and_avoid_dealloc”A separate volatile word, never overlapping flags, packs two 16-bit sub-counters into one 32-bit int so each can be mutated by a single atomic:
| Sub-field | Bits | Mask / shift | Mutators | Reader |
|---|---|---|---|---|
| avoid-dealloc count | low 16 | PGBUF_BCB_AVOID_DEALLOC_MASK = 0x0000FFFF | pgbuf_bcb_register_avoid_deallocation (+1), _unregister_ (-1, CAS) | pgbuf_bcb_should_avoid_deallocation |
| fix count | high 16 | << PGBUF_BCB_COUNT_FIX_SHIFT_BITS (16) | pgbuf_bcb_register_fix (+ 1<<16, capped) | pgbuf_bcb_is_hot |
They are merged (per the struct comment) because avoid-dealloc must change atomically yet 2-byte atomics are uncommon, so both ride in one CPU-native 4-byte word:
// pgbuf_bcb_register_avoid_deallocation -- src/storage/page_buffer.c assert ((bcb->count_fix_and_avoid_dealloc & 0x00008000) == 0); /* <- low-half top bit clear: overflow guard */ (void) ATOMIC_INC_32 (&bcb->count_fix_and_avoid_dealloc, 1); /* <- +1 touches only the low half */register_fix adds 1 << 16 but only while below the cap PGBUF_FIX_COUNT_THRESHOLD << 16 — once hot, it stops counting (hotness is a one-way latch, not a live count). pgbuf_bcb_is_hot compares against that same PGBUF_FIX_COUNT_THRESHOLD << 16 (fix count drives LRU hotness, Ch 7). The unregister path uses a CAS loop and tolerates an avoid-dealloc count already at zero (a pgbuf_ordered_fix corner case where the page was victimized and reloaded), logging via er_log_debug and breaking rather than underflowing. This counter is the second, orthogonal victim gate — fixed or dealloc-protected pages held out of reach — independent of the flag gate dirtiness rides; Ch 9 consumes both.
6.9 Chapter summary — key takeaways
Section titled “6.9 Chapter summary — key takeaways”bcb->flagsis one 32-bit word split at the half: low 16 bits = LRU index (PGBUF_LRU_INDEX_MASK), high 16 bits = flag bits (0x80000000..0x02000000) plus the zone selector (bits 16-19; LRU1..3<<16, INVALID1<<18, VOID2<<18). Flag and zone bits are disjoint, which lets the two mutators share the word without clobbering each other.pgbuf_bcb_update_flagsandpgbuf_bcb_change_zoneshare one CAS retry loop: the former mutates flag bits (no-op early return) and reconcilesdirties_cntplus zone-3 candidacy; the latter mutates zone+index and reconciles per-list zone counters under the LRU mutex. Every reconciliation branch must run or the victim counter drifts; thedefault: assert(false)arms encode that an LRU BCB is in exactly one of zones 1/2/3.- The four-bit
PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK(DIRTY, FLUSHING, VICTIM_DIRECT, INVALIDATE_DV) defines victim ineligibility; the other three flags (MOVE_TO_LRU_BOTTOM, TO_VACUUM, ASYNC_FLUSH_REQ) are victim-neutral.pgbuf_bcb_avoid_victimreads the whole mask;pgbuf_bcb_is_dirtyreads one bit of it. - Dirtying takes the hand-coded fast path
pgbuf_bcb_set_dirty(set-DIRTY-only CAS) rather than the general mutator, maintaining the candidacy/dirty-counter invariants inline. The BCB write latch — asserted inpgbuf_set_dirty_buffer_ptr, not the CAS — serializes concurrent writers and keeps the DIRTY/LSA pair consistent. oldest_unflush_lsais established exactly once per dirty cycle inpgbuf_set_lsa, guarded byLSA_ISNULL, never advanced forward, validated against the checkpoint redo horizon. Temp and auxiliary volumes are excluded (sentinelPGBUF_TEMP_LSA = (-2,-2)).count_fix_and_avoid_deallocis a second packed word carrying fix-count (high 16 bits, capped, drives hotness viapgbuf_bcb_is_hot) and avoid-dealloc count (low 16 bits) so both fit one native atomic — the orthogonal fix/dealloc victim gate, independent of the flag gate that dirtiness rides.
Chapter 7: Unfix LRU Movement Aout History and Private to Shared Migration
Section titled “Chapter 7: Unfix LRU Movement Aout History and Private to Shared Migration”This chapter answers: on unfix, how does a BCB move through the three LRU
zones, when does it boost to the top, when does it migrate from a private
to a shared list, and what role does the Aout 2Q ghost list play? Zone
model, private/shared split, and 2Q intent are in the high-level companion
(cubrid-page-buffer-manager.md); the BCB struct, packed flags/zone word,
and dirty bit are in Chapters 1 and 6 — reused here.
7.1 The unfix funnel — pgbuf_unfix to pgbuf_unlatch_bcb_upon_unfix
Section titled “7.1 The unfix funnel — pgbuf_unfix to pgbuf_unlatch_bcb_upon_unfix”// pgbuf_unfix -- src/storage/page_buffer.cCAST_PGPTR_TO_BFPTR (bufptr, pgptr);holder_status = pgbuf_unlatch_thrd_holder (thread_p, bufptr, &holder_perf_stat);// ... perf tracking (perfmon_pbx_unfix) elided ...if (pgbuf_lockfree_unfix_ro (thread_p, bufptr)) /* <- pure read latch: CAS-drop fcnt, no mutex */ return; /* <- never touches LRU */PGBUF_BCB_LOCK (bufptr);(void) pgbuf_unlatch_bcb_upon_unfix (thread_p, bufptr, holder_status); /* releases mutex inside */Invariant — the read-only fast path never reorders LRU. A shared latch
being dropped does not change zones; pgbuf_lockfree_unfix_ro returns
true after a CAS, so reordering is reserved for the last unfixer or a
writer — otherwise every reader would contend on the list mutex.
pgbuf_unlatch_bcb_upon_unfix is the decision engine; its prologue CASes
the fix count down:
// pgbuf_unlatch_bcb_upon_unfix -- src/storage/page_buffer.cdo { blocked_reader_writer = false; is_zero_fcnt = false; impl_orig = get_impl (&bufptr->atomic_latch); impl_new = impl_orig; impl_new.impl.fcnt--; /* <- drop one fix */ blocked_reader_writer = impl_orig.impl.waiter_exists; if (impl_new.impl.fcnt == 0) { is_zero_fcnt = true; impl_new.impl.latch_mode = PGBUF_NO_LATCH; /* <- last unfixer drops latch */ } if (impl_new.impl.fcnt < 0) { /* <- "freed too much": defensive reset */ assert (false); er_set (...); impl_new.impl.fcnt = 0; impl_new.impl.waiter_exists = false; impl_new.impl.latch_mode = PGBUF_NO_LATCH; is_zero_fcnt = true; break; }} while (!bufptr->atomic_latch.compare_exchange_weak (impl_orig.raw, impl_new.raw, ...));The CAS (Chapter 5) yields is_zero_fcnt (last holder) and
blocked_reader_writer (a latch waiter queued). Reordering runs only
when is_zero_fcnt && !blocked_reader_writer — a queued waiter re-latches
the BCB immediately, so moving it would be wasted work.
flowchart TD
A["pgbuf_unlatch_bcb_upon_unfix\nCAS: fcnt--"] --> B{"is_zero_fcnt?"}
B -->|no| W["wakeup reader/writer\nrelease mutex"]
B -->|yes| C{"MOVE_TO_LRU_BOTTOM?"}
C -->|yes| D["pgbuf_move_bcb_to_bottom_lru\ndealloc shortcut"]
C -->|no| E{"blocked_reader_writer?"}
E -->|yes| W
E -->|no| F["switch on zone"]
F --> Z0["VOID -> pgbuf_unlatch_void_zone_bcb"]
F --> Z1["LRU_1 -> keep or prv->shr"]
F --> Z2["LRU_2 -> keep, boost if old, or prv->shr"]
F --> Z3["LRU_3 -> boost or prv->shr or direct-victim"]
Z0 --> W
Z1 --> W
Z2 --> W
Z3 --> W
D --> W
Figure 7-1 — Branch structure of pgbuf_unlatch_bcb_upon_unfix. Only the
zero-fcnt, no-waiter path reaches the zone switch.
7.2 The dealloc shortcut and the zone switch
Section titled “7.2 The dealloc shortcut and the zone switch”// pgbuf_unlatch_bcb_upon_unfix -- src/storage/page_buffer.cif (is_zero_fcnt) { assert (LSA_ISNULL (&bufptr->oldest_unflush_lsa) || pgbuf_bcb_is_dirty (bufptr)); if (pgbuf_bcb_should_be_moved_to_bottom_lru (bufptr)) /* <- MOVE_TO_LRU_BOTTOM flag */ pgbuf_move_bcb_to_bottom_lru (thread_p, bufptr); /* dealloc shortcut */ else if (blocked_reader_writer == false) { th_lru_idx = PGBUF_THREAD_HAS_PRIVATE_LRU (thread_p) ? PGBUF_LRU_INDEX_FROM_PRIVATE (PGBUF_PRIVATE_LRU_FROM_THREAD (thread_p)) : -1; /* own list or none */ switch (pgbuf_bcb_get_zone (bufptr)) { /* ... see 7.3 ... */ } }}pgbuf_bcb_should_be_moved_to_bottom_lru tests the
PGBUF_BCB_MOVE_TO_LRU_BOTTOM_FLAG bit, set on the dealloc path: a
deallocated page is worthless hot, so it is shoved to the bottom for first
reclamation. th_lru_idx (own private list or -1) is the reference point
for every private/shared decision below.
Invariant — oldest_unflush_lsa implies the dirty bit. Chapter 6’s WAL
invariant at unfix: a page with a pending flush LSA must stay dirty, or the
flush daemons (Chapter 8) skip it and break WAL.
7.3 Zone-by-zone branch trace
Section titled “7.3 Zone-by-zone branch trace”Two guards recur in every LRU case, quoted once and referenced for all three zones:
// pgbuf_unlatch_bcb_upon_unfix (per-case prologue) -- src/storage/page_buffer.cif (PGBUF_SHOULD_IGNORE_UNFIX (thread_p, bufptr)) { ...KEEP_VAC stat...; break; } /* <- don't warm cache */if (pgbuf_should_move_private_to_shared (thread_p, bufptr, th_lru_idx)) { /* <- see 7.5 */ pgbuf_lru_move_from_private_to_shared (thread_p, bufptr); ...PRV_TO_SHR_MID stat...; break;}PGBUF_SHOULD_IGNORE_UNFIX is not vacuum-only: its real definition is
VACUUM_IS_THREAD_VACUUM_WORKER (th) || pgbuf_is_temporary_volume (buf->vpid.volid)
(SERVER_MODE; false otherwise). It fires for vacuum workers and for
pages on temporary volumes — both should not warm the cache or promote a BCB
to hot (the source comment also names the checkpoint thread as a logical
member of this set). pgbuf_should_move_private_to_shared (7.5) escalates
contended pages. Only the default action after these guards differs by
zone. Note the LRU_3 case applies its SHOULD_IGNORE_UNFIX branch before
the private-to-shared check (see below).
VOID (Chapter 4): delegated to pgbuf_unlatch_void_zone_bcb (7.4).
LRU_1 (hottest): after the guards, do nothing but register a hit — zone 1 is never reordered:
/* after the per-case prologue, plus a PRV_KEEP/SHR_KEEP stat: */pgbuf_bcb_register_hit_for_lru (bufptr); break; /* <- never boost zone 1 */LRU_2 (boost-eligible): boost only if aged:
if (PGBUF_IS_BCB_OLD_ENOUGH (bufptr, pgbuf_lru_list_from_bcb (bufptr))) pgbuf_lru_boost_bcb (thread_p, bufptr); /* <- aged enough -> promote to top */else { ...PRV_KEEP / SHR_KEEP stat... } /* <- too new: leave in place */pgbuf_bcb_register_hit_for_lru (bufptr); break;LRU_3 (victim zone): a real unfix always boosts, but its
PGBUF_SHOULD_IGNORE_UNFIX branch may instead hand the BCB out as a direct
victim:
case PGBUF_LRU_3_ZONE: if (PGBUF_SHOULD_IGNORE_UNFIX (...)) { if (!pgbuf_bcb_avoid_victim (bufptr) && pgbuf_assign_direct_victim (thread_p, bufptr)) { ...DIRECT_VACUUM_LRU stat... } /* <- give it straight to a waiter */ else { ...THREE_KEEP_VAC stat... } break; } if (pgbuf_should_move_private_to_shared (...)) { ...move; THREE_PRV_TO_SHR_MID...; break; } pgbuf_lru_boost_bcb (thread_p, bufptr); /* <- rule 3: always boost from zone 3 */ pgbuf_bcb_register_hit_for_lru (bufptr); break;After the switch the function wakes any latch waiter
(pgbuf_wakeup_reader_writer); on a requested async flush
(pgbuf_bcb_is_async_flush_request) it uses
pgbuf_bcb_safe_flush_force_unlock (which unlocks), else unlocks directly.
assert (... != PGBUF_LATCH_FLUSH) guards that unfix never sees a flush
latch — flushing is the daemons’ job (Chapter 8).
7.4 VOID-zone landing — pgbuf_unlatch_void_zone_bcb and the Aout hit
Section titled “7.4 VOID-zone landing — pgbuf_unlatch_void_zone_bcb and the Aout hit”A VOID BCB was just claimed for a non-resident page. It first removes the VPID from Aout (recording a re-fix as a hit), then branches on private-list ownership and Aout membership:
// pgbuf_unlatch_void_zone_bcb -- src/storage/page_buffer.cif (pgbuf_Pool.buf_AOUT_list.max_count > 0) { aout_enabled = true; aout_list_id = pgbuf_remove_vpid_from_aout_list (thread_p, &bcb->vpid); } /* <- 2Q lookup+remove */if (PGBUF_VACUUM_SHOULD_IGNORE_UNFIX (thread_p)) { /* vacuum worker only here */ if (!pgbuf_bcb_avoid_victim (bcb) && pgbuf_assign_direct_victim (thread_p, bcb)) { // ... if Aout on: pgbuf_add_vpid_to_aout_list (..., aout_list_id) ... <- re-ghost return; } aout_list_id = PGBUF_AOUT_NOT_FOUND; /* <- vacuum never gets Aout-boost */}if (thread_private_lru_index != -1) { if (PGBUF_VACUUM_SHOULD_IGNORE_UNFIX (thread_p)) { /* <- vacuum: top, no hit */ pgbuf_lru_add_new_bcb_to_top (thread_p, bcb, thread_private_lru_index); return; } if (!aout_enabled || thread_private_lru_index == aout_list_id) { /* <- Aout HIT -> top of LRU 1 */ pgbuf_lru_add_new_bcb_to_top (thread_p, bcb, thread_private_lru_index); pgbuf_bcb_register_hit_for_lru (bcb); return; } if (aout_list_id == PGBUF_AOUT_NOT_FOUND) { /* <- cold miss -> middle */ pgbuf_lru_add_new_bcb_to_middle (thread_p, bcb, thread_private_lru_index); pgbuf_bcb_register_hit_for_lru (bcb); return; } /* fall through: ghosted in a *different* private list -> shared */}pgbuf_lru_add_new_bcb_to_middle (thread_p, bcb, pgbuf_get_shared_lru_index_for_add ()); /* <- shared middle */if (!PGBUF_VACUUM_SHOULD_IGNORE_UNFIX (thread_p)) pgbuf_bcb_register_hit_for_lru (bcb);Note this branch gates on PGBUF_VACUUM_SHOULD_IGNORE_UNFIX (vacuum worker
only) — the temp-volume arm of PGBUF_SHOULD_IGNORE_UNFIX from 7.3 is not
applied in the VOID path. Placement (private-LRU thread, Aout enabled,
non-vacuum):
| Aout result | Placement | Meaning |
|---|---|---|
== aout_list_id | top of own private list (LRU 1) | evicted from my list — 2Q second-touch, promote hot |
!aout_enabled | top of own private list | no history; first unfix treated as warm |
PGBUF_AOUT_NOT_FOUND | middle of own private list | never seen — cold, lands at the LRU-1/2 boundary |
| different list | middle of shared list | shared across workers; quotas (Ch 10) apply |
Invariant — Aout removal precedes placement. aout_list_id is captured
once before any insertion and drives the whole branch. If remove came after
placement, two threads re-fixing the same page could both “hit” and
double-promote; Aout_mutex (7.6) serializes lookup-and-remove so exactly
one thread consumes the ghost.
7.5 pgbuf_should_move_private_to_shared — the migration test
Section titled “7.5 pgbuf_should_move_private_to_shared — the migration test”// pgbuf_should_move_private_to_shared -- src/storage/page_buffer.cint bcb_lru_idx = pgbuf_bcb_get_lru_index (bcb);if (PGBUF_IS_SHARED_LRU_INDEX (bcb_lru_idx)) return false; /* <- already shared */if (thread_private_lru_index != bcb_lru_idx) return true; /* cond 1: foreign-thread unfix */if (!pgbuf_bcb_is_hot (bcb)) return false; /* cond 2a: must be hot */if (!PGBUF_IS_BCB_OLD_ENOUGH (bcb, PGBUF_GET_LRU_LIST (bcb_lru_idx))) return false; /* cond 2b: and old */return true; /* hot + aged -> escalate to shared */Two triggers: (1) foreign unfix — the BCB lives in private list X but
the unfixer’s own list is Y (or -1); a page touched by >1 worker goes
shared. (2) hot and old — same list, but both hot (pgbuf_bcb_is_hot)
and old enough.
// pgbuf_bcb_is_hot / pgbuf_bcb_register_fix -- src/storage/page_buffer.c// hot: count_fix_and_avoid_dealloc >= (PGBUF_FIX_COUNT_THRESHOLD << PGBUF_BCB_COUNT_FIX_SHIFT_BITS)// == 64 << 16 (fix count lives in the high 16 bits)// register_fix saturates: stops incrementing once the threshold bit is set.count_fix_and_avoid_dealloc packs the fix count (high 16 bits, bumped by
pgbuf_bcb_register_fix and saturating at the 64-fix threshold) and the
avoid-dealloc count (low 16 bits, PGBUF_BCB_AVOID_DEALLOC_MASK); see
Chapter 6.
// PGBUF_IS_BCB_OLD_ENOUGH -- src/storage/page_buffer.c#define PGBUF_IS_BCB_OLD_ENOUGH(bcb, lru_list) \ (PGBUF_AGE_DIFF ((bcb)->tick_lru_list, (lru_list)->tick_list) >= ((lru_list)->count_lru2 / 2))A BCB stamps tick_lru_list from tick_list on insert; tick_list bumps
on every add-to-top/middle. “Old enough” = passed by at least half of
zone 2’s worth (count_lru2 / 2) of newer inserts — so a page fixed twice
in quick succession is not boosted on the second unfix. PGBUF_AGE_DIFF
handles the wraparound of the 31-bit tick.
7.6 The Aout 2Q ghost list — pgbuf_aout_buf and pgbuf_aout_list
Section titled “7.6 The Aout 2Q ghost list — pgbuf_aout_buf and pgbuf_aout_list”Aout holds VPIDs only (not BCBs) for recently victimized pages — a FIFO fronted by per-shard hash tables for O(1) lookup.
// struct pgbuf_aout_buf -- src/storage/page_buffer.cstruct pgbuf_aout_buf { VPID vpid; /* page VPID */ int lru_idx; /* which LRU list it was evicted from */ PGBUF_AOUT_BUF *next; /* next element in list */ PGBUF_AOUT_BUF *prev; /* prev element in list */};| Field | Role |
|---|---|
vpid | ghosted identity / hash key; VPID_SET_NULL marks a free node |
lru_idx | LRU list it was evicted from — re-fix re-enters the same list (7.4) |
next / prev | FIFO links, doubling as the free-list link when recycled; prev gives O(1) middle unlink on a hit |
// struct pgbuf_aout_list -- src/storage/page_buffer.cstruct pgbuf_aout_list { pthread_mutex_t Aout_mutex; /* integrity of the whole list (SERVER_MODE) */ PGBUF_AOUT_BUF *Aout_top; /* top of the queue (most recent) */ PGBUF_AOUT_BUF *Aout_bottom; /* bottom of the queue (oldest) */ PGBUF_AOUT_BUF *Aout_free; /* free list of recycled nodes */ PGBUF_AOUT_BUF *bufarray; /* preallocated node array */ int num_hashes; /* number of hash shards */ MHT_TABLE **aout_buf_ht; /* per-shard VPID -> node hash */ int max_count; /* capacity; <= 0 disables Aout */};| Field | Role |
|---|---|
Aout_mutex | global list+hash lock; serializes the 7.4 lookup-remove |
Aout_top / Aout_bottom | newest / oldest ghost — insertion vs eviction points |
Aout_free | recycled nodes; avoids malloc on the victim path |
bufarray | preallocated backing storage; no runtime alloc |
num_hashes / aout_buf_ht | shard count and per-shard MHT_TABLE* for O(1) VPID lookup over the FIFO (AOUT_HASH_IDX) |
max_count | capacity / enable switch; <= 0 disables Aout entirely |
The LRU list struct these BCBs move within (pgbuf_lru_list) carries the
boundary pointers and tick clocks 7.5/7.7 lean on:
| Field | Role | Used by |
|---|---|---|
top / bottom | list endpoints | add-to-top, add-to-bottom |
bottom_1 | last BCB of zone 1 | add-to-middle inserts after it |
bottom_2 | last BCB of zone 2 | repaired on removal (zone-2 care, 7.7) |
victim_hint | where victim search starts | advanced on every remove |
count_lru1/2/3 | per-zone populations | count_lru2/2 = old-enough threshold |
threshold_lru1/2 | zone-size targets | drive pgbuf_lru_adjust_zone* |
tick_list | bumped on add-to-top/middle | boost-age clock (PGBUF_IS_BCB_OLD_ENOUGH) |
tick_lru3 | bumped on fall-to-zone-3 | victim-hint ordering |
index | list id | private vs shared classification |
flowchart LR
subgraph AOUT["pgbuf_aout_list (FIFO + hash)"]
T["Aout_top\n(newest)"] --> N1["node"] --> N2["node"] --> B["Aout_bottom\n(oldest)"]
HT["aout_buf_ht[shard]\nVPID -> node"] -.-> N1
FR["Aout_free\n(recycled)"]
end
VICT["victimization\npgbuf_add_vpid_to_aout_list"] --> T
B -->|"full -> drop oldest"| FR
REFIX["re-fix\npgbuf_remove_vpid_from_aout_list"] -.->|hit| FR
Figure 7-2 — Aout as a fixed-capacity FIFO with a hash index.
pgbuf_add_vpid_to_aout_list (from the direct-victim branches of 7.4 and
pgbuf_lru_fall_bcb_to_zone_3): under Aout_mutex, if Aout_free is empty
it evicts Aout_bottom (mht_rem), else pops a free node; stamps
lru_idx/vpid, mht_puts, links at Aout_top.
pgbuf_remove_vpid_from_aout_list: mht_get; if absent returns
PGBUF_AOUT_NOT_FOUND (-2, a true fault); if present it captures
aout_list_id = aout_buf->lru_idx, unlinks, mht_rems, nulls the VPID,
resets lru_idx, pushes the node onto Aout_free, and returns lru_idx
— the value 7.4 compares against thread_private_lru_index.
7.7 LRU insertion primitives and the boost
Section titled “7.7 LRU insertion primitives and the boost”Zone-2 and zone-3 boosts route through pgbuf_lru_boost_bcb:
// pgbuf_lru_boost_bcb -- src/storage/page_buffer.cassert (zone != PGBUF_LRU_1_ZONE); /* <- never called on zone 1 */pthread_mutex_lock (&lru_list->mutex);pgbuf_remove_from_lru_list (thread_p, bcb, lru_list);/* unlink */pgbuf_lru_add_bcb_to_top (thread_p, bcb, lru_list); /* relink at top of zone 1 */if (zone == PGBUF_LRU_2_ZONE) pgbuf_lru_adjust_zone1 (thread_p, lru_list, true); /* only zone 1 grew */else pgbuf_lru_adjust_zones (thread_p, lru_list, true); /* zone 3: rebalance all */pthread_mutex_unlock (&lru_list->mutex);pgbuf_lru_add_bcb_to_top patches the links, sets top (and bottom/
bottom_1 if empty), increments tick_list (the clock that ages every
other BCB for PGBUF_IS_BCB_OLD_ENOUGH), then change_zone marks it
PGBUF_LRU_1_ZONE. pgbuf_lru_add_bcb_to_middle inserts after bottom_1
(the zone-1 bottom), also bumps tick_list, and marks zone 2;
pgbuf_lru_add_bcb_to_bottom appends at bottom, stamps tick_lru3, and
marks zone 3.
pgbuf_remove_from_lru_list is the inverse and repairs every boundary
pointer before moving the BCB to VOID:
// pgbuf_remove_from_lru_list -- src/storage/page_buffer.cif (lru_list->top == bufptr) lru_list->top = bufptr->next_BCB;if (lru_list->bottom == bufptr) lru_list->bottom = bufptr->prev_BCB;if (lru_list->bottom_1 == bufptr) lru_list->bottom_1 = bufptr->prev_BCB;if (lru_list->bottom_2 == bufptr) { /* <- zone-2 boundary needs care */ if (bufptr->prev_BCB != NULL && pgbuf_bcb_get_zone (bufptr->prev_BCB) == PGBUF_LRU_2_ZONE) lru_list->bottom_2 = bufptr->prev_BCB; else { assert (lru_list->count_lru2 == 1); lru_list->bottom_2 = NULL; }}/* splice neighbors, null this bcb's links */pgbuf_lru_advance_victim_hint (thread_p, lru_list, bufptr, bcb_prev, false);pgbuf_bcb_change_zone (thread_p, bufptr, 0, PGBUF_VOID_ZONE); /* <- now belongs to no zone */Invariant — a removed BCB’s zone matches its links: it lands in VOID, and
the victim hint advances first. The function ends with
change_zone(..., PGBUF_VOID_ZONE) so a BCB unlinked from a list never keeps
an LRU zone tag; and it calls pgbuf_lru_advance_victim_hint before the
splice — if the hint were not advanced first, a victimizer could chase a
dangling prev_BCB. Boost = remove + add-to-top leaves the BCB momentarily
in VOID, but the whole sequence runs under lru_list->mutex, so no thread
observes the gap.
pgbuf_lru_fall_bcb_to_zone_3 is the demotion counterpart, run by the
zone-adjust functions when zones 1/2 exceed their thresholds:
// pgbuf_lru_fall_bcb_to_zone_3 -- src/storage/page_buffer.cassert (pgbuf_bcb_get_zone (bcb) == PGBUF_LRU_1_ZONE || pgbuf_bcb_get_zone (bcb) == PGBUF_LRU_2_ZONE);#if defined (SERVER_MODE)if (pgbuf_is_bcb_victimizable (bcb, false) && pgbuf_is_any_thread_waiting_for_direct_victim ()) { if (pgbuf_bcb_is_to_vacuum (bcb)) { /* ...stat; fall through... */ } else if (PGBUF_BCB_TRYLOCK (bcb) == 0) { /* <- conditional: avoid list/bcb lock-order deadlock */ VPID vpid_copy = bcb->vpid; if (pgbuf_is_bcb_victimizable (bcb, true) && pgbuf_assign_direct_victim (thread_p, bcb)) { pgbuf_remove_from_lru_list (thread_p, bcb, lru_list); PGBUF_BCB_UNLOCK (bcb); pgbuf_add_vpid_to_aout_list (thread_p, &vpid_copy, lru_list->index); /* <- ghost on the way out */ return; } PGBUF_BCB_UNLOCK (bcb); /* not assigned; fall through */ } }#endifbcb->tick_lru3 = lru_list->tick_lru3; /* stamp zone-3 position */if (++lru_list->tick_lru3 >= DB_INT32_MAX) lru_list->tick_lru3 = 0;pgbuf_bcb_change_zone (thread_p, bcb, lru_list->index, PGBUF_LRU_3_ZONE);PGBUF_BCB_TRYLOCK is conditional because lock order is normally
bcb-then-list but we already hold the list mutex; rather than deadlock it
gives up and lets the BCB be victimized later (Chapter 9). The direct-victim
branch ghosts the VPID into Aout on the way out, closing the 2Q loop
with 7.4’s lookup. tick_lru3 (small at the bottom) feeds the victim hint,
distinct from tick_lru_list which feeds the boost age.
7.8 Chapter summary — key takeaways
Section titled “7.8 Chapter summary — key takeaways”- LRU movement is gated on
is_zero_fcnt && !blocked_reader_writer. Shared-read unfixes takepgbuf_lockfree_unfix_roand never touch the list; only the last unfixer with no waiter reorders the BCB. MOVE_TO_LRU_BOTTOMis a dealloc shortcut that bypasses the zone switch and shoves a deallocated page to the bottom for fast reclamation.- Zone sets the default action: zone 1 never boosts; zone 2 boosts
only when
PGBUF_IS_BCB_OLD_ENOUGH(past half ofcount_lru2); zone 3 always boosts on a real unfix, or hands out a direct victim under the ignore-unfix branch. PGBUF_SHOULD_IGNORE_UNFIXis vacuum-OR-temp-volume, not vacuum-only: pages on temporary volumes are also kept from warming the cache, and the VOID path narrows this to vacuum workers (PGBUF_VACUUM_SHOULD_IGNORE_UNFIX).- Boost =
pgbuf_remove_from_lru_list+pgbuf_lru_add_bcb_to_topunder the list mutex; add-to-top bumpstick_list, the clock that ages every other BCB for the old-enough test. A removed BCB always lands in VOID, keeping the zone field consistent with the list it is linked in. pgbuf_should_move_private_to_sharedfires on two triggers: a foreign-thread unfix (immediate), or a same-list page that is both hot (>= 64 packed, saturating fixes) and old enough — escalating contended pages to the shared pool.- Aout is a fixed-capacity VPID-only ghost FIFO with a hash index, and
its lookup-and-remove is serialized under
Aout_mutexand precedes placement, so one thread consumes a ghost and no page is double-promoted: a re-fix found for the same private list lands at the top of LRU 1, a different list goes shared, not-found lands cold in the middle; victimization re-ghosts the outgoing VPID to keep the loop closed.
Chapter 8: Flushing Under the WAL Rule and the Flush Daemons
Section titled “Chapter 8: Flushing Under the WAL Rule and the Flush Daemons”A dirty page may not reach disk until the log record for its most recent
change is durable. This chapter traces how the page buffer enforces that
log-before-page ordering inside pgbuf_bcb_flush_with_wal, how the
three flush daemons pace and batch their writes, and where the
double-write buffer (DWB) intercepts the write. For the why of WAL and
the high-level picture, see the companion cubrid-page-buffer-manager.md
(“Write-Ahead Logging”, “Flushing and the daemons”). The DWB’s block
geometry and crash-recovery rationale live in
cubrid-double-write-buffer.md; the durability semantics of
logpb_flush_log_for_wal and the flushed-LSA bookkeeping live in
cubrid-log-manager-detail.md — both are referenced, not re-derived,
here. The flushing/dirty flags and the atomic-latch FLUSH mode come from
Chapters 6 and 5; this chapter shows how the flush path consumes them.
8.1 Single-page entry points
Section titled “8.1 Single-page entry points”Four public entry points push one page toward disk; all funnel into
pgbuf_bcb_safe_flush_internal, which decides whether the flush happens
now, is delegated, or is awaited.
flowchart TB F["pgbuf_flush\n(optionally unfix after)"] --> FW["pgbuf_flush_with_wal"] FIR["pgbuf_flush_if_requested\n(permanently write-latched page)"] -->|ASYNC_FLUSH_REQ set| SFFU FW --> SFFU["pgbuf_bcb_safe_flush_force_unlock"] SFFU --> SFI["pgbuf_bcb_safe_flush_internal"] SFI -->|immediate_flush| BFW["pgbuf_bcb_flush_with_wal"] SFI -->|page write-latched by other| REQ["set ASYNC_FLUSH_REQ\nlet holder flush on unfix"] SFI -->|already flushing / latched, synchronous| BLK["pgbuf_block_bcb\nPGBUF_LATCH_FLUSH wait"]
Figure 8-1 — Single-page flush entry points converging on
pgbuf_bcb_safe_flush_internal.
pgbuf_flush_with_wal is the canonical caller — it asserts a READ+ latch
held by the calling thread, locks the BCB mutex, and delegates
synchronously:
// pgbuf_flush_with_wal -- src/storage/page_buffer.cCAST_PGPTR_TO_BFPTR (bufptr, pgptr);/* In CUBRID, the caller is holding WRITE page latch */assert (get_latch (&bufptr->atomic_latch) >= PGBUF_LATCH_READ && pgbuf_find_thrd_holder (thread_p, bufptr) != NULL);PGBUF_BCB_LOCK (bufptr);if (pgbuf_bcb_safe_flush_force_unlock (thread_p, bufptr, true) != NO_ERROR) /* <- synchronous=true */ { ASSERT_ERROR (); return NULL; }pgbuf_flush wraps this and unfixes afterward when free_page == FREE;
its header warns it does not guarantee the page reached disk, so callers
needing durability use pgbuf_flush_with_wal and check the return.
pgbuf_flush_if_requested serves a thread holding a page permanently
write-latched (it can never unfix to trigger a normal flush): it asserts
a WRITE latch held by the caller, checks
pgbuf_bcb_is_async_flush_request (bcb), and only when set locks and
flushes with synchronous=false — the consumer side of the
PGBUF_BCB_ASYNC_FLUSH_REQ flag the daemon/checkpoint sets on a
write-latched victim.
8.2 The decision core: pgbuf_bcb_safe_flush_internal
Section titled “8.2 The decision core: pgbuf_bcb_safe_flush_internal”The caller holds the BCB mutex. The function short-circuits clean pages, then runs a CAS loop choosing among the outcomes below. A flush cannot happen immediately for exactly two reasons, both spelled out in the source: the page is write-latched by another thread (its contents could change mid-write), or another thread is already flushing it (two writers could reorder an old version over a new one).
// pgbuf_bcb_safe_flush_internal -- src/storage/page_buffer.cif (!pgbuf_bcb_is_dirty (bufptr)) return NO_ERROR; /* <- clean: nothing to do, stays locked */do { immediate_flush = false; block = false; is_flushing = false; impl = get_impl (&bufptr->atomic_latch); impl_new = impl; is_flushing = pgbuf_bcb_is_flushing (bufptr); if (!is_flushing && (impl.impl.latch_mode == PGBUF_NO_LATCH || impl.impl.latch_mode == PGBUF_LATCH_READ || (impl.impl.latch_mode == PGBUF_LATCH_WRITE && pgbuf_find_thrd_holder (thread_p, bufptr) != NULL))) /* <- I am the writer */ immediate_flush = true; else { assert (is_flushing || impl.impl.latch_mode == PGBUF_LATCH_WRITE); /* <- only these reach else */ if (synchronous) { block = true; impl_new.impl.waiter_exists = true; } /* <- publish waiter into latch word */ }} while (!bufptr->atomic_latch.compare_exchange_strong (impl.raw, impl_new.raw, ...));| Outcome | Condition | Action |
|---|---|---|
immediate_flush | not flushing; unlatched, read-latched, or write-latched by me | pgbuf_bcb_flush_with_wal (..., false, locked) — flush now |
| async request | not flushing, write-latched by another | pgbuf_bcb_update_flags (..., PGBUF_BCB_ASYNC_FLUSH_REQ, 0) — holder flushes on unfix |
| block | flushing or foreign write-latch, synchronous==true | *locked=false; pgbuf_block_bcb (..., PGBUF_LATCH_FLUSH, ...) — sleep |
| no-wait return | foreign latch/flush, synchronous==false | return NO_ERROR without flushing |
Note the async-request flag is set whenever the immediate path was not
taken and the BCB is not already flushing (i.e. a foreign write-latch),
regardless of synchronous — the synchronous caller then also blocks.
Invariant — at most one flusher per BCB. A page flushes only while
pgbuf_bcb_is_flushing is false when the writer commits the flushing flag
(set inside pgbuf_bcb_flush_with_wal). A second thread sees
is_flushing == true, cannot take immediate_flush, and blocks
(PGBUF_LATCH_FLUSH) or returns. Violate it and two fileio_write calls
race, landing an older image after a newer one and corrupting the page.
The force_unlock/force_lock wrappers only normalize the locked
out-parameter, since the internal function may drop the mutex when it
blocks.
8.3 pgbuf_bcb_flush_with_wal — the durable write
Section titled “8.3 pgbuf_bcb_flush_with_wal — the durable write”The heart of the chapter. The caller holds the mutex; the function copies
the page, enforces WAL, writes through the DWB or directly, and on success
clears FLUSHING and wakes waiters; on failure it reverts DIRTY and
oldest_unflush_lsa.
flowchart TB
A["mark_is_flushing\nset FLUSHING, clear DIRTY"] --> C["copy_unflushed_lsa\nsave lsa+oldest_unflush\nNULL oldest_unflush, UNLOCK"]
C --> D{oldest_unflush_lsa\nnon-null?}
D -->|yes| E["logpb_flush_log_for_wal"]
D -->|no| F["debug: changed not logged"]
E --> G{uses_dwb?}
F --> G
G -->|yes| H["dwb_add_page"]
G -->|no| I["fileio_write"]
H --> J{error?}
I --> J
J -->|fail| K["mark_was_not_flushed\nrestore DIRTY+lsa, wake, ER_FAILED"]
J -->|ok, flush thread + waiter| L["queue to flushed_bcbs\nwake post-flush daemon"]
J -->|ok| M["mark_was_flushed\nclear FLUSHING, wake"]
Figure 8-2 — pgbuf_bcb_flush_with_wal branch map.
Step 1 — claim the flush, clear DIRTY.
was_dirty = pgbuf_bcb_mark_is_flushing (...) sets FLUSHING_TO_DISK and
atomically clears DIRTY and ASYNC_FLUSH_REQ:
// pgbuf_bcb_mark_is_flushing -- src/storage/page_buffer.cif (pgbuf_bcb_is_dirty (bcb)) { pgbuf_bcb_update_flags (thread_p, bcb, PGBUF_BCB_FLUSHING_TO_DISK_FLAG, PGBUF_BCB_DIRTY_FLAG | PGBUF_BCB_ASYNC_FLUSH_REQ); /* <- set | clear */ return true;}Invariant — DIRTY clears at flush start, not end. A concurrent
re-dirty during the long copy-to-write window re-sets DIRTY and is not
lost; the page just flushes again. On write failure,
pgbuf_bcb_mark_was_not_flushed re-sets DIRTY (when was_dirty).
Step 2 — copy the image. At start_copy_page the page is copied into
a stack iopage (via tde_encrypt_data_page if TDE-encrypted, else a
memcpy of IO_PAGESIZE). If uses_dwb, the copy is staged into a DWB
slot by dwb_set_data_on_next_slot; on a granted slot the local iopage
is nulled and control jumps to copy_unflushed_lsa.
Step 3 — WAL enforcement. At copy_unflushed_lsa it saves the page
LSA and oldest_unflush_lsa, NULLs bufptr->oldest_unflush_lsa, drops
the mutex, and — if the saved oldest_unflush_lsa is non-null — forces the
log:
// pgbuf_bcb_flush_with_wal -- src/storage/page_buffer.cLSA_COPY (&lsa, &(bufptr->iopage_buffer->iopage.prv.lsa));LSA_COPY (&oldest_unflush_lsa, &bufptr->oldest_unflush_lsa);LSA_SET_NULL (&bufptr->oldest_unflush_lsa);PGBUF_BCB_UNLOCK (bufptr); *is_bcb_locked = false;if (!LSA_ISNULL (&oldest_unflush_lsa)) logpb_flush_log_for_wal (thread_p, &lsa); /* <- log-before-page: force log up to page LSA */WAL INVARIANT — log up to the page LSA is durable before the write. The page is never handed to
fileio_write/dwb_add_pageuntil the log tail is forced throughlsa(the page’s ownprv.lsa). Enforcement is structural:logpb_flush_log_for_wal (thread_p, &lsa)sits between the mutex drop and the write, andlsais read before the mutex drops so it cannot be re-stamped underneath. The trigger gate isoldest_unflush_lsa != NULL, but the force targetslsa(the newest change), notoldest_unflush_lsa. What breaks if skipped: a crash after the page write but before the log write leaves a persisted change whose redo/undo never reached disk — recovery cannot reconstruct or roll it back, so the page is silently corrupt. Seecubrid-log-manager-detail.mdfor howlogpb_flush_log_for_walguarantees durability to a given LSA. Theelsebranch (nulloldest_unflush_lsa) is the rare “changed but not logged” case (temporary volumes) and only emits a debug note.
Null-ing bufptr->oldest_unflush_lsa lets a re-dirty during the write
window re-stamp a fresh value that a later flush re-forces; on write
failure the saved value is restored (Step 5a).
Step 4 — the write. DWB use is gated by
uses_dwb = dwb_is_created () && !is_temp (temp volumes always bypass it).
If uses_dwb, dwb_add_page registers the page’s VPID into the staged
slot; the DWB batches pages and flushes a full block to the double-write
area and then the data files (block geometry and the torn-page recovery
argument are in cubrid-double-write-buffer.md). Subtle branch: if DWB was
disabled between staging and adding, dwb_add_page returns
dwb_slot == NULL, so the code clears uses_dwb, re-locks, and
goto start_copy_page to retry direct. The direct path does a plain
fileio_write (bumping num_pages_written, PSTAT_PB_NUM_IOWRITES) with
mode FILEIO_WRITE_NO_COMPENSATE_WRITE when a DWB exists globally
(double-write makes torn-page compensation redundant), else
FILEIO_WRITE_DEFAULT_WRITE.
Step 5a — write failure. Re-lock, pgbuf_bcb_mark_was_not_flushed (.., was_dirty) clears FLUSHING and restores DIRTY, restore the saved
oldest_unflush_lsa, wake PGBUF_LATCH_FLUSH waiters (only if
next_wait_thrd != NULL), return ER_FAILED.
Step 5b — success, daemon hand-off. If this is the page flush daemon
(is_page_flush_thread), the post-flush daemon exists, a thread waits for
a direct victim, and the BCB is accepted into pgbuf_Pool.flushed_bcbs
(via produce), the BCB is left unlocked but un-cleared for the
post-flush daemon to assign as a victim (Chapter 9), which is then woken —
mark_was_flushed is deliberately not called on this path. Step 5c
(otherwise) re-locks, calls pgbuf_bcb_mark_was_flushed (clears FLUSHING),
and wakes flush waiters when any are queued.
8.4 Waking the FLUSH waiters: pgbuf_wake_flush_waiters
Section titled “8.4 Waking the FLUSH waiters: pgbuf_wake_flush_waiters”Threads that took the block branch in 8.2 park on the BCB’s
next_wait_thrd list with request_latch_mode == PGBUF_LATCH_FLUSH. The
waker unlinks only FLUSH waiters, leaving READ/WRITE latch waiters in
place:
// pgbuf_wake_flush_waiters -- src/storage/page_buffer.cfor (crt_waiter = bcb->next_wait_thrd; crt_waiter != NULL; crt_waiter = save_next_waiter) { save_next_waiter = crt_waiter->next_wait_thrd; if (crt_waiter->request_latch_mode == PGBUF_LATCH_FLUSH) { if (prev_waiter != NULL) prev_waiter->next_wait_thrd = save_next_waiter; else bcb->next_wait_thrd = save_next_waiter; /* <- unlink only FLUSH waiters */ crt_waiter->next_wait_thrd = NULL; pgbuf_wakeup_uncond (crt_waiter); } else { prev_waiter = crt_waiter; /* <- keep latch waiters threaded */ }}The caller must hold the BCB mutex. Both the failure and success paths of
8.3 call it, but only when next_wait_thrd != NULL. Mixing FLUSH and latch
waiters on one list is why the loop tracks prev_waiter instead of
truncating.
8.5 The Page Flush Daemon: candidate collection and flushing
Section titled “8.5 The Page Flush Daemon: candidate collection and flushing”pgbuf_flush_victim_candidates is the daemon body: size the scan, collect
dirty candidates from the LRUs, force the log, flush each survivor.
Adaptive scan width. It reads/resets lru_victim_req_cnt and
fix_req_cnt for lru_miss_rate, then boosts flush_ratio * num_buffers
by up to PGBUF_FLUSH_VICTIM_BOOST_MULT (=10) when misses are high — but
only when not in checkpoint (checkpoint already flushes, so boosting
would double-flush). The result caps at ~200 MB of pages.
// pgbuf_flush_victim_candidates -- src/storage/page_buffer.cif (pgbuf_Pool.is_checkpoint == false) { lru_dynamic_flush_adj = MAX (1.0f, 1 + (PGBUF_FLUSH_VICTIM_BOOST_MULT - 1) * lru_miss_rate); lru_dynamic_flush_adj = MIN (PGBUF_FLUSH_VICTIM_BOOST_MULT, lru_dynamic_flush_adj);} else lru_dynamic_flush_adj = 1.0f;check_count_lru = (int) (cfg_check_cnt * lru_dynamic_flush_adj);check_count_lru = MIN (check_count_lru, (200 * 1024 * 1024) / DB_PAGESIZE);Branches after collection. If victim_count == 0, nothing to flush;
sets *stop true (so the daemon loop in 8.8 breaks) only when scanning was
actually attempted (check_count_lru > 0 && lru_sum_flush_priority > 0),
then goto end. Otherwise it wakes the log flush daemon (WAL needs the log
current — log_wakeup_log_flush_daemon, or logpb_force_flush_pages if no
daemon), optionally qsorts the list by VPID under
PRM_ID_PB_SEQUENTIAL_VICTIM_FLUSH, and sets is_flushing_victims = true.
Per-candidate loop. For each candidate it locks the BCB and applies four guards:
- VPID changed / not dirty / already flushing ->
num_skipped_already_flushed, unlock, continue. - left the LRU victim zone or got fixed/hot ->
num_skipped_fixed_or_hot, unlock, continue. logpb_need_wal(page LSA beyond flushed log) -> record maxlsa_need_wal, bumpcount_need_wal, wake log flush daemon,num_skipped_need_wal, unlock, continue.- else flush:
pgbuf_flush_page_and_neighbors_fbwhenPGBUF_NEIGHBOR_PAGES > 1, elsepgbuf_bcb_flush_with_wal (..., true, &is_bcb_locked)(is_page_flush_thread=true; the loop unlocks the BCB if it stayed locked). On error ->goto end.
The repeat retry. At end, if every candidate was skipped purely
for WAL (count_need_wal == victim_count) and a thread still waits for a
direct victim, the daemon forces the log itself (logpb_flush_log_for_wal)
and jumps to repeat exactly once (the second pass asserts LSAs advanced),
then clears is_flushing_victims.
Neighbor batching: pgbuf_flush_page_and_neighbors_fb. When
PGBUF_NEIGHBOR_PAGES > 1, branch 4 calls this function, which grows a
contiguous-VPID window around the anchor so a run of physically adjacent
pages is written in one sequential sweep. The window state lives in a
static file-scope global, pgbuf_Flush_helper (type
pgbuf_batch_flush_helper) — not a per-call stack object; a single
shared instance is safe because the dedicated page-flush daemon is the only
caller, and each invocation zeroes the counters at entry.
// pgbuf_batch_flush_helper -- src/storage/page_buffer.cstruct pgbuf_batch_flush_helper{ int npages; /* <- pages currently staged in the window */ int fwd_offset; /* <- pages added forward (higher pageid) of anchor */ int back_offset; /* <- pages added backward (lower pageid) of anchor */ PGBUF_BCB *pages_bufptr[2 * PGBUF_MAX_NEIGHBOR_PAGES - 1]; /* <- window BCBs */ VPID vpids[2 * PGBUF_MAX_NEIGHBOR_PAGES - 1]; /* <- their VPIDs */};// static PGBUF_BATCH_FLUSH_HELPER pgbuf_Flush_helper; <- the single shared instance| Field | Role | Why it exists |
|---|---|---|
npages | pages staged in the window | end bound of the per-window flush loop; trimmed when a tail/head neighbor is clean |
fwd_offset | forward reach (higher pageid) from anchor | next forward candidate is anchor + fwd_offset + 1 |
back_offset | backward reach (lower pageid) from anchor | next backward candidate is anchor - back_offset - 1 |
pages_bufptr[2*MAX-1] | BCB handles for every window member | the BCBs flushed; sized to reach PGBUF_MAX_NEIGHBOR_PAGES-1 (=31) each way around the anchor |
vpids[2*MAX-1] | VPID snapshot per member, parallel to pages_bufptr[] | validate key: pgbuf_flush_neighbor_safe re-checks it so a member whose VPID changed before its write is skipped |
PGBUF_NEIGHBOR_POS (off) indexes the arrays relative to the anchor
(PGBUF_NEIGHBOR_PAGES - 1 + off). The window is not strictly
dirty-only: when PGBUF_NEIGHBOR_FLUSH_NONDIRTY is enabled the probe
deliberately admits interior clean pages to keep the on-disk run
contiguous, abandoning the batch only on two consecutive non-dirties
(NEIGHBOR_ABORT_TWO_CONSECTIVE_NONDIRTIES) or when non-dirties exceed
half the window past a small threshold
(NEIGHBOR_ABORT_TOO_MANY_NONDIRTIES). A clean page at the very tail or
head is then trimmed (decrement the offset and npages) so the run does
not end on a wasted write. Before the sweep the neighbor path enforces WAL
once for the whole window:
// pgbuf_flush_page_and_neighbors_fb -- src/storage/page_buffer.c/* WAL protocol: force log record to disk */logpb_flush_log_for_wal (thread_p, &log_newest_oldest_unflush_lsa);for (pos = PGBUF_NEIGHBOR_POS (-helper->back_offset); pos <= PGBUF_NEIGHBOR_POS (helper->fwd_offset); pos++) error = pgbuf_flush_neighbor_safe (thread_p, helper->pages_bufptr[pos], &helper->vpids[pos], &was_page_flushed);pgbuf_flush_neighbor_safe re-routes each member through the single-page
path (re-validating its VPID), so per-page WAL still holds; the batch force
just guarantees the log is current before the contiguous write begins. A
single-page window (npages <= 1) skips the batch force and flushes the
lone page directly.
8.6 pgbuf_get_victim_candidates_from_lru
Section titled “8.6 pgbuf_get_victim_candidates_from_lru”Called from 8.5, it walks each LRU from the bottom through the victim
zone, budgeting by each list’s lru_victim_flush_priority_per_lru:
// pgbuf_get_victim_candidates_from_lru -- src/storage/page_buffer.cfor (bufptr = pgbuf_Pool.buf_LRU_list[lru_idx].bottom; bufptr != NULL && PGBUF_IS_BCB_IN_LRU_VICTIM_ZONE (bufptr) && i > 0; bufptr = bufptr->prev_BCB, i--) { if (pgbuf_bcb_is_dirty (bufptr)) { pgbuf_Pool.victim_cand_list[victim_cand_count].bufptr = bufptr; pgbuf_Pool.victim_cand_list[victim_cand_count].vpid = bufptr->vpid; victim_cand_count++; /* <- dirty -> flush before victimization */ }#if defined (SERVER_MODE) else if (try_direct_assign && pgbuf_is_any_thread_waiting_for_direct_victim () && pgbuf_is_bcb_victimizable (bufptr, false) && PGBUF_BCB_TRYLOCK (bufptr) == 0) { if (pgbuf_is_bcb_victimizable (bufptr, true) && pgbuf_assign_direct_victim (thread_p, bufptr)) { try_direct_assign = false; *assigned_directly = true; /* <- clean bcb handed to a waiter */ } PGBUF_BCB_UNLOCK (bufptr); }#endif}Two outputs: dirty BCBs go to the candidate list (they need a flush
before victimization), while a single clean victimizable BCB may be
handed straight to a starving waiter (assigned_directly) under trylock so
the scan never blocks. Candidate VPIDs are snapshot so the flush loop in
8.5 can detect a reassigned BCB (guard 1 there). The whole walk runs under
the per-LRU mutex.
8.7 The seq-flusher and pgbuf_flush_seq_list pacing
Section titled “8.7 The seq-flusher and pgbuf_flush_seq_list pacing”Checkpoint flushing is rate-controlled by a PGBUF_SEQ_FLUSHER: unlike the
victim daemon, it spreads writes across one-second “super-intervals” so
checkpoint I/O does not starve the foreground.
struct pgbuf_seq_flusher — every field:
| Field | Role | Why it exists |
|---|---|---|
flush_list | array of PGBUF_VICTIM_CANDIDATE_LIST (bufptr+vpid) | working set for this pass |
flush_upto_lsa | newest oldest-LSA over all listed pages | WAL gate; pages beyond it are skipped |
control_intervals_cnt | intervals elapsed this 1 s super-interval | feeds the flush_per_interval math |
control_flushed | pages flushed so far this super-interval | lets a slow interval be compensated next |
interval_msec | duration of one pacing interval | computed in pgbuf_flush_chkpt_seq_list as 1000 * PGBUF_CHKPT_BURST_PAGES / chkpt_flush_rate, where chkpt_flush_rate = 1000 / PRM_ID_LOG_CHECKPOINT_SLEEP_MSECS — not from the struct flush_rate field |
flush_max_size | capacity of flush_list, set at init | batch-size bound; checkpoint refills when full |
flush_cnt | live element count | end bound of the flush loop |
flush_idx | index of next element to flush | resumes across interval boundaries |
flushed_pages | pages flushed this call (return param) | accumulated by the caller |
flush_rate | max pages/sec (negative = unlimited) | target the pacing math converges to; set to chkpt_flush_rate each interval |
burst_mode | flush a chunk ASAP vs one page then sleep | burst keeps data I/O sequential |
pgbuf_initialize_seq_flusher zeroes the struct, sets flush_max_size,
allocates flush_list, and defaults burst_mode = true.
pgbuf_flush_seq_list derives flush_per_interval from the control
counters: with control_intervals_cnt > 0 it targets
flush_rate * (control_intervals_cnt+1) / control_total_cnt_intervals
minus what was already flushed (compensation), floored at
PGBUF_CHKPT_MIN_FLUSH_RATE (=50) scaled by the interval. The loop runs
while flush_idx < flush_cnt && flushed_pages < flush_per_interval:
// pgbuf_flush_seq_list -- src/storage/page_buffer.cPGBUF_BCB_LOCK (bufptr); locked_bcb = true;if (!VPID_EQ (&bufptr->vpid, &f_list[seq_flusher->flush_idx].vpid) || !pgbuf_bcb_is_dirty (bufptr) || (flush_if_already_flushed == false && !LSA_ISNULL (&bufptr->oldest_unflush_lsa) && LSA_GT (&bufptr->oldest_unflush_lsa, &seq_flusher->flush_upto_lsa))) { PGBUF_BCB_UNLOCK (bufptr); dropped_pages++; continue; } /* <- stale / beyond chkpt horizon */if (pgbuf_bcb_safe_flush_force_lock (thread_p, bufptr, true) == NO_ERROR) { /* ... done_flush = true */ }The flush_if_already_flushed heuristic re-flushes an already-flushed page
only if its VPID is contiguous with the next list entry — preferring write
sequentiality over avoiding a redundant write. After each page, non-burst
mode sleeps time_remaining / pages_remaining ms (skipped when below
1000 / PGBUF_CHKPT_MAX_FLUSH_RATE) to spread the writes; burst mode only
checks the absolute limit_time and breaks (*time_rem = -1) when
exceeded. flush_upto_lsa is the WAL gate: only pages whose
oldest-unflush LSA is at or below it flush in this checkpoint.
8.8 The three daemons and checkpoint flush
Section titled “8.8 The three daemons and checkpoint flush”Three SERVER_MODE daemons register via REGISTER_DAEMON:
| Daemon | Task type | Looper | Body |
|---|---|---|---|
pgbuf_Page_flush_daemon | dedicated pgbuf_page_flush_daemon_task (subclass of cubthread::entry_task) | pgbuf_get_page_flush_interval (timed if PRM_ID_PAGE_BG_FLUSH_INTERVAL_MSECS > 0, else infinite wait) | loop pgbuf_flush_victim_candidates |
pgbuf_Page_post_flush_daemon | entry_callable_task (pgbuf_page_post_flush_execute) | 3-tier looper {1,10,100} ms | pgbuf_assign_flushed_pages — drain flushed_bcbs, assign direct victims |
pgbuf_Page_maintenance_daemon | entry_callable_task (pgbuf_page_maintenance_execute) | fixed 100 ms | pgbuf_adjust_quotas + pgbuf_direct_victims_maintenance |
The flush daemon runs at least once if explicitly woken (was_woken_up),
then loops while pgbuf_keep_victim_flush_thread_running or until
pgbuf_flush_victim_candidates sets stop_iteration:
// pgbuf_page_flush_daemon_task::execute -- src/storage/page_buffer.cbool force_one_run = pgbuf_Page_flush_daemon->was_woken_up ();while (force_one_run || pgbuf_keep_victim_flush_thread_running ()) { pgbuf_flush_victim_candidates (&thread_ref, prm_get_float_value (PRM_ID_PB_BUFFER_FLUSH_RATIO), &m_perf_track, &stop_iteration); force_one_run = false; if (stop_iteration) break;}It is the only class-based dedicated task; post-flush and maintenance are
callable functions in entry_callable_task. Foreground threads nudge it
via pgbuf_wakeup_page_flush_daemon when no victim is found.
pgbuf_flush_control_from_dirty_ratio adds a separate adaptive signal — a
rate bump that grows quadratically as dirties_cnt exceeds
num_buffers/2, plus the dirty growth rate — to flush harder before the
pool saturates.
Checkpoint flush. pgbuf_flush_checkpoint forces the log to
flush_upto_lsa, sets is_checkpoint=true, then scans all BCBs. Each
dirty non-temporary page with oldest_unflush_lsa <= flush_upto_lsa is
appended to the shared seq_chkpt_flusher.flush_list; when full
(>= flush_max_size) it is qsorted by VPID and drained through
pgbuf_flush_chkpt_seq_list (which calls the paced pgbuf_flush_seq_list),
then refilled. A page older than prev_chkpt_redo_lsa asserts
(ER_LOG_CHECKPOINT_SKIP_INVALID_PAGE) — it should have flushed in the
previous checkpoint. The smallest unflushed LSA among skipped pages returns
in smallest_lsa to advance the redo horizon. The flush_all family
(pgbuf_flush_all, _all_unfixed, _all_unfixed_and_set_lsa_as_null) is
an unpaced sweep over all BCBs via pgbuf_flush_all_helper, used only by
the log/recovery manager.
8.9 Chapter summary — key takeaways
Section titled “8.9 Chapter summary — key takeaways”- All single-page flushes funnel through
pgbuf_bcb_safe_flush_internal, whose CAS loop on the atomic latch picks immediate flush, async-request-on-unfix, or block-on-PGBUF_LATCH_FLUSH. pgbuf_bcb_flush_with_walenforces the WAL invariant by saving the page LSA, NULLingoldest_unflush_lsaunder the mutex, dropping it, and callinglogpb_flush_log_for_wal (.., &lsa)before anyfileio_write/dwb_add_page; skipping the force would lose redo for a persisted page.- DIRTY clears at flush start (
pgbuf_bcb_mark_is_flushing) so a concurrent re-dirty is never lost; a failed write restores DIRTY and the savedoldest_unflush_lsaviapgbuf_bcb_mark_was_not_flushed. - At most one flusher per BCB: FLUSHING_TO_DISK plus the blocking FLUSH-latch path serialize writers, preventing an old image from overwriting a newer one.
- The Page Flush Daemon (
pgbuf_flush_victim_candidates+pgbuf_get_victim_candidates_from_lru) collects dirty bottom-of-LRU candidates, skips fixed/hot/need-WAL pages, may batch neighbors through the shared-globalpgbuf_Flush_helperwindow (which forces WAL once for the whole run and can include interior clean pages for sequentiality), and retries once when all candidates were WAL-blocked. - Checkpoint uses a rate-controlled
PGBUF_SEQ_FLUSHER(pgbuf_flush_seq_list) with burst/spread pacing and aflush_upto_lsaWAL gate;flush_all*is an unpaced sweep. - Three daemons exist — one dedicated class (page flush) plus two callable
tasks (post-flush, maintenance) — and the DWB, when created, stages every
non-temp write into a block before the data files (see
cubrid-double-write-buffer.md).
Chapter 9: Victim Selection the LFCQs and Direct Victim Hand-off
Section titled “Chapter 9: Victim Selection the LFCQs and Direct Victim Hand-off”This chapter answers: when the invalid (free) list is empty, how does a thread find an evictable BCB, and when none is found, how is a freed BCB handed straight to a sleeping waiter? The high-level companion sketched “LFCQ — victim selection” and “Direct victim hand-off” at altitude; here we trace every branch.
The two paths are duals. The pull path (pgbuf_get_victim) scans the layered lock-free queues (LFCQs) for a clean BCB to claim. The push path (pgbuf_assign_direct_victim / pgbuf_get_direct_victim) is the inverse: a producer that already cleaned a BCB wakes a waiter, writes the BCB into the waiter’s mailbox slot, and skips the LRU. A thread that fails the pull path becomes a waiter (suspend/wake is Ch. 5).
9.1 The two structs: mailbox and candidate slot
Section titled “9.1 The two structs: mailbox and candidate slot”pgbuf_direct_victim is the global mailbox-and-queues record (pgbuf_Pool.direct_victims), SERVER_MODE-only. pgbuf_victim_candidate_list is the scratch array the flush daemon (Ch. 8) fills; in scope only because the spec names it.
// pgbuf_direct_victim -- src/storage/page_buffer.cstruct pgbuf_direct_victim { PGBUF_BCB **bcb_victims; /* per-thread mailbox: bcb_victims[tid] = BCB handed to thread tid */ lockfree::circular_queue<THREAD_ENTRY *> *waiter_threads_high_priority; lockfree::circular_queue<THREAD_ENTRY *> *waiter_threads_low_priority;};// pgbuf_victim_candidate_list -- src/storage/page_buffer.cstruct pgbuf_victim_candidate_list { PGBUF_BCB *bufptr; /* selected BCB as victim candidate */ VPID vpid; /* page id of the page managed by the BCB */};| Struct.Field | Role | Why it exists |
|---|---|---|
direct_victim.bcb_victims | Array of num_total_threads BCB ptrs by thread_p->index; slot [tid] = tid’s BCB or NULL. | The mailbox. Producer writes a slot under the waiter’s entry lock, waiter reads+NULLs its own on wake — one slot/thread, no contention. |
direct_victim.waiter_threads_high_priority | LFCQ of threads blocking the system on a victim. | Drained first — latency-critical fixers jump the queue. |
direct_victim.waiter_threads_low_priority | LFCQ of threads that tolerate waiting. | Drained 1-in-4 ahead of high — the 75/25 weighting (§9.5). |
victim_candidate_list.bufptr | BCB the flush pass selected to clean. | Lets the flusher re-lock+flush in a second pass without re-scanning. |
victim_candidate_list.vpid | Snapshot of bufptr->vpid at selection. | Detects reassignment before flush; a stale bufptr whose vpid no longer matches is skipped. |
9.2 pgbuf_get_victim — the staged LFCQ scan
Section titled “9.2 pgbuf_get_victim — the staged LFCQ scan”The queues hold list indices, never BCBs; an index sits in a queue iff its list has count_vict_cand > 0 and PGBUF_LRU_VICTIM_LFCQ_FLAG set. The function walks four stages, returning the first locked BCB it claims.
Stage 1 — own private, only with a private LRU that is over quota:
// pgbuf_get_victim -- src/storage/page_buffer.cif (PGBUF_THREAD_HAS_PRIVATE_LRU (thread_p)) { private_lru_idx = PGBUF_LRU_INDEX_FROM_PRIVATE (PGBUF_PRIVATE_LRU_FROM_THREAD (thread_p)); lru_list = PGBUF_GET_LRU_LIST (private_lru_idx); if (PGBUF_LRU_LIST_IS_ONE_TWO_OVER_QUOTA (lru_list) /* zone1+2 exceeds quota */ || (PGBUF_LRU_LIST_IS_OVER_QUOTA (lru_list) && lru_list->count_vict_cand > 0)) { victim = pgbuf_get_victim_from_lru_list (thread_p, private_lru_idx); if (victim != NULL) { return victim; } /* <- happy path */ if (!PGBUF_VACUUM_SHOULD_IGNORE_UNFIX (thread_p)) restrict_other = PGBUF_LRU_LIST_IS_OVER_QUOTA_WITH_BUFFER (lru_list); /* gate stage 2 */ searched_own = true; } }restrict_other is set only for a non-vacuum thread comfortably over quota (quota + MAX(10, quota*0.01) buffer); it confines stage 2 to big-private lists. searched_own stops stage 4 repeating stage 1.
Stage 2 — other private, entered only when PGBUF_PAGE_QUOTA_IS_ENABLED && has_flush_thread; it calls pgbuf_lfcq_get_victim_from_private_lru (thread_p, restrict_other) (§9.4) and returns on the first claim.
Stage 3 — shared, in a guarded loop — the only looping stage, and only without a flush daemon to refill candidates:
do { victim = pgbuf_lfcq_get_victim_from_shared_lru (thread_p, has_flush_thread); if (victim != NULL) { return victim; } /* <- happy path */ current_consume_cursor = pgbuf_Pool.shared_lrus_with_victims->get_consumer_cursor (); }while (!has_flush_thread && !pgbuf_Pool.shared_lrus_with_victims->is_empty () && ((int) (current_consume_cursor - initial_consume_cursor) <= pgbuf_Pool.num_LRU_list) && (++nloops <= pgbuf_Pool.num_LRU_list));The four while conditions each stop the spin: a flush daemon present, the queue empty, more indices consumed than there are shared lists, or nloops exceeding num_LRU_list (a paranoia guard). With a flush daemon the body runs exactly once.
Stage 4 — last-resort own private, ignoring quota. Only if stages 1-3 failed and stage 1 never ran (PGBUF_THREAD_HAS_PRIVATE_LRU && !searched_own), it re-calls pgbuf_get_victim_from_lru_list on the own list and returns the result; otherwise it falls through to return victim (NULL). This guards the source-documented deadlock: all private lists just below quota, shared lists with no zone-3, nothing victimizable or flushable. A NULL return tells the caller to enqueue as a waiter and sleep (Ch. 5). Figure 9-1 traces all four stages.
Figure 9-1 — pgbuf_get_victim staged scan.
flowchart TD
B{"own private over quota?"} -- yes --> C["victim_from_lru own"]
C -->|found| R["return victim"]
C -->|fail| D["restrict_other, searched_own"]
B -- no --> E
D --> E{"quota+flush thread?"}
E -- yes --> F["lfcq private: big then ordinary"]
F -->|found| R
E -- no --> G
F -->|fail| G["loop: lfcq shared"]
G -->|found| R
G -->|exhausted| H{"not searched_own?"}
H -- yes --> I["victim_from_lru own, no quota"]
I -->|found| R
H -- no --> J["NULL: wait"]
I -->|fail| J
9.3 pgbuf_get_victim_from_lru_list — bottom-up scan, four exclusions
Section titled “9.3 pgbuf_get_victim_from_lru_list — bottom-up scan, four exclusions”Where a BCB is actually claimed: scan from victim_hint toward the bottom of zone 3, apply the exclusion mask, and on success remove the BCB and return it locked. Three early NULL returns precede the scan, then the hint is resynced:
// pgbuf_get_victim_from_lru_list -- src/storage/page_buffer.cif (lru_list->count_vict_cand == 0) { return NULL; } /* <- 1: no candidates, no mutex */pthread_mutex_lock (&lru_list->mutex);if (lru_list->bottom == NULL || !PGBUF_IS_BCB_IN_LRU_VICTIM_ZONE (lru_list->bottom)) { pthread_mutex_unlock (&lru_list->mutex); return NULL; } /* <- 2: no zone-3 */if (PGBUF_IS_PRIVATE_LRU_ONE_TWO_OVER_QUOTA (lru_idx)) pgbuf_lru_adjust_zones (thread_p, lru_list, false); /* shrink zone1 so zone3 grows */lru_victim_cnt = lru_list->count_vict_cand;if (lru_victim_cnt <= 0) { pthread_mutex_unlock (&lru_list->mutex); return NULL; } /* <- 3: race emptied it */if (!pgbuf_bcb_is_dirty (lru_list->bottom) && lru_list->victim_hint != lru_list->bottom) (void) ATOMIC_TAS_ADDR (&lru_list->victim_hint, /* resync drifted hint */ PGBUF_IS_BCB_IN_LRU_VICTIM_ZONE (lru_list->bottom) ? lru_list->bottom : (PGBUF_BCB *) NULL);bufptr_start = (lru_list->victim_hint == NULL) ? lru_list->bottom : lru_list->victim_hint;The scan loop. Walk prev_BCB upward from bufptr_start, in zone 3, capped at MAX_DEPTH (1000). Per BCB:
- Excl. 1 — avoid-victim flag.
pgbuf_bcb_avoid_victimtestsPGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK=DIRTY | FLUSHING_TO_DISK | VICTIM_DIRECT | INVALIDATE_DIRECT_VICTIM(the four exclusions: dirty, mid-flush, already-assigned, invalidation-pending). Any bit →continue. - Excl. 2 — fixed/has waiters.
pgbuf_is_bcb_fixed_by_any (bufptr, false): iffcnt > 0,next_wait_thrd != NULL, or latch held, it is valid-but-busy. Record asbufptr_victimizable(first becomes the hint via CAS), count it,continue; break whenfound_victim_cntreacheslru_victim_cnt. - Claim —
PGBUF_BCB_TRYLOCK(conditional: we hold the list mutex and must not block on the BCB mutex — lock-order rule, Ch. 5):- Trylock ok +
pgbuf_is_bcb_victimizable(bufptr, true): the win — advance hint tobufptr->prev_BCB,pgbuf_remove_from_lru_list, then panic-assign viapgbuf_panic_assign_direct_victims_from_lruiffwaiter_threads_low_priority->size() >= 5 + num_total_threads/20(the low-priority backlog drain), wake the flush daemon if the new bottom is dirty, unlock, push VPID to Aout (Ch. 7), return locked BCB. - Trylock ok but not victimizable (flag flipped under us):
PGBUF_BCB_UNLOCK, next iteration. - Trylock fails: the BCB mutex is held elsewhere, only possible with a flush daemon — asserts
pgbuf_is_page_flush_daemon_available(). Record + hint, count it, honor the early-out.
- Trylock ok +
TO_VACUUM note: PGBUF_BCB_TO_VACUUM_FLAG is not in the mask, so a to-vacuum BCB is still victimizable here; its forcing to the LRU bottom happens at unfix/direct-assign time, not in this scan.
Failure tail. No claim, and a stale hint with no candidate found (bufptr_victimizable == NULL && victim_hint != NULL) → reset hint to bottom (if candidates remain) or NULL via CAS, unlock, wake flush daemon, return NULL.
Invariant —
victim_hintmarks the lowest point worth scanning;count_vict_candcounts clean zone-3 BCBs. The scan walks only upward from the hint and trustscount_vict_cand(kept by the LRU bookkeeping helpers as BCBs enter/leave zone 3 clean) as the early-out ceiling. The hint stays honest via the CAS-advance on every claim/record and the resync above. The documented TPCC drift (hint sits before the first victimizable BCB) only wastes scan steps — each candidate is re-validated under its own BCB lock before being claimed, so the hint is a performance hint, never a safety property; drift is tolerated, not fixed.
9.4 Quota gating in the private LFCQ helper
Section titled “9.4 Quota gating in the private LFCQ helper”pgbuf_lfcq_get_victim_from_private_lru picks which private list to scan and whether to re-enqueue it:
// pgbuf_lfcq_get_victim_from_private_lru -- src/storage/page_buffer.cif (pgbuf_Pool.big_private_lrus_with_victims->consume (lru_idx)) { /* big first */ }else { if (restricted) { return NULL; } /* <- restricted: big only */ if (!pgbuf_Pool.private_lrus_with_victims->consume (lru_idx)) { return NULL; } /* <- both empty */ }lru_list = PGBUF_GET_LRU_LIST (lru_idx);if (PGBUF_LRU_LIST_COUNT (lru_list) > PBGUF_BIG_PRIVATE_MIN_SIZE /* big: >100 ... */ && PGBUF_LRU_LIST_COUNT (lru_list) > 2 * lru_list->quota && lru_list->count_vict_cand > 1) { if (pgbuf_Pool.big_private_lrus_with_victims->produce (lru_idx)) added_back = true; } /* re-queue BIG before scan */victim = pgbuf_get_victim_from_lru_list (thread_p, lru_idx);if (added_back) return victim;if (lru_list->count_vict_cand > 0 && PGBUF_LRU_LIST_IS_OVER_QUOTA (lru_list)) { if (pgbuf_Pool.private_lrus_with_victims->produce (lru_idx)) return victim; }lru_list->flags &= ~PGBUF_LRU_VICTIM_LFCQ_FLAG; /* not re-queued: clear so next candidate re-adds it */return victim;Invariant — a private list is victimizable only while over quota. “Big-private” =
count > 100andcount > 2*quotaand>1candidate; re-queued before scanning so peers drain it in parallel. A non-big list is re-queued only while still over quota with candidates; otherwise its LFCQ flag is cleared and it leaves rotation untilpgbuf_adjust_quotas(§9.8) or a new candidate re-adds it. A list at/below quota is never poached. The shared siblingpgbuf_lfcq_get_victim_from_shared_lruhas no quota, so it simply re-enqueues whilecount_vict_cand > 0and (single-threaded) retries the same list once.
9.5 pgbuf_assign_direct_victim — producer side
Section titled “9.5 pgbuf_assign_direct_victim — producer side”When a BCB becomes clean+free (end of flush, panic-assign in §9.3, or last-unfix), its owner may hand it to a waiter. The producer holds the BCB mutex; the only invalidating flag tolerated is FLUSHING_TO_DISK (flush itself calls this):
// pgbuf_assign_direct_victim -- src/storage/page_buffer.cwhile (pgbuf_get_thread_waiting_for_direct_victim (waiter_thread)) { /* 75/25: low 1-in-4, else high */ thread_lock_entry (waiter_thread); if (waiter_thread->resume_status != THREAD_ALLOC_BCB_SUSPENDED) { thread_unlock_entry (waiter_thread); continue; } /* <- waiter gone, try next */ thread_wakeup_already_had_mutex (waiter_thread, THREAD_ALLOC_BCB_RESUMED); pgbuf_bcb_update_flags (thread_p, bcb, PGBUF_BCB_VICTIM_DIRECT_FLAG, PGBUF_BCB_FLUSHING_TO_DISK_FLAG); pgbuf_Pool.direct_victims.bcb_victims[waiter_thread->index] = bcb; /* <- write mailbox before unlock */ thread_unlock_entry (waiter_thread); return true; } /* <- assigned */return false; /* <- no waiters */pgbuf_get_thread_waiting_for_direct_victim holds the 75/25 weighting (low queue 1-in-4, else high), skipping dead queue entries. The while skips any waiter no longer THREAD_ALLOC_BCB_SUSPENDED; the BCB pointer is written before the entry lock releases, so the waiter never wakes to an empty slot. Empty queues → false, and the caller disposes of the BCB normally.
Invariant — a handed-off victim is exclusively owned, so no other thread can claim it. The producer enters with the BCB mutex held (
PGBUF_BCB_CHECK_OWN) and stampsPGBUF_BCB_VICTIM_DIRECT_FLAGwhile still holding it — and that flag is one of the four bits inPGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK, so from that instantpgbuf_bcb_avoid_victimreturns true and the §9.3 scan, the §9.4 helpers, andpgbuf_assign_flushed_pagesall skip the BCB. The only writer ofbcb_victims[tid]is the producer; the only reader is threadtidvia theATOMIC_TAS_ADDRin §9.6 — one slot per thread, single producer, single consumer. Evenpgbuf_invalidate_bcbdefers (§9.7). Thus between assignment and collection the BCB is logically owned by exactly one waiter; a concurrent re-fixer cannot steal it, only markINVALIDATE_DIRECT_VICTIMto release it back (§9.6).
9.6 pgbuf_get_direct_victim — consumer side and the invalidation retry
Section titled “9.6 pgbuf_get_direct_victim — consumer side and the invalidation retry”The slot read is a TAS that clears the slot atomically:
// pgbuf_get_direct_victim -- src/storage/page_buffer.cPGBUF_BCB *bcb = (PGBUF_BCB *) ATOMIC_TAS_ADDR (&pgbuf_Pool.direct_victims.bcb_victims[thread_p->index], NULL);PGBUF_BCB_LOCK (bcb);if (pgbuf_bcb_is_invalid_direct_victim (bcb)) { /* <- re-fix race */ pgbuf_bcb_update_flags (thread_p, bcb, 0, PGBUF_BCB_INVALIDATE_DIRECT_VICTIM_FLAG); /* clear it */ PGBUF_BCB_UNLOCK (bcb); return NULL; } /* <- caller re-sleeps */pgbuf_bcb_update_flags (thread_p, bcb, 0, PGBUF_BCB_VICTIM_DIRECT_FLAG); /* clear VICTIM_DIRECT */if (!pgbuf_is_bcb_victimizable (bcb, true)) { assert (false); PGBUF_BCB_UNLOCK (bcb); return NULL; }switch (pgbuf_bcb_get_zone (bcb)) { case PGBUF_VOID_ZONE: break; /* already detached (from flush) */ case PGBUF_INVALID_ZONE: assert (false); break; /* impossible */ default: /* still in an LRU: detach + Aout */ lru_idx = pgbuf_bcb_get_lru_index (bcb); pgbuf_lru_remove_bcb (thread_p, bcb); pgbuf_add_vpid_to_aout_list (thread_p, &bcb->vpid, lru_idx); break; }return bcb; /* locked, in VOID zone */The invalidation retry. Between assignment and collection a re-fixer may find the BCB on a hash hit (Ch. 3). It cannot steal a VICTIM_DIRECT BCB, so it sets INVALIDATE_DIRECT_VICTIM; the waiter observes it, clears the flag (releasing ownership), unlocks, and returns NULL — which the caller treats like a failed pgbuf_get_victim (re-enqueue and sleep). The BCB is left in place — the “puts it back and re-sleeps” path. Otherwise the zone switch detaches an in-LRU BCB (Aout-recorded) or no-ops a VOID one; post-condition: locked, VOID_ZONE, ready for reuse.
Figure 9-2 — direct hand-off.
stateDiagram-v2 [*] --> Clean: bcb flushed or unfixed clean Clean --> Assigning: assign direct victim, hold bcb mutex Assigning --> NoWaiter: queues drained, return false Assigning --> Assigned: live waiter, set VICTIM_DIRECT, write mailbox, wake Assigned --> Collected: waiter TAS reads slot, VICTIM_DIRECT seen Collected --> Detached: clear VICTIM_DIRECT, remove from LRU, push Aout Assigned --> Invalidated: re-fixer sets INVALIDATE_DIRECT_VICTIM Invalidated --> ReSleep: waiter clears flag, returns NULL, re-enqueues NoWaiter --> [*] Detached --> [*] ReSleep --> [*]
9.7 pgbuf_invalidate_bcb and the already-assigned victim
Section titled “9.7 pgbuf_invalidate_bcb and the already-assigned victim”pgbuf_invalidate_bcb tears a BCB out of the page table when its page is gone (dealloc, volume removal). It is in scope here for exactly one branch: an already-assigned direct victim is left alone — if pgbuf_bcb_is_direct_victim (bufptr) is true the function unlocks and returns NO_ERROR, since the waiting thread will victimize it momentarily and racing to invalidate it would corrupt the hand-off. (The remaining branches — the LATCH_INVALID early no-op, the clear-dirty plus zone removal, and the NO_LATCH hash-chain delete onto the invalid list versus the unexpected assert(false) tail — are the ordinary tear-down path and belong to the BCB-lifecycle chapters.)
9.8 pgbuf_adjust_quotas as the supplier
Section titled “9.8 pgbuf_adjust_quotas as the supplier”pgbuf_adjust_quotas (full logic in Ch. 10) keeps everything above viable: it recomputes each private list’s quota and zone thresholds, and re-adds to the LFCQs any over-quota list with candidates that fell out of rotation. The quota/threshold values read by §9.2’s stage gates and §9.4’s re-enqueue test all originate here.
9.9 Chapter summary — key takeaways
Section titled “9.9 Chapter summary — key takeaways”pgbuf_get_victimis a four-stage priority scan: own private (over quota) → other private (big first, then ordinary unlessrestricted) → shared (loops only without a flush daemon) → own private ignoring quota as a deadlock-avoidance last resort.NULLmeans “sleep and wait for a direct victim.”- The LFCQs hold list indices, not BCBs. A list is enqueued iff it has candidates and
PGBUF_LRU_VICTIM_LFCQ_FLAGis set; consumers re-enqueue while over quota with candidates, else clear the flag so it re-enters lazily. - Quota gating protects working sets. A private list is victimizable only while over quota; big-private (
>100,>2*quota,>1cand) lists are re-queued before scanning so they drain in parallel. pgbuf_get_victim_from_lru_listre-validates every candidate under its own BCB lock, applying the four exclusions inPGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK.TO_VACUUMis deliberately not an exclusion here.victim_hintis a performance hint, not a safety property. Its documented drift only wastes scan steps; correctness comes entirely from per-BCB re-validation under lock.- Direct hand-off is a mailbox protocol. The producer picks a waiter (High/Low at 75/25), stamps
VICTIM_DIRECT, writesbcb_victims[tid]under the waiter’s entry lock, and wakes it. - The consumer handles the re-fix race: observing
INVALIDATE_DIRECT_VICTIMit clears the flag, leaves the BCB in place, returnsNULLso the caller re-sleeps;pgbuf_invalidate_bcblikewise leaves an already-assigned direct victim untouched.
Chapter 10: Adaptive Quotas Ordered Fix and Special Paths
Section titled “Chapter 10: Adaptive Quotas Ordered Fix and Special Paths”Three families sit outside the single-page lifecycle of Chapters 3-9: the
adaptive quota daemon (100 ms) re-sizing private LRU lists, ordered
fix (multi-page deadlock avoidance over pgbuf_fix), and special fix
paths that each bypass part of the normal path. For private/shared lists,
victim zones, and LFCQ queues see the companion cubrid-page-buffer-manager.md
and Chapter 9 — not re-derived here.
10.1 The two structs: pgbuf_page_quota and pgbuf_watcher
Section titled “10.1 The two structs: pgbuf_page_quota and pgbuf_watcher”pgbuf_page_quota (one instance, pgbuf_Pool.quota) holds the global state
pgbuf_adjust_quotas reads/writes. Per-list outputs (quota,
threshold_lru1/2) live on each PGBUF_LRU_LIST (Chapter 1), not here.
| Field | Role | Why it exists |
|---|---|---|
num_private_LRU_list | private-list count; PGBUF_PAGE_QUOTA_IS_ENABLED is > 0 | master enable switch; 0 makes the subsystem inert |
lru_victim_flush_priority_per_lru | per-list float, flush priority | flush daemon (Ch 8) biases flushing toward over-quota lists |
private_lru_session_cnt | active sessions per private list | pgbuf_assign_private_lru picks the zero-session list first |
private_pages_ratio | fraction of BCBs that should be private | smoothed target driving all_private_quota |
add_shared_lru_idx | circular index for shared-list relocation | round-robins shared-LRU assignment on private-to-shared migration |
avoid_shared_lru_idx | shared list to avoid when relocating | steers traffic off the fattest list so it drains via victimization |
last_adjust_time | TSC_TICKS of last adjust | gates the 1 ms / 500 ms cadence checks |
adjust_age | monotonic counter, bumped each adjust | generation stamp other code compares against |
is_adjusting | re-entrancy guard | only one thread runs the adjust body at a time |
pgbuf_watcher is the caller-owned ordered-fix handle: stack-allocated,
init’d with PGBUF_INIT_WATCHER(w, rank, hfid), passed to pgbuf_ordered_fix,
then threaded onto the holder’s watcher list so the machinery can re-fix it.
| Field | Role | Why it exists |
|---|---|---|
pgptr | fixed page, or NULL | fix output; also the “is watcher live” test (PGBUF_IS_CLEAN_WATCHER) |
next / prev | links in the holder’s watcher list | one holder (one fixed BCB) may carry several watchers |
group_id | VPID of the group’s heap-header page | deadlock key: pages of one heap share a group |
latch_mode (7 bits) | latch held by this watcher | re-fix restores the same mode; WRITE on any watcher promotes the page |
page_was_unfixed (1 bit) | true if ordered fix had to unfix-and-refix this page | tells caller “in-page pointers moved; re-read” |
initial_rank (4 bits) | rank set at init time | caller’s declared rank before any fix |
curr_rank (4 bits) | effective rank after fix | promoted to PGBUF_ORDERED_HEAP_HDR if this page is its own group header |
magic (debug) | 0x12345678 | catches an uninitialized/garbage watcher |
watched_at / init_at (debug) | source location strings | leak / double-fix diagnostics |
Invariant (watcher rank monotonicity within a group): every watcher on the
same physical page must carry the same curr_rank and group_id.
pgbuf_ordered_fix_release enforces this while scanning a holder’s watcher
list; a mismatch raises ER_PB_ORDERED_INCONSISTENCY (fatal). If violated, the
VPID sort below is ill-defined and the deadlock guarantee collapses.
The rank ordering is the enum PGBUF_ORDERED_RANK (in page_buffer.h):
PGBUF_ORDERED_HEAP_HDR = 0 (fixed first) < PGBUF_ORDERED_HEAP_NORMAL <
PGBUF_ORDERED_HEAP_OVERFLOW (fixed last) < PGBUF_ORDERED_RANK_UNDEFINED
(sentinel). A pgbuf_watcher hangs off a PGBUF_HOLDER’s
first_watcher..last_watcher chain (whose bufptr is the fixed PGBUF_BCB)
and tags the page via group_id with the heap-header VPID defining its group.
10.2 pgbuf_adjust_quotas — recomputing private quotas every 100 ms
Section titled “10.2 pgbuf_adjust_quotas — recomputing private quotas every 100 ms”The Page Maintenance Daemon (100 ms cubthread::looper) calls
pgbuf_page_maintenance_execute, which after a boot guard
(BO_IS_FLUSH_DAEMON_AVAILABLE) calls pgbuf_adjust_quotas then
pgbuf_direct_victims_maintenance (10.3). Cadence gates and exits:
- Disabled / already running.
if (!PGBUF_PAGE_QUOTA_IS_ENABLED || quota->is_adjusting) return;elseis_adjusting = 1. - Too soon (< 1 ms). if
diff_usec < 1000, clear the guard, return. - Low activity and < 500 ms. if
pg_unfix_cnt < PGBUF_TRAN_THRESHOLD_ACTIVITY(num_buffers/4) and < 500 ms elapsed, bail. Busy pool adjusts ~every 1 ms; idle pool waits 500 ms. - Very low activity flag.
pg_unfix_cnt.exchange(0)reads-and-resets; if the prior value< THRESHOLD/100, setlow_overall_activity = true.
Then it stamps last_adjust_time and bumps adjust_age.
Phase A — per-list hits. One pass over PGBUF_TOTAL_LRU_COUNT lists.
lru_hits = ATOMIC_TAS_32(&monitor->lru_hits[i], 0) is read-and-reset, scaled
to hits/sec, accumulated into lru_private_hits/lru_shared_hits, and
total_victims += PGBUF_GET_LRU_LIST(i)->count_vict_cand. Each private list’s
activity sample is history-smoothed: if diff_usec >= tensec_usec (>10 s)
monitor->lru_activity[i] = lru_hits (old sample stale, replace); else it is
the time-weighted blend ((tensec_usec - diff_usec) * old + diff_usec * lru_hits) / tensec_usec.
Phase B — private ratio. If low_overall_activity, force private_ratio = MIN_PRIVATE_RATIO (starve privates); else lru_private_hits / (private + shared) clamped to [0.01, 0.998] (shared floored to 1), then 10 s-smoothed
into quota->private_pages_ratio.
Phase C — redistribute (two mutually exclusive branches):
- No private activity (
sum_private_lru_activity_total == 0):all_private_quota = 0; every private list getsquota = threshold_lru1 = threshold_lru2 = 0,pgbuf_lru_adjust_zonesunder the list mutex if it still holds pages, and a push onto the victim LFCQ (pgbuf_lfcq_add_lru_with_victims) if it has over-quota candidates. - Some private activity (else): the budget is
all_private_quota = (int)((num_buffers - invalid_cnt) * quota->private_pages_ratio), split per list proportional to activity:
// pgbuf_adjust_quotas (phase C, active) -- src/storage/page_buffer.cnew_quota = (int) ((float) lru_activity[i] / sum_private_lru_activity_total * all_private_quota);new_quota = MIN (new_quota, PGBUF_PRIVATE_LRU_MAX_HARD_QUOTA); /* absolute cap */new_quota = MIN (new_quota, num_buffers / 2); /* half-pool cap */lru_list->threshold_lru1 = lru_list->threshold_lru2 = (int) (new_quota * PGBUF_LRU_ZONE_MIN_RATIO);The two caps stop a single list monopolizing the pool; threshold_lru1/2 are
the zone sizes the Chapter 7 unfix path reads.
Phase D — shared lists. Leftover budget spreads evenly:
avg_shared_lru_size = (num_buffers - all_private_quota) / num_LRU_list,
floored at PGBUF_MIN_SHARED_LIST_ADJUST_SIZE; threshold_lru1/2 from the
configured ratio_lru1/2. Each over-threshold shared list is re-zoned and, if
it has candidates, queued for victims.
Phase E — victim_rich and release. `monitor.victim_rich = total_victims
= (int)(0.1 * num_buffers); quota->is_adjusting = 0;
.victim_rich` is Chapter 9’s cheap “push hard on victimization?” hint — true above 10% of pool.
Invariant (single-writer adjust): is_adjusting is set on entry and
cleared on every exit (all four early returns and the tail). Not a mutex — a
best-effort flag for a single-threaded daemon. An early return that forgets
to clear it freezes the subsystem forever (every later call hits gate 1).
flowchart TD
B{"enabled and\nnot adjusting?"} -->|no| Z["return"]
B -->|yes| C["is_adjusting=1"]
C --> D{"diff<1ms?"}
D -->|yes| Y["is_adjusting=0;\nreturn"]
D -->|no| E{"low activity\nand diff<500ms?"}
E -->|yes| Y
E -->|no| K{"sum activity==0?"}
K -->|yes| J["all private quota=0"]
K -->|no| S["split by activity,\ncap abs and pool/2"]
J --> L["shared thresholds;\nvictim_rich; is_adjusting=0"]
S --> L
Figure 10-1. pgbuf_adjust_quotas: cadence gates (1-4) and the two Phase C redistribution branches.
10.3 pgbuf_direct_victims_maintenance — the backup victim hand-off
Section titled “10.3 pgbuf_direct_victims_maintenance — the backup victim hand-off”The fast path assigns victims as a side effect of unfix/flush; on an idle system that never fires, so a blocked thread could starve. This backup walks lists round-robin and hands victims out directly, once over private lists and once over shared:
// pgbuf_direct_victims_maintenance -- src/storage/page_buffer.cstatic int prv_index = 0; /* round-robin cursors, single-threaded use only */static int shr_index = 0;for (index = prv_index, restarted = false; pgbuf_is_any_thread_waiting_for_direct_victim () && nassigns > 0 && index != prv_index && !restarted; (index == PGBUF_PRIVATE_LRU_COUNT - 1) ? index = 0, restarted = true : index++) pgbuf_lfcq_assign_direct_victims (thread_p, PGBUF_LRU_INDEX_FROM_PRIVATE (index), &nassigns);prv_index = index; /* persist cursor for next tick */// ... a second, structurally identical loop over shared lists, then shr_index = index ...The cursor starts at prv_index; index != prv_index therefore becomes the
wrap-around terminator only after the iterator has advanced past it. Each loop
stops when (a) no thread waits, (b) the per-iteration budget
DEFAULT_ASSIGNS_PER_ITERATION (5) is spent, or (c) it wrapped once
(restarted). The prv_index = index / shr_index = index write-backs are
what make the static cursors persist across ticks so each tick sweeps
different lists — hence single-threaded use only.
pgbuf_lfcq_assign_direct_victims retries from lru_list->bottom if the
cached victim_hint is stale (a CAS resets it), self-healing the hint.
10.4 pgbuf_ordered_fix_release — multi-page deadlock avoidance
Section titled “10.4 pgbuf_ordered_fix_release — multi-page deadlock avoidance”Heap ops hold several pages at once; fixing them in different orders
deadlocks. The heap header must stay fixed first, so plain VPID ordering is
insufficient. Ordered fix keeps a rank (header < normal < overflow), sorts
by VPID within rank, and — if a new request violates that order against held
pages — unfixes the offenders, sorts, re-fixes in order. Entry contract:
req_watcher->pgptr must be NULL else ER_FAILED_ASSERTION; curr_rank
becomes PGBUF_ORDERED_HEAP_HDR if the requested VPID is the group header,
else initial_rank.
Branch 1 — conditional first attempt. If the thread holds no other page
(holder == NULL, or holder->thrd_link == NULL && VPID_EQ(req_vpid, &holder->bufptr->vpid) — only this one), use PGBUF_UNCONDITIONAL_LATCH; else
PGBUF_CONDITIONAL_LATCH so a would-be deadlock fails fast. Then
ret_pgptr = pgbuf_fix_release(...).
- Got it: find the holder, resolve group id (existing watcher, or fix the heap header via
pgbuf_get_groupid_and_unfixifPAGE_HEAP), attach viapgbuf_add_watch_instance_internal,goto exit. Common no-reorder case. - Did not get it, branch on error:
ER_PB_BAD_PAGEID/ER_INTERRUPTED→ exit;OLD_PAGE_MAYBE_DEALLOCATED+ER_PB_BAD_PAGEID→ treat deallocated, exit;LK_ZERO_WAIT/LK_FORCE_ZERO_WAIT→ER_LK_PAGE_TIMEOUT(no error set for force, scans continue), exit;UNCONDITIONAL→ already blocked and failed, exit with the error; else (conditional failed) → fall through to reorder, clearinger_status.
Branch 2 — classify held pages. Walk holders, skipping watch_count <= 0
(no watcher; assumed deadlock-safe). Gather each watched holder’s watchers into
ordered_holders_info[], verifying the 10.1 invariant.
diff = pgbuf_compare_hold_vpid_for_sort(req, held): diff < 0 (held sorts
after req) → save for unfix; diff == 0 (same page) → ER_FAILED_ASSERTION;
diff > 0 (held sorts before) → leave fixed. If the request has no group yet
(req_page_has_group == false), diff is forced -1 so all watched pages are
unfixed and re-fixed.
Branch 3 — unfix the out-of-order pages. For each saved entry,
pgbuf_bcb_register_avoid_deallocation(holder->bufptr) pins it across the gap,
pgbuf_unfix runs fix_count times, then each watcher is PGBUF_CLEAR_WATCHER’d
and gets pg_watcher->page_was_unfixed = true.
Branch 4 — resolve missing group, then sort. If req had no group, re-fix
it unconditionally, derive its group id (clear dealloc-prevent if
OLD_PAGE_PREVENT_DEALLOC downgraded to OLD_PAGE), append the requested
page, qsort by pgbuf_compare_hold_vpid_for_sort (rank, volid, pageid).
Branch 5 — re-fix in sorted order. Requested page uses caller’s
request_mode/fetch_mode; restored pages use saved latch_mode+OLD_PAGE.
All PGBUF_UNCONDITIONAL_LATCH now — order guaranteed, blocking deadlock-free.
Failures: ER_INTERRUPTED exits; ER_PB_BAD_PAGEID on the requested page
under OLD_PAGE_MAYBE_DEALLOCATED is tolerated; failure restoring a held
page → ER_PB_ORDERED_REFIX_FAILED (serious — watchers partially live).
Invariant (caller must honor page_was_unfixed): any unfixed-and-refixed
watcher has page_was_unfixed == true. Pointers cached into that page may now
be invalid — another thread could have modified it during the gap. Reusing a
stale pointer reads corrupt data — the single most important contract of
ordered fix.
10.5 pgbuf_ordered_unfix — watcher-aware unfix
Section titled “10.5 pgbuf_ordered_unfix — watcher-aware unfix”pgbuf_ordered_unfix is the counterpart: if watcher_object->pgptr == NULL it
assert_release(false) and returns; otherwise pgbuf_get_holder finds the
holder, a for (watcher = holder->last_watcher; ...; watcher = watcher->prev)
scan finds the exact watcher, then after the invariant assert
(assert(holder->fix_count >= holder->watch_count)) it calls
pgbuf_remove_watcher and one pgbuf_unfix.
Invariant (fix_count >= watch_count): a holder may be fixed more than
watched (a plain fix adds no watcher) but never the reverse. Asserted here and
in 10.4’s classification loop. Violation means a watcher outlived its fix — a
use-after-unfix.
10.6 pgbuf_promote_read_latch_release — R-to-W promotion
Section titled “10.6 pgbuf_promote_read_latch_release — R-to-W promotion”Converts a held READ latch to WRITE without fully unfixing, per a
PGBUF_PROMOTE_CONDITION; a CAS loop on the packed atomic_latch (Chapter 5):
- Sole reader (
holder->fix_count == impl.fcnt): unless the next waiter is a promoter (then failER_PAGE_LATCH_PROMOTE_FAIL), flipimpl_new.latch_mode = PGBUF_LATCH_WRITEin place. No blocking. - Other readers present (
fix_count != fcnt):PGBUF_PROMOTE_ONLY_READERor next waiter is a promoter → failER_PAGE_LATCH_PROMOTE_FAIL(CASE #1/#2).PGBUF_PROMOTE_SHARED_READER→ subtract our fixes fromfcnt, markwaiter_exists, setneed_block, leave the loop.
If need_block, it effectively unfixes (holder->fix_count = 0, remove
holder), sets thread_p->wait_for_latch_promote = true, blocks via
pgbuf_block_bcb for WRITE, and on wake re-allocates a holder with the saved
fix_count.
Invariant (promoter mutual exclusion): at most one waiter on a BCB carries
wait_for_latch_promote; both branches that detect a promoter waiter abort
rather than queue behind it. Two blocked promoters would each wait for the
other to drop its read latch — a deadlock. The abort returns
ER_PAGE_LATCH_PROMOTE_FAIL; callers retry by unfix + fix WRITE.
10.7 Remaining special fix paths
Section titled “10.7 Remaining special fix paths”pgbuf_simple_fix/pgbuf_simple_unfix— temp files only. Latchless, LRU-mutexless; onlyadd_fcnt(&bufptr->atomic_latch, 1), never latches (“Cannot be mixed with general FIX(LATCH)”). Resident → if a direct-victim claim is pending, invalidate it whenneed_fix, else NULL. Absent → ifneed_fix: lock hash,pgbuf_claim_bcb_for_fix, insert, add to private/shared LRU; else NULL. Unfix = lock,add_fcnt(..., -1), unlock.pgbuf_fix_if_not_deallocated— vacuum’s dealloc-aware fix.disk_is_page_sector_reservedfirst:DISK_INVALID→ NO_ERROR,*page = NULL(deallocated, not error);DISK_ERROR→ propagate;DISK_VALID→ real fix withOLD_PAGE_MAYBE_DEALLOCATED, then a NULL +ER_PB_BAD_PAGEIDis swallowed (raced) unless mid recovery-redo.pgbuf_invalidate— drop a page (caller holds WRITE).fcnt > 1→ just unfix (pgbuf_unlatch_thrd_holder+pgbuf_unlatch_bcb_upon_unfix), no invalidation.fcnt == 1→ flush if dirty (pgbuf_bcb_safe_flush_force_lock), record VPID, unfix, re-lock, re-check; if the BCB was reused, re-fixed, or avoid-victim → skip, elsepgbuf_invalidate_bcbdetaches it. Persistent pages run this as a post-commit postpone; temp pages unconditionally.pgbuf_invalidate_alliterates a volume.pgbuf_notify_vacuum_follows/pgbuf_bcb_is_to_vacuum— vacuum hint. SetsPGBUF_BCB_TO_VACUUM_FLAGviapgbuf_bcb_update_flags(thread_p, bcb, PGBUF_BCB_TO_VACUUM_FLAG, 0)(“vacuum will revisit, prefer not to victimize”).pgbuf_bcb_is_to_vacuumtests it; the victim-flush path clears it on commit-to-flush, so the hint is one-shot.- TDE hook (out-of-scope).
pgbuf_set_tde_algorithmearly-returns if the algorithm is unchanged, else clears the existing bits (pflag &= ~FILEIO_PAGE_FLAG_ENCRYPTED_MASK) and ORs in the new encryption bit iniopage->prv.pflag(FILEIO_PAGE_FLAG_ENCRYPTED_AES/_ARIA), logs undoredo unlessskip_logging, marks dirty. The page buffer only carries these bits through dirty/flush (Ch 6); en/decryption is the TDE module’s. Noted only so a maintainer knowspflaghas a TDE tenant.
10.8 Chapter summary — key takeaways
Section titled “10.8 Chapter summary — key takeaways”- The 100 ms Page Maintenance Daemon runs
pgbuf_adjust_quotasthenpgbuf_direct_victims_maintenance(idle victim backup); the former is gated by a 1 ms / 500 ms cadence andnum_buffers/4activity, guarded by the single-writeris_adjustingflag that must clear on every exit, and sets per-listquota/threshold_lru1/2(capped atPGBUF_PRIVATE_LRU_MAX_HARD_QUOTAand half the pool) plusvictim_rich(>10% of pool). - Ordered fix ranks pages (heap-hdr < normal < overflow) then sorts by VPID; held pages sorting after the request are unfixed, the set
qsorted, all re-fixed unconditionally. Its load-bearing outputpage_was_unfixedmeans the caller must re-read that page (cached pointers may be stale). - Watcher invariants (same rank/group per page;
fix_count >= watch_countper holder) are fatal-enforced.pgbuf_promote_read_latch_releaseflips R-to-W in place as sole reader, blocks underPGBUF_PROMOTE_SHARED_READER, aborts withER_PAGE_LATCH_PROMOTE_FAILif another promoter waits. - Each special path bypasses one step:
pgbuf_simple_fix(latchless temp),pgbuf_fix_if_not_deallocated(deallocated = NULL non-error),pgbuf_invalidate(detach a singly-fixed page),pgbuf_notify_vacuum_follows(one-shot anti-victim hint); the TDEpflagtenant is the out-of-scope boundary.
Position hints as of this revision
Section titled “Position hints as of this revision”The following are line numbers as observed on 2026-06-17; symbols are the canonical anchor and line numbers are hints that decay.
| Symbol | File | Line |
|---|---|---|
dwb_set_data_on_next_slot | src/storage/double_write_buffer.cpp | 2686 |
dwb_add_page | src/storage/double_write_buffer.cpp | 2726 |
dwb_is_created | src/storage/double_write_buffer.cpp | 2909 |
fileio_page_reserved | src/storage/file_io.h | 166 |
fileio_page_watermark | src/storage/file_io.h | 179 |
fileio_page | src/storage/file_io.h | 186 |
PGBUF_DEFAULT_FIX_COUNT | src/storage/page_buffer.c | 90 |
PGBUF_NUM_ALLOC_HOLDER | src/storage/page_buffer.c | 94 |
PGBUF_FIX_COUNT_THRESHOLD | src/storage/page_buffer.c | 106 |
pgbuf_latch_timeout | src/storage/page_buffer.c | 107 |
PGBUF_IOPAGE_BUFFER_SIZE | src/storage/page_buffer.c | 118 |
PGBUF_FIND_BCB_PTR | src/storage/page_buffer.c | 135 |
PGBUF_LRU_NBITS | src/storage/page_buffer.c | 148 |
PGBUF_LRU_INDEX_MASK | src/storage/page_buffer.c | 150 |
PGBUF_LRU_INDEX_MASK | src/storage/page_buffer.c | 182 |
PGBUF_LRU_1_ZONE | src/storage/page_buffer.c | 197 |
PGBUF_LRU_ZONE_MASK | src/storage/page_buffer.c | 201 |
PGBUF_INVALID_ZONE | src/storage/page_buffer.c | 205 |
PGBUF_VOID_ZONE | src/storage/page_buffer.c | 206 |
PGBUF_ZONE_MASK | src/storage/page_buffer.c | 211 |
PGBUF_GET_ZONE | src/storage/page_buffer.c | 215 |
PGBUF_GET_LRU_INDEX | src/storage/page_buffer.c | 216 |
PGBUF_BCB_DIRTY_FLAG | src/storage/page_buffer.c | 224 |
PGBUF_BCB_FLUSHING_TO_DISK_FLAG | src/storage/page_buffer.c | 227 |
PGBUF_BCB_VICTIM_DIRECT_FLAG | src/storage/page_buffer.c | 234 |
PGBUF_BCB_INVALIDATE_DIRECT_VICTIM_FLAG | src/storage/page_buffer.c | 235 |
PGBUF_BCB_MOVE_TO_LRU_BOTTOM_FLAG | src/storage/page_buffer.c | 237 |
PGBUF_BCB_TO_VACUUM_FLAG | src/storage/page_buffer.c | 239 |
PGBUF_BCB_ASYNC_FLUSH_REQ | src/storage/page_buffer.c | 241 |
PGBUF_BCB_FLAGS_MASK | src/storage/page_buffer.c | 244 |
PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK | src/storage/page_buffer.c | 258 |
PGBUF_BCB_INIT_FLAGS | src/storage/page_buffer.c | 265 |
PGBUF_BCB_COUNT_FIX_SHIFT_BITS | src/storage/page_buffer.c | 268 |
PGBUF_BCB_AVOID_DEALLOC_MASK | src/storage/page_buffer.c | 269 |
PGBUF_TRAN_THRESHOLD_ACTIVITY | src/storage/page_buffer.c | 276 |
PGBUF_AOUT_NOT_FOUND | src/storage/page_buffer.c | 279 |
PGBUF_SHOULD_IGNORE_UNFIX | src/storage/page_buffer.c | 290 |
HASH_SIZE_BITS | src/storage/page_buffer.c | 295 |
PGBUF_HASH_SIZE | src/storage/page_buffer.c | 296 |
PGBUF_HASH_VALUE | src/storage/page_buffer.c | 300 |
PGBUF_FLUSH_VICTIM_BOOST_MULT | src/storage/page_buffer.c | 305 |
PGBUF_NEIGHBOR_FLUSH_NONDIRTY | src/storage/page_buffer.c | 307 |
PGBUF_MAX_NEIGHBOR_PAGES | src/storage/page_buffer.c | 310 |
PGBUF_NEIGHBOR_POS | src/storage/page_buffer.c | 314 |
PGBUF_CHKPT_MAX_FLUSH_RATE | src/storage/page_buffer.c | 322 |
PGBUF_CHKPT_MIN_FLUSH_RATE | src/storage/page_buffer.c | 323 |
PGBUF_CHKPT_BURST_PAGES | src/storage/page_buffer.c | 326 |
PGBUF_LRU_ZONE_MIN_RATIO | src/storage/page_buffer.c | 342 |
PGBUF_LOCK_HOLDER | src/storage/page_buffer.c | 348 |
pgbuf_holder_stat | src/storage/page_buffer.c | 441 |
pgbuf_batch_flush_helper | src/storage/page_buffer.c | 451 |
pgbuf_holder | src/storage/page_buffer.c | 461 |
pgbuf_holder_anchor | src/storage/page_buffer.c | 479 |
pgbuf_holder_set | src/storage/page_buffer.c | 488 |
pgbuf_atomic_latch_impl | src/storage/page_buffer.c | 494 |
pgbuf_bcb | src/storage/page_buffer.c | 506 |
atomic_latch | src/storage/page_buffer.c | 513 |
flags | src/storage/page_buffer.c | 514 |
next_wait_thrd | src/storage/page_buffer.c | 516 |
count_fix_and_avoid_dealloc | src/storage/page_buffer.c | 528 |
oldest_unflush_lsa | src/storage/page_buffer.c | 536 |
pgbuf_iopage_buffer | src/storage/page_buffer.c | 541 |
struct pgbuf_iopage_buffer | src/storage/page_buffer.c | 541 |
pgbuf_buffer_lock | src/storage/page_buffer.c | 557 |
struct pgbuf_buffer_lock | src/storage/page_buffer.c | 557 |
pgbuf_buffer_hash | src/storage/page_buffer.c | 570 |
pgbuf_lru_list | src/storage/page_buffer.c | 580 |
victim_hint | src/storage/page_buffer.c | 589 |
count_vict_cand | src/storage/page_buffer.c | 602 |
pgbuf_invalid_list | src/storage/page_buffer.c | 621 |
struct pgbuf_invalid_list | src/storage/page_buffer.c | 621 |
pgbuf_aout_buf | src/storage/page_buffer.c | 636 |
struct pgbuf_aout_list | src/storage/page_buffer.c | 645 |
pgbuf_aout_list | src/storage/page_buffer.c | 645 |
pgbuf_seq_flusher | src/storage/page_buffer.c | 669 |
struct pgbuf_page_monitor | src/storage/page_buffer.c | 688 |
struct pgbuf_page_quota | src/storage/page_buffer.c | 710 |
pgbuf_page_quota | src/storage/page_buffer.c | 710 |
struct pgbuf_direct_victim | src/storage/page_buffer.c | 737 |
pgbuf_buffer_pool | src/storage/page_buffer.c | 749 |
struct pgbuf_victim_candidate_list | src/storage/page_buffer.c | 833 |
pgbuf_Flush_helper | src/storage/page_buffer.c | 840 |
AOUT_HASH_IDX | src/storage/page_buffer.c | 854 |
PGBUF_BCB_LOCK | src/storage/page_buffer.c | 869 |
PGBUF_BCB_TRYLOCK | src/storage/page_buffer.c | 871 |
PGBUF_IS_BCB_IN_LRU_VICTIM_ZONE | src/storage/page_buffer.c | 919 |
PGBUF_IS_BCB_IN_LRU | src/storage/page_buffer.c | 920 |
PGBUF_IS_BCB_OLD_ENOUGH | src/storage/page_buffer.c | 927 |
PGBUF_PRIVATE_LRU_MAX_HARD_QUOTA | src/storage/page_buffer.c | 943 |
PGBUF_MIN_PAGES_IN_SHARED_LIST | src/storage/page_buffer.c | 946 |
PGBUF_TOTAL_LRU_COUNT | src/storage/page_buffer.c | 969 |
PGBUF_IS_PRIVATE_LRU_INDEX | src/storage/page_buffer.c | 975 |
PGBUF_LRU_LIST_IS_OVER_QUOTA | src/storage/page_buffer.c | 977 |
PGBUF_LRU_LIST_IS_OVER_QUOTA_WITH_BUFFER | src/storage/page_buffer.c | 987 |
set_latch | src/storage/page_buffer.c | 1310 |
add_fcnt | src/storage/page_buffer.c | 1324 |
set_waiter_exists | src/storage/page_buffer.c | 1368 |
get_latch | src/storage/page_buffer.c | 1398 |
get_impl | src/storage/page_buffer.c | 1406 |
pgbuf_thread_variables_init | src/storage/page_buffer.c | 1415 |
pgbuf_hash_func_mirror | src/storage/page_buffer.c | 1441 |
pgbuf_hash_vpid | src/storage/page_buffer.c | 1480 |
pgbuf_compare_vpid | src/storage/page_buffer.c | 1494 |
pgbuf_initialize | src/storage/page_buffer.c | 1518 |
pgbuf_finalize | src/storage/page_buffer.c | 1796 |
pgbuf_fix_with_retry | src/storage/page_buffer.c | 1993 |
pgbuf_fix_release | src/storage/page_buffer.c | 2041 |
pgbuf_simple_fix | src/storage/page_buffer.c | 2475 |
pgbuf_simple_unfix | src/storage/page_buffer.c | 2569 |
pgbuf_promote_read_latch_debug | src/storage/page_buffer.c | 2624 |
pgbuf_promote_read_latch_release | src/storage/page_buffer.c | 2628 |
pgbuf_unfix | src/storage/page_buffer.c | 2850 |
pgbuf_invalidate | src/storage/page_buffer.c | 3158 |
pgbuf_flush | src/storage/page_buffer.c | 3341 |
pgbuf_flush_with_wal | src/storage/page_buffer.c | 3364 |
pgbuf_flush_if_requested | src/storage/page_buffer.c | 3404 |
pgbuf_flush_all_helper | src/storage/page_buffer.c | 3438 |
pgbuf_get_victim_candidates_from_lru | src/storage/page_buffer.c | 3564 |
pgbuf_flush_victim_candidates | src/storage/page_buffer.c | 3645 |
pgbuf_flush_checkpoint | src/storage/page_buffer.c | 3960 |
pgbuf_flush_chkpt_seq_list | src/storage/page_buffer.c | 4102 |
pgbuf_flush_seq_list | src/storage/page_buffer.c | 4210 |
pgbuf_set_dirty | src/storage/page_buffer.c | 4700 |
pgbuf_set_lsa | src/storage/page_buffer.c | 4771 |
pgbuf_set_tde_algorithm | src/storage/page_buffer.c | 4881 |
pgbuf_set_bcb_page_vpid | src/storage/page_buffer.c | 5214 |
pgbuf_initialize_bcb_table | src/storage/page_buffer.c | 5334 |
pgbuf_initialize_hash_table | src/storage/page_buffer.c | 5452 |
pgbuf_initialize_lock_table | src/storage/page_buffer.c | 5481 |
pgbuf_initialize_lru_list | src/storage/page_buffer.c | 5519 |
pgbuf_initialize_aout_list | src/storage/page_buffer.c | 5582 |
pgbuf_initialize_invalid_list | src/storage/page_buffer.c | 5686 |
pgbuf_initialize_thrd_holder | src/storage/page_buffer.c | 5701 |
pgbuf_allocate_thrd_holder_entry | src/storage/page_buffer.c | 5783 |
pgbuf_find_thrd_holder | src/storage/page_buffer.c | 5870 |
pgbuf_remove_thrd_holder | src/storage/page_buffer.c | 5971 |
pgbuf_latch_bcb_upon_fix | src/storage/page_buffer.c | 6073 |
pgbuf_unlatch_bcb_upon_unfix | src/storage/page_buffer.c | 6417 |
pgbuf_unlatch_void_zone_bcb | src/storage/page_buffer.c | 6652 |
pgbuf_should_move_private_to_shared | src/storage/page_buffer.c | 6758 |
pgbuf_block_bcb | src/storage/page_buffer.c | 6803 |
pgbuf_timed_sleep_error_handling | src/storage/page_buffer.c | 6925 |
pgbuf_timed_sleep | src/storage/page_buffer.c | 7014 |
pgbuf_wakeup_reader_writer | src/storage/page_buffer.c | 7186 |
pgbuf_search_hash_chain | src/storage/page_buffer.c | 7327 |
pgbuf_lockfree_fix_ro | src/storage/page_buffer.c | 7452 |
pgbuf_search_hash_chain_no_bcb_lock | src/storage/page_buffer.c | 7517 |
pgbuf_insert_into_hash_chain | src/storage/page_buffer.c | 7569 |
pgbuf_lock_page | src/storage/page_buffer.c | 7718 |
pgbuf_unlock_page | src/storage/page_buffer.c | 7831 |
pgbuf_allocate_bcb | src/storage/page_buffer.c | 7916 |
pgbuf_claim_bcb_for_fix | src/storage/page_buffer.c | 8133 |
pgbuf_victimize_bcb | src/storage/page_buffer.c | 8372 |
pgbuf_invalidate_bcb | src/storage/page_buffer.c | 8424 |
pgbuf_bcb_safe_flush_force_unlock | src/storage/page_buffer.c | 8494 |
pgbuf_bcb_safe_flush_force_lock | src/storage/page_buffer.c | 8517 |
pgbuf_bcb_safe_flush_internal | src/storage/page_buffer.c | 8550 |
pgbuf_get_bcb_from_invalid_list | src/storage/page_buffer.c | 8644 |
pgbuf_put_bcb_into_invalid_list | src/storage/page_buffer.c | 8693 |
pgbuf_get_victim | src/storage/page_buffer.c | 8805 |
pgbuf_is_bcb_fixed_by_any | src/storage/page_buffer.c | 8995 |
pgbuf_is_bcb_victimizable | src/storage/page_buffer.c | 9023 |
pgbuf_get_victim_from_lru_list | src/storage/page_buffer.c | 9053 |
pgbuf_panic_assign_direct_victims_from_lru | src/storage/page_buffer.c | 9279 |
pgbuf_direct_victims_maintenance | src/storage/page_buffer.c | 9346 |
pgbuf_lfcq_assign_direct_victims | src/storage/page_buffer.c | 9388 |
pgbuf_lru_add_bcb_to_top | src/storage/page_buffer.c | 9432 |
pgbuf_lru_add_bcb_to_middle | src/storage/page_buffer.c | 9482 |
pgbuf_lru_add_bcb_to_bottom | src/storage/page_buffer.c | 9570 |
pgbuf_lru_fall_bcb_to_zone_3 | src/storage/page_buffer.c | 9788 |
pgbuf_lru_boost_bcb | src/storage/page_buffer.c | 9858 |
pgbuf_lru_move_from_private_to_shared | src/storage/page_buffer.c | 10064 |
pgbuf_remove_from_lru_list | src/storage/page_buffer.c | 10089 |
pgbuf_move_bcb_to_bottom_lru | src/storage/page_buffer.c | 10157 |
pgbuf_add_vpid_to_aout_list | src/storage/page_buffer.c | 10201 |
pgbuf_remove_vpid_from_aout_list | src/storage/page_buffer.c | 10282 |
pgbuf_bcb_flush_with_wal | src/storage/page_buffer.c | 10456 |
pgbuf_wake_flush_waiters | src/storage/page_buffer.c | 10694 |
pgbuf_is_exist_blocked_reader_writer | src/storage/page_buffer.c | 10741 |
pgbuf_wakeup | src/storage/page_buffer.c | 11319 |
pgbuf_set_dirty_buffer_ptr | src/storage/page_buffer.c | 11369 |
pgbuf_flush_page_and_neighbors_fb | src/storage/page_buffer.c | 11527 |
pgbuf_flush_neighbor_safe | src/storage/page_buffer.c | 11762 |
pgbuf_add_bufptr_to_batch | src/storage/page_buffer.c | 11820 |
pgbuf_ordered_fix_release | src/storage/page_buffer.c | 11985 |
pgbuf_ordered_unfix | src/storage/page_buffer.c | 12860 |
pgbuf_add_watch_instance_internal | src/storage/page_buffer.c | 12927 |
pgbuf_initialize_page_quota_parameters | src/storage/page_buffer.c | 13326 |
pgbuf_initialize_page_quota | src/storage/page_buffer.c | 13370 |
pgbuf_initialize_page_monitor | src/storage/page_buffer.c | 13430 |
pgbuf_adjust_quotas | src/storage/page_buffer.c | 13639 |
pgbuf_initialize_seq_flusher | src/storage/page_buffer.c | 14016 |
pgbuf_flush_control_from_dirty_ratio | src/storage/page_buffer.c | 14233 |
pgbuf_fix_if_not_deallocated_with_caller | src/storage/page_buffer.c | 14735 |
pgbuf_assign_direct_victim | src/storage/page_buffer.c | 14809 |
pgbuf_assign_flushed_pages | src/storage/page_buffer.c | 14876 |
pgbuf_get_thread_waiting_for_direct_victim | src/storage/page_buffer.c | 14946 |
pgbuf_get_direct_victim | src/storage/page_buffer.c | 14978 |
pgbuf_lru_advance_victim_hint | src/storage/page_buffer.c | 15131 |
pgbuf_bcb_update_flags | src/storage/page_buffer.c | 15171 |
pgbuf_bcb_change_zone | src/storage/page_buffer.c | 15269 |
pgbuf_bcb_get_zone | src/storage/page_buffer.c | 15374 |
pgbuf_bcb_get_zone | src/storage/page_buffer.c | 15375 |
pgbuf_bcb_get_lru_index | src/storage/page_buffer.c | 15386 |
pgbuf_bcb_get_lru_index | src/storage/page_buffer.c | 15387 |
pgbuf_bcb_is_dirty | src/storage/page_buffer.c | 15400 |
pgbuf_bcb_set_dirty | src/storage/page_buffer.c | 15412 |
pgbuf_bcb_mark_is_flushing | src/storage/page_buffer.c | 15463 |
pgbuf_bcb_mark_was_flushed | src/storage/page_buffer.c | 15486 |
pgbuf_bcb_mark_was_not_flushed | src/storage/page_buffer.c | 15500 |
pgbuf_bcb_is_flushing | src/storage/page_buffer.c | 15513 |
pgbuf_bcb_should_be_moved_to_bottom_lru | src/storage/page_buffer.c | 15561 |
pgbuf_notify_vacuum_follows | src/storage/page_buffer.c | 15574 |
pgbuf_bcb_is_to_vacuum | src/storage/page_buffer.c | 15589 |
pgbuf_bcb_avoid_victim | src/storage/page_buffer.c | 15603 |
pgbuf_bcb_register_avoid_deallocation | src/storage/page_buffer.c | 15627 |
pgbuf_bcb_unregister_avoid_deallocation | src/storage/page_buffer.c | 15640 |
pgbuf_bcb_should_avoid_deallocation | src/storage/page_buffer.c | 15684 |
pgbuf_bcb_register_fix | src/storage/page_buffer.c | 15720 |
pgbuf_bcb_is_hot | src/storage/page_buffer.c | 15741 |
pgbuf_lfcq_get_victim_from_private_lru | src/storage/page_buffer.c | 15802 |
pgbuf_lfcq_get_victim_from_shared_lru | src/storage/page_buffer.c | 15894 |
pgbuf_bcb_register_hit_for_lru | src/storage/page_buffer.c | 15979 |
pgbuf_get_page_flush_interval | src/storage/page_buffer.c | 16353 |
pgbuf_page_maintenance_execute | src/storage/page_buffer.c | 16375 |
pgbuf_page_flush_daemon_task | src/storage/page_buffer.c | 16396 |
pgbuf_page_maintenance_daemon_init | src/storage/page_buffer.c | 16531 |
pgbuf_page_flush_daemon_init | src/storage/page_buffer.c | 16549 |
pgbuf_page_post_flush_daemon_init | src/storage/page_buffer.c | 16567 |
pgbuf_is_page_flush_daemon_available | src/storage/page_buffer.c | 16673 |
pgbuf_is_temp_lsa | src/storage/page_buffer.c | 16683 |
pgbuf_init_temp_page_lsa | src/storage/page_buffer.c | 16689 |
PAGE_FETCH_MODE | src/storage/page_buffer.h | 172 |
PGBUF_LATCH_MODE | src/storage/page_buffer.h | 190 |
PGBUF_ORDERED_RANK | src/storage/page_buffer.h | 222 |
pgbuf_watcher | src/storage/page_buffer.h | 234 |
PGBUF_TEMP_LSA | src/storage/page_buffer.h | 258 |
PGBUF_ATOMIC_LATCH | src/storage/page_buffer.h | 365 |
Sources
Section titled “Sources”cubrid-page-buffer-manager.md— the high-level companion. See alsocubrid-double-write-buffer.md(the flush path below) andcubrid-log-manager-detail.md(the WAL rule the flush obeys).- Raw analyses under
raw/code-analysis/cubrid/storage/buffer_manager/. - Code:
src/storage/page_buffer.{c,h}. - Methodology:
knowledge/methodology/code-analysis-detail-doc.md.