Skip to content

CUBRID Page Buffer Manager — Code-Level Deep Dive

Where this document fits: The high-level analysis cubrid-page-buffer-manager.md covers design intent and theoretical background. This document traces every branch and field at the code level. Each chapter is self-contained, but reading in order follows the full lifecycle of a single data page inside the buffer pool — fix, latch, dirty, flush, victimize.

Contents:

ChTitleStatus
1Data-Structure Map
2Initialization and Memory Layout
3The Fix Entry Path and Page-Table Lookup
4Miss Handling BCB Claim and the PGBUF Allocation Lock
5The BCB Atomic Latch Acquire Block and Wake
6Dirtying a Page and the Packed Flags and Zone Word
7Unfix LRU Movement Aout History and Private to Shared Migration
8Flushing Under the WAL Rule and the Flush Daemons
9Victim Selection the LFCQs and Direct Victim Hand-off
10Adaptive Quotas Ordered Fix and Special Paths

Field-level reference for every struct the page buffer manager owns. The high-level companion (cubrid-page-buffer-manager.md) explains why a buffer manager needs a BCB, a VPID hash, a free list, and multi-zone LRUs — see its ### Buffer Control Block — PGBUF_BCB and ### Three-zone LRU lists sections; we do not repeat that framing. What the companion simplified — the BCB latch, drawn there as a pthread_mutex_t + latch_mode pair — is corrected here against the real source. All structs are TU-private to src/storage/page_buffer.c except pgbuf_watcher / PGBUF_LATCH_MODE (public, in page_buffer.h); the single global is pgbuf_Pool of type PGBUF_BUFFER_POOL. Each table below uses one Role & rationale column to keep field coverage exhaustive at low cost.

Two BCB words are bit-packed; later chapters read them through accessor macros. The flags word (volatile int, 32 bits): BCB-flag bits in the high byte, the zone in bits 16-19, the LRU index in the low 16 bits.

// PGBUF zone + index layout -- src/storage/page_buffer.c
#define PGBUF_LRU_NBITS 16
#define PGBUF_LRU_INDEX_MASK (PGBUF_LRU_LIST_MAX_COUNT - 1) /* 0x0000FFFF */
PGBUF_LRU_1_ZONE = 1 << PGBUF_LRU_NBITS, /* 0x00010000 */
PGBUF_LRU_2_ZONE = 2 << PGBUF_LRU_NBITS, /* 0x00020000 */
PGBUF_LRU_3_ZONE = 3 << PGBUF_LRU_NBITS, /* 0x00030000 */
PGBUF_LRU_ZONE_MASK = PGBUF_LRU_1_ZONE | PGBUF_LRU_2_ZONE | PGBUF_LRU_3_ZONE,
PGBUF_INVALID_ZONE = 1 << (PGBUF_LRU_NBITS + 2), /* 0x00040000 */
PGBUF_VOID_ZONE = 2 << (PGBUF_LRU_NBITS + 2), /* 0x00080000 */

The two-bit skip is deliberate: LRU zones use bits 16-17; INVALID/VOID jump to bit 18+ so their masks never collide with an LRU index setting bit 16 or 17. The flag bits sit in the top byte (PGBUF_BCB_DIRTY_FLAG 0x80000000, then ..._FLUSHING_TO_DISK, ..._VICTIM_DIRECT, ..._INVALIDATE_DIRECT_VICTIM, ..._MOVE_TO_LRU_BOTTOM, ..._TO_VACUUM, ..._ASYNC_FLUSH_REQ, descending one bit each). One word reads high-to-low: [flag byte | reserved | zone 16-19 | lru index 0-15]. Flag semantics belong to Chapter 6.

The second packed word, count_fix_and_avoid_dealloc, splits via PGBUF_BCB_COUNT_FIX_SHIFT_BITS (16) and PGBUF_BCB_AVOID_DEALLOC_MASK (0x0000FFFF): high 16 a saturating fix counter (hot-page detection), low 16 an atomically-mutated avoid-deallocation count. Fused into one int because 2-byte atomics are not portable — the field comment says so verbatim.

1.2 pgbuf_atomic_latch_impl — the real BCB latch

Section titled “1.2 pgbuf_atomic_latch_impl — the real BCB latch”

This is the single biggest correction to the high-level doc. The latch is not a mutex plus a latch_mode int; it is a 64-bit atomic word reinterpreted through a union:

// PGBUF_ATOMIC_LATCH + union pgbuf_atomic_latch_impl -- page_buffer.h / page_buffer.c
typedef std::atomic<uint64_t> PGBUF_ATOMIC_LATCH;
union pgbuf_atomic_latch_impl
{
uint64_t raw;
struct {
PGBUF_LATCH_MODE latch_mode; /* uint16_t enum: NO/READ/WRITE/FLUSH/INVALID */
uint16_t waiter_exists; /* a thread is parked on next_wait_thrd */
int32_t fcnt; /* current fix count under the latch */
} impl;
};

The BCB’s atomic_latch field is the std::atomic<uint64_t> itself. Code loads it (memory_order_acquire) into a stack PGBUF_ATOMIC_LATCH_IMPL, edits the sub-fields, CAS-es the whole raw back (set_latch_and_fcnt, set_latch_and_add_fcnt; get_latch does the read half).

FieldRole & rationale
rawThe whole 64-bit word; CAS updates mode+waiter+fcnt in one instruction, no hot-path mutex
impl.latch_modeCurrent mode (PGBUF_LATCH_MODE, uint16_t); separates shared-read from exclusive-write
impl.waiter_exists1 if a thread is parked; tells the unlatcher to wake next_wait_thrd
impl.fcntFix count under the latch; read mode allows fcnt > 1, releases at 0

PGBUF_LATCH_MODE is uint16_t-backed to fit the union’s first two bytes: PGBUF_NO_LATCH=0, _READ=1, _WRITE=2, _FLUSH=3 (block mode only — a page is never fixed in flush mode), _INVALID=4.

Invariant — the union is only ever touched as a whole raw word. Writing latch_mode directly through the live atomic would tear the 64-bit word and lose fcnt/waiter_exists updates racing on the other halves. The BCB’s mutex guards list/flag transitions, not the latch. Chapter 5 traces the CAS loop branch-by-branch.

1.3 pgbuf_bcb — the buffer control block, field by field

Section titled “1.3 pgbuf_bcb — the buffer control block, field by field”
// struct pgbuf_bcb -- src/storage/page_buffer.c (condensed; SERVER_MODE fields elided)
struct pgbuf_bcb
{
VPID vpid;
PGBUF_ATOMIC_LATCH atomic_latch; /* the 64-bit union latch from 1.2 */
volatile int flags; /* flag byte | zone | lru index (1.1) */
PGBUF_BCB *hash_next, *prev_BCB, *next_BCB;
int tick_lru_list, tick_lru3;
volatile int count_fix_and_avoid_dealloc; /* two-purpose; see 1.1 */
int hit_age;
LOG_LSA oldest_unflush_lsa;
PGBUF_IOPAGE_BUFFER *iopage_buffer;
// ... condensed: mutex, owner_mutex, next_wait_thrd, latch_last_thread under #if SERVER_MODE ...
};
FieldRole & rationale
mutex (SM)Per-BCB pthread_mutex_t; serializes list/flag transitions the latch doesn’t cover
owner_mutex (SM)Index of thread holding mutex; assert aid for wrong-owner/double unlock
vpidVolume+page id of resident page; the hash key
atomic_latchR/W page latch (union, 1.2); user-level latch off the kernel-mutex hot path
flagsPacked flag-byte + zone + LRU index; one atomically-readable replacement/dirty word
next_wait_thrd (SM)FIFO head of threads blocked on the latch; waiter_exists points here
latch_last_thread (SM)Last thread that latched; diagnostic trail
hash_nextNext BCB in hash bucket chain; collision chaining (1.5)
prev_BCBPrevious LRU node; doubly-linked LRU gives O(1) unlink
next_BCBNext LRU node or free-list next; reused per the §1.3 invariant
tick_lru_listList tick when BCB entered its LRU; vs list tick_list to decide age boost
tick_lru3Position stamp inside zone 3; tells victim-hint which zone-3 BCB is lowest
count_fix_and_avoid_deallocHi 16 fix count, lo 16 avoid-dealloc; hot-page detection fused with dealloc protection (1.1)
hit_ageAge stamp of last hit; feeds activity/quota (Ch 10)
oldest_unflush_lsaOldest LSA of unflushed change; WAL anchor — page not written until log durable (Ch 8)
iopage_bufferPointer to this BCB’s payload slot; separates control metadata from aligned payload

Invariant — next_BCB belongs to exactly one list at a time. A BCB is in an LRU list, the invalid free list, or transiently PGBUF_VOID_ZONE (neither). The flags zone field is the source of truth; zone change and next_BCB relink must be one critical section or a BCB appears in two lists. Chapter 7 traces the relink.

1.4 pgbuf_iopage_buffer — the page payload slot

Section titled “1.4 pgbuf_iopage_buffer — the page payload slot”
// struct pgbuf_iopage_buffer -- src/storage/page_buffer.c
struct pgbuf_iopage_buffer
{
PGBUF_BCB *bcb; /* back-pointer to owning BCB */
#if (__WORDSIZE == 32)
int dummy; /* pad so iopage starts 8-byte aligned */
#endif
FILEIO_PAGE iopage; /* the actual buffered IO page */
};
FieldRole & rationale
bcbBack-pointer to owning pgbuf_bcb; a PAGE_PTR into iopage.page recovers the BCB via CAST_PGPTR_TO_BFPTR
dummy (32-bit)4-byte filler; on 32-bit bcb is 4 bytes, so the pad pushes iopage to offset 8 to align LOG_LSA
iopageEmbedded FILEIO_PAGE; one allocation holds control + on-disk image inline

The first bytes of iopage are the header (prv), whose lsa and ptype the WAL/recovery paths read directly:

// struct fileio_page -- src/storage/file_io.h
struct fileio_page_reserved { LOG_LSA lsa; INT32 pageid; INT16 volid;
unsigned char ptype; unsigned char pflag; /* ... condensed ... */ };
struct fileio_page_watermark { LOG_LSA lsa; /* duplicates prv.lsa */ };
struct fileio_page
{
FILEIO_PAGE_RESERVED prv; /* system area at start */
char page[1]; /* user area */
FILEIO_PAGE_WATERMARK prv2; /* end-of-page watermark, duplicates prv.lsa */
};

fileio_page is not header-only: a trailing prv2 (FILEIO_PAGE_WATERMARK) sits at end-of-page holding a copy of prv.lsa. Since page[1] is a flexible array the layout is logical, not literal — fileio_get_page_watermark_pos computes prv2’s real address from the page size rather than dereferencing the member.

Invariant — iopage_buffer->bcb round-trips and prv.lsa never moves backward. The back-pointer (bcb->iopage_buffer->bcb == bcb) is set once at init (Ch 2). prv.lsa is the page’s durable-recovery watermark — it advances as the page changes, is mirrored into prv2 at flush, and is what oldest_unflush_lsa and the WAL rule (Ch 8) compare against.

1.5 The VPID hash: pgbuf_buffer_hash and pgbuf_buffer_lock

Section titled “1.5 The VPID hash: pgbuf_buffer_hash and pgbuf_buffer_lock”

pgbuf_buffer_hash is { pthread_mutex_t hash_mutex; PGBUF_BCB *hash_next; PGBUF_BUFFER_LOCK *lock_next; }; pgbuf_buffer_lock is { VPID vpid; PGBUF_BUFFER_LOCK *lock_next; THREAD_ENTRY *next_wait_thrd; } (mutex/thread fields under #if SERVER_MODE).

Struct.FieldRole & rationale
buffer_hash.hash_mutexBucket lock; protects both chains in the bucket
buffer_hash.hash_nextResident-BCB chain head; lookup walks it matching vpid (Ch 3)
buffer_hash.lock_nextPending PGBUF-lock chain head; VPIDs being read in, not yet a BCB
buffer_lock.vpidVPID reserved for read-in; a second fixer finds the in-flight read and waits
buffer_lock.lock_nextNext lock record in bucket; chains concurrent in-flight reads
buffer_lock.next_wait_thrd (SM)Queue waiting on this read-in; woken when the page lands

The buffer-lock table is fixed-size — one record per thread, since at most one outstanding read per thread. Chapter 4 traces how a miss claims a lock before allocating a BCB.

flowchart LR
  H["pgbuf_buffer_hash[bucket]"] -->|hash_next| B1["BCB"] -->|hash_next| B2["BCB"]
  H -->|lock_next| K1["pgbuf_buffer_lock vpid=A"] -->|lock_next| K2["pgbuf_buffer_lock vpid=B"]
  K1 -->|next_wait_thrd| T["waiting threads"]

Figure 1-1: one bucket anchors a resident-BCB chain and a pending-lock chain under one hash_mutex.

struct pgbuf_lru_list holds, after #if SERVER_MODE pthread_mutex_t mutex: PGBUF_BCB *top, *bottom, *bottom_1, *bottom_2; PGBUF_BCB *volatile victim_hint; int count_lru1/2/3, count_vict_cand, threshold_lru1/2, quota, tick_list, tick_lru3; volatile int flags; int index;.

FieldRole & rationale
mutex (SM)List lock; protects link integrity
top / bottomHead (MRU) / tail (LRU); new/boosted link at top, eviction ends at bottom
bottom_1Last BCB of zone 1 (NULL if empty); O(1) zone-1→2 boundary move
bottom_2Last BCB of zone 2 (NULL if empty); zone 2/3 boundary marker
victim_hintVolatile victim-scan start; avoids re-walking pinned BCBs each search
count_lru1/2/3Per-zone BCB counts; drive rebalancing and victim availability
count_vict_candVictimizable BCB count; lets victim search skip empty lists (Ch 9)
threshold_lru1/2Target sizes of zones 1/2; a BCB falls a zone when its zone exceeds threshold, zone 3 is the rest
quotaTarget size of a private list; adaptive per-session (Ch 10), unused for shared
tick_listBumped on add/boost; BCB stores its entry tick, the difference gauges staleness
tick_lru3Bumped when a BCB falls to zone 3; stamps bcb->tick_lru3 for victim-hint order
flagsPer-list flag word; marks bulk/quota list state
indexThis list’s index in buf_LRU_list[]; stored into BCB flags low 16 so a BCB knows its home list

Invariant — victim_hint may drift below the true first victim, and the code tolerates it. Everything below the hint should be dirty, but TPCC core dumps showed it sometimes sitting before the first victimizable BCB — a known unfixed bug flagged TODO. Consumers treat the hint as a start point and re-validate every candidate; trusting it as exact would skip valid victims or victimize a dirty page. Chapter 9 walks the scan.

struct pgbuf_invalid_list is { pthread_mutex_t invalid_mutex (SERVER_MODE); PGBUF_BCB *invalid_top; int invalid_cnt; }.

FieldRole & rationale
invalid_mutex (SM)Free-list lock; serializes push/pop of free BCBs
invalid_topFree-chain head (links via BCB next_BCB); a miss pops here before victimizing (Ch 4)
invalid_cntCount of free BCBs; “pool empty → victimize” without walking the chain

A free BCB is PGBUF_INVALID_ZONE and uses only next_BCB; prev_BCB is unused (per §1.3 invariant).

1.8 The holder triad — per-thread fix bookkeeping

Section titled “1.8 The holder triad — per-thread fix bookkeeping”

A thread records each fix in a pgbuf_holder it owns, not in the BCB. Ownership chain: pgbuf_holder_anchor (per-thread head) → pgbuf_holder (one per held page) → pgbuf_holder_set (the slab holders are carved from).

// pgbuf_holder / _anchor / _set -- src/storage/page_buffer.c (condensed)
struct pgbuf_holder {
int fix_count; PGBUF_BCB *bufptr;
PGBUF_HOLDER *thrd_link, *next_holder; /* hold-list / free-list links */
PGBUF_HOLDER_STAT perf_stat; /* #if !NDEBUG: char fixed_at[64*1024]; int fixed_at_size; */
int watch_count; PGBUF_WATCHER *first_watcher, *last_watcher; };
// pgbuf_holder_anchor: { int num_free_cnt, num_hold_cnt; PGBUF_HOLDER *thrd_free_list, *thrd_hold_list; }
// pgbuf_holder_set: { PGBUF_HOLDER element[PGBUF_NUM_ALLOC_HOLDER /*==10*/]; PGBUF_HOLDER_SET *next_set; }
Struct.FieldRole & rationale
holder.fix_countRe-fix depth on this BCB; only the last unfix releases the latch
holder.bufptrThe held BCB; links per-thread holder to shared BCB
holder.thrd_linkNext in hold list; lets pgbuf_unfix_all walk every held page
holder.next_holderNext in free list; recycles slots without re-alloc
holder.perf_statPGBUF_HOLDER_STAT flags; perf accounting of page usage
holder.fixed_at/fixed_at_size (dbg)Fix call-site capture; debug where-fixed tracing
holder.watch_countWatchers attached here; ordered-fix watchers (Ch 10) hang off the holder
holder.first_watcher/last_watcherWatcher list ends; O(1) append/detach during ordered fix
anchor.num_free_cnt/num_hold_cntFree/used counters; fast “need a new slab?” decision
anchor.thrd_free_list/thrd_hold_listFree/hold list heads; thread’s private view of its fixes
holder_set.element[10]Slab of 10 pre-allocated holders; handed out in batches, never returned
holder_set.next_setNext slab; the global free-holder pool is a list of slabs

1.9 pgbuf_watcher — the ordered-fix bit-field

Section titled “1.9 pgbuf_watcher — the ordered-fix bit-field”
// struct pgbuf_watcher -- src/storage/page_buffer.h (debug magic/strings elided)
struct pgbuf_watcher {
PAGE_PTR pgptr; PGBUF_WATCHER *next, *prev;
PGBUF_ORDERED_GROUP group_id; /* VPID of group's HEAP header */
unsigned latch_mode:7; /* requested latch mode */
unsigned page_was_unfixed:1; /* set if any refix occurred */
unsigned initial_rank:4; /* rank at init */
unsigned curr_rank:4; }; /* rank after fix */
FieldRole & rationale
pgptrThe watched page handle; what the caller reads/writes
next/prevLinks in the holder’s watcher list; a page may carry several, O(1) detach
group_idVPID of the grouping heap header; ordered fix orders pages within a group to avoid deadlock
latch_mode:7Requested latch mode; 7 bits cover the small enum, packed tight
page_was_unfixed:1Set on unfix+refix during reorder; cached pgptr may have moved, revalidate
initial_rank:4Rank at watcher init; desired fix order before reorder
curr_rank:4Rank after fixing; detects out-of-order fixes

Chapter 10 traces the ordered-fix reorder loop driving these bits.

1.10 pgbuf_buffer_pool — the global root

Section titled “1.10 pgbuf_buffer_pool — the global root”

pgbuf_Pool ties everything together. Fields a modifier must know:

FieldRole & rationale
num_buffersTotal BCB frames (≈ 10 × num_trans); fixed pool size bounding every table
BCB_tablePGBUF_BCB[]; the control blocks
buf_hash_tablePGBUF_BUFFER_HASH[]; the VPID hash (1.5)
buf_lock_tablePGBUF_BUFFER_LOCK[]; one pending-read record per thread (1.5)
iopage_tablePGBUF_IOPAGE_BUFFER[]; page payloads, parallel to BCB_table
num_LRU_listNumber of shared LRU lists; first slice of buf_LRU_list
ratio_lru1/ratio_lru2Zone-1/2 size ratios; seed each list’s threshold_lru1/2
buf_LRU_listPGBUF_LRU_LIST[] shared+garbage+private; one backing array, index decides class
buf_AOUT_listPGBUF_AOUT_LIST victim history; the “Aout” half of 2Q (Ch 7)
buf_invalid_listPGBUF_INVALID_LIST free pool (1.7); source of fresh BCBs
victim_cand_listVictim-candidate array; flush daemon working set (Ch 8)
seq_chkpt_flusherPGBUF_SEQ_FLUSHER; rate-controlled checkpoint flush state
monitorPGBUF_PAGE_MONITOR; dirty count, per-LRU hits, victim/fix counters
quotaPGBUF_PAGE_QUOTA; private-list quota tuning (Ch 10)
thrd_holder_infoPGBUF_HOLDER_ANCHOR[] per thread; per-thread holder anchors (1.8)
thrd_reserved_holderBacking memory for all holders; pre-reserved holder space
free_holder_set_mutex (SM)Shared free-holder pool lock; serializes slab hand-out
free_holder_set/free_indexFirst slab with free entries, first free slot; global holder allocator cursor
check_for_interruptsSet when interrupts must be checked; log mgr toggles under TR_TABLE_CS
is_flushing_victims/is_checkpoint (SM)Daemon-state flags; coordinate flush vs. checkpoint
direct_victims (SM)Victim array + two priority waiter LFCQs; direct victim hand-off (Ch 9)
flushed_bcbs (SM)LFCQ of post-flush BCBs; post-flush processing queue
private/big_private/shared_lrus_with_victimsThree LFCQs of LRU indices with victims; victim search consults these vs scanning all (Ch 9)
show_status/_old/_snapshot/_mutexSHOW STATUS reporting state; statistics surfaced to SHOW queries

The only sanctioned readers of the packed flags word; the rest of the code never masks flags by hand.

// pgbuf_bcb_get_zone / _get_lru_index / PGBUF_IS_BCB_IN_LRU -- page_buffer.c
STATIC_INLINE PGBUF_ZONE pgbuf_bcb_get_zone (const PGBUF_BCB * bcb)
{ return PGBUF_GET_ZONE (bcb->flags); } /* (flags & PGBUF_ZONE_MASK) */
STATIC_INLINE int pgbuf_bcb_get_lru_index (const PGBUF_BCB * bcb)
{ assert (PGBUF_IS_BCB_IN_LRU (bcb)); /* <- precondition */
return PGBUF_GET_LRU_INDEX (bcb->flags); } /* (flags & 0x0000FFFF) */
#define PGBUF_IS_BCB_IN_LRU(bcb) ((pgbuf_bcb_get_zone (bcb) & PGBUF_LRU_ZONE_MASK) != 0)

Branch analysis:

  1. pgbuf_bcb_get_zone — single unconditional return, no error path. Every legal flags yields exactly one of the five PGBUF_ZONE values; a corrupted flags falls to the caller’s switch default.
  2. pgbuf_bcb_get_lru_index — two branches via the assert. Debug, BCB in an LRU zone: assert passes, returns the low 16 bits (the home list’s index, §1.6). Debug, not in an LRU zone (INVALID/VOID): assert fires — those bits are meaningless. In release the assert is compiled out and the function returns whatever the low 16 bits hold, so callers own the precondition; hence call sites guard with PGBUF_IS_BCB_IN_LRU.
  3. PGBUF_IS_BCB_IN_LRU — one boolean, two outcomes. ANDs the zone against PGBUF_LRU_ZONE_MASK (zones 1/2/3). INVALID/VOID set bits outside that mask → false; any LRU zone → true. The gate making branch 2 safe.
flowchart TB
  POOL["pgbuf_buffer_pool (pgbuf_Pool)"]
  POOL -->|BCB_table| BCB["pgbuf_bcb[]"]
  POOL -->|iopage_table| IOP["pgbuf_iopage_buffer[]"]
  POOL -->|buf_hash_table| HASH["pgbuf_buffer_hash[]"]
  POOL -->|buf_lock_table| LOCK["pgbuf_buffer_lock[]"]
  POOL -->|buf_LRU_list| LRU["pgbuf_lru_list[] shared+garbage+private"]
  POOL -->|buf_invalid_list| INV["pgbuf_invalid_list"]
  POOL -->|thrd_holder_info| ANC["pgbuf_holder_anchor[] per thread"]
  POOL -->|free_holder_set| SET["pgbuf_holder_set slabs"]
  HASH -->|hash_next| BCB
  HASH -->|lock_next| LOCK
  BCB -->|iopage_buffer| IOP
  IOP -->|bcb| BCB
  BCB -->|atomic_latch| LATCH["pgbuf_atomic_latch_impl union"]
  LRU -->|top/bottom/bottom_1/bottom_2/victim_hint| BCB
  INV -->|invalid_top via next_BCB| BCB
  ANC -->|thrd_hold_list| HLD["pgbuf_holder"]
  SET -->|element[10]| HLD
  HLD -->|bufptr| BCB
  HLD -->|first_watcher/last_watcher| WAT["pgbuf_watcher"]

Figure 1-2: the full pointer panorama. Every later chapter operates on a sub-graph of this picture.

  1. The BCB latch is a 64-bit atomic union, not a mutex. pgbuf_atomic_latch_impl packs latch_mode+waiter_exists+fcnt into one std::atomic<uint64_t>, touched only as the whole raw word. The BCB’s mutex guards list/flag transitions, not the latch.
  2. flags is three things in one word — flag byte (24-31), zone (16-19, with a skip so INVALID/VOID don’t collide with the index), LRU index (0-15). Read it only through the §1.11 accessors.
  3. count_fix_and_avoid_dealloc is two fused counters — hi 16 a saturating fix count, lo 16 an atomic avoid-dealloc count, fused because 2-byte atomics aren’t portable.
  4. next_BCB is shared between the LRU and free lists; the zone is the single source of truth. Relink and zone change must be one critical section.
  5. victim_hint is advisory and can drift below the true first victim — a known unfixed bug; treat it as a start point and re-validate.
  6. The iopage is embedded with a back-pointer and an alignment pad, so PAGE_PTR → BCB round-trips and FILEIO_PAGE stays 8-byte aligned; its prv.lsa (mirrored at end-of-page in prv2) is the WAL/recovery watermark.
  7. Fixes are bookkept per-thread in the holder triad (anchor → holder → slab of 10); pgbuf_bcb_get_lru_index is valid only when PGBUF_IS_BCB_IN_LRU holds — the assert encodes that precondition.

Chapter 2: Initialization and Memory Layout

Section titled “Chapter 2: Initialization and Memory Layout”

This chapter answers: where does every page-buffer structure come from at server start, and how is each table sized, allocated, and cross-wired before the first pgbuf_fix runs? The high-level companion (cubrid-page-buffer-manager.md) names the players in its CUBRID’s Approach section (BCB, page table, invalid list, three-zone LRU, private LRUs, Aout, LFCQs) and Ch. 1 gives the struct map; neither is re-derived below.

Everything lives in one file-scope object, pgbuf_Pool (type PGBUF_BUFFER_POOL). pgbuf_initialize is the orchestrator: it zeroes pgbuf_Pool field-by-field, derives sizes, then calls ten sub-initializers in a fixed order — quota parameters first (they fix the LRU count), then the dependent tables, then the quota/monitor arrays sized by PGBUF_TOTAL_LRU_COUNT.

2.1 pgbuf_initialize — the orchestrator and its size derivation

Section titled “2.1 pgbuf_initialize — the orchestrator and its size derivation”

The function opens with per-field zeroing plus memset on embedded sub-structs. The std::atomic_int members of monitor cannot be memset, so they use .store(0). This manual reset is what makes pgbuf_finalize safe on a half-built pool: every pointer is NULL before any allocation (scalar sentinels like free_index are set to 0 here and re-set to -1 later in §2.9). num_buffers is read then floored; the two LRU-zone ratios are each clamped:

// pgbuf_initialize -- src/storage/page_buffer.c
pgbuf_Pool.num_buffers = prm_get_integer_value (PRM_ID_PB_NBUFFERS);
if (pgbuf_Pool.num_buffers < PGBUF_MINIMUM_BUFFERS) /* MAX_NTRANS * 10 */
pgbuf_Pool.num_buffers = PGBUF_MINIMUM_BUFFERS; /* <- silent floor, never error */
pgbuf_Pool.ratio_lru1 = prm_get_float_value (PRM_ID_PB_LRU_HOT_RATIO);
pgbuf_Pool.ratio_lru2 = prm_get_float_value (PRM_ID_PB_LRU_BUFFER_RATIO);
pgbuf_Pool.ratio_lru1 = MAX (pgbuf_Pool.ratio_lru1, PGBUF_LRU_ZONE_MIN_RATIO); /* clamp lru1 into */
pgbuf_Pool.ratio_lru1 = MIN (pgbuf_Pool.ratio_lru1, PGBUF_LRU_ZONE_MAX_RATIO); /* [0.05f, 0.90f] */
pgbuf_Pool.ratio_lru2 = MAX (pgbuf_Pool.ratio_lru2, PGBUF_LRU_ZONE_MIN_RATIO); /* lru2 floor */
pgbuf_Pool.ratio_lru2 = MIN (pgbuf_Pool.ratio_lru2, 1.0f - PGBUF_LRU_ZONE_MIN_RATIO - pgbuf_Pool.ratio_lru1);

The ratios are only stored here; they govern LRU1/LRU2 thresholds in Ch. 6/7. Two asserts follow: ratio_lru2 stays in [0.05, 0.90] and the two-zone sum stays in [0.099, 0.951]. Each sub-initializer failure does goto error, which calls pgbuf_finalize.

flowchart TD
  A["pgbuf_initialize"] --> B["pgbuf_initialize_page_quota_parameters\nfixes num_private_LRU_list"]
  B --> C["pgbuf_initialize_bcb_table\nBCB_table + iopage_table"]
  C --> D["pgbuf_initialize_hash_table\n2^20 buckets"]
  D --> E["pgbuf_initialize_lock_table\none record per thread"]
  E --> F["pgbuf_initialize_lru_list\nfixes num_LRU_list, builds shared+private"]
  F --> G["pgbuf_initialize_invalid_list\nall BCBs seeded here"]
  G --> H["pgbuf_initialize_aout_list"]
  H --> I["pgbuf_initialize_thrd_holder\npre-allocate holder sets"]
  I --> J["pgbuf_initialize_page_quota\narrays sized by TOTAL_LRU_COUNT"]
  J --> K["pgbuf_initialize_page_monitor\nlru_hits/lru_activity arrays"]
  K --> L["victim_cand_list + seq_chkpt_flusher\n+ SERVER_MODE LFCQs + show_status"]
  B -.error.-> Z["goto error -> pgbuf_finalize"]
  C -.error.-> Z
  F -.error.-> Z
  L --> M["NO_ERROR"]

Figure 2-1. The ten sub-initializers in call order. Quota parameters are first because they set num_private_LRU_list, which feeds PGBUF_TOTAL_LRU_COUNT, which sizes the LRU, quota, and monitor arrays. Note the invalid list (G) is seeded before the Aout list (H).

After the ten, the orchestrator allocates victim_cand_list (one per buffer), sizes the checkpoint flusher at MIN(0.25 * num_buffers, 65536), and under SERVER_MODE allocates the direct-victim array (bcb_victims, one per thread) and three lockfree::circular_queue objects (waiter_threads_high_priority, waiter_threads_low_priority, flushed_bcbs). The private/big_private_lrus_with_victims queues are created only if PGBUF_PAGE_QUOTA_IS_ENABLED; shared_lrus_with_victims always. Finally show_status (MAX_NTRANS + 1 records) is allocated and zeroed. These LFCQs and the daemons are Ch. 8–9.

Invariant — “every pointer NULL before first allocation.” Enforcement: the opening per-field reset NULLs every pointer member, so error: can call pgbuf_finalize at any point and finalize frees exactly what was allocated (every free is != NULL-guarded). What breaks: a pointer field added but not NULL-initialized here would feed garbage to free_and_init on a mid-init failure — a free of uninitialized memory.

2.2 pgbuf_initialize_bcb_table — BCB/iopage allocation, cross-linking, alignment

Section titled “2.2 pgbuf_initialize_bcb_table — BCB/iopage allocation, cross-linking, alignment”

Two parallel arrays are allocated and each validated with MEM_SIZE_IS_VALID: BCB_table (metadata, num_buffers * PGBUF_BCB_SIZEOF) and iopage_table (page frames, num_buffers * PGBUF_IOPAGE_BUFFER_SIZE). Both iopage failure branches (bad-size and OOM) roll back BCB_table themselves (free_and_init, guarded != NULL) rather than relying on finalize, then return ER_PRM_BAD_VALUE / ER_OUT_OF_VIRTUAL_MEMORY. The per-BCB loop then initializes each BCB and cross-links it to its iopage frame symmetrically; next_BCB chains every BCB into one forward list (last NULL) that the invalid list inherits:

// pgbuf_initialize_bcb_table -- src/storage/page_buffer.c
for (i = 0; i < pgbuf_Pool.num_buffers; i++)
{
bufptr = PGBUF_FIND_BCB_PTR (i); /* base + i * sizeof(PGBUF_BCB) */
pthread_mutex_init (&bufptr->mutex, NULL);
VPID_SET_NULL (&bufptr->vpid);
placement_new (&bufptr->atomic_latch, 0); /* C++ atomic needs placement-new, not memset */
bufptr->atomic_latch.store (impl.raw); /* impl = {mode INVALID, no waiter, fcnt 0}; Ch.5 */
bufptr->next_BCB = (i == pgbuf_Pool.num_buffers - 1) ? NULL : PGBUF_FIND_BCB_PTR (i + 1); /* chain */
bufptr->flags = PGBUF_BCB_INIT_FLAGS; /* == PGBUF_INVALID_ZONE, no other flag */
/* ... clear hash_next/prev_BCB/count_fix_and_avoid_dealloc/hit_age/oldest_unflush_lsa/ticks ... */
ioptr = PGBUF_FIND_IOPAGE_PTR (i); /* base + i * PGBUF_IOPAGE_BUFFER_SIZE */
/* ... fileio_init_lsa_of_page; set iopage.prv pageid/volid = -1, ptype UNKNOWN ... */
bufptr->iopage_buffer = ioptr; ioptr->bcb = bufptr; /* <- symmetric cross-link */
}
graph LR
  subgraph BCB_table
    b0["BCB[0]"]
    b1["BCB[1]"]
  end
  subgraph iopage_table
    p0["iopage[0]"]
    p1["iopage[1]"]
  end
  b0 -->|iopage_buffer| p0
  p0 -->|bcb| b0
  b1 -->|iopage_buffer| p1
  p1 -->|bcb| b1
  b0 -->|next_BCB| b1

Figure 2-2. Parallel arrays, symmetric per-slot cross-link, and the next_BCB chain the invalid list inherits.

Invariant — “iopage is 8-byte aligned.” Enforcement: struct pgbuf_iopage_buffer places PGBUF_BCB *bcb first, then on 32-bit builds (__WORDSIZE == 32) inserts an explicit int dummy so the following FILEIO_PAGE iopage starts on an 8-byte boundary (an unsupported platform that is neither LINUX/WINDOWS/AIX trips a #error). PGBUF_IOPAGE_BUFFER_SIZE (offsetof(.., iopage) + SIZEOF_IOPAGE_PAGESIZE_AND_GUARD()) is the stride PGBUF_FIND_IOPAGE_PTR multiplies by i, so every frame stays aligned. What breaks: dropping the dummy yields misaligned buffers and undefined direct-I/O behavior.

2.3 pgbuf_initialize_hash_table — the fixed 2^20 bucket page table

Section titled “2.3 pgbuf_initialize_hash_table — the fixed 2^20 bucket page table”

The page table size is a compile-time constant, independent of num_buffers:

// pgbuf_initialize_hash_table -- src/storage/page_buffer.c
hashsize = PGBUF_HASH_SIZE; /* (1 << HASH_SIZE_BITS) == 1 << 20 == 1048576 */
pgbuf_Pool.buf_hash_table = (PGBUF_BUFFER_HASH *) malloc (hashsize * PGBUF_BUFFER_HASH_SIZEOF);
/* ... OOM check; loop: pthread_mutex_init each hash_mutex; hash_next = lock_next = NULL ... */

A power-of-two bucket count keeps the final masking step a single ANDpgbuf_hash_func_mirror finishes with hash_val & ((1 << HASH_SIZE_BITS) - 1), so no modulo/division is needed (the function still bit-reverses the 8 LSBs of volid into the high bits in a small loop before XOR-ing with pageid). Each bucket has its own hash_mutex (SERVER_MODE only) — no global page-table lock. The hash_next walk is Ch. 3.

2.4 pgbuf_initialize_lock_table — one buffer-lock record per thread

Section titled “2.4 pgbuf_initialize_lock_table — one buffer-lock record per thread”

The buffer-lock table has one slot per server thread, indexed by thread index; it is the rendezvous used while a miss is being resolved (Ch. 4):

// pgbuf_initialize_lock_table -- src/storage/page_buffer.c
thrd_num_total = thread_num_total_threads (); /* SA mode asserts thrd_num_total == 1 */
pgbuf_Pool.buf_lock_table = (PGBUF_BUFFER_LOCK *) malloc (thrd_num_total * PGBUF_BUFFER_LOCK_SIZEOF);
/* ... OOM check; loop: VPID_SET_NULL(vpid); lock_next = NULL; (SERVER_MODE) next_wait_thrd = NULL ... */

Sizing by thread count works because a thread resolves at most one miss at a time; its record is reused for whichever VPID it is bringing in.

2.5 pgbuf_initialize_lru_list — shared + private list count and per-list reset

Section titled “2.5 pgbuf_initialize_lru_list — shared + private list count and per-list reset”

This initializer first fixes num_LRU_list (the shared count): a non-zero parameter is taken verbatim; zero is auto-derived:

// pgbuf_initialize_lru_list -- src/storage/page_buffer.c
pgbuf_Pool.num_LRU_list = prm_get_integer_value (PRM_ID_PB_NUM_LRU_CHAINS);
if (pgbuf_Pool.num_LRU_list == 0)
{
pgbuf_Pool.num_LRU_list = (int) MAX_NTRANS; /* default: one shared list per transaction slot */
if (pgbuf_Pool.num_buffers / pgbuf_Pool.num_LRU_list < PGBUF_MIN_PAGES_IN_SHARED_LIST) /* 1000 */
pgbuf_Pool.num_LRU_list = pgbuf_Pool.num_buffers / PGBUF_MIN_PAGES_IN_SHARED_LIST; /* coarsen */
pgbuf_Pool.num_LRU_list = MAX (pgbuf_Pool.num_LRU_list, 4); /* floor: at least 4 shared LRUs */
}

Branch logic: one list per transaction; if that gives fewer than 1000 pages per list, coarsen; never below 4. The allocation covers shared + private (PGBUF_TOTAL_LRU_COUNT = PGBUF_SHARED_LRU_COUNT + PGBUF_PRIVATE_LRU_COUNT, the latter = num_private_LRU_list from §2.7). Private lists occupy the high index range; PGBUF_IS_PRIVATE_LRU_INDEX(i) is true for i >= PGBUF_SHARED_LRU_COUNT.

// pgbuf_initialize_lru_list -- src/storage/page_buffer.c
pgbuf_Pool.buf_LRU_list = (PGBUF_LRU_LIST *) malloc (PGBUF_TOTAL_LRU_COUNT * PGBUF_LRU_LIST_SIZEOF);
/* ... OOM check; loop over PGBUF_TOTAL_LRU_COUNT lists: ... */
pgbuf_Pool.buf_LRU_list[i].index = i; /* self-index, used to recover list from a BCB */
/* ... pthread_mutex_init; top/bottom/bottom_1/bottom_2 = NULL; counts/victim_hint/ticks cleared ... */
pgbuf_Pool.buf_LRU_list[i].threshold_lru1 = 0; /* <- initial threshold ZERO, set later */
pgbuf_Pool.buf_LRU_list[i].threshold_lru2 = 0; pgbuf_Pool.buf_LRU_list[i].quota = 0;
pgbuf_Pool.buf_LRU_list[i].flags = 0;

Both kinds of list use the same loop — they differ only by index range, not by struct. The thresholds and quota start at 0, not from the §2.1 ratios; they get real values from the quota machinery (Ch. 7/10) once num_buffers is distributed. At init every list is empty, so zero is correct.

2.6 pgbuf_initialize_invalid_list and the Aout list

Section titled “2.6 pgbuf_initialize_invalid_list and the Aout list”

The invalid (free) list is the cheapest initializer — it points its head at BCB[0] and trusts the next_BCB chain from §2.2:

// pgbuf_initialize_invalid_list -- src/storage/page_buffer.c
pthread_mutex_init (&pgbuf_Pool.buf_invalid_list.invalid_mutex, NULL);
pgbuf_Pool.buf_invalid_list.invalid_top = PGBUF_FIND_BCB_PTR (0); /* head of the next_BCB chain */
pgbuf_Pool.buf_invalid_list.invalid_cnt = pgbuf_Pool.num_buffers; /* every BCB starts invalid */

Invariant — “all BCBs begin in the invalid list.” Enforcement: every BCB’s flag is PGBUF_INVALID_ZONE (§2.2) and invalid_cnt == num_buffers — the same truth stored twice. What breaks: the first num_buffers misses pop here before any eviction; if the count and the flags disagree, a popped BCB could be double-counted or skipped, so Ch. 7 keeps the two in sync on every move.

The Aout list (pgbuf_initialize_aout_list, struct pgbuf_aout_list) records eviction history to decide whether a re-faulted page was recently evicted. Capacity is num_buffers * aout_ratio (where aout_ratio = prm_get_float_value(PRM_ID_PB_AOUT_RATIO)), capped at PGBUF_LIMIT_AOUT_BUFFERS (32768); a non-positive ratio disables it (max_count = 0, early return NO_ERROR after Aout_mutex is already initialized). Otherwise it pre-allocates a bufarray of max_count PGBUF_AOUT_BUF nodes chained into a free list (Aout_free at bufarray[0]), then builds num_hashes = MAX(max_count / AOUT_HASH_DIVIDE_RATIO, 1) MHT tables. The error_return path nulls Aout_free, frees bufarray, then destroys the MHTs with a loop that stops at the first NULL slot (for (i = 0; list->aout_buf_ht[i] != NULL; i++)) — so only the tables actually created are destroyed, unlike the pgbuf_finalize loop which iterates the full num_hashes — frees aout_buf_ht, destroys Aout_mutex, returns ER_FAILED.

PGBUF_AOUT_LIST (the Aout container):

FieldRoleWhy it exists
Aout_mutex (SERVER_MODE)guards the whole Aout listhistory mutated on every eviction/refault
Aout_topmost-recently-evicted endnewest history entry
Aout_bottomoldest endthe entry discarded when the list overflows
Aout_freehead of the free node listnodes preallocated, never malloc’d per insert
bufarraythe single allocation of all nodesone block beats per-node alloc
num_hashescount of MHT lookup tablesshards the lookup to cut contention
aout_buf_htarray of MHT tables, VPID to nodeO(1) “was this page recently evicted?”
max_countcapacity; 0 means disabledbounds memory and acts as the on/off switch

PGBUF_PAGE_QUOTA (adaptive private-LRU sizing — populated in §2.7):

FieldRoleWhy it exists
num_private_LRU_listnumber of private LRUs; 0 disables quotamaster switch for the private-LRU feature
lru_victim_flush_priority_per_lruper-LRU flush priority (TOTAL_LRU_COUNT floats)tells flush daemons where dirty pressure is
private_lru_session_cntactive sessions per private LRUa list with 0 sessions can be reclaimed
private_pages_ratiofraction of all BCBs that are privatetarget the quota adjuster steers toward
add_shared_lru_idxround-robin cursor for relocating to sharedspreads BCBs evenly across shared lists
avoid_shared_lru_idxshared LRU to skip when relocatingavoids piling onto an oversized list
last_adjust_timetimestamp of last quota adjustmentrate-limits the adjuster
adjust_agemonotonic adjustment counterversions the quota state
is_adjustingre-entrancy guard for the adjusteronly one thread adjusts quotas at a time

PGBUF_PAGE_MONITOR (per-LRU statistics — populated in §2.8):

FieldRoleWhy it exists
dirties_cntcount of dirty BCBs (INT64)drives flush urgency
lru_hitsLRU1 hits per LRU (TOTAL_LRU_COUNT ints)recency-quality signal for quota tuning
lru_activityactivity level per LRUdetects idle private lists for reclamation
lru_shared_pgs_cntBCBs across all shared LRUs (volatile)complements private_pages_ratio
pg_unfix_cntunfix counter (std::atomic_int)triggers periodic quota refresh
lru_victim_req_cntvictim requests across all LRUsvictim-pressure gauge
fix_req_cntfix requests (std::atomic_int)overall load gauge
bcb_locks (SERVER_MODE)per-thread BCB-mutex usage trackinglock-contention diagnostics
victim_richtrue when victims are plentifulfast-path hint for the fix code

2.7 Quota bootstrap — pgbuf_initialize_page_quota_parameters then _page_quota

Section titled “2.7 Quota bootstrap — pgbuf_initialize_page_quota_parameters then _page_quota”

The split is deliberate. Parameters runs before the BCB/LRU tables because it fixes num_private_LRU_list (a dependency of PGBUF_TOTAL_LRU_COUNT); the data initializer runs after them because it allocates arrays sized by that total.

// pgbuf_initialize_page_quota_parameters -- src/storage/page_buffer.c
quota = &(pgbuf_Pool.quota); memset (quota, 0, sizeof (PGBUF_PAGE_QUOTA));
tsc_getticks (&quota->last_adjust_time); quota->adjust_age = 0; quota->is_adjusting = 0;
#if defined (SERVER_MODE)
quota->num_private_LRU_list = prm_get_integer_value (PRM_ID_PB_NUM_PRIVATE_CHAINS);
if (quota->num_private_LRU_list == -1)
quota->num_private_LRU_list = MAX_NTRANS + VACUUM_MAX_WORKER_COUNT; /* auto: one per worker */
else if (quota->num_private_LRU_list == 0)
{ /* disabled */ } /* <- explicit no-op branch */
else if (quota->num_private_LRU_list < PGBUF_PRIVATE_LRU_MIN_COUNT) /* 4 */
quota->num_private_LRU_list = PGBUF_PRIVATE_LRU_MIN_COUNT; /* floor when user-set */
#else
quota->num_private_LRU_list = 0; /* SA_MODE: no private LRUs */
#endif

Outcomes: -1 (auto) becomes MAX_NTRANS + VACUUM_MAX_WORKER_COUNT; 0 stays disabled; positive below 4 is raised to 4; SA-mode is always 0. This integer drives PGBUF_PAGE_QUOTA_IS_ENABLED (> 0) everywhere. The data initializer then allocates the two arrays and seeds the session counts:

// pgbuf_initialize_page_quota -- src/storage/page_buffer.c
quota->lru_victim_flush_priority_per_lru = (float *) malloc (PGBUF_TOTAL_LRU_COUNT * sizeof (float)); /* ALL lists */
quota->private_lru_session_cnt = (int *) malloc (PGBUF_PRIVATE_LRU_COUNT * sizeof (int)); /* PRIVATE only */
/* ... each OOM -> error_status, goto exit; loop zeros priority for all, session_cnt only where ... */
/* ... PGBUF_IS_PRIVATE_LRU_INDEX(i) holds, indexed via PGBUF_PRIVATE_LIST_FROM_LRU_INDEX(i) ... */
quota->private_pages_ratio = PGBUF_PAGE_QUOTA_IS_ENABLED ? 1.0f : 0; /* start fully private if enabled */
quota->add_shared_lru_idx = 0; quota->avoid_shared_lru_idx = -1;

Both failures land on a single exit: (which returns error_status); the orchestrator’s goto error then runs finalize, which frees whatever was allocated.

Mirroring quota-data, the monitor first re-NULLs its pointer members, then allocates two per-LRU integer arrays sized by PGBUF_TOTAL_LRU_COUNT:

// pgbuf_initialize_page_monitor -- src/storage/page_buffer.c
monitor->lru_hits = (int *) malloc (PGBUF_TOTAL_LRU_COUNT * sizeof (int));
monitor->lru_activity = (int *) malloc (PGBUF_TOTAL_LRU_COUNT * sizeof (int));
/* ... each OOM -> goto exit; loop zeros both; lru_victim_req_cnt/lru_shared_pgs_cnt = 0 ... */
monitor->fix_req_cnt.store (0); monitor->pg_unfix_cnt.store (0); /* atomics: .store, not memset */
#if defined (SERVER_MODE)
if (pgbuf_Monitor_locks) /* forced true in !NDEBUG; param-driven in NDEBUG */
monitor->bcb_locks = (PGBUF_MONITOR_BCB_MUTEX *) calloc (count_threads, sizeof (PGBUF_MONITOR_BCB_MUTEX));
#endif
monitor->victim_rich = false; /* no BCBs in lists yet, so no victims */

bcb_locks is per-thread (sized by thread_num_total_threads()), allocated only when lock monitoring is on (pgbuf_Monitor_locks is set in §2.1: forced true in debug builds, read from PRM_ID_PB_MONITOR_LOCKS in NDEBUG). All error paths funnel through exit:.

2.9 pgbuf_initialize_thrd_holder — pre-allocated per-thread holder pools

Section titled “2.9 pgbuf_initialize_thrd_holder — pre-allocated per-thread holder pools”

A holder records that a thread has a BCB fixed. Each thread gets a private free list of PGBUF_DEFAULT_FIX_COUNT (7) holders so the common fix path never allocates:

// pgbuf_initialize_thrd_holder -- src/storage/page_buffer.c
thrd_num_total = thread_num_total_threads ();
pgbuf_Pool.thrd_holder_info = (PGBUF_HOLDER_ANCHOR *) malloc (thrd_num_total * PGBUF_HOLDER_ANCHOR_SIZEOF);
pgbuf_Pool.thrd_reserved_holder = (PGBUF_HOLDER *) malloc (thrd_num_total * PGBUF_DEFAULT_FIX_COUNT * PGBUF_HOLDER_SIZEOF);
/* ... each OOM check; per-thread anchor i: num_hold_cnt=0, num_free_cnt=7, thrd_hold_list=NULL ... */
pgbuf_Pool.thrd_holder_info[i].thrd_free_list = &(pgbuf_Pool.thrd_reserved_holder[i * PGBUF_DEFAULT_FIX_COUNT]);
/* ... inner loop chains the 7 reserved holders via next_holder, last == NULL ... */
pthread_mutex_init (&pgbuf_Pool.free_holder_set_mutex, NULL);
pgbuf_Pool.free_holder_set = NULL; pgbuf_Pool.free_index = -1; /* -1 == no shared free holder; grow on demand */

The reserved holders are one flat array sliced per thread by i * PGBUF_DEFAULT_FIX_COUNT. When a thread exceeds 7 concurrent fixes, pgbuf_allocate_thrd_holder_entry falls back to the shared free_holder_set, malloc’d in PGBUF_HOLDER_SET blocks (PGBUF_NUM_ALLOC_HOLDER = 10 elements each) and never freed until finalize; free_index == -1 is the “pool empty, grow it” sentinel set here (it was a transient 0 from the §2.1 reset).

2.10 pgbuf_thread_variables_init — a worker claims its private LRU index

Section titled “2.10 pgbuf_thread_variables_init — a worker claims its private LRU index”

Called when a worker’s THREAD_ENTRY comes online, this hook wires the worker to its private LRU and holder anchor:

// pgbuf_thread_variables_init -- src/storage/page_buffer.c
if (!thread_p) return;
if (pgbuf_Pool.quota.num_private_LRU_list > 0 && thread_p->private_lru_index != -1)
thread_p->m_is_private_lru_enabled = true; /* quota on AND this worker has a private slot */
else
thread_p->m_is_private_lru_enabled = false;
if (!thread_p->m_holder_anchor)
thread_p->m_holder_anchor = &pgbuf_Pool.thrd_holder_info[thread_p->index]; /* bind to its slice */

private_lru_index lives on THREAD_ENTRY (default -1), assigned elsewhere when a transaction acquires a private list. This function only interprets it: a worker uses a private LRU iff quota is enabled and its index != -1. The anchor bind is idempotent (guarded by !m_holder_anchor) and gives O(1) access to the §2.9 slice. Vacuum workers and SA-mode fall to false, using shared LRUs only.

Teardown is not the strict reverse of init; it is a flat sequence of NULL-guarded frees, each safe because of the §2.1 invariant: (1) hash table — destroy all hash_mutexes, free buf_hash_table; (2) lock table — free buf_lock_table; (3) BCB table — destroy every BCB mutex, free BCB_table, set num_buffers = 0; (4) free iopage_table; (5) LRU lists — destroy every list mutex, free buf_LRU_list; (6) destroy invalid_mutex; (7) thrd holder — free thrd_holder_info/thrd_reserved_holder, destroy free_holder_set_mutex, walk and free every lazily-grown free_holder_set block; (8) victim_cand_list, then Aout (free bufarray, mht_destroy each of num_hashes slots, free aout_buf_ht, destroy Aout_mutex, zero fields), then seq_chkpt_flusher.flush_list; (9) quota arrays; (10) monitor arrays + (SERVER_MODE) bcb_locks; (11) SERVER_MODE: free direct_victims.bcb_victims, delete the two waiter queues and flushed_bcbs; (12) delete the three _lrus_with_victims queues; (13) free show_status, destroy its mutex; (14) thread_clear_all_holder_anchor () — the symmetric undo of §2.10.

C++ objects (lockfree::circular_queue) use delete, not free_and_init, because they were new’d; mixing would corrupt the heap. num_buffers is zeroed early (step 3) so a racing reader sees an empty pool. With every free != NULL-guarded and pointers NULL-initialized in §2.1, finalize is correct whether the pool is fully built or failed mid-init.

  1. Ten sub-initializers, fixed order. pgbuf_initialize zeroes pgbuf_Pool field-by-field (atomics via .store), then calls ten sub-initializers; quota parameters must run first (they fix num_private_LRU_list) and quota/monitor data must run last (sized by PGBUF_TOTAL_LRU_COUNT).
  2. num_buffers is floored, not validated (below MAX_NTRANS * 10 it is silently raised); each LRU zone ratio is independently clamped (lru1 into [0.05, 0.90], lru2 floored at 0.05 then capped so the sum leaves room) and only stored.
  3. BCB and iopage are parallel arrays, symmetrically cross-linked; the next_BCB chain is what the invalid list inherits; the int dummy padding enforces 8-byte iopage alignment on 32-bit builds.
  4. The page table is a fixed 2^20 buckets, each with its own hash_mutex (no global lock); a power-of-two size makes the final hash step a single AND. Lock and holder pools are sized by thread count, not buffer count.
  5. All BCBs start in the invalid list (invalid_cnt == num_buffers, every flag PGBUF_INVALID_ZONE) — one truth stored twice. LRU thresholds start at 0 because every list is empty.
  6. Quota is one integer switchnum_private_LRU_list (-1 auto, 0 disabled, positive floored to 4, 0 in SA mode) drives PGBUF_PAGE_QUOTA_IS_ENABLED; a worker uses a private LRU iff quota is on and its THREAD_ENTRY.private_lru_index != -1.
  7. Finalize is a flat NULL-guarded sequence, safe at any partial-build point; it deletes C++ queues but frees C arrays, and ends by clearing per-thread holder-anchor back-pointers — see Ch. 8 for the daemon set this chapter only constructs.

Chapter 3: The Fix Entry Path and Page-Table Lookup

Section titled “Chapter 3: The Fix Entry Path and Page-Table Lookup”

Every page access enters through pgbuf_fix (compiled as pgbuf_fix_release in release builds, pgbuf_fix_debug under !NDEBUG). This chapter dissects that function as a master state machine, from argument validation to the moment a hit hands off to latching (Chapter 5) or a miss hands off to BCB claim (Chapter 4). For the big-picture flow and the meaning of the zones, flags, and BCB struct, see ### How a page fix flows, ### Page table — VPID hash, and ### Buffer Control Block — PGBUF_BCB in cubrid-page-buffer-manager.md. The fetch mode (PAGE_FETCH_MODE) is the biggest source of branching: its seven values reappear at the lock-free fast path, the miss fork, the page-VPID check, and the PAGE_UNKNOWN switch near the exit.

// PAGE_FETCH_MODE -- src/storage/page_buffer.h
typedef enum
{
OLD_PAGE = 0, /* must already exist on disk or in buffer */
NEW_PAGE, /* newly allocated; may be created in buffer */
OLD_PAGE_IF_IN_BUFFER, /* return only if resident; never fix from disk */
OLD_PAGE_PREVENT_DEALLOC, /* fetch + mark to block dealloc */
OLD_PAGE_DEALLOCATED, /* deliberately fetch a deallocated page */
OLD_PAGE_MAYBE_DEALLOCATED, /* fetch, tolerate deallocated (warn) */
RECOVERY_PAGE /* recovery: new/old/deallocated all valid */
} PAGE_FETCH_MODE;
ModeValidation skipped?Miss → claim from disk?Behaviour on PAGE_UNKNOWN page at exit
OLD_PAGEnoyesassert(false) + ER_ERROR_SEVERITY ER_PB_BAD_PAGEID, unfix, return NULL
NEW_PAGEnoyes (created in buffer)accepted, returned
OLD_PAGE_IF_IN_BUFFERsuppresses errors in pgbuf_is_valid_pageno — returns NULL on missaccepted, returned
OLD_PAGE_PREVENT_DEALLOCnoyestreated like OLD_PAGE: assert(false), unfix, NULL
OLD_PAGE_DEALLOCATEDnoyesaccepted, returned
OLD_PAGE_MAYBE_DEALLOCATEDnoyeswarning ER_PB_BAD_PAGEID, unfix, return NULL
RECOVERY_PAGEbypasses the page-validation block entirelyyesaccepted, returned

3.2 Argument validation and the unconditional→conditional downgrade

Section titled “3.2 Argument validation and the unconditional→conditional downgrade”

pgbuf_fix_release validates before touching shared state. Four guards fire in order, each an early return NULL. The first two are assert_release (false) checks rejecting an illegal request_mode (non-R/W) or condition; then pgbuf_Pool.monitor.fix_req_cnt is bumped, then the page-validation and pageid guards:

// pgbuf_fix_release -- src/storage/page_buffer.c
if (pgbuf_get_check_page_validation_level (PGBUF_DEBUG_PAGE_VALIDATION_FETCH)
&& fetch_mode != RECOVERY_PAGE) /* <- recovery skips validation */
{
if (pgbuf_is_valid_page (thread_p, vpid, fetch_mode == OLD_PAGE_IF_IN_BUFFER) != DISK_VALID)
return NULL; /* IF_IN_BUFFER suppresses errors */
}
if (vpid->pageid < 0) /* <- always-on cheap check */
{
er_set (ER_FATAL_ERROR_SEVERITY, ARG_FILE_LINE, ER_PB_BAD_PAGEID, 2, ...);
return NULL; /* fatal: ER_FATAL_ERROR_SEVERITY */
}

The page-validation block runs only when debug validation is armed and the mode is not RECOVERY_PAGE (recovery may legitimately fix pages disk metadata says do not exist). For OLD_PAGE_IF_IN_BUFFER the second argument is true, suppressing the error log since “not valid” is normal. The pivotal transformation comes next: if condition == PGBUF_UNCONDITIONAL_LATCH and pgbuf_find_current_wait_msecs (thread_p) is LK_ZERO_WAIT or LK_FORCE_ZERO_WAIT, condition is silently set to PGBUF_CONDITIONAL_LATCH.

Invariant — a zero-wait transaction never blocks on a page latch. The downgrade happens here, before any hashing or latching, and everything downstream keys off condition. Skipping it would let a zero-wait transaction sleep indefinitely in pgbuf_latch_bcb_upon_fix.

3.3 The try_again loop and the interrupt check

Section titled “3.3 The try_again loop and the interrupt check”

Perf tracking is sampled just before the try_again: label, the loop re-entry point:

// pgbuf_fix_release -- src/storage/page_buffer.c
try_again:
if (logtb_get_check_interrupt (thread_p) == true)
if (logtb_is_interrupted (thread_p, true, &pgbuf_Pool.check_for_interrupts) == true)
{
er_set (ER_ERROR_SEVERITY, ARG_FILE_LINE, ER_INTERRUPTED, 0);
PGBUF_BCB_CHECK_MUTEX_LEAKS (); /* <- assert no mutex held on exit */
return NULL;
}

The interrupt check sits inside the loop, so every retry re-checks for interruption. Exactly one statement jumps back to try_again: the miss path’s pgbuf_claim_bcb_for_fix returning NULL with its retry out-parameter set (a BCB-claim race; Chapter 4). The §3.2 guards sit above the label and run once.

// pgbuf_fix_release -- src/storage/page_buffer.c (miss fork, retry edge)
bufptr = pgbuf_claim_bcb_for_fix (thread_p, vpid, fetch_mode, hash_anchor, &perf, &retry, false);
if (bufptr == NULL)
{
if (retry) { retry = false; goto try_again; } /* <- the only re-entry */
ASSERT_ERROR ();
return NULL;
}

pgbuf_fix_with_retry is a thin wrapper around pgbuf_fix, not part of the loop. It re-calls pgbuf_fix while it returns NULL, switching on er_errid (): NO_ERROR/ER_INTERRUPTED retry without bumping i; the three timeout errors (ER_LK_UNILATERALLY_ABORTED, ER_LK_PAGE_TIMEOUT, ER_PAGE_LATCH_TIMEDOUT) do i++; anything else sets noretry. The loop breaks (with ER_PAGE_LATCH_ABORTED) once noretry || i > retry — so interrupts never consume retry budget and any other error exits at once.

The page table is indexed by PGBUF_HASH_VALUE, calling pgbuf_hash_func_mirror:

// pgbuf_hash_func_mirror -- src/storage/page_buffer.c
#define HASH_SIZE_BITS 20 /* 2^20 ~ 1M anchors, fixed */
#define VOLID_LSB_BITS 8
reverse_mask = 1 << (HASH_SIZE_BITS - 1); /* top bit of the 20-bit space */
for (i = VOLID_LSB_BITS; i > 0; i--) /* bit-reverse low 8 volid bits */
{ if (volid_lsb & lsb_mask) reversed_volid_lsb |= reverse_mask;
reverse_mask >>= 1; lsb_mask <<= 1; }
hash_val = vpid->pageid ^ reversed_volid_lsb; /* XOR pageid with mirrored volid */
hash_val = hash_val & ((1 << HASH_SIZE_BITS) - 1); /* clamp to 2^20 buckets */

The “mirror” trick bit-reverses the low 8 volid bits into the top of the 20-bit space, then XORs with the pageid (which dominates the low bits), so different volumes get disjoint high-bit signatures and adjacent ids across volumes do not share chains.

Two sibling helpers serve the Aout victim-history mht table (Chapter 7), not the main page table: pgbuf_hash_vpid is a generic modulo hash, ((vpid->pageid | ((unsigned int) vpid->volid) << 24) % htsize), and pgbuf_compare_vpid is its ordering callback (same volume ⇒ pageid difference, else volid difference). The main buf_hash_table uses pgbuf_hash_func_mirror only, comparing via VPID_EQ.

Before grabbing any anchor mutex, a read fix of a present page tries to fix without locking. The guard requires all four of: request_mode == PGBUF_LATCH_READ, fetch_mode in the three eligible modes, and condition == PGBUF_UNCONDITIONAL_LATCH — so after the §3.2 downgrade a zero-wait transaction is ineligible. On a non-NULL pgbuf_lockfree_fix_ro it bumps num_hit and goto fast_path, bypassing the hash walk and the latch pass. The function does a lock-free chain walk, then a CAS on the BCB latch word:

// pgbuf_lockfree_fix_ro -- src/storage/page_buffer.c
bufptr = pgbuf_search_hash_chain_no_bcb_lock (thread_p,
&pgbuf_Pool.buf_hash_table[PGBUF_HASH_VALUE (vpid)], vpid);
if (bufptr == NULL) return NULL; /* not resident -> slow path */
do {
impl = get_impl (&bufptr->atomic_latch); new_impl = impl;
if (impl.impl.latch_mode != PGBUF_LATCH_READ /* must already be read-latched */
|| impl.impl.waiter_exists || impl.impl.fcnt == 0 /* no writer queued, still held */
|| bufptr->vpid.pageid != vpid->pageid /* re-validate identity ... */
|| bufptr->vpid.volid != vpid->volid) /* ... against ABA reuse */
return NULL; /* any failure -> slow path */
new_impl.impl.fcnt++; /* bump fix count */
} while (!bufptr->atomic_latch.compare_exchange_weak (impl.raw, new_impl.raw,
std::memory_order_acq_rel, std::memory_order_acquire));

Invariant — the fast path only adds a reader to an already-read-held BCB. The CAS refuses unless the latch is PGBUF_LATCH_READ, fcnt != 0, and has no waiter — so it never upgrades from free/write, never starves a queued writer, and the in-loop VPID re-check defeats ABA. Any failure returns NULL to the slow path; never an error.

The chain walk it uses, pgbuf_search_hash_chain_no_bcb_lock, is bare: it pointer-chases hash_anchor->hash_next returning the first VPID_EQ match, with no mutex or trylock — the CAS above does all synchronization.

On a successful CAS the function still has holder bookkeeping to do before returning the page: pgbuf_find_thrd_holder either finds the caller is already a holder (bump holder->fix_count, set hold_has_read_latch) or, in SERVER_MODE, allocates a fresh holder via pgbuf_allocate_thrd_holder_entry (a NULL return there is assert(false)

  • NULL). Only then does CAST_BFPTR_TO_PGPTR produce the PAGE_PTR and the caller reaches fast_path:.

3.6 The locked hash-chain walk and the hit/miss fork

Section titled “3.6 The locked hash-chain walk and the hit/miss fork”

If the fast path is skipped or returns NULL, the slow path sets hash_anchor, clears buf_lock_acquired, and calls pgbuf_search_hash_chain. If that returns a direct-victim BCB (pgbuf_bcb_is_direct_victim), pgbuf_bcb_update_flags (..., PGBUF_BCB_INVALIDATE_DIRECT_VICTIM_FLAG, ...) tells the victim-waiter it cannot use this BCB.

The anchor it walks is one slot of the buf_hash_table[], a PGBUF_BUFFER_HASH:

FieldRoleWhy
hash_mutexper-bucket pthread_mutex_t (SERVER_MODE only)Serializes chain insert/remove and the buffer-lock chain; the only mutex phase two holds while walking. Per-bucket, not global, so different buckets hash concurrently.
hash_nexthead of the BCB hash chain (PGBUF_BCB *)The chain pgbuf_search_hash_chain pointer-chases via each BCB’s own hash_next; resident pages for this bucket live here.
lock_nexthead of the buffer-lock chain (PGBUF_BUFFER_LOCK *)Records VPIDs a thread has claimed but not yet inserted (the miss path, Chapter 4), so a second fixer for the same VPID waits instead of double-claiming. Also protected by hash_mutex.

pgbuf_search_hash_chain is the workhorse: a two-phase search with an exact return contract — non-NULL ⇒ caller holds bufptr->mutex (not the hash mutex); NULL ⇒ caller holds hash_anchor->hash_mutex.

Phase one (one_phase:) walks the chain without the hash mutex, trying a non-blocking PGBUF_BCB_TRYLOCK on the matched BCB. The load-bearing core, per matched bufptr:

// pgbuf_search_hash_chain -- src/storage/page_buffer.c (one_phase core)
rv = PGBUF_BCB_TRYLOCK (bufptr);
if (rv != 0)
{ if (rv != EBUSY) goto two_phase; /* trylock error -> escalate */
PGBUF_BCB_LOCK (bufptr); } /* EBUSY -> block on the bcb mutex */
if (!VPID_EQ (&(bufptr->vpid), vpid)) /* bcb reused under us? */
{ PGBUF_BCB_UNLOCK (bufptr); goto one_phase; } /* <- restart phase 1 */
break; /* matched + locked -> return bufptr */

Three branches leave phase one (Figure 3-1): clean trylock + VPID recheck (return bufptr); EBUSY → blocking PGBUF_BCB_LOCK then recheck; and a non-EBUSY error escalating via goto two_phase. The post-lock recheck catches a slot repurposed between match and lock.

Phase two (two_phase:/try_again:) re-runs the same walk under the hash mutex, differing in three points: on a clean trylock it unlocks the hash mutex before returning; on EBUSY it unlocks the hash mutex before the blocking PGBUF_BCB_LOCK and re-validates via goto try_again; and a non-EBUSY failure is fataler_set_with_oserror (ER_CSS_PTHREAD_MUTEX_TRYLOCK) then return NULL.

Invariant — lock ordering is hash mutex then BCB mutex, never the reverse. Phase two always drops the hash mutex before a blocking PGBUF_BCB_LOCK; inverting it would deadlock insert/remove paths. The ER_CSS_PTHREAD_MUTEX_TRYLOCK branch is the one place the function returns NULL while not holding the hash mutex — a fatal OS failure.

flowchart TD
  A["pgbuf_search_hash_chain"] --> B["one_phase: walk chain, no hash mutex"]
  B --> C{"VPID match?"}
  C -- "no, end of chain" --> TP["two_phase"]
  C -- "yes" --> D["PGBUF_BCB_TRYLOCK"]
  D -- "rv==0" --> E{"VPID still equal?"}
  D -- "EBUSY" --> F["PGBUF_BCB_LOCK block"]
  D -- "other err" --> TP
  F --> E
  E -- "no, reused" --> B
  E -- "yes" --> R1["return bufptr, holds bcb mutex"]
  TP --> G["lock hash_mutex; walk chain"]
  G --> H{"VPID match?"}
  H -- "no, end" --> R2["return NULL, holds hash mutex"]
  H -- "yes" --> I["PGBUF_BCB_TRYLOCK"]
  I -- "rv==0 or EBUSY" --> JK["unlock hash_mutex; if EBUSY PGBUF_BCB_LOCK"]
  I -- "other err" --> ERR["fatal: return NULL"]
  JK --> L{"VPID still equal?"}
  L -- "no" --> G
  L -- "yes" --> R3["return bufptr, holds bcb mutex"]

Back in pgbuf_fix_release, the returned bufptr drives the hit/miss fork into three outcomes:

  1. Hit (bufptr != NULL): increment num_hit; if NEW_PAGE, assert the page is clean-LSA or dirty (a NEW_PAGE re-using a buffered, invalidated page). Control falls through to pgbuf_bcb_register_fix and the latch pass (Chapter 5).

  2. OLD_PAGE_IF_IN_BUFFER miss: this mode never reads from disk, so unlock the hash mutex and return NULL — the only mode that short-circuits a miss.

  3. General miss: call pgbuf_claim_bcb_for_fix (Chapter 4). On NULL with retry, goto try_again; on NULL without retry, ASSERT_ERROR and return NULL; on success, set buf_lock_acquired = true and continue to the page-VPID check.

3.7 Post-claim VPID re-check and the maybe_deallocated branch

Section titled “3.7 Post-claim VPID re-check and the maybe_deallocated branch”

After a hit or a successful claim the caller holds bufptr->mutex; pgbuf_bcb_register_fix and pgbuf_set_bcb_page_vpid run, then page identity is re-validated:

// pgbuf_fix_release -- src/storage/page_buffer.c
maybe_deallocated = (fetch_mode == OLD_PAGE_MAYBE_DEALLOCATED);
if (pgbuf_check_bcb_page_vpid (bufptr, maybe_deallocated) != true)
{
if (buf_lock_acquired)
{ pgbuf_put_bcb_into_invalid_list (thread_p, bufptr); /* releases bcb mutex */
(void) pgbuf_unlock_page (thread_p, hash_anchor, vpid, true); }
else
{ PGBUF_BCB_UNLOCK (bufptr); } /* hit case: just unlock */
PGBUF_BCB_CHECK_MUTEX_LEAKS ();
return NULL;
}
if (fetch_mode == OLD_PAGE_PREVENT_DEALLOC)
pgbuf_bcb_register_avoid_deallocation (bufptr); /* pin against dealloc */

The maybe_deallocated flag relaxes pgbuf_check_bcb_page_vpid so a deallocated VPID is not a failure for OLD_PAGE_MAYBE_DEALLOCATED. The cleanup branch differs by ownership: a fresh claim (buf_lock_acquired) is recycled to the invalid list and the page lock dropped; a hit only unlocks the BCB. Past here the function enters the latch pass (Chapter 5) and, on success, jumps to fast_path: where the §3.1 PAGE_UNKNOWN switch runs — the last place fetch mode steers the result.

  1. pgbuf_fix_release is a state machine: four early-return validations, then a try_again loop whose only re-entry edge is a BCB-claim race via pgbuf_claim_bcb_for_fix’s retry out-parameter.
  2. A zero-wait transaction (LK_ZERO_WAIT/LK_FORCE_ZERO_WAIT) has its unconditional fix rewritten to conditional before hashing.
  3. The lock-free fast path (pgbuf_lockfree_fix_ro) covers read latches in the three eligible modes under an unconditional request; its CAS only adds a reader to an already-read-held, waiter-free BCB, re-validating the VPID against ABA.
  4. pgbuf_hash_func_mirror bit-reverses the low 8 volid bits into the top of a 20-bit space and XORs with pageid; pgbuf_hash_vpid/ pgbuf_compare_vpid belong to the separate Aout mht table.
  5. pgbuf_search_hash_chain is two-phase; its return contract (non-NULL ⇒ holds BCB mutex; NULL ⇒ holds hash mutex) and strict hash-then-BCB lock ordering are load-bearing invariants.
  6. The hit/miss fork: hit → latch pass (Ch. 5); OLD_PAGE_IF_IN_BUFFER miss → immediate NULL; general miss → BCB claim (Ch. 4) with a retry.
  7. Fetch mode steers four points — fast-path eligibility, the miss short-circuit, the maybe_deallocated re-check, and the final PAGE_UNKNOWN switch — so any fix bug starts with the caller’s mode.

Chapter 4: Miss Handling BCB Claim and the PGBUF Allocation Lock

Section titled “Chapter 4: Miss Handling BCB Claim and the PGBUF Allocation Lock”

Chapter 3 left us where a pgbuf_fix lookup returns not in the page table. This chapter answers: how does a thread reserve the VPID against racing allocators, obtain a fresh BCB from the invalid list or a victim, read the page bytes, and insert the BCB into the hash chain? The companion (cubrid-page-buffer-manager.md §“How a page fix flows”, §“PGBUF lock”) sketches Step 1 / Step 2; here we trace every branch, assuming the reader knows the PGBUF_BCB layout, the five zones, and the victim sources from Chapters 1-3 and the companion’s §“LFCQ”.

The miss path is a four-layer onion: pgbuf_claim_bcb_for_fix (outer coordinator) takes the per-bucket VPID lock via pgbuf_lock_page, calls pgbuf_allocate_bcb (source selector: invalid list, then victim, then sleep on the direct-victim queue), whose cheapest source is pgbuf_get_bcb_from_invalid_list. Victim search (pgbuf_get_victim) and the direct-victim hand-off are Chapter 9 black boxes.

4.1 pgbuf_invalid_list — the free pool, every field

Section titled “4.1 pgbuf_invalid_list — the free pool, every field”

The invalid list is the pool of BCBs bound to no page (all BCBs at server start; error-rolled-back or invalidated BCBs at runtime). It is a LIFO stack guarded by one mutex.

// struct pgbuf_invalid_list -- src/storage/page_buffer.c
struct pgbuf_invalid_list
{
#if defined(SERVER_MODE)
pthread_mutex_t invalid_mutex; /* integrity of the singly-linked list */
#endif
PGBUF_BCB *invalid_top; /* head of the list (LIFO) */
int invalid_cnt; /* # of entries */
};
FieldRoleWhy it exists
invalid_mutexSerializes push/pop of the stackWithout it two poppers could grab the same head. SERVER_MODE only — SA mode is single-threaded.
invalid_topHead pointer; chain runs through bufptr->next_BCBAn invalid BCB is on no LRU list, so it reuses its next_BCB LRU pointer as the invalid-chain link — no separate field.
invalid_cntLive count of free BCBsRead by quota math (pgbuf_adjust_quotas, Ch 10); also a fast “is the pool exhausted?” probe before taking the mutex.

Invariant — invalid_top chains exclusively through next_BCB, and a BCB is on the invalid list iff its zone is PGBUF_INVALID_ZONE. pgbuf_get_bcb_from_invalid_list flips the popped BCB to PGBUF_VOID_ZONE; pgbuf_put_bcb_into_invalid_list flips it back and asserts (bufptr->flags & PGBUF_BCB_FLAGS_MASK) == 0 — a BCB returning to the pool must carry no dirty/flushing/victim flag. If violated, the BCB re-enters the free pool still advertising a pending flush, and a later claimer treats stale page bytes as a clean fresh page.

flowchart LR
  IT["invalid_top"] --> B1["BCB a"]
  B1 -->|next_BCB| B2["BCB b"]
  B2 -->|next_BCB| B3["BCB c"]
  B3 -->|next_BCB| NUL["NULL"]

Figure 4-1 — The invalid list is a LIFO stack threaded through each BCB’s next_BCB pointer; invalid_cnt tracks its length.

pgbuf_get_bcb_from_invalid_list — double-checked-locking pop

Section titled “pgbuf_get_bcb_from_invalid_list — double-checked-locking pop”

This is pgbuf_allocate_bcb’s cheapest source. It pops one BCB with a lock-free fast path so the common “pool empty” case never touches the mutex.

// pgbuf_get_bcb_from_invalid_list -- src/storage/page_buffer.c
if (pgbuf_Pool.buf_invalid_list.invalid_top == NULL) /* (1) fast path: empty */
return NULL; /* no mutex taken */
rv = pthread_mutex_lock (&pgbuf_Pool.buf_invalid_list.invalid_mutex);
if (pgbuf_Pool.buf_invalid_list.invalid_top == NULL) /* (2) re-check under mutex */
{ pthread_mutex_unlock (...); return NULL; } /* someone emptied it */
else /* (3) pop the LIFO top */
{ bufptr = pgbuf_Pool.buf_invalid_list.invalid_top;
pgbuf_Pool.buf_invalid_list.invalid_top = bufptr->next_BCB; /* advance head */
pgbuf_Pool.buf_invalid_list.invalid_cnt -= 1;
pthread_mutex_unlock (...);
PGBUF_BCB_LOCK (bufptr); /* now hold bufptr->mutex */
bufptr->next_BCB = NULL; /* sever invalid-chain link */
pgbuf_bcb_change_zone (thread_p, bufptr, 0, PGBUF_VOID_ZONE); /* INVALID -> VOID */
return bufptr; }

Three branches: (1) unlocked empty check returns NULL (no mutex). (2) post-mutex re-check returns NULL if a racing popper drained the list between the two reads. (3) the pop advances invalid_top, decrements invalid_cnt, drops the list mutex, then locks the BCB, nulls its chain link, and flips it to PGBUF_VOID_ZONE, returned under bufptr->mutex.

4.2 pgbuf_claim_bcb_for_fix — the outer coordinator

Section titled “4.2 pgbuf_claim_bcb_for_fix — the outer coordinator”

The fix path calls this on a page-table miss. Its contract is unusual: it is entered holding hash_anchor->hash_mutex and may exit having released it, having set *try_again, or having returned a fully-loaded BCB under bufptr->mutex. The four exit branches:

// pgbuf_claim_bcb_for_fix -- src/storage/page_buffer.c
/* Branch A: a prior trylock on the bucket failed -> bail, no retry. */
if (er_errid () == ER_CSS_PTHREAD_MUTEX_TRYLOCK)
{ pthread_mutex_unlock (&hash_anchor->hash_mutex); return NULL; }
/* Branch B: take the VPID lock; hash_mutex is released inside. */
if (!already_locked && pgbuf_lock_page (...) != PGBUF_LOCK_HOLDER)
{ *try_again = true; return NULL; } /* <- LOSER of a same-VPID race */
bufptr = pgbuf_allocate_bcb (thread_p, vpid);
if (bufptr == NULL) /* Branch C: pool dirty / interrupted */
{ ASSERT_ERROR (); (void) pgbuf_unlock_page (..., true); return NULL; }
/* Branch D: success. Scrub the fresh BCB. */
bufptr->vpid = *vpid;
/* atomic_latch <- {PGBUF_NO_LATCH, waiter=false, fcnt=0}; clears stale victim latch */
pgbuf_bcb_update_flags (..., 0, PGBUF_BCB_ASYNC_FLUSH_REQ); /* clear stray flag */
LSA_SET_NULL (&bufptr->oldest_unflush_lsa); /* nothing unflushed yet */

Branch A — a failed trylock on the bucket leaves ER_CSS_PTHREAD_MUTEX_TRYLOCK in the error slot; drop the mutex and return NULL without touching *try_again (the caller pre-initialized it to false, so its goto try_again does not fire and pgbuf_fix’s own retry loop re-drives the lookup). Branch B is the race protocol: already_locked is true only for the dealloc-aware caller. HOLDER → we own the VPID, fall through. WAITER → another thread is allocating it; we already slept in pgbuf_lock_page, so set *try_again = true and the caller’s goto try_again re-runs the lookup and hits the BCB the winner inserted. Branch C (allocation failure, §4.4) must undo the VPID lock: it holds no mutex, so pgbuf_unlock_page(..., true) re-acquires hash_anchor->hash_mutex to unlink the record. Branch D scrubs the BCB; the atomic_latch reset clears a victim’s stale PGBUF_LATCH_INVALID (the real latch is acquired in Chapter 5). Bytes are then loaded (§4.5).

flowchart TD
  S["enter holding hash_mutex"] --> A{"errid == TRYLOCK?"}
  A -- yes --> AR["unlock hash_mutex; return NULL\ntry_again untouched, stays false"]
  A -- no --> B{"pgbuf_lock_page\n== HOLDER?"}
  B -- "WAITER" --> BR["try_again=true; return NULL\n-> caller re-looks-up, hits"]
  B -- "HOLDER" --> C["pgbuf_allocate_bcb"]
  C --> D{"bufptr == NULL?"}
  D -- yes --> DR["unlock_page need_hash=true\nreturn NULL; propagate error"]
  D -- no --> E["init BCB; load bytes -> 4.5"]
  E --> G["return BCB under bufptr->mutex"]

Figure 4-2 — pgbuf_claim_bcb_for_fix branch map. Only B-WAITER and the success branch leave the caller a different state to act on; the two error branches both unwind the VPID lock.

4.3 pgbuf_lock_page / pgbuf_unlock_page — the VPID race lock

Section titled “4.3 pgbuf_lock_page / pgbuf_unlock_page — the VPID race lock”

The PGBUF lock is not the BCB latch and not the bucket mutex. It is a logical lock keyed on the VPID, on a chain hanging off the same hash bucket as the BCB chain: when no BCB exists yet for a VPID, it ensures exactly one thread allocates it. The lock record pgbuf_buffer_lock is statically pre-allocated one per thread (no malloc on the hot path):

// struct pgbuf_buffer_lock -- src/storage/page_buffer.c
struct pgbuf_buffer_lock
{
VPID vpid; /* the VPID being allocated */
PGBUF_BUFFER_LOCK *lock_next; /* next record on this bucket's lock chain */
#if defined(SERVER_MODE)
THREAD_ENTRY *next_wait_thrd; /* FIFO of threads blocked on this VPID */
#endif
};

pgbuf_lock_page is entered holding hash_anchor->hash_mutex and always releases it before returning. Two branches:

// pgbuf_lock_page -- src/storage/page_buffer.c
for (cur = hash_anchor->lock_next; cur != NULL; cur = cur->lock_next)
if (VPID_EQ (&cur->vpid, vpid)) /* LOSER: VPID already being allocated */
{
cur_thrd_entry->next_wait_thrd = cur->next_wait_thrd;
cur->next_wait_thrd = cur_thrd_entry; /* push onto waiter FIFO */
pgbuf_sleep (cur_thrd_entry, &hash_anchor->hash_mutex); /* releases mutex, sleeps */
if (cur_thrd_entry->resume_status != THREAD_PGBUF_RESUMED)
{ /* woke for interrupt: re-take mutex, splice self out of waiter list */ }
return PGBUF_LOCK_WAITER;
}
/* WINNER: VPID absent. Claim this thread's static record. */
cur = &pgbuf_Pool.buf_lock_table[cur_thrd_entry->index];
cur->vpid = *vpid; cur->next_wait_thrd = NULL;
cur->lock_next = hash_anchor->lock_next; hash_anchor->lock_next = cur; /* link at head */
pthread_mutex_unlock (&hash_anchor->hash_mutex);
return PGBUF_LOCK_HOLDER;

Loser branch (VPID on the chain): push onto that record’s next_wait_thrd FIFO; pgbuf_sleep releases hash_mutex and suspends. The resume_status != THREAD_PGBUF_RESUMED sub-branch handles an interrupt that woke the thread without the winner unlocking: it re-takes hash_mutex and splices itself out of the waiter list so a later pgbuf_unlock_page does not wake a departed thread. Result is WAITER. Winner branch (VPID absent): claim this thread’s static record, link at the chain head, return HOLDER.

Invariant — at most one BCB-less allocation per VPID is in flight. Enforced because the winner installs its record under hash_mutex before releasing it, so any later scanner under the same mutex sees the record and becomes a waiter. Two winners would create two BCBs for one VPID — the lookup would be nondeterministic and one copy’s writes lost.

pgbuf_unlock_page is the mirror. need_hash_mutex says whether to acquire hash_anchor->hash_mutex itself (error paths, true) or whether the caller already holds it (success path after pgbuf_insert_into_hash_chain, false).

// pgbuf_unlock_page -- src/storage/page_buffer.c
if (need_hash_mutex) pthread_mutex_lock (&hash_anchor->hash_mutex);
/* find this VPID's record; if found, unlink it ... */
if (cur != NULL)
{
/* splice out of lock_next chain */
pthread_mutex_unlock (&hash_anchor->hash_mutex);
while ((t = cur->next_wait_thrd) != NULL) /* wake EVERY waiter */
{ cur->next_wait_thrd = t->next_wait_thrd; t->next_wait_thrd = NULL;
pgbuf_wakeup_uncond (t); }
}
else pthread_mutex_unlock (&hash_anchor->hash_mutex); /* record gone (error case) */

It unlinks the record, drops the mutex, then wakes all waiters. Each woken loser re-runs the fix, finds the BCB in the table (inserted before unlocking on the success path), and proceeds via the Chapter 3 hash-hit path. Waking after dropping the mutex avoids immediate re-contention.

4.4 pgbuf_allocate_bcb — the source selector

Section titled “4.4 pgbuf_allocate_bcb — the source selector”

The VPID-lock winner needs an actual BCB. The selector tries three sources in cost order.

// pgbuf_allocate_bcb -- src/storage/page_buffer.c
bufptr = pgbuf_get_bcb_from_invalid_list (thread_p); /* Source 1: free list, cheapest */
if (bufptr != NULL) return bufptr; /* short-circuit: SKIPS the 'end:' victimize */
bufptr = pgbuf_get_victim (thread_p); /* Source 2: scan LFCQs */
if (bufptr != NULL) goto end; /* victim still needs pgbuf_victimize_bcb */

Source 1 short-circuits: a BCB off the invalid list (§4.1) is bound to no page and has no flags, so it returns immediately — it does not reach end: and so is not victimized. Source 2 is different: by the time pgbuf_get_victim returns, the victim has already been unlinked from its LRU list and flipped to PGBUF_VOID_ZONEpgbuf_get_victim_from_lru_list calls pgbuf_remove_from_lru_list (which does the unlink + zone flip) before handing the BCB back (§4.6). What still remains is the BCB’s link in the hash chain and its old latch. That is precisely why Source 2 must fall through to end:: pgbuf_victimize_bcb is what detaches the hash chain and invalidates the latch. So the goto end is about hash-chain and latch cleanup, not about LRU unlinking, which already happened inside pgbuf_get_victim.

If both fail, behavior forks on build mode and daemon availability. In SERVER_MODE with the flush daemon up, the thread enqueues on a direct-victim waiter queue and suspends with a timeout:

// pgbuf_allocate_bcb -- src/storage/page_buffer.c (SERVER_MODE, flush daemon up)
retry:
high_priority = high_priority || VACUUM_IS_THREAD_VACUUM (thread_p)
|| pgbuf_is_thread_high_priority (thread_p);
thread_lock_entry (thread_p);
if (high_priority) waiter_threads_high_priority->produce (thread_p);
else if (!waiter_threads_low_priority->produce (thread_p)) /* low queue jammed */
{ if (!waiter_threads_high_priority->produce (thread_p)) { assert(false); goto end; } }
pgbuf_wakeup_page_flush_daemon (thread_p); /* ensure SOMEONE will feed us */
r = thread_suspend_timeout_wakeup_and_unlock_entry (..., THREAD_ALLOC_BCB_SUSPENDED);

high_priority is true for vacuum threads, threads already holding a hot-page latch, or on a retry. The low-priority produce-failure sub-branch guards a preempted-consumer wedge: if the low queue cannot accept, the thread is pushed to the high queue. After enqueueing it wakes the flush daemon and suspends. On wake, four sub-branches:

  • Normal handoff (THREAD_ALLOC_BCB_RESUMED): a producer put a BCB in this thread’s slot; pgbuf_get_direct_victim reads it and we goto end.
  • Stolen-back: pgbuf_get_direct_victim returns NULL (the BCB was re-fixed between assign and get, companion §“Direct victim hand-off”); set high_priority and goto retry.
  • Interrupt/shutdown (other resume_status): undo any half-assigned victim, then raise ER_INTERRUPTED so the claim path takes Branch C.
  • r != NO_ERROR (the timeout else): asserts no timeout, re-stamps resume_status, and is a can’t-happen path under the assert.

The no-daemon else (SA mode or crash recovery) cannot sleep for a producer, so it flushes via pgbuf_wakeup_page_flush_daemon, re-scans with pgbuf_get_victim, and asserts a victim now exists. The shared tail victimizes any acquired victim:

// pgbuf_allocate_bcb -- src/storage/page_buffer.c
end:
if (bufptr != NULL)
{ if (pgbuf_victimize_bcb (thread_p, bufptr) != NO_ERROR) { assert (false); bufptr = NULL; } }
else if (er_errid () == NO_ERROR)
er_set (..., ER_PB_ALL_BUFFERS_DIRTY, ...);
return bufptr;

pgbuf_victimize_bcb re-checks victimizability under the BCB mutex, unlinks the BCB from its hash chain (pgbuf_delete_from_hash_chain), and stamps PGBUF_LATCH_INVALID into the atomic latch. It does not change the zone — the victim was already moved to VOID by pgbuf_remove_from_lru_list during pgbuf_get_victim (§4.6). Invalid-list BCBs skip this (never in a chain). If bufptr is still NULL, ER_PB_ALL_BUFFERS_DIRTY is set so Branch C has an error to propagate.

Invariant — every BCB leaving pgbuf_allocate_bcb non-NULL is detached from any hash chain and LRU list, held under bufptr->mutex. Invalid-list BCBs never attached; victims detach via pgbuf_remove_from_lru_list + pgbuf_victimize_bcb. A still-linked BCB leaking out would let the claimer bind a second VPID onto a slot still reachable by its old VPID.

4.5 Loading the page bytes — NEW_PAGE vs read vs DWB

Section titled “4.5 Loading the page bytes — NEW_PAGE vs read vs DWB”

Back in pgbuf_claim_bcb_for_fix (Branch D), the initialized BCB needs its page bytes. The fork is on fetch_mode.

// pgbuf_claim_bcb_for_fix -- src/storage/page_buffer.c (read branch)
if (fetch_mode != NEW_PAGE)
{
/* DWB first: a torn-write copy may be fresher than the volume. */
if (dwb_read_page (thread_p, vpid, &...iopage, &success) != NO_ERROR)
{ assert (false); return NULL; } /* (1) DWB error: can't-happen */
else if (success == true) { /* copied from DWB, no disk read */ }
else if (fileio_read (...) == NULL) /* (2) volume read failed */
{ ASSERT_ERROR ();
pgbuf_put_bcb_into_invalid_list (thread_p, bufptr); /* releases bufptr->mutex */
(void) pgbuf_unlock_page (..., true); return NULL; }
/* (3) decrypt if TDE-protected; on failure roll back like (2) */
if (tde_algo != TDE_ALGORITHM_NONE && tde_decrypt_data_page (...) != NO_ERROR)
{ ASSERT_ERROR (); pgbuf_put_bcb_into_invalid_list (...); pgbuf_unlock_page (..., true); return NULL; }
if (pgbuf_is_temporary_volume (vpid->volid) && !pgbuf_is_temp_lsa (...)) /* temp first-touch */
{ pgbuf_init_temp_page_lsa (...); pgbuf_set_dirty_buffer_ptr (thread_p, bufptr); }
}

The read branch honors DWB-first ordering (companion §“Double Write Buffer”): dwb_read_page sets success when the double-write buffer holds a copy of this VPID, short-circuiting the disk read. Only on a DWB miss does fileio_read hit the volume. Three error sub-branches — (1) the DWB-error guard, (2) the volume-read failure, (3) the TDE-decrypt failure — each roll the BCB back via pgbuf_put_bcb_into_invalid_list (nulls the VPID, sets PGBUF_LATCH_INVALID, flips the zone to INVALID, releases bufptr->mutex) then pgbuf_unlock_page(..., true), so half-read bytes never linger in the pool. The temp first-touch sub-branch stamps a sentinel temp LSA and marks the page dirty.

// pgbuf_claim_bcb_for_fix -- src/storage/page_buffer.c (NEW_PAGE branch)
else
{
if (pgbuf_is_temporary_volume (vpid->volid))
pgbuf_init_temp_page_lsa (&...iopage, IO_PAGESIZE);
else fileio_init_lsa_of_page (&...iopage, IO_PAGESIZE);
if (bufptr->vpid.volid > NULL_VOLID) /* perm: mark page immature */
{ ...iopage.prv.pageid = -1; ...iopage.prv.volid = -1; }
}
return bufptr;

The NEW_PAGE branch has nothing on disk to read, so it initializes the in-page LSA and (for permanent volumes) stamps prv.pageid/volid = -1 to mark the page immature — the real identity is written later by pgbuf_set_bcb_page_vpid (§4.6). It cannot fail, so there is no rollback sub-branch. Both branches return the loaded BCB under bufptr->mutex.

4.6 pgbuf_set_bcb_page_vpid and the hash-chain insertion

Section titled “4.6 pgbuf_set_bcb_page_vpid and the hash-chain insertion”

The fix path (Chapter 5 territory, shown for the bind step) stamps the page identity and inserts the BCB. pgbuf_set_bcb_page_vpid has three branches:

// pgbuf_set_bcb_page_vpid -- src/storage/page_buffer.c
if (bufptr == NULL || VPID_ISNULL (&bufptr->vpid)) /* (A) guard: nothing to stamp */
{ assert (bufptr != NULL); assert (!VPID_ISNULL (&bufptr->vpid)); return; }
if (bufptr->vpid.volid > NULL_VOLID) /* permanent volume only */
{
if (prv.pageid == NULL_PAGEID && prv.volid == NULL_VOLID) /* (B) first time */
{ prv.pageid = bufptr->vpid.pageid; /* write identity into header */
prv.volid = bufptr->vpid.volid;
prv.ptype = PAGE_UNKNOWN; /* + p_reserve_1/2, tde_nonce zeroed */ }
else /* (C) already stamped */
{ assert (prv.volid == bufptr->vpid.volid); /* values not reset on dealloc */
assert (prv.pageid == bufptr->vpid.pageid); } /* identity must match -- no rewrite */
}

(A) the top guard: a NULL BCB or null VPID is a caller bug — it asserts and returns without touching bytes. (B) first-time path (immature NEW_PAGE sentinel from §4.5): writes the VPID into the in-page header, making the bytes self-identifying on disk. (C) the else — a re-allocated/already-stamped perm page (the in-page identity survives deallocation): the function leaves the bytes untouched and only asserts that the stored prv.volid/pageid still equal the BCB’s VPID. Temp pages (volid <= NULL_VOLID) fall through all three with no action.

// pgbuf_insert_into_hash_chain -- src/storage/page_buffer.c
pthread_mutex_lock (&hash_anchor->hash_mutex);
bufptr->hash_next = hash_anchor->hash_next; hash_anchor->hash_next = bufptr; /* link at head */
/* hash_mutex stays held; released by the following pgbuf_unlock_page (need_hash_mutex=false) */

pgbuf_insert_into_hash_chain links the BCB at the head of the bucket’s BCB chain and deliberately keeps hash_mutex held — the immediately-following pgbuf_unlock_page(..., false) unlinks the VPID lock record and releases the same mutex. Holding it across both means a racing loser cannot see a window where the BCB is in the chain but the VPID lock is already gone (which would let it wrongly become a winner).

Where does the BCB land in the LRU? At this point the BCB is in PGBUF_VOID_ZONE. An invalid-list BCB was put there by pgbuf_get_bcb_from_invalid_list; a victim was moved there by pgbuf_remove_from_lru_list (call chain pgbuf_get_victim -> pgbuf_get_victim_from_lru_list -> pgbuf_remove_from_lru_list, whose tail does the pgbuf_bcb_change_zone (..., PGBUF_VOID_ZONE)), and pgbuf_victimize_bcb afterward only detaches it from the hash chain and invalidates the latch. VOID means “on no list yet.” The BCB does not enter an LRU list during claim/insert; it lands in a zone only at the matching unfix, where pgbuf_unfix routes it to LRU 2 normally, or LRU 1 when the fixer is a vacuum worker or the VPID was an Aout-history hit (boost) — Chapter 7’s subject.

stateDiagram-v2
  [*] --> INVALID: server start
  INVALID --> VOID: pop from invalid list
  Victim --> VOID: pgbuf_remove_from_lru_list during get_victim
  VOID --> LRU2: unfix normal -> Ch 7
  VOID --> LRU1: unfix vacuum or Aout-hit -> Ch 7
  VOID --> INVALID: read error rollback

Figure 4-3 — Zone trajectory of a claimed BCB. The claim path lands it in VOID; LRU placement happens at unfix (Chapter 7). Any read error returns it straight to INVALID.

  1. pgbuf_claim_bcb_for_fix has four exits: trylock-bail (Branch A), loser-retry with *try_again=true (B), allocation-failure unwind (C), and success returning a loaded BCB under bufptr->mutex (D).

  2. The PGBUF lock makes same-VPID allocation single-winner. pgbuf_lock_page returns HOLDER to one thread and parks the rest as WAITERs on a per-record FIFO; the winner inserts the BCB and pgbuf_unlock_page wakes everyone, who re-drive the fix and hit it.

  3. pgbuf_allocate_bcb short-circuits the invalid list (returns without victimizing); a victim falls through to end: for pgbuf_victimize_bcb. When both fail, server mode sleeps on a direct-victim queue with four wake sub-branches (handoff, stolen-back, interrupt, r != NO_ERROR timeout-assert); SA/recovery flushes inline.

  4. pgbuf_get_bcb_from_invalid_list uses double-checked locking: an unlocked empty-list fast return, a post-mutex re-check return, and the pop branch that advances invalid_top, decrements invalid_cnt, locks the BCB and flips it to VOID. Every read-error path funnels through pgbuf_put_bcb_into_invalid_list + pgbuf_unlock_page(...,true), so no half-bound slot leaks.

  5. DWB is consulted before disk on every read miss (dwb_read_page sets success, eliminating fileio_read); NEW_PAGE never reads or fails, marking permanent pages immature (prv.pageid/volid = -1). pgbuf_set_bcb_page_vpid then stamps identity only on the first-time branch; on a re-allocated perm page it leaves bytes untouched and only asserts the stored identity still matches the VPID.

  6. The claim path leaves the BCB in VOID. A victim reaches VOID via pgbuf_remove_from_lru_list during pgbuf_get_victim — not via pgbuf_victimize_bcb, which only detaches the hash chain and invalidates the latch. Hash-chain insertion holds hash_mutex across the VPID-lock release to close the winner/loser race window; LRU-zone placement is deferred to unfix in Chapter 7.

Chapter 5: The BCB Atomic Latch Acquire Block and Wake

Section titled “Chapter 5: The BCB Atomic Latch Acquire Block and Wake”

Chapter 3 left us with a BCB in hand and its mutex held. The question: given a fixer that owns the BCB mutex, how does the per-page read/write latch decide compatibility, block an incompatible request on a per-page waiter list, time it out, and wake waiters in order on release? For latch semantics at design level see the high-level companion’s “Latch modes and the fix protocol” section; here we trace every branch.

The subsystem rests on one 64-bit word — the atomic_latch — plus a singly linked next_wait_thrd queue on the BCB. The BCB mutex serializes the queue; the latch word is mutated by lock-free CAS so a waiter being woken can publish its grant without re-taking the mutex.

5.1 The packed latch word: pgbuf_atomic_latch_impl

Section titled “5.1 The packed latch word: pgbuf_atomic_latch_impl”

The latch is std::atomic<uint64_t> atomic_latch, accessed through a union:

// union pgbuf_atomic_latch_impl -- src/storage/page_buffer.c
union pgbuf_atomic_latch_impl {
uint64_t raw; /* the word actually CAS'd */
struct {
PGBUF_LATCH_MODE latch_mode; /* enum:uint16_t NO_LATCH=0, READ=1, WRITE=2, FLUSH=3 */
uint16_t waiter_exists; /* 1 if next_wait_thrd has a R/W waiter */
int32_t fcnt; /* number of granted fixes */
} impl;
};
FieldRoleWhy it exists
rawuint64_t payload compare_exchange_* operates onMoves the three fields atomically in one CAS; a torn read is impossible only because they share this word. The layout (16 + 16 + 32 bits) exactly fills 64.
impl.latch_modeCurrent grant mode. PGBUF_LATCH_FLUSH is only a block mode, never a grant — the header at PGBUF_LATCH_MODE says so.The compatibility decision (5.3) reads this first; PGBUF_NO_LATCH doubles as the “idle” sentinel.
impl.waiter_existsA hint, not a count: true once a R/W request is queued.The writer-starvation guard (5.3, Case 1).
impl.fcntTotal fix count across holders sharing the current modeAt 0 the page can switch mode or be victimized; compared against one holder’s fix_count for “am I the only reader”.

get_impl snapshots with an acquire load. The single-field mutators (set_latch, add_fcnt, set_waiter_exists, set_latch_and_fcnt, set_latch_and_add_fcnt) each run a load/compare_exchange_weak loop. pgbuf_latch_bcb_upon_fix does not call those helpers: it computes the whole new_impl from a fresh old_impl and retries a single compare_exchange_strong on bufptr->atomic_latch — so its entire decision tree is one atomic transition, recomputed on contention.

Invariant — the latch word transitions atomically; no partial publish. Every mutation is do { old = get_impl(); ...build new... } while (!CAS(old, new)). With two separate stores a concurrent fixer could observe latch_mode == READ with a stale fcnt, grant a compatible read, and corrupt holder accounting. The single-word CAS forbids that torn intermediate. The decision-tree CAS uses the strong form (no spurious failure); the per-field helpers and the error/wakeup repair loops use the weak form inside their retry loops.

flowchart LR
  BCB["pgbuf_bcb"] --> AL["atomic_latch\n(uint64_t word)"]
  BCB --> NWT["next_wait_thrd\n(THREAD_ENTRY*)"]
  AL --> IMPL["impl: latch_mode | waiter_exists | fcnt"]
  NWT --> T1["THREAD_ENTRY\nrequest_latch_mode\nrequest_fix_count\nwait_for_latch_promote"]
  T1 --> T2["THREAD_ENTRY ..."]
  TH["THREAD_ENTRY\nm_holder_anchor"] --> HL["thrd_hold_list"]
  HL --> H1["pgbuf_holder\nfix_count, bufptr"]
  H1 --> H2["pgbuf_holder (thrd_link)"]

Figure 5-1 — The latch word and the two lists on it: the per-BCB waiter queue (next_wait_thrd) and the per-thread holder list (thrd_hold_list).

A pgbuf_holder records this thread’s fixes on one BCB. The latch word counts fixes globally; the holder counts my slice so promotion and unfix can be reasoned about locally.

// struct pgbuf_holder -- src/storage/page_buffer.c
struct pgbuf_holder {
int fix_count; /* the count of fix by the holder */
PGBUF_BCB *bufptr; /* pointer to BCB */
PGBUF_HOLDER *thrd_link; /* next holder in this thread's hold list */
PGBUF_HOLDER *next_holder;/* next in this thread's *free* list */
PGBUF_HOLDER_STAT perf_stat;
#if !defined(NDEBUG)
char fixed_at[64 * 1024]; /* call-site trail for leak debugging */
int fixed_at_size;
#endif
int watch_count; /* number of PGBUF_WATCHERs on this holder */
PGBUF_WATCHER *first_watcher;
PGBUF_WATCHER *last_watcher;
};
FieldRoleWhy it exists
fix_countHow many times this thread fixed this BCBold_impl.impl.fcnt == holder->fix_count (5.3) means “global == mine”, i.e. only fixer. At 0 the holder is recycled.
bufptrBack-pointer to the recorded BCBpgbuf_find_thrd_holder matches on this.
thrd_linkNext holder in this thread’s in-use list (thrd_hold_list)Links the many pages a thread holds; next_holder must be NULL while on this list (asserted in pgbuf_find_thrd_holder).
next_holderNext holder in this thread’s free list (thrd_free_list)next_holder meaningful only while free, thrd_link only while in use. Never both.
perf_statPGBUF_HOLDER_STAT bitfield (dirty_before_hold, dirtied_by_holder, hold_has_write_latch, hold_has_read_latch)Feeds perfmon; the hold_has_* bits are set when the latch is granted in any branch of pgbuf_latch_bcb_upon_fix.
fixed_at / fixed_at_sizeFixed-size buffer holding the concatenated call-site trail (file:line of each fix) and its lengthLeak / double-fix debugging in non-NDEBUG builds only; absent from release builds.
watch_count / first_watcher / last_watcherOrdered-fix watcher chain (Chapter 10)Must be 0 / NULL before recycle — pgbuf_remove_thrd_holder asserts watch_count == 0.

Invariant — a holder lives on exactly one of the two per-thread lists. While in use it is on thrd_hold_list with next_holder == NULL (pgbuf_find_thrd_holder asserts this on every node it walks); while free it is on thrd_free_list reached via next_holder, and thrd_link is dead. The free-list and hold-list links never both point somewhere at once.

Three helpers maintain the lists. pgbuf_allocate_thrd_holder_entry pops thrd_free_list if non-empty (no global mutex); else takes free_holder_set_mutex, carves the next element from the shared free_holder_set, and grows it with a fresh malloced PGBUF_HOLDER_SET when free_index == -1. Either way the holder is pushed onto thrd_hold_list:

// pgbuf_allocate_thrd_holder_entry -- src/storage/page_buffer.c
holder->next_holder = NULL; /* disconnect from free list */
holder->thrd_link = thrd_holder_info->thrd_hold_list; /* push onto hold list */
thrd_holder_info->thrd_hold_list = holder;
thrd_holder_info->num_hold_cnt += 1;
holder->first_watcher = NULL; holder->last_watcher = NULL; holder->watch_count = 0;

pgbuf_find_thrd_holder walks thrd_hold_list for the holder whose bufptr matches, else NULL; its assert (holder->next_holder == NULL) enforces the “never on both lists” invariant. pgbuf_remove_thrd_holder asserts fix_count == 0 and watch_count == 0, prepends the holder to thrd_free_list first, then unlinks it from thrd_hold_list (head special-case, else walk to the predecessor); a missing entry trips assert (false) and returns ER_FAILED.

5.3 pgbuf_latch_bcb_upon_fix — the compatibility decision tree

Section titled “5.3 pgbuf_latch_bcb_upon_fix — the compatibility decision tree”

The caller holds the BCB mutex; a scope_exit unlock_BCB guard releases it on every exit unless a branch .release()s it. It looks up the caller’s holder once, then runs a do { ...recompute new_impl... } while (!compare_exchange_strong) loop. request_fcnt starts at 1 and is reset at the top of every retry.

flowchart TB
  S["snapshot old_impl; new_impl = old_impl\nrequest_fcnt = 1"] --> IDLE{"buf_lock_acquired\nor latch_mode == NO_LATCH ?"}
  IDLE -- yes --> SETIDLE["is_page_idle=true\nnormalize old to clean idle\nnew: mode=request, fcnt=1"]
  IDLE -- no --> C1{"READ req on\nREAD-latched page ?"}
  C1 -- yes --> W{"waiter_exists ?"}
  W -- no --> GR1["can_latch=true; fcnt++"]
  W -- yes --> OWN{"holder != NULL ?"}
  OWN -- yes --> GR2["can_latch=true; fcnt++"]
  OWN -- no --> BLK1["can_latch=false\nwriter-starvation guard"]
  C1 -- no --> H{"holder != NULL ?"}
  H -- no --> BLK2["Case 3: can_latch=false\nwaiter_exists=true"]
  H -- yes --> WR{"latch_mode == WRITE ?"}
  WR -- yes --> GR3["Sub 2-1: can_latch=true; fcnt++"]
  WR -- no --> SOLE{"old fcnt == holder fix_count ?"}
  SOLE -- yes --> GR4["Sub 2-2: in-place promote\nmode=WRITE, fcnt=1"]
  SOLE -- no --> COND{"CONDITIONAL ?"}
  COND -- yes --> CFAIL["can_latch=false\nwaiter_exists=true, then reject"]
  COND -- no --> PROM["promote_needed=true\nfcnt -= holder fix_count\nwaiter_exists=true"]

Figure 5-2 — Every branch of the new_impl computation. The loop CASes old_impl.raw to new_impl.raw with compare_exchange_strong; on failure it re-snapshots and re-walks the whole tree.

Idle short-circuit. If buf_lock_acquired (fresh BCB, Chapter 4) or the page is PGBUF_NO_LATCH, the code normalizes old_impl to a clean idle state before building new_impl — this matters for the CAS expectation:

// pgbuf_latch_bcb_upon_fix -- src/storage/page_buffer.c
if (is_page_idle == true) {
old_impl.impl.waiter_exists = false; /* <- expect a clean word */
old_impl.impl.latch_mode = PGBUF_NO_LATCH; old_impl.impl.fcnt = 0;
new_impl = old_impl;
new_impl.impl.latch_mode = request_mode; new_impl.impl.fcnt = 1; /* grant */
}

(In SA_MODE only, a non-idle page with no holder is a leaked latch — the code assert (0)s and treats it as idle.)

Case 1 — R on R. No waiter: grant (can_latch = true; fcnt++). Waiter present: the reader may join only if already a holder (re-entrant); a brand-new reader (holder == NULL) blocks.

Invariant — readers yield to a queued writer. Once waiter_exists is set, only re-entrant readers may join an R-latch (the holder == NULL test). If violated, a stream of fresh readers would indefinitely defer the queued writer.

Case 2 — caller already a holder (not R-on-R, so the page is WRITE-latched or the caller is a R-holder asking for WRITE):

  • Sub-2-1, page WRITE-latched: re-fix (R or W) is a pure passthrough, can_latch = true; fcnt++ — the W-holder shortcut.
  • Sub-2-2, page READ-latched requesting WRITE (in-place promotion): if old_impl.impl.fcnt == holder->fix_count the caller is the sole reader, so the latch flips to WRITE in place (mode = WRITE; fcnt = 1). If other readers exist, a PGBUF_CONDITIONAL_LATCH sets waiter_exists and falls to rejection; an unconditional one sets promote_needed, deducts its own fixes (new_impl.impl.fcnt -= holder->fix_count), sets waiter_exists, and blocks as a tail waiter (see 5.4).

The “unreachable in-place upgrade” the high-level companion mentions refers to a historically-removed contended upgrade; the sole-reader in-place flip (sub-2-2) is reachable today, alongside the dedicated promotion entry point pgbuf_promote_read_latch_debug and the one-promoter assert in pgbuf_block_bcb (see summary item 3).

Case 3 — caller not a holder, request incompatible (W on R, R/W on W by a stranger): can_latch = false; waiter_exists = true; the thread blocks.

After the CAS succeeds the function dispatches on the outcome flags:

  • is_page_idle or can_latch — granted: allocate a holder (idle / stranger paths) or bump the existing holder’s fix_count, set perf bits, update latch_last_thread, return NO_ERROR.
  • promote_needed — roll the holder’s read fixes into request_fcnt (request_fcnt += holder->fix_count), zero the holder, pgbuf_remove_thrd_holder, fall into the block path.
  • block/promote + PGBUF_CONDITIONAL_LATCH — reject ER_FAILED (raise ER_LK_PAGE_TIMEOUT first if the txn’s wait_msec == LK_ZERO_WAIT).
  • block/promote, unconditionalunlock_BCB.release(), call pgbuf_block_bcb(..., as_promote = false); on return the latch is held, so allocate a holder with fix_count = request_fcnt, set *is_latch_wait = true, return NO_ERROR.

The caller holds the BCB mutex with waiter_exists true (asserted). It stamps request_latch_mode and request_fix_count (the count to credit fcnt with on wake), then enqueues by the as_promote flag:

// pgbuf_block_bcb -- src/storage/page_buffer.c
cur_thrd_entry->request_latch_mode = request_mode;
cur_thrd_entry->request_fix_count = request_fcnt; /* SPECIAL_NOTE */
if (as_promote) {
/* Safe guard: there can be only one promoter. */
assert (bufptr->next_wait_thrd == NULL
|| !bufptr->next_wait_thrd->wait_for_latch_promote);
cur_thrd_entry->next_wait_thrd = bufptr->next_wait_thrd; /* head insert */
bufptr->next_wait_thrd = cur_thrd_entry;
} else { cur_thrd_entry->next_wait_thrd = NULL; /* ... walk to tail, link ... */ }

The as_promote flag is the caller’s choice, and it splits two distinct callers:

  • Head insert (as_promote = true) is used only by pgbuf_promote_read_latch_debug (the explicit pgbuf_promote_read_latch path): a promoter that already released its read latch must win the race against fresh waiters, so it jumps the queue. The assert enforces at most one promoter in the queue.
  • Tail insert (as_promote = false, FIFO) is used by every other caller — including the promote_needed branch of pgbuf_latch_bcb_upon_fix (5.3), which calls pgbuf_block_bcb(..., false). So a promotion discovered during a fix enqueues at the tail, not the head; only the dedicated promote API head-inserts. The async-flush path (Chapter 8) also tail-inserts with PGBUF_LATCH_FLUSH.

Then it sleeps by mode:

  • PGBUF_LATCH_FLUSH (flush-waiter, Chapter 8) sleeps infinitely via thread_suspend_wakeup_and_unlock_entry; on a non-RESUMED wake (interrupt) it re-locks the BCB, unlinks itself from next_wait_thrd, returns ER_FAILED.
  • R/W waiters go through pgbuf_timed_sleep. CUBRID builds no wait-for graph across page latches; it relies on timeout — “When the request is waken up by timeout, the request is treated as a victim.” On a successful return the function sets bufptr->latch_last_thread = thread_p.

5.5 pgbuf_timed_sleep and pgbuf_timed_sleep_error_handling

Section titled “5.5 pgbuf_timed_sleep and pgbuf_timed_sleep_error_handling”

pgbuf_timed_sleep locks the thread entry, then drops the BCB mutex (ordering: thread entry inside BCB), computes the timeout, and suspends:

// pgbuf_timed_sleep -- src/storage/page_buffer.c
thread_lock_entry (thread_p); PGBUF_BCB_UNLOCK (bufptr);
old_wait_msecs = wait_secs = pgbuf_find_current_wait_msecs (thread_p);
/* LK_ZERO_WAIT/LK_FORCE_ZERO_WAIT -> 0, else wait_secs = pgbuf_latch_timeout */
try_again:
to.tv_sec = (int) time (NULL) + wait_secs;
thread_p->resume_status = THREAD_PGBUF_SUSPENDED;
r = thread_suspend_timeout_wakeup_and_unlock_entry (thread_p, &to, THREAD_PGBUF_SUSPENDED);

pgbuf_latch_timeout defaults to 300 * 1000, reset from PRM_ID_PAGE_LATCH_TIMEOUT at boot. Three return branches:

  1. NO_ERROR — signalled. Re-lock the entry. If resume_status == THREAD_PGBUF_RESUMED a waker granted the latch (5.6) — return NO_ERROR (the latch is already ours, fcnt bumped by the waker). Else an interrupt: set request_latch_mode = PGBUF_NO_LATCH, call the error handler, raise ER_INTERRUPTED, return ER_FAILED.
  2. ER_CSS_PTHREAD_COND_TIMEDOUT — timed out. If RESUMED in the race, return NO_ERROR. If the txn is no longer active (logtb_is_current_active false) loop to try_again (don’t time out a committing/aborting txn). Else a page-latch deadlock victim: save the mode, set request_latch_mode = PGBUF_NO_LATCH (the marker the waker uses to skip us), run the error handler, goto er_set_return.
  3. elsepthread_cond failure: er_set_with_oserror (ER_CSS_PTHREAD_COND_TIMEDWAIT), return ER_FAILED.

er_set_return formats by the original wait spec, then releases the BCB mutex and returns ER_FAILED:

  • LK_INFINITE_WAITER_PAGE_LATCH_TIMEDOUT then ER_LK_UNILATERALLY_ABORTED (guarded by an assert (0) marked FIXME).
  • positive old_wait_msecsER_PAGE_LATCH_TIMEDOUT + ER_LK_PAGE_TIMEOUT (the latter reports save_request_latch_mode).
  • otherwise — just unlock.

pgbuf_timed_sleep_error_handling runs when a waiter abandons the queue, re-locks the BCB, and unlinks the thread in three cases:

flowchart TB
  L["PGBUF_BCB_LOCK"] --> E{"next_wait_thrd == NULL ?"}
  E -- yes --> R0["case 1: already removed by waker, return"]
  E -- no --> F{"head == thrd_entry ?"}
  F -- no --> M["case 2: walk list, unlink thrd_entry, return"]
  F -- yes --> H["case 3: pop head\nthen wake consecutive READ waiters\nuntil non-grantable or WRITE"]

Figure 5-3 — Three removal cases. Only case 3 (the abandoning thread at the head) must repair the queue by waking the readers it was shadowing. In case 3 it pops the head, then loops: for each follower, if the page latch_mode == READ and the waiter wants READ it CASes (compare_exchange_weak) fcnt += request_fix_count, locks the entry, unlinks it, and wakes it (pgbuf_wakeup); a WRITE waiter or non-grantable state breaks the loop.

5.6 pgbuf_wakeup_reader_writer — ordered wake on unlatch

Section titled “5.6 pgbuf_wakeup_reader_writer — ordered wake on unlatch”

When unfix drops fcnt to 0 and resets the mode to PGBUF_NO_LATCH (both asserted on entry), this function walks next_wait_thrd once and grants what it can. The caller holds the BCB mutex.

// pgbuf_wakeup_reader_writer -- src/storage/page_buffer.c
for (thrd_entry = bufptr->next_wait_thrd; thrd_entry != NULL; thrd_entry = next_thrd_entry) {
next_thrd_entry = thrd_entry->next_wait_thrd;
if (thrd_entry->request_latch_mode == PGBUF_NO_LATCH) { /* unlink, continue -- corpse */ }
if (thrd_entry->request_latch_mode == PGBUF_LATCH_FLUSH) {
assert (pgbuf_bcb_is_async_flush_request (bufptr) || pgbuf_bcb_is_flushing (bufptr));
prev_thrd_entry = thrd_entry; continue; /* skip -- leave in list, do NOT wake */
}
/* ... R/W grant via compare_exchange_strong loop ... */
}

Branch by branch:

  1. PGBUF_NO_LATCH waiter — a thread that gave up (timed out / interrupted, not yet self-removed). Unlink and continue — “clean a timed-out waiter”.

  2. PGBUF_LATCH_FLUSH waiter — not a latch holder; flush wakes it separately. Advance prev_thrd_entry and continue, leaving it in place — the “skip the FLUSH waiter” rule. Advancing prev rather than unlinking keeps the FLUSH entry queued and followers reachable behind it.

  3. R/W waiter — enter the inner CAS loop (compare_exchange_strong) on a fresh impl:

    • latch_mode == NO_LATCH, or (latch_mode == READ and waiter wants READ): grantable. Lock the thread entry; re-check request_latch_mode == PGBUF_NO_LATCH (it may have timed out between the outer test and the lock) — if so unlink, can_grant = false, break. Else can_grant = true, add request_fix_count to fcnt, set latch_mode to the waiter’s mode.
    • Else if latch_mode == READ (page already R-granted this pass, this waiter wants WRITE): set prev_thrd_entry = thrd_entry and break only the inner CAS loop. should_stop stays false. The writer is left in place and skipped; the outer walk continues to look for more READ waiters behind it (“Look for other readers.”). The walk truly stops only at the WRITE arm.
    • Else (latch_mode == WRITE): should_stop = true, break.

    After the inner loop, should_stop breaks the outer loop; otherwise can_grant unlinks the waiter and pgbuf_wakeups it.

Net effect (matching the header comment): each READ grant leaves latch_mode == READ, so all consecutive READ waiters at the head are woken; a WRITE waiter met while the page is R-granted is skipped and scanning continues behind it; only a held WRITE latch (the WRITE arm via should_stop) ends the pass, so at most one writer is granted.

Finally the function recomputes the hint:

// pgbuf_wakeup_reader_writer -- src/storage/page_buffer.c
if (!pgbuf_is_exist_blocked_reader_writer (bufptr))
set_waiter_exists (&bufptr->atomic_latch, false); /* clear the guard */

Invariant — waiter_exists is true iff a R/W waiter is queued. pgbuf_is_exist_blocked_reader_writer walks next_wait_thrd and counts only PGBUF_LATCH_READ/PGBUF_LATCH_WRITE entries (FLUSH and NO_LATCH don’t count). A stale bit after the last R/W waiter was woken would make Case 1 of pgbuf_latch_bcb_upon_fix block fresh readers forever (phantom starvation).

  1. The latch is one uint64_t (pgbuf_atomic_latch_impl): latch_mode | waiter_exists | fcnt (16 + 16 + 32 bits). Single-word CASes are the only thing that makes the mode/count triple tear-free under concurrency.
  2. pgbuf_latch_bcb_upon_fix is a branch-complete tree retried under one compare_exchange_strong: idle short-circuit; R-on-R (grant unless a waiter exists and the caller is not a holder — the writer-starvation guard); W-holder passthrough (sub-2-1); sole-reader in-place promotion (sub-2-2); and block paths for strangers (Case 3) and contended promotions.
  3. The sole-reader in-place R-to-W flip (sub-2-2) is live and reachable; the source-anchored promotion API is pgbuf_promote_read_latch_debug plus the one-promoter assert in pgbuf_block_bcb. Any deprecation of a contended in-place upgrade is external project history, not visible in this source.
  4. Holders (pgbuf_holder) record this thread’s per-BCB fix_count; comparing it against the latch word’s fcnt decides “am I the only fixer?”. allocate/find/remove keep a holder on exactly one of the free or hold lists, never both (the next_holder == NULL assert).
  5. Blocking is timed, not graph-based: pgbuf_block_bcb enqueues on next_wait_thrdtail for ordinary waiters and the promote_needed branch of the fix path, head only for the dedicated pgbuf_promote_read_latch_debug API (as_promote). pgbuf_timed_sleep waits pgbuf_latch_timeout; a timeout makes the waiter a deadlock victim and pgbuf_timed_sleep_error_handling removes it (repairing the head case by waking shadowed readers).
  6. pgbuf_wakeup_reader_writer walks the queue once: clean NO_LATCH corpses, skip FLUSH waiters, wake all consecutive READ waiters, grant at most one WRITE. A WRITE waiter met mid-pass while readers are granted does not stop further reader grants — only a held WRITE latch (should_stop) ends the walk.
  7. waiter_exists is a hint retracked exactly by pgbuf_is_exist_blocked_reader_writer after every wake, so a stale bit cannot phantom-starve fresh readers.

Chapter 6: Dirtying a Page and the Packed Flags and Zone Word

Section titled “Chapter 6: Dirtying a Page and the Packed Flags and Zone Word”

A modification to a page is three small acts under the BCB’s write latch: stamp the page image with the redo LSA, mark the BCB dirty, and — the first time only — record its oldest unflushed LSA. None takes a lock; they mutate one 32-bit word, bcb->flags, with a lock-free compare-and-swap retry loop, plus one separately-packed counter word. This chapter dissects that accessor layer, which every later chapter — unfix (Ch 7), flush under WAL (Ch 8), victim selection (Ch 9) — reads or mutates, so its invariants are load-bearing. For the why — the WAL contract, the checkpoint horizon — see the high-level companion; this chapter does not re-derive that theory.

6.1 One word, three fields: flags, zone, lru index

Section titled “6.1 One word, three fields: flags, zone, lru index”

PGBUF_BCB::flags is volatile int. The source’s own comment fixes the layout exactly (Figure 6-1): “(bcb flags + zone = 2 bytes) + (lru index = 2 bytes)”. So the 32-bit word splits at the half: the low 16 bits are the LRU list index (PGBUF_LRU_INDEX_MASK = 0x0000FFFF, since PGBUF_LRU_NBITS = 16), and the high 16 bits carry the flag bits and the zone selector together.

// pgbuf_bcb (struct) -- src/storage/page_buffer.c
volatile int flags; /* <- packed: flag bits + zone + lru index */
// ... condensed ...
volatile int count_fix_and_avoid_dealloc; /* <- a SECOND packed word, see 6.8 */
LOG_LSA oldest_unflush_lsa; /* <- WAL watermark, established once per dirty cycle */

Within that high half the two namespaces are bit-disjoint: the seven flag bits occupy the very top (0x80000000..0x02000000, bits 25-31), while the zone enum sits just above the index, at bits 16-19. Zone values are shifts of PGBUF_LRU_NBITS: the LRU sub-zones are 1<<16, 2<<16, 3<<16; the non-LRU zones skip two further bits (PGBUF_LRU_NBITS + 2 = 18) so they cannot collide with the LRU mask — PGBUF_INVALID_ZONE = 1<<18, PGBUF_VOID_ZONE = 2<<18. Because PGBUF_BCB_FLAGS_MASK and PGBUF_ZONE_MASK | PGBUF_LRU_INDEX_MASK never share a bit, pgbuf_bcb_update_flags can touch only flag bits and pgbuf_bcb_change_zone only zone+index — each preserving the other — both via CAS on the same word.

A BCB is born with PGBUF_BCB_INIT_FLAGS = PGBUF_INVALID_ZONE: no flag bits set, zone = INVALID, index 0. That is the start state of Figure 6-3.

The complete zone catalogue:

Zone valueBitsNumericMeaningSet by (zone moves go through pgbuf_bcb_change_zone)
PGBUF_INVALID_ZONE1<<180x00040000free/uninitialized BCB (on invalid list)PGBUF_BCB_INIT_FLAGS; reset on free
PGBUF_VOID_ZONE2<<180x00080000transient: read from disk before list insert, or removed from list before victimizingread-from-disk path; victim extraction
PGBUF_LRU_1_ZONE1<<160x00010000hottest LRU sub-zone; never victimizedunfix/boost into zone 1
PGBUF_LRU_2_ZONE2<<160x00020000buffer sub-zone between hot and victim; never victimizedLRU zone adjustment
PGBUF_LRU_3_ZONE3<<160x00030000victimization sub-zone; only zone with eligible candidatesLRU zone adjustment / fall from zone 2

Three masks decode it: PGBUF_LRU_ZONE_MASK (= 1|2|3 << 16) ORs the three LRU sub-zone bits; PGBUF_ZONE_MASK (= PGBUF_LRU_ZONE_MASK | PGBUF_INVALID_ZONE | PGBUF_VOID_ZONE) covers every zone; PGBUF_LRU_INDEX_MASK carries the low-16 list index. PGBUF_GET_ZONE(flags) is (PGBUF_ZONE)(flags & PGBUF_ZONE_MASK).

6.2 The flag catalogue: every bit, producer, and consumer

Section titled “6.2 The flag catalogue: every bit, producer, and consumer”

Seven flags live in the high bits of flags. The composite PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK is the OR of the first four — the states that disqualify a BCB from being victimized; “Blocks victim?” marks membership in that mask. The table carries producer, clearer, and reader per flag.

FlagBitProducerCleared byReaderBlocks victim?
PGBUF_BCB_DIRTY_FLAG0x80000000pgbuf_bcb_set_dirty, _update_flags_clear_dirty, _mark_is_flushingpgbuf_bcb_is_dirtyyes
PGBUF_BCB_FLUSHING_TO_DISK_FLAG0x40000000pgbuf_bcb_mark_is_flushing_mark_was_flushed / _was_not_flushedpgbuf_bcb_is_flushingyes
PGBUF_BCB_VICTIM_DIRECT_FLAG0x20000000direct-victim hand-off (Ch 9)replaced by INVALIDATEpgbuf_bcb_is_direct_victimyes
PGBUF_BCB_INVALIDATE_DIRECT_VICTIM_FLAG0x10000000fixer grabbing a direct victim (Ch 4/5)when the waiter re-requestspgbuf_bcb_is_invalid_direct_victimyes
PGBUF_BCB_MOVE_TO_LRU_BOTTOM_FLAG0x08000000dealloc pathunfix that moves it (Ch 7)pgbuf_bcb_should_be_moved_to_bottom_lruno
PGBUF_BCB_TO_VACUUM_FLAG0x04000000pgbuf_notify_vacuum_followsvacuum routingpgbuf_bcb_is_to_vacuumno
PGBUF_BCB_ASYNC_FLUSH_REQ0x02000000async flush requesterspgbuf_bcb_mark_is_flushingpgbuf_bcb_is_async_flush_requestno

mark_is_flushing (when the page is dirty) clears both DIRTY — the flush captured the image — and ASYNC_FLUSH_REQ — the request is now in flight — while setting FLUSHING: one transition swaps three bits.

// PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK -- src/storage/page_buffer.c
#define PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK \
(PGBUF_BCB_DIRTY_FLAG \
| PGBUF_BCB_FLUSHING_TO_DISK_FLAG \
| PGBUF_BCB_VICTIM_DIRECT_FLAG \
| PGBUF_BCB_INVALIDATE_DIRECT_VICTIM_FLAG) /* <- the 4 disqualifiers; the other 3 flags are victim-neutral */

Invariant — the victim-candidate count tracks exactly the BCBs in LRU zone 3 with none of the four disqualifier bits. Any transition adding/removing a disqualifier bit while the BCB sits in zone 3 must symmetrically add/remove it from candidacy, enforced in pgbuf_bcb_update_flags, pgbuf_bcb_change_zone, and the pgbuf_bcb_set_dirty fast path. Omit it in any one and the LRU victim counter drifts, so victimizers skip valid candidates or chase phantoms (Ch 9). pgbuf_bcb_avoid_victim is the read-side query of the same mask.

6.3 The shared CAS-loop shape: pgbuf_bcb_update_flags and pgbuf_bcb_change_zone

Section titled “6.3 The shared CAS-loop shape: pgbuf_bcb_update_flags and pgbuf_bcb_change_zone”

Both share one lock-free skeleton — read bcb->flags, compute the new word, CAS, retry — differing only in which half they recompute and what they reconcile afterward.

pgbuf_bcb_update_flags is the general flag mutator: set some bits, clear others, preserving zone and unnamed flags. Every flag transition except the dirty fast path goes through it.

// pgbuf_bcb_update_flags -- src/storage/page_buffer.c
assert ((set_flags & (~PGBUF_BCB_FLAGS_MASK)) == 0); /* <- callers may only touch flag bits ... */
assert ((clear_flags & (~PGBUF_BCB_FLAGS_MASK)) == 0); /* <- ... never zone/index bits */
do
{
old_flags = bcb->flags;
new_flags = old_flags | set_flags;
new_flags = new_flags & (~clear_flags);
if (old_flags == new_flags)
return; /* <- no-op: bits already as desired, skip CAS + bookkeeping (contention saver) */
}
while (!ATOMIC_CAS_32 (&bcb->flags, old_flags, new_flags));

pgbuf_bcb_change_zone does the opposite: same loop, but the assignment recomputes zone+index — new_flags = (old_flags & PGBUF_BCB_FLAGS_MASK) | new_zone_idx; where new_zone_idx = PGBUF_MAKE_ZONE (new_lru_idx, new_zone) — preserving all flag bits, and it has no early no-op return.

After the CAS the two diverge. update_flags runs two fix-ups (Figure 6-2): a zone-3 victim-candidacy adjustment (only when PGBUF_GET_ZONE (old_flags) == PGBUF_LRU_3_ZONE — read from old_flags since the zone never changes here), and a dirties_cnt adjustment keyed on whether DIRTY toggled, closed by an assert pinning 0 <= dirties_cnt <= num_buffers.

flowchart TD
  A["enter: set_flags, clear_flags — Figure 6-2"] --> B["old = bcb->flags<br/>new = (old | set) & ~clear"]
  B --> C{"old == new?"}
  C -->|yes| R["return (no-op)"]
  C -->|no| D{"CAS(flags, old, new)?"}
  D -->|fail| B
  D -->|ok| E{"zone(old) == LRU_3?"}
  E -->|yes| F{"victim candidacy changed?"}
  F -->|"became valid"| G["lru_add_victim_candidate"]
  F -->|"became invalid"| H["lru_remove_victim_candidate"]
  F -->|"no change"| I
  E -->|no| I["dirty bit toggled?"]
  G --> I
  H --> I
  I -->|"set->clear"| J["dirties_cnt -= 1"]
  I -->|"clear->set"| K["dirties_cnt += 1"]
  I -->|"unchanged"| L["assert range; done"]
  J --> L
  K --> L

change_zone reconciles per-list zone counters and victim candidacy. Zone moves run under the LRU list mutex, so count_lru1/2/3 are plain increments, not atomics — only the CAS on flags is lock-free, since a concurrent pgbuf_set_dirty may flip a flag bit on the same word with no mutex. Branch map:

  1. is_valid_victim_candidate = (old_flags & INVALID_VICTIM_CANDIDATE_MASK) == 0 — a flags property, unchanged by the move, so it holds on both sides.
  2. Leaving (old_flags & PGBUF_LRU_ZONE_MASK): decrement lru_shared_pgs_cnt if the old list was shared; switch on old zone decrementing the right count_lruN; in zone 3 and valid-candidate, pgbuf_lru_remove_victim_candidate; default: assert(false).
  3. Entering (new_zone & PGBUF_LRU_ZONE_MASK): symmetric increments, with pgbuf_lru_add_victim_candidate in zone 3 for a valid candidate; default: assert(false).

The default: assert(false) arms encode a totality invariant: an LRU-zone BCB is in exactly one of zones 1/2/3 (the zone field is a single value, not a mask of memberships). A second assert guards hint coherence: lru_list->victim_hint != bcb || zone(old) != LRU_3 — the hint must already have been retargeted before a zone-3 BCB leaves, unless checkpoint (via update_flags) is concurrently retargeting it.

stateDiagram-v2
  [*] --> INVALID: init flags = PGBUF_INVALID_ZONE
  INVALID --> VOID: read from disk
  VOID --> LRU: unfix inserts into list
  LRU --> LRU: adjust zones 1 to 2 to 3
  LRU --> VOID: selected as victim
  VOID --> [*]: reused for new page
  note right of LRU
    only LRU_3 sub-zone
    is victim-eligible
  end note

Figure 6-3: zone transitions driven by pgbuf_bcb_change_zone. The flag namespace rides along untouched on every edge.

6.4 pgbuf_bcb_get_zone and the decode macros

Section titled “6.4 pgbuf_bcb_get_zone and the decode macros”

pgbuf_bcb_get_zone is a pure decode — it masks the word and returns the zone enum:

// pgbuf_bcb_get_zone -- src/storage/page_buffer.c
STATIC_INLINE PGBUF_ZONE
pgbuf_bcb_get_zone (const PGBUF_BCB * bcb)
{
return PGBUF_GET_ZONE (bcb->flags); /* <- (flags & PGBUF_ZONE_MASK) */
}

Two macros build on it to answer the two questions later chapters ask most:

// PGBUF_IS_BCB_IN_LRU* -- src/storage/page_buffer.c
#define PGBUF_IS_BCB_IN_LRU_VICTIM_ZONE(bcb) (pgbuf_bcb_get_zone (bcb) == PGBUF_LRU_3_ZONE)
#define PGBUF_IS_BCB_IN_LRU(bcb) ((pgbuf_bcb_get_zone (bcb) & PGBUF_LRU_ZONE_MASK) != 0)

PGBUF_IS_BCB_IN_LRU_VICTIM_ZONE is exact-equality (only zone 3 is victim-eligible); PGBUF_IS_BCB_IN_LRU is a mask test — PGBUF_LRU_ZONE_MASK ORs all three LRU sub-zone bits, so zones 1/2/3 match but VOID (2<<18) and INVALID (1<<18) do not, since their bits fall outside the mask. pgbuf_bcb_get_lru_index asserts PGBUF_IS_BCB_IN_LRU before returning the low-16 index.

6.5 Setting dirty: three entry points, one fast path

Section titled “6.5 Setting dirty: three entry points, one fast path”

A modifier reaches dirty through a tiny call chain. The public pgbuf_set_dirty recovers the BCB via CAST_PGPTR_TO_BFPTR, validates vpid (debug only), delegates to pgbuf_set_dirty_buffer_ptr, then unfixes only if the caller passed free_page == FREE. pgbuf_set_dirty_buffer_ptr is the latch/perf layer over the real mutator:

// pgbuf_set_dirty_buffer_ptr -- src/storage/page_buffer.c
pgbuf_bcb_set_dirty (thread_p, bufptr);
holder = pgbuf_find_thrd_holder (thread_p, bufptr);
assert (get_latch (&bufptr->atomic_latch) == PGBUF_LATCH_WRITE); /* <- dirtier MUST hold the write latch */
assert (holder != NULL);
// ... condensed: mark holder->perf_stat.dirtied_by_holder, perfmon PSTAT_PB_NUM_DIRTIES ...

Invariant — a page is only dirtied while its setter holds the BCB write latch. The assert (get_latch (...) == PGBUF_LATCH_WRITE) enforces it, serializing concurrent writers (Ch 5) so the DIRTY/LSA pair stays consistent though each is mutated lock-free. The CAS in pgbuf_bcb_set_dirty defends only against other threads racing on unrelated bits of the same word (a no-latch change_zone), not against two writers.

pgbuf_bcb_set_dirty is a hand-coded fast path that bypasses update_flags because dirtying is the hottest case (the source comment says so explicitly):

// pgbuf_bcb_set_dirty -- src/storage/page_buffer.c
do
{
old_flags = bcb->flags;
if (old_flags & PGBUF_BCB_DIRTY_FLAG)
return; /* <- already dirty: skip CAS + counter (common case) */
}
while (!ATOMIC_CAS_32 (&bcb->flags, old_flags, old_flags | PGBUF_BCB_DIRTY_FLAG));
ATOMIC_INC_64 (&pgbuf_Pool.monitor.dirties_cnt, 1); /* <- dirties_cnt += 1; assert range follows */
if (PGBUF_GET_ZONE (old_flags) == PGBUF_LRU_3_ZONE
&& (old_flags & PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK) == 0)
pgbuf_lru_remove_victim_candidate (thread_p, pgbuf_lru_list_from_bcb (bcb), bcb); /* <- newly dirty -> drop candidacy */

Branch map: (1) already-dirty → early return; (2) CAS sets the bit; (3) dirties_cnt += 1 then assert range; (4) if the BCB was a valid zone-3 candidate before DIRTY (the test reads old_flags), the new bit disqualifies it, so remove it — the §6.2 invariant inlined for speed. Note this fast path only ever sets DIRTY, so unlike update_flags it never needs the dirty-cleared branch or the add-candidate branch.

6.6 Recording the oldest unflushed LSA — pgbuf_set_lsa

Section titled “6.6 Recording the oldest unflushed LSA — pgbuf_set_lsa”

pgbuf_set_lsa (log/recovery manager only) stamps the redo LSA and establishes oldest_unflush_lsa once per dirty cycle, with special branches for temporary and auxiliary volumes:

// pgbuf_set_lsa -- src/storage/page_buffer.c
// ... condensed: debug-gated page-pointer validation may return NULL; assert (lsa_ptr != NULL) ...
if (pgbuf_is_temp_lsa (bufptr->iopage_buffer->iopage.prv.lsa)
|| PGBUF_IS_AUXILIARY_VOLUME (bufptr->vpid.volid) == true)
return NULL; /* <- branch 2: temp/aux pages are never WAL-tracked: bail */
if (pgbuf_is_temporary_volume (bufptr->vpid.volid) == true)
{
pgbuf_init_temp_page_lsa (&bufptr->iopage_buffer->iopage, IO_PAGESIZE); /* <- branch 3: force sentinel temp LSA */
if (logtb_is_current_active (thread_p))
return NULL; /* <- active txn on temp page carries no real LSA */
}
fileio_set_page_lsa (&bufptr->iopage_buffer->iopage, lsa_ptr, IO_PAGESIZE); /* <- branch 4: write redo LSA into image */
if (LSA_ISNULL (&bufptr->oldest_unflush_lsa)) /* <- branch 5: FIRST dirty since last flush? */
{
if (LSA_LT (lsa_ptr, &log_Gl.chkpt_redo_lsa))
{ /* ... condensed: re-read chkpt_redo_lsa under chkpt_lsa_lock; if still older,
raise ER_LOG_CHECKPOINT_SKIP_INVALID_PAGE + assert(false) ... */ }
LSA_COPY (&bufptr->oldest_unflush_lsa, lsa_ptr); /* <- watermark established */
}
// ... condensed: branch 6, #if defined(NDEBUG) also calls pgbuf_set_dirty_buffer_ptr (safety net) ...

Two facts the comments compress: pgbuf_is_temp_lsa compares the stored LSA against sentinel PGBUF_TEMP_LSA = { NULL_LOG_PAGEID - 1, NULL_LOG_OFFSET - 1 } (i.e. (-2,-2)), and the watermark lives here, not in set-dirty, because pages can be dirtied before any LSA exists — so it anchors at the first LSA set. The release-build #if defined(NDEBUG) tail forcing pgbuf_set_dirty_buffer_ptr is a safety net for any missed set-dirty call, since an LSA was just written and must be flushed.

Invariant — oldest_unflush_lsa is the LSA of the earliest modification not yet on disk, set once on the first dirty after a clean state and cleared only on flush. The LSA_ISNULL guard makes later set_lsa calls in the same cycle leave it untouched; it never advances forward. Ch 8’s WAL rule reads it for log-flush ordering and resets it to NULL on a successful flush; checkpoint reads it to find the oldest dirty page.

6.7 The readers: pgbuf_bcb_is_dirty and pgbuf_bcb_avoid_victim

Section titled “6.7 The readers: pgbuf_bcb_is_dirty and pgbuf_bcb_avoid_victim”

Both are single-mask predicates over the same word — no lock, just a volatile read:

// pgbuf_bcb_is_dirty -- src/storage/page_buffer.c
STATIC_INLINE bool
pgbuf_bcb_is_dirty (const PGBUF_BCB * bcb)
{
return (bcb->flags & PGBUF_BCB_DIRTY_FLAG) != 0;
}
// pgbuf_bcb_avoid_victim -- src/storage/page_buffer.c
STATIC_INLINE bool
pgbuf_bcb_avoid_victim (const PGBUF_BCB * bcb)
{
return (bcb->flags & PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK) != 0; /* <- ANY of the 4 disqualifiers */
}

The relation is hierarchical: a dirty BCB always makes avoid_victim true (DIRTY ∈ the mask), but avoid_victim can also be true for a clean BCB that is mid-flush or a (invalidated) direct victim. Hence Ch 9’s victim scan calls avoid_victim, not is_dirty — dirtiness is only one of four ways to be ineligible. The sibling per-flag readers (pgbuf_bcb_is_flushing, _is_direct_victim, _is_invalid_direct_victim, _is_async_flush_request, _is_to_vacuum, _should_be_moved_to_bottom_lru) are each the same one-bit (flags & FLAG) != 0 test.

6.8 The dual-purpose counter — count_fix_and_avoid_dealloc

Section titled “6.8 The dual-purpose counter — count_fix_and_avoid_dealloc”

A separate volatile word, never overlapping flags, packs two 16-bit sub-counters into one 32-bit int so each can be mutated by a single atomic:

Sub-fieldBitsMask / shiftMutatorsReader
avoid-dealloc countlow 16PGBUF_BCB_AVOID_DEALLOC_MASK = 0x0000FFFFpgbuf_bcb_register_avoid_deallocation (+1), _unregister_ (-1, CAS)pgbuf_bcb_should_avoid_deallocation
fix counthigh 16<< PGBUF_BCB_COUNT_FIX_SHIFT_BITS (16)pgbuf_bcb_register_fix (+ 1<<16, capped)pgbuf_bcb_is_hot

They are merged (per the struct comment) because avoid-dealloc must change atomically yet 2-byte atomics are uncommon, so both ride in one CPU-native 4-byte word:

// pgbuf_bcb_register_avoid_deallocation -- src/storage/page_buffer.c
assert ((bcb->count_fix_and_avoid_dealloc & 0x00008000) == 0); /* <- low-half top bit clear: overflow guard */
(void) ATOMIC_INC_32 (&bcb->count_fix_and_avoid_dealloc, 1); /* <- +1 touches only the low half */

register_fix adds 1 << 16 but only while below the cap PGBUF_FIX_COUNT_THRESHOLD << 16 — once hot, it stops counting (hotness is a one-way latch, not a live count). pgbuf_bcb_is_hot compares against that same PGBUF_FIX_COUNT_THRESHOLD << 16 (fix count drives LRU hotness, Ch 7). The unregister path uses a CAS loop and tolerates an avoid-dealloc count already at zero (a pgbuf_ordered_fix corner case where the page was victimized and reloaded), logging via er_log_debug and breaking rather than underflowing. This counter is the second, orthogonal victim gate — fixed or dealloc-protected pages held out of reach — independent of the flag gate dirtiness rides; Ch 9 consumes both.

  1. bcb->flags is one 32-bit word split at the half: low 16 bits = LRU index (PGBUF_LRU_INDEX_MASK), high 16 bits = flag bits (0x80000000..0x02000000) plus the zone selector (bits 16-19; LRU 1..3<<16, INVALID 1<<18, VOID 2<<18). Flag and zone bits are disjoint, which lets the two mutators share the word without clobbering each other.
  2. pgbuf_bcb_update_flags and pgbuf_bcb_change_zone share one CAS retry loop: the former mutates flag bits (no-op early return) and reconciles dirties_cnt plus zone-3 candidacy; the latter mutates zone+index and reconciles per-list zone counters under the LRU mutex. Every reconciliation branch must run or the victim counter drifts; the default: assert(false) arms encode that an LRU BCB is in exactly one of zones 1/2/3.
  3. The four-bit PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK (DIRTY, FLUSHING, VICTIM_DIRECT, INVALIDATE_DV) defines victim ineligibility; the other three flags (MOVE_TO_LRU_BOTTOM, TO_VACUUM, ASYNC_FLUSH_REQ) are victim-neutral. pgbuf_bcb_avoid_victim reads the whole mask; pgbuf_bcb_is_dirty reads one bit of it.
  4. Dirtying takes the hand-coded fast path pgbuf_bcb_set_dirty (set-DIRTY-only CAS) rather than the general mutator, maintaining the candidacy/dirty-counter invariants inline. The BCB write latch — asserted in pgbuf_set_dirty_buffer_ptr, not the CAS — serializes concurrent writers and keeps the DIRTY/LSA pair consistent.
  5. oldest_unflush_lsa is established exactly once per dirty cycle in pgbuf_set_lsa, guarded by LSA_ISNULL, never advanced forward, validated against the checkpoint redo horizon. Temp and auxiliary volumes are excluded (sentinel PGBUF_TEMP_LSA = (-2,-2)).
  6. count_fix_and_avoid_dealloc is a second packed word carrying fix-count (high 16 bits, capped, drives hotness via pgbuf_bcb_is_hot) and avoid-dealloc count (low 16 bits) so both fit one native atomic — the orthogonal fix/dealloc victim gate, independent of the flag gate that dirtiness rides.

Chapter 7: Unfix LRU Movement Aout History and Private to Shared Migration

Section titled “Chapter 7: Unfix LRU Movement Aout History and Private to Shared Migration”

This chapter answers: on unfix, how does a BCB move through the three LRU zones, when does it boost to the top, when does it migrate from a private to a shared list, and what role does the Aout 2Q ghost list play? Zone model, private/shared split, and 2Q intent are in the high-level companion (cubrid-page-buffer-manager.md); the BCB struct, packed flags/zone word, and dirty bit are in Chapters 1 and 6 — reused here.

7.1 The unfix funnel — pgbuf_unfix to pgbuf_unlatch_bcb_upon_unfix

Section titled “7.1 The unfix funnel — pgbuf_unfix to pgbuf_unlatch_bcb_upon_unfix”
// pgbuf_unfix -- src/storage/page_buffer.c
CAST_PGPTR_TO_BFPTR (bufptr, pgptr);
holder_status = pgbuf_unlatch_thrd_holder (thread_p, bufptr, &holder_perf_stat);
// ... perf tracking (perfmon_pbx_unfix) elided ...
if (pgbuf_lockfree_unfix_ro (thread_p, bufptr)) /* <- pure read latch: CAS-drop fcnt, no mutex */
return; /* <- never touches LRU */
PGBUF_BCB_LOCK (bufptr);
(void) pgbuf_unlatch_bcb_upon_unfix (thread_p, bufptr, holder_status); /* releases mutex inside */

Invariant — the read-only fast path never reorders LRU. A shared latch being dropped does not change zones; pgbuf_lockfree_unfix_ro returns true after a CAS, so reordering is reserved for the last unfixer or a writer — otherwise every reader would contend on the list mutex.

pgbuf_unlatch_bcb_upon_unfix is the decision engine; its prologue CASes the fix count down:

// pgbuf_unlatch_bcb_upon_unfix -- src/storage/page_buffer.c
do {
blocked_reader_writer = false; is_zero_fcnt = false;
impl_orig = get_impl (&bufptr->atomic_latch); impl_new = impl_orig;
impl_new.impl.fcnt--; /* <- drop one fix */
blocked_reader_writer = impl_orig.impl.waiter_exists;
if (impl_new.impl.fcnt == 0) {
is_zero_fcnt = true; impl_new.impl.latch_mode = PGBUF_NO_LATCH; /* <- last unfixer drops latch */
}
if (impl_new.impl.fcnt < 0) { /* <- "freed too much": defensive reset */
assert (false); er_set (...); impl_new.impl.fcnt = 0;
impl_new.impl.waiter_exists = false;
impl_new.impl.latch_mode = PGBUF_NO_LATCH; is_zero_fcnt = true; break;
}
} while (!bufptr->atomic_latch.compare_exchange_weak (impl_orig.raw, impl_new.raw, ...));

The CAS (Chapter 5) yields is_zero_fcnt (last holder) and blocked_reader_writer (a latch waiter queued). Reordering runs only when is_zero_fcnt && !blocked_reader_writer — a queued waiter re-latches the BCB immediately, so moving it would be wasted work.

flowchart TD
  A["pgbuf_unlatch_bcb_upon_unfix\nCAS: fcnt--"] --> B{"is_zero_fcnt?"}
  B -->|no| W["wakeup reader/writer\nrelease mutex"]
  B -->|yes| C{"MOVE_TO_LRU_BOTTOM?"}
  C -->|yes| D["pgbuf_move_bcb_to_bottom_lru\ndealloc shortcut"]
  C -->|no| E{"blocked_reader_writer?"}
  E -->|yes| W
  E -->|no| F["switch on zone"]
  F --> Z0["VOID -> pgbuf_unlatch_void_zone_bcb"]
  F --> Z1["LRU_1 -> keep or prv->shr"]
  F --> Z2["LRU_2 -> keep, boost if old, or prv->shr"]
  F --> Z3["LRU_3 -> boost or prv->shr or direct-victim"]
  Z0 --> W
  Z1 --> W
  Z2 --> W
  Z3 --> W
  D --> W

Figure 7-1 — Branch structure of pgbuf_unlatch_bcb_upon_unfix. Only the zero-fcnt, no-waiter path reaches the zone switch.

7.2 The dealloc shortcut and the zone switch

Section titled “7.2 The dealloc shortcut and the zone switch”
// pgbuf_unlatch_bcb_upon_unfix -- src/storage/page_buffer.c
if (is_zero_fcnt) {
assert (LSA_ISNULL (&bufptr->oldest_unflush_lsa) || pgbuf_bcb_is_dirty (bufptr));
if (pgbuf_bcb_should_be_moved_to_bottom_lru (bufptr)) /* <- MOVE_TO_LRU_BOTTOM flag */
pgbuf_move_bcb_to_bottom_lru (thread_p, bufptr); /* dealloc shortcut */
else if (blocked_reader_writer == false) {
th_lru_idx = PGBUF_THREAD_HAS_PRIVATE_LRU (thread_p)
? PGBUF_LRU_INDEX_FROM_PRIVATE (PGBUF_PRIVATE_LRU_FROM_THREAD (thread_p)) : -1; /* own list or none */
switch (pgbuf_bcb_get_zone (bufptr)) { /* ... see 7.3 ... */ }
}
}

pgbuf_bcb_should_be_moved_to_bottom_lru tests the PGBUF_BCB_MOVE_TO_LRU_BOTTOM_FLAG bit, set on the dealloc path: a deallocated page is worthless hot, so it is shoved to the bottom for first reclamation. th_lru_idx (own private list or -1) is the reference point for every private/shared decision below.

Invariant — oldest_unflush_lsa implies the dirty bit. Chapter 6’s WAL invariant at unfix: a page with a pending flush LSA must stay dirty, or the flush daemons (Chapter 8) skip it and break WAL.

Two guards recur in every LRU case, quoted once and referenced for all three zones:

// pgbuf_unlatch_bcb_upon_unfix (per-case prologue) -- src/storage/page_buffer.c
if (PGBUF_SHOULD_IGNORE_UNFIX (thread_p, bufptr)) { ...KEEP_VAC stat...; break; } /* <- don't warm cache */
if (pgbuf_should_move_private_to_shared (thread_p, bufptr, th_lru_idx)) { /* <- see 7.5 */
pgbuf_lru_move_from_private_to_shared (thread_p, bufptr); ...PRV_TO_SHR_MID stat...; break;
}

PGBUF_SHOULD_IGNORE_UNFIX is not vacuum-only: its real definition is VACUUM_IS_THREAD_VACUUM_WORKER (th) || pgbuf_is_temporary_volume (buf->vpid.volid) (SERVER_MODE; false otherwise). It fires for vacuum workers and for pages on temporary volumes — both should not warm the cache or promote a BCB to hot (the source comment also names the checkpoint thread as a logical member of this set). pgbuf_should_move_private_to_shared (7.5) escalates contended pages. Only the default action after these guards differs by zone. Note the LRU_3 case applies its SHOULD_IGNORE_UNFIX branch before the private-to-shared check (see below).

VOID (Chapter 4): delegated to pgbuf_unlatch_void_zone_bcb (7.4).

LRU_1 (hottest): after the guards, do nothing but register a hit — zone 1 is never reordered:

/* after the per-case prologue, plus a PRV_KEEP/SHR_KEEP stat: */
pgbuf_bcb_register_hit_for_lru (bufptr); break; /* <- never boost zone 1 */

LRU_2 (boost-eligible): boost only if aged:

if (PGBUF_IS_BCB_OLD_ENOUGH (bufptr, pgbuf_lru_list_from_bcb (bufptr)))
pgbuf_lru_boost_bcb (thread_p, bufptr); /* <- aged enough -> promote to top */
else { ...PRV_KEEP / SHR_KEEP stat... } /* <- too new: leave in place */
pgbuf_bcb_register_hit_for_lru (bufptr); break;

LRU_3 (victim zone): a real unfix always boosts, but its PGBUF_SHOULD_IGNORE_UNFIX branch may instead hand the BCB out as a direct victim:

case PGBUF_LRU_3_ZONE:
if (PGBUF_SHOULD_IGNORE_UNFIX (...)) {
if (!pgbuf_bcb_avoid_victim (bufptr) && pgbuf_assign_direct_victim (thread_p, bufptr))
{ ...DIRECT_VACUUM_LRU stat... } /* <- give it straight to a waiter */
else { ...THREE_KEEP_VAC stat... }
break;
}
if (pgbuf_should_move_private_to_shared (...)) { ...move; THREE_PRV_TO_SHR_MID...; break; }
pgbuf_lru_boost_bcb (thread_p, bufptr); /* <- rule 3: always boost from zone 3 */
pgbuf_bcb_register_hit_for_lru (bufptr); break;

After the switch the function wakes any latch waiter (pgbuf_wakeup_reader_writer); on a requested async flush (pgbuf_bcb_is_async_flush_request) it uses pgbuf_bcb_safe_flush_force_unlock (which unlocks), else unlocks directly. assert (... != PGBUF_LATCH_FLUSH) guards that unfix never sees a flush latch — flushing is the daemons’ job (Chapter 8).

7.4 VOID-zone landing — pgbuf_unlatch_void_zone_bcb and the Aout hit

Section titled “7.4 VOID-zone landing — pgbuf_unlatch_void_zone_bcb and the Aout hit”

A VOID BCB was just claimed for a non-resident page. It first removes the VPID from Aout (recording a re-fix as a hit), then branches on private-list ownership and Aout membership:

// pgbuf_unlatch_void_zone_bcb -- src/storage/page_buffer.c
if (pgbuf_Pool.buf_AOUT_list.max_count > 0) { aout_enabled = true;
aout_list_id = pgbuf_remove_vpid_from_aout_list (thread_p, &bcb->vpid); } /* <- 2Q lookup+remove */
if (PGBUF_VACUUM_SHOULD_IGNORE_UNFIX (thread_p)) { /* vacuum worker only here */
if (!pgbuf_bcb_avoid_victim (bcb) && pgbuf_assign_direct_victim (thread_p, bcb)) {
// ... if Aout on: pgbuf_add_vpid_to_aout_list (..., aout_list_id) ... <- re-ghost
return; }
aout_list_id = PGBUF_AOUT_NOT_FOUND; /* <- vacuum never gets Aout-boost */
}
if (thread_private_lru_index != -1) {
if (PGBUF_VACUUM_SHOULD_IGNORE_UNFIX (thread_p)) { /* <- vacuum: top, no hit */
pgbuf_lru_add_new_bcb_to_top (thread_p, bcb, thread_private_lru_index); return; }
if (!aout_enabled || thread_private_lru_index == aout_list_id) { /* <- Aout HIT -> top of LRU 1 */
pgbuf_lru_add_new_bcb_to_top (thread_p, bcb, thread_private_lru_index);
pgbuf_bcb_register_hit_for_lru (bcb); return; }
if (aout_list_id == PGBUF_AOUT_NOT_FOUND) { /* <- cold miss -> middle */
pgbuf_lru_add_new_bcb_to_middle (thread_p, bcb, thread_private_lru_index);
pgbuf_bcb_register_hit_for_lru (bcb); return; }
/* fall through: ghosted in a *different* private list -> shared */
}
pgbuf_lru_add_new_bcb_to_middle (thread_p, bcb, pgbuf_get_shared_lru_index_for_add ()); /* <- shared middle */
if (!PGBUF_VACUUM_SHOULD_IGNORE_UNFIX (thread_p)) pgbuf_bcb_register_hit_for_lru (bcb);

Note this branch gates on PGBUF_VACUUM_SHOULD_IGNORE_UNFIX (vacuum worker only) — the temp-volume arm of PGBUF_SHOULD_IGNORE_UNFIX from 7.3 is not applied in the VOID path. Placement (private-LRU thread, Aout enabled, non-vacuum):

Aout resultPlacementMeaning
== aout_list_idtop of own private list (LRU 1)evicted from my list — 2Q second-touch, promote hot
!aout_enabledtop of own private listno history; first unfix treated as warm
PGBUF_AOUT_NOT_FOUNDmiddle of own private listnever seen — cold, lands at the LRU-1/2 boundary
different listmiddle of shared listshared across workers; quotas (Ch 10) apply

Invariant — Aout removal precedes placement. aout_list_id is captured once before any insertion and drives the whole branch. If remove came after placement, two threads re-fixing the same page could both “hit” and double-promote; Aout_mutex (7.6) serializes lookup-and-remove so exactly one thread consumes the ghost.

7.5 pgbuf_should_move_private_to_shared — the migration test

Section titled “7.5 pgbuf_should_move_private_to_shared — the migration test”
// pgbuf_should_move_private_to_shared -- src/storage/page_buffer.c
int bcb_lru_idx = pgbuf_bcb_get_lru_index (bcb);
if (PGBUF_IS_SHARED_LRU_INDEX (bcb_lru_idx)) return false; /* <- already shared */
if (thread_private_lru_index != bcb_lru_idx) return true; /* cond 1: foreign-thread unfix */
if (!pgbuf_bcb_is_hot (bcb)) return false; /* cond 2a: must be hot */
if (!PGBUF_IS_BCB_OLD_ENOUGH (bcb, PGBUF_GET_LRU_LIST (bcb_lru_idx))) return false; /* cond 2b: and old */
return true; /* hot + aged -> escalate to shared */

Two triggers: (1) foreign unfix — the BCB lives in private list X but the unfixer’s own list is Y (or -1); a page touched by >1 worker goes shared. (2) hot and old — same list, but both hot (pgbuf_bcb_is_hot) and old enough.

// pgbuf_bcb_is_hot / pgbuf_bcb_register_fix -- src/storage/page_buffer.c
// hot: count_fix_and_avoid_dealloc >= (PGBUF_FIX_COUNT_THRESHOLD << PGBUF_BCB_COUNT_FIX_SHIFT_BITS)
// == 64 << 16 (fix count lives in the high 16 bits)
// register_fix saturates: stops incrementing once the threshold bit is set.

count_fix_and_avoid_dealloc packs the fix count (high 16 bits, bumped by pgbuf_bcb_register_fix and saturating at the 64-fix threshold) and the avoid-dealloc count (low 16 bits, PGBUF_BCB_AVOID_DEALLOC_MASK); see Chapter 6.

// PGBUF_IS_BCB_OLD_ENOUGH -- src/storage/page_buffer.c
#define PGBUF_IS_BCB_OLD_ENOUGH(bcb, lru_list) \
(PGBUF_AGE_DIFF ((bcb)->tick_lru_list, (lru_list)->tick_list) >= ((lru_list)->count_lru2 / 2))

A BCB stamps tick_lru_list from tick_list on insert; tick_list bumps on every add-to-top/middle. “Old enough” = passed by at least half of zone 2’s worth (count_lru2 / 2) of newer inserts — so a page fixed twice in quick succession is not boosted on the second unfix. PGBUF_AGE_DIFF handles the wraparound of the 31-bit tick.

7.6 The Aout 2Q ghost list — pgbuf_aout_buf and pgbuf_aout_list

Section titled “7.6 The Aout 2Q ghost list — pgbuf_aout_buf and pgbuf_aout_list”

Aout holds VPIDs only (not BCBs) for recently victimized pages — a FIFO fronted by per-shard hash tables for O(1) lookup.

// struct pgbuf_aout_buf -- src/storage/page_buffer.c
struct pgbuf_aout_buf {
VPID vpid; /* page VPID */
int lru_idx; /* which LRU list it was evicted from */
PGBUF_AOUT_BUF *next; /* next element in list */
PGBUF_AOUT_BUF *prev; /* prev element in list */
};
FieldRole
vpidghosted identity / hash key; VPID_SET_NULL marks a free node
lru_idxLRU list it was evicted from — re-fix re-enters the same list (7.4)
next / prevFIFO links, doubling as the free-list link when recycled; prev gives O(1) middle unlink on a hit
// struct pgbuf_aout_list -- src/storage/page_buffer.c
struct pgbuf_aout_list {
pthread_mutex_t Aout_mutex; /* integrity of the whole list (SERVER_MODE) */
PGBUF_AOUT_BUF *Aout_top; /* top of the queue (most recent) */
PGBUF_AOUT_BUF *Aout_bottom; /* bottom of the queue (oldest) */
PGBUF_AOUT_BUF *Aout_free; /* free list of recycled nodes */
PGBUF_AOUT_BUF *bufarray; /* preallocated node array */
int num_hashes; /* number of hash shards */
MHT_TABLE **aout_buf_ht; /* per-shard VPID -> node hash */
int max_count; /* capacity; <= 0 disables Aout */
};
FieldRole
Aout_mutexglobal list+hash lock; serializes the 7.4 lookup-remove
Aout_top / Aout_bottomnewest / oldest ghost — insertion vs eviction points
Aout_freerecycled nodes; avoids malloc on the victim path
bufarraypreallocated backing storage; no runtime alloc
num_hashes / aout_buf_htshard count and per-shard MHT_TABLE* for O(1) VPID lookup over the FIFO (AOUT_HASH_IDX)
max_countcapacity / enable switch; <= 0 disables Aout entirely

The LRU list struct these BCBs move within (pgbuf_lru_list) carries the boundary pointers and tick clocks 7.5/7.7 lean on:

FieldRoleUsed by
top / bottomlist endpointsadd-to-top, add-to-bottom
bottom_1last BCB of zone 1add-to-middle inserts after it
bottom_2last BCB of zone 2repaired on removal (zone-2 care, 7.7)
victim_hintwhere victim search startsadvanced on every remove
count_lru1/2/3per-zone populationscount_lru2/2 = old-enough threshold
threshold_lru1/2zone-size targetsdrive pgbuf_lru_adjust_zone*
tick_listbumped on add-to-top/middleboost-age clock (PGBUF_IS_BCB_OLD_ENOUGH)
tick_lru3bumped on fall-to-zone-3victim-hint ordering
indexlist idprivate vs shared classification
flowchart LR
  subgraph AOUT["pgbuf_aout_list (FIFO + hash)"]
    T["Aout_top\n(newest)"] --> N1["node"] --> N2["node"] --> B["Aout_bottom\n(oldest)"]
    HT["aout_buf_ht[shard]\nVPID -> node"] -.-> N1
    FR["Aout_free\n(recycled)"]
  end
  VICT["victimization\npgbuf_add_vpid_to_aout_list"] --> T
  B -->|"full -> drop oldest"| FR
  REFIX["re-fix\npgbuf_remove_vpid_from_aout_list"] -.->|hit| FR

Figure 7-2 — Aout as a fixed-capacity FIFO with a hash index.

pgbuf_add_vpid_to_aout_list (from the direct-victim branches of 7.4 and pgbuf_lru_fall_bcb_to_zone_3): under Aout_mutex, if Aout_free is empty it evicts Aout_bottom (mht_rem), else pops a free node; stamps lru_idx/vpid, mht_puts, links at Aout_top. pgbuf_remove_vpid_from_aout_list: mht_get; if absent returns PGBUF_AOUT_NOT_FOUND (-2, a true fault); if present it captures aout_list_id = aout_buf->lru_idx, unlinks, mht_rems, nulls the VPID, resets lru_idx, pushes the node onto Aout_free, and returns lru_idx — the value 7.4 compares against thread_private_lru_index.

7.7 LRU insertion primitives and the boost

Section titled “7.7 LRU insertion primitives and the boost”

Zone-2 and zone-3 boosts route through pgbuf_lru_boost_bcb:

// pgbuf_lru_boost_bcb -- src/storage/page_buffer.c
assert (zone != PGBUF_LRU_1_ZONE); /* <- never called on zone 1 */
pthread_mutex_lock (&lru_list->mutex);
pgbuf_remove_from_lru_list (thread_p, bcb, lru_list);/* unlink */
pgbuf_lru_add_bcb_to_top (thread_p, bcb, lru_list); /* relink at top of zone 1 */
if (zone == PGBUF_LRU_2_ZONE) pgbuf_lru_adjust_zone1 (thread_p, lru_list, true); /* only zone 1 grew */
else pgbuf_lru_adjust_zones (thread_p, lru_list, true); /* zone 3: rebalance all */
pthread_mutex_unlock (&lru_list->mutex);

pgbuf_lru_add_bcb_to_top patches the links, sets top (and bottom/ bottom_1 if empty), increments tick_list (the clock that ages every other BCB for PGBUF_IS_BCB_OLD_ENOUGH), then change_zone marks it PGBUF_LRU_1_ZONE. pgbuf_lru_add_bcb_to_middle inserts after bottom_1 (the zone-1 bottom), also bumps tick_list, and marks zone 2; pgbuf_lru_add_bcb_to_bottom appends at bottom, stamps tick_lru3, and marks zone 3.

pgbuf_remove_from_lru_list is the inverse and repairs every boundary pointer before moving the BCB to VOID:

// pgbuf_remove_from_lru_list -- src/storage/page_buffer.c
if (lru_list->top == bufptr) lru_list->top = bufptr->next_BCB;
if (lru_list->bottom == bufptr) lru_list->bottom = bufptr->prev_BCB;
if (lru_list->bottom_1 == bufptr) lru_list->bottom_1 = bufptr->prev_BCB;
if (lru_list->bottom_2 == bufptr) { /* <- zone-2 boundary needs care */
if (bufptr->prev_BCB != NULL && pgbuf_bcb_get_zone (bufptr->prev_BCB) == PGBUF_LRU_2_ZONE)
lru_list->bottom_2 = bufptr->prev_BCB;
else { assert (lru_list->count_lru2 == 1); lru_list->bottom_2 = NULL; }
}
/* splice neighbors, null this bcb's links */
pgbuf_lru_advance_victim_hint (thread_p, lru_list, bufptr, bcb_prev, false);
pgbuf_bcb_change_zone (thread_p, bufptr, 0, PGBUF_VOID_ZONE); /* <- now belongs to no zone */

Invariant — a removed BCB’s zone matches its links: it lands in VOID, and the victim hint advances first. The function ends with change_zone(..., PGBUF_VOID_ZONE) so a BCB unlinked from a list never keeps an LRU zone tag; and it calls pgbuf_lru_advance_victim_hint before the splice — if the hint were not advanced first, a victimizer could chase a dangling prev_BCB. Boost = remove + add-to-top leaves the BCB momentarily in VOID, but the whole sequence runs under lru_list->mutex, so no thread observes the gap.

pgbuf_lru_fall_bcb_to_zone_3 is the demotion counterpart, run by the zone-adjust functions when zones 1/2 exceed their thresholds:

// pgbuf_lru_fall_bcb_to_zone_3 -- src/storage/page_buffer.c
assert (pgbuf_bcb_get_zone (bcb) == PGBUF_LRU_1_ZONE || pgbuf_bcb_get_zone (bcb) == PGBUF_LRU_2_ZONE);
#if defined (SERVER_MODE)
if (pgbuf_is_bcb_victimizable (bcb, false) && pgbuf_is_any_thread_waiting_for_direct_victim ()) {
if (pgbuf_bcb_is_to_vacuum (bcb)) { /* ...stat; fall through... */ }
else if (PGBUF_BCB_TRYLOCK (bcb) == 0) { /* <- conditional: avoid list/bcb lock-order deadlock */
VPID vpid_copy = bcb->vpid;
if (pgbuf_is_bcb_victimizable (bcb, true) && pgbuf_assign_direct_victim (thread_p, bcb)) {
pgbuf_remove_from_lru_list (thread_p, bcb, lru_list); PGBUF_BCB_UNLOCK (bcb);
pgbuf_add_vpid_to_aout_list (thread_p, &vpid_copy, lru_list->index); /* <- ghost on the way out */
return; }
PGBUF_BCB_UNLOCK (bcb); /* not assigned; fall through */
} }
#endif
bcb->tick_lru3 = lru_list->tick_lru3; /* stamp zone-3 position */
if (++lru_list->tick_lru3 >= DB_INT32_MAX) lru_list->tick_lru3 = 0;
pgbuf_bcb_change_zone (thread_p, bcb, lru_list->index, PGBUF_LRU_3_ZONE);

PGBUF_BCB_TRYLOCK is conditional because lock order is normally bcb-then-list but we already hold the list mutex; rather than deadlock it gives up and lets the BCB be victimized later (Chapter 9). The direct-victim branch ghosts the VPID into Aout on the way out, closing the 2Q loop with 7.4’s lookup. tick_lru3 (small at the bottom) feeds the victim hint, distinct from tick_lru_list which feeds the boost age.

  1. LRU movement is gated on is_zero_fcnt && !blocked_reader_writer. Shared-read unfixes take pgbuf_lockfree_unfix_ro and never touch the list; only the last unfixer with no waiter reorders the BCB.
  2. MOVE_TO_LRU_BOTTOM is a dealloc shortcut that bypasses the zone switch and shoves a deallocated page to the bottom for fast reclamation.
  3. Zone sets the default action: zone 1 never boosts; zone 2 boosts only when PGBUF_IS_BCB_OLD_ENOUGH (past half of count_lru2); zone 3 always boosts on a real unfix, or hands out a direct victim under the ignore-unfix branch.
  4. PGBUF_SHOULD_IGNORE_UNFIX is vacuum-OR-temp-volume, not vacuum-only: pages on temporary volumes are also kept from warming the cache, and the VOID path narrows this to vacuum workers (PGBUF_VACUUM_SHOULD_IGNORE_UNFIX).
  5. Boost = pgbuf_remove_from_lru_list + pgbuf_lru_add_bcb_to_top under the list mutex; add-to-top bumps tick_list, the clock that ages every other BCB for the old-enough test. A removed BCB always lands in VOID, keeping the zone field consistent with the list it is linked in.
  6. pgbuf_should_move_private_to_shared fires on two triggers: a foreign-thread unfix (immediate), or a same-list page that is both hot (>= 64 packed, saturating fixes) and old enough — escalating contended pages to the shared pool.
  7. Aout is a fixed-capacity VPID-only ghost FIFO with a hash index, and its lookup-and-remove is serialized under Aout_mutex and precedes placement, so one thread consumes a ghost and no page is double-promoted: a re-fix found for the same private list lands at the top of LRU 1, a different list goes shared, not-found lands cold in the middle; victimization re-ghosts the outgoing VPID to keep the loop closed.

Chapter 8: Flushing Under the WAL Rule and the Flush Daemons

Section titled “Chapter 8: Flushing Under the WAL Rule and the Flush Daemons”

A dirty page may not reach disk until the log record for its most recent change is durable. This chapter traces how the page buffer enforces that log-before-page ordering inside pgbuf_bcb_flush_with_wal, how the three flush daemons pace and batch their writes, and where the double-write buffer (DWB) intercepts the write. For the why of WAL and the high-level picture, see the companion cubrid-page-buffer-manager.md (“Write-Ahead Logging”, “Flushing and the daemons”). The DWB’s block geometry and crash-recovery rationale live in cubrid-double-write-buffer.md; the durability semantics of logpb_flush_log_for_wal and the flushed-LSA bookkeeping live in cubrid-log-manager-detail.md — both are referenced, not re-derived, here. The flushing/dirty flags and the atomic-latch FLUSH mode come from Chapters 6 and 5; this chapter shows how the flush path consumes them.

Four public entry points push one page toward disk; all funnel into pgbuf_bcb_safe_flush_internal, which decides whether the flush happens now, is delegated, or is awaited.

flowchart TB
  F["pgbuf_flush\n(optionally unfix after)"] --> FW["pgbuf_flush_with_wal"]
  FIR["pgbuf_flush_if_requested\n(permanently write-latched page)"] -->|ASYNC_FLUSH_REQ set| SFFU
  FW --> SFFU["pgbuf_bcb_safe_flush_force_unlock"]
  SFFU --> SFI["pgbuf_bcb_safe_flush_internal"]
  SFI -->|immediate_flush| BFW["pgbuf_bcb_flush_with_wal"]
  SFI -->|page write-latched by other| REQ["set ASYNC_FLUSH_REQ\nlet holder flush on unfix"]
  SFI -->|already flushing / latched, synchronous| BLK["pgbuf_block_bcb\nPGBUF_LATCH_FLUSH wait"]

Figure 8-1 — Single-page flush entry points converging on pgbuf_bcb_safe_flush_internal.

pgbuf_flush_with_wal is the canonical caller — it asserts a READ+ latch held by the calling thread, locks the BCB mutex, and delegates synchronously:

// pgbuf_flush_with_wal -- src/storage/page_buffer.c
CAST_PGPTR_TO_BFPTR (bufptr, pgptr);
/* In CUBRID, the caller is holding WRITE page latch */
assert (get_latch (&bufptr->atomic_latch) >= PGBUF_LATCH_READ
&& pgbuf_find_thrd_holder (thread_p, bufptr) != NULL);
PGBUF_BCB_LOCK (bufptr);
if (pgbuf_bcb_safe_flush_force_unlock (thread_p, bufptr, true) != NO_ERROR) /* <- synchronous=true */
{ ASSERT_ERROR (); return NULL; }

pgbuf_flush wraps this and unfixes afterward when free_page == FREE; its header warns it does not guarantee the page reached disk, so callers needing durability use pgbuf_flush_with_wal and check the return. pgbuf_flush_if_requested serves a thread holding a page permanently write-latched (it can never unfix to trigger a normal flush): it asserts a WRITE latch held by the caller, checks pgbuf_bcb_is_async_flush_request (bcb), and only when set locks and flushes with synchronous=false — the consumer side of the PGBUF_BCB_ASYNC_FLUSH_REQ flag the daemon/checkpoint sets on a write-latched victim.

8.2 The decision core: pgbuf_bcb_safe_flush_internal

Section titled “8.2 The decision core: pgbuf_bcb_safe_flush_internal”

The caller holds the BCB mutex. The function short-circuits clean pages, then runs a CAS loop choosing among the outcomes below. A flush cannot happen immediately for exactly two reasons, both spelled out in the source: the page is write-latched by another thread (its contents could change mid-write), or another thread is already flushing it (two writers could reorder an old version over a new one).

// pgbuf_bcb_safe_flush_internal -- src/storage/page_buffer.c
if (!pgbuf_bcb_is_dirty (bufptr))
return NO_ERROR; /* <- clean: nothing to do, stays locked */
do {
immediate_flush = false; block = false; is_flushing = false;
impl = get_impl (&bufptr->atomic_latch); impl_new = impl;
is_flushing = pgbuf_bcb_is_flushing (bufptr);
if (!is_flushing
&& (impl.impl.latch_mode == PGBUF_NO_LATCH || impl.impl.latch_mode == PGBUF_LATCH_READ
|| (impl.impl.latch_mode == PGBUF_LATCH_WRITE
&& pgbuf_find_thrd_holder (thread_p, bufptr) != NULL))) /* <- I am the writer */
immediate_flush = true;
else {
assert (is_flushing || impl.impl.latch_mode == PGBUF_LATCH_WRITE); /* <- only these reach else */
if (synchronous)
{ block = true; impl_new.impl.waiter_exists = true; } /* <- publish waiter into latch word */
}
} while (!bufptr->atomic_latch.compare_exchange_strong (impl.raw, impl_new.raw, ...));
OutcomeConditionAction
immediate_flushnot flushing; unlatched, read-latched, or write-latched by mepgbuf_bcb_flush_with_wal (..., false, locked) — flush now
async requestnot flushing, write-latched by anotherpgbuf_bcb_update_flags (..., PGBUF_BCB_ASYNC_FLUSH_REQ, 0) — holder flushes on unfix
blockflushing or foreign write-latch, synchronous==true*locked=false; pgbuf_block_bcb (..., PGBUF_LATCH_FLUSH, ...) — sleep
no-wait returnforeign latch/flush, synchronous==falsereturn NO_ERROR without flushing

Note the async-request flag is set whenever the immediate path was not taken and the BCB is not already flushing (i.e. a foreign write-latch), regardless of synchronous — the synchronous caller then also blocks.

Invariant — at most one flusher per BCB. A page flushes only while pgbuf_bcb_is_flushing is false when the writer commits the flushing flag (set inside pgbuf_bcb_flush_with_wal). A second thread sees is_flushing == true, cannot take immediate_flush, and blocks (PGBUF_LATCH_FLUSH) or returns. Violate it and two fileio_write calls race, landing an older image after a newer one and corrupting the page. The force_unlock/force_lock wrappers only normalize the locked out-parameter, since the internal function may drop the mutex when it blocks.

8.3 pgbuf_bcb_flush_with_wal — the durable write

Section titled “8.3 pgbuf_bcb_flush_with_wal — the durable write”

The heart of the chapter. The caller holds the mutex; the function copies the page, enforces WAL, writes through the DWB or directly, and on success clears FLUSHING and wakes waiters; on failure it reverts DIRTY and oldest_unflush_lsa.

flowchart TB
  A["mark_is_flushing\nset FLUSHING, clear DIRTY"] --> C["copy_unflushed_lsa\nsave lsa+oldest_unflush\nNULL oldest_unflush, UNLOCK"]
  C --> D{oldest_unflush_lsa\nnon-null?}
  D -->|yes| E["logpb_flush_log_for_wal"]
  D -->|no| F["debug: changed not logged"]
  E --> G{uses_dwb?}
  F --> G
  G -->|yes| H["dwb_add_page"]
  G -->|no| I["fileio_write"]
  H --> J{error?}
  I --> J
  J -->|fail| K["mark_was_not_flushed\nrestore DIRTY+lsa, wake, ER_FAILED"]
  J -->|ok, flush thread + waiter| L["queue to flushed_bcbs\nwake post-flush daemon"]
  J -->|ok| M["mark_was_flushed\nclear FLUSHING, wake"]

Figure 8-2 — pgbuf_bcb_flush_with_wal branch map.

Step 1 — claim the flush, clear DIRTY. was_dirty = pgbuf_bcb_mark_is_flushing (...) sets FLUSHING_TO_DISK and atomically clears DIRTY and ASYNC_FLUSH_REQ:

// pgbuf_bcb_mark_is_flushing -- src/storage/page_buffer.c
if (pgbuf_bcb_is_dirty (bcb)) {
pgbuf_bcb_update_flags (thread_p, bcb, PGBUF_BCB_FLUSHING_TO_DISK_FLAG,
PGBUF_BCB_DIRTY_FLAG | PGBUF_BCB_ASYNC_FLUSH_REQ); /* <- set | clear */
return true;
}

Invariant — DIRTY clears at flush start, not end. A concurrent re-dirty during the long copy-to-write window re-sets DIRTY and is not lost; the page just flushes again. On write failure, pgbuf_bcb_mark_was_not_flushed re-sets DIRTY (when was_dirty).

Step 2 — copy the image. At start_copy_page the page is copied into a stack iopage (via tde_encrypt_data_page if TDE-encrypted, else a memcpy of IO_PAGESIZE). If uses_dwb, the copy is staged into a DWB slot by dwb_set_data_on_next_slot; on a granted slot the local iopage is nulled and control jumps to copy_unflushed_lsa.

Step 3 — WAL enforcement. At copy_unflushed_lsa it saves the page LSA and oldest_unflush_lsa, NULLs bufptr->oldest_unflush_lsa, drops the mutex, and — if the saved oldest_unflush_lsa is non-null — forces the log:

// pgbuf_bcb_flush_with_wal -- src/storage/page_buffer.c
LSA_COPY (&lsa, &(bufptr->iopage_buffer->iopage.prv.lsa));
LSA_COPY (&oldest_unflush_lsa, &bufptr->oldest_unflush_lsa);
LSA_SET_NULL (&bufptr->oldest_unflush_lsa);
PGBUF_BCB_UNLOCK (bufptr); *is_bcb_locked = false;
if (!LSA_ISNULL (&oldest_unflush_lsa))
logpb_flush_log_for_wal (thread_p, &lsa); /* <- log-before-page: force log up to page LSA */

WAL INVARIANT — log up to the page LSA is durable before the write. The page is never handed to fileio_write/dwb_add_page until the log tail is forced through lsa (the page’s own prv.lsa). Enforcement is structural: logpb_flush_log_for_wal (thread_p, &lsa) sits between the mutex drop and the write, and lsa is read before the mutex drops so it cannot be re-stamped underneath. The trigger gate is oldest_unflush_lsa != NULL, but the force targets lsa (the newest change), not oldest_unflush_lsa. What breaks if skipped: a crash after the page write but before the log write leaves a persisted change whose redo/undo never reached disk — recovery cannot reconstruct or roll it back, so the page is silently corrupt. See cubrid-log-manager-detail.md for how logpb_flush_log_for_wal guarantees durability to a given LSA. The else branch (null oldest_unflush_lsa) is the rare “changed but not logged” case (temporary volumes) and only emits a debug note.

Null-ing bufptr->oldest_unflush_lsa lets a re-dirty during the write window re-stamp a fresh value that a later flush re-forces; on write failure the saved value is restored (Step 5a).

Step 4 — the write. DWB use is gated by uses_dwb = dwb_is_created () && !is_temp (temp volumes always bypass it). If uses_dwb, dwb_add_page registers the page’s VPID into the staged slot; the DWB batches pages and flushes a full block to the double-write area and then the data files (block geometry and the torn-page recovery argument are in cubrid-double-write-buffer.md). Subtle branch: if DWB was disabled between staging and adding, dwb_add_page returns dwb_slot == NULL, so the code clears uses_dwb, re-locks, and goto start_copy_page to retry direct. The direct path does a plain fileio_write (bumping num_pages_written, PSTAT_PB_NUM_IOWRITES) with mode FILEIO_WRITE_NO_COMPENSATE_WRITE when a DWB exists globally (double-write makes torn-page compensation redundant), else FILEIO_WRITE_DEFAULT_WRITE.

Step 5a — write failure. Re-lock, pgbuf_bcb_mark_was_not_flushed (.., was_dirty) clears FLUSHING and restores DIRTY, restore the saved oldest_unflush_lsa, wake PGBUF_LATCH_FLUSH waiters (only if next_wait_thrd != NULL), return ER_FAILED.

Step 5b — success, daemon hand-off. If this is the page flush daemon (is_page_flush_thread), the post-flush daemon exists, a thread waits for a direct victim, and the BCB is accepted into pgbuf_Pool.flushed_bcbs (via produce), the BCB is left unlocked but un-cleared for the post-flush daemon to assign as a victim (Chapter 9), which is then woken — mark_was_flushed is deliberately not called on this path. Step 5c (otherwise) re-locks, calls pgbuf_bcb_mark_was_flushed (clears FLUSHING), and wakes flush waiters when any are queued.

8.4 Waking the FLUSH waiters: pgbuf_wake_flush_waiters

Section titled “8.4 Waking the FLUSH waiters: pgbuf_wake_flush_waiters”

Threads that took the block branch in 8.2 park on the BCB’s next_wait_thrd list with request_latch_mode == PGBUF_LATCH_FLUSH. The waker unlinks only FLUSH waiters, leaving READ/WRITE latch waiters in place:

// pgbuf_wake_flush_waiters -- src/storage/page_buffer.c
for (crt_waiter = bcb->next_wait_thrd; crt_waiter != NULL; crt_waiter = save_next_waiter) {
save_next_waiter = crt_waiter->next_wait_thrd;
if (crt_waiter->request_latch_mode == PGBUF_LATCH_FLUSH) {
if (prev_waiter != NULL) prev_waiter->next_wait_thrd = save_next_waiter;
else bcb->next_wait_thrd = save_next_waiter; /* <- unlink only FLUSH waiters */
crt_waiter->next_wait_thrd = NULL;
pgbuf_wakeup_uncond (crt_waiter);
} else {
prev_waiter = crt_waiter; /* <- keep latch waiters threaded */
}
}

The caller must hold the BCB mutex. Both the failure and success paths of 8.3 call it, but only when next_wait_thrd != NULL. Mixing FLUSH and latch waiters on one list is why the loop tracks prev_waiter instead of truncating.

8.5 The Page Flush Daemon: candidate collection and flushing

Section titled “8.5 The Page Flush Daemon: candidate collection and flushing”

pgbuf_flush_victim_candidates is the daemon body: size the scan, collect dirty candidates from the LRUs, force the log, flush each survivor.

Adaptive scan width. It reads/resets lru_victim_req_cnt and fix_req_cnt for lru_miss_rate, then boosts flush_ratio * num_buffers by up to PGBUF_FLUSH_VICTIM_BOOST_MULT (=10) when misses are high — but only when not in checkpoint (checkpoint already flushes, so boosting would double-flush). The result caps at ~200 MB of pages.

// pgbuf_flush_victim_candidates -- src/storage/page_buffer.c
if (pgbuf_Pool.is_checkpoint == false) {
lru_dynamic_flush_adj = MAX (1.0f, 1 + (PGBUF_FLUSH_VICTIM_BOOST_MULT - 1) * lru_miss_rate);
lru_dynamic_flush_adj = MIN (PGBUF_FLUSH_VICTIM_BOOST_MULT, lru_dynamic_flush_adj);
} else lru_dynamic_flush_adj = 1.0f;
check_count_lru = (int) (cfg_check_cnt * lru_dynamic_flush_adj);
check_count_lru = MIN (check_count_lru, (200 * 1024 * 1024) / DB_PAGESIZE);

Branches after collection. If victim_count == 0, nothing to flush; sets *stop true (so the daemon loop in 8.8 breaks) only when scanning was actually attempted (check_count_lru > 0 && lru_sum_flush_priority > 0), then goto end. Otherwise it wakes the log flush daemon (WAL needs the log current — log_wakeup_log_flush_daemon, or logpb_force_flush_pages if no daemon), optionally qsorts the list by VPID under PRM_ID_PB_SEQUENTIAL_VICTIM_FLUSH, and sets is_flushing_victims = true.

Per-candidate loop. For each candidate it locks the BCB and applies four guards:

  1. VPID changed / not dirty / already flushing -> num_skipped_already_flushed, unlock, continue.
  2. left the LRU victim zone or got fixed/hot -> num_skipped_fixed_or_hot, unlock, continue.
  3. logpb_need_wal (page LSA beyond flushed log) -> record max lsa_need_wal, bump count_need_wal, wake log flush daemon, num_skipped_need_wal, unlock, continue.
  4. else flush: pgbuf_flush_page_and_neighbors_fb when PGBUF_NEIGHBOR_PAGES > 1, else pgbuf_bcb_flush_with_wal (..., true, &is_bcb_locked) (is_page_flush_thread=true; the loop unlocks the BCB if it stayed locked). On error -> goto end.

The repeat retry. At end, if every candidate was skipped purely for WAL (count_need_wal == victim_count) and a thread still waits for a direct victim, the daemon forces the log itself (logpb_flush_log_for_wal) and jumps to repeat exactly once (the second pass asserts LSAs advanced), then clears is_flushing_victims.

Neighbor batching: pgbuf_flush_page_and_neighbors_fb. When PGBUF_NEIGHBOR_PAGES > 1, branch 4 calls this function, which grows a contiguous-VPID window around the anchor so a run of physically adjacent pages is written in one sequential sweep. The window state lives in a static file-scope global, pgbuf_Flush_helper (type pgbuf_batch_flush_helper) — not a per-call stack object; a single shared instance is safe because the dedicated page-flush daemon is the only caller, and each invocation zeroes the counters at entry.

// pgbuf_batch_flush_helper -- src/storage/page_buffer.c
struct pgbuf_batch_flush_helper
{
int npages; /* <- pages currently staged in the window */
int fwd_offset; /* <- pages added forward (higher pageid) of anchor */
int back_offset; /* <- pages added backward (lower pageid) of anchor */
PGBUF_BCB *pages_bufptr[2 * PGBUF_MAX_NEIGHBOR_PAGES - 1]; /* <- window BCBs */
VPID vpids[2 * PGBUF_MAX_NEIGHBOR_PAGES - 1]; /* <- their VPIDs */
};
// static PGBUF_BATCH_FLUSH_HELPER pgbuf_Flush_helper; <- the single shared instance
FieldRoleWhy it exists
npagespages staged in the windowend bound of the per-window flush loop; trimmed when a tail/head neighbor is clean
fwd_offsetforward reach (higher pageid) from anchornext forward candidate is anchor + fwd_offset + 1
back_offsetbackward reach (lower pageid) from anchornext backward candidate is anchor - back_offset - 1
pages_bufptr[2*MAX-1]BCB handles for every window memberthe BCBs flushed; sized to reach PGBUF_MAX_NEIGHBOR_PAGES-1 (=31) each way around the anchor
vpids[2*MAX-1]VPID snapshot per member, parallel to pages_bufptr[]validate key: pgbuf_flush_neighbor_safe re-checks it so a member whose VPID changed before its write is skipped

PGBUF_NEIGHBOR_POS (off) indexes the arrays relative to the anchor (PGBUF_NEIGHBOR_PAGES - 1 + off). The window is not strictly dirty-only: when PGBUF_NEIGHBOR_FLUSH_NONDIRTY is enabled the probe deliberately admits interior clean pages to keep the on-disk run contiguous, abandoning the batch only on two consecutive non-dirties (NEIGHBOR_ABORT_TWO_CONSECTIVE_NONDIRTIES) or when non-dirties exceed half the window past a small threshold (NEIGHBOR_ABORT_TOO_MANY_NONDIRTIES). A clean page at the very tail or head is then trimmed (decrement the offset and npages) so the run does not end on a wasted write. Before the sweep the neighbor path enforces WAL once for the whole window:

// pgbuf_flush_page_and_neighbors_fb -- src/storage/page_buffer.c
/* WAL protocol: force log record to disk */
logpb_flush_log_for_wal (thread_p, &log_newest_oldest_unflush_lsa);
for (pos = PGBUF_NEIGHBOR_POS (-helper->back_offset); pos <= PGBUF_NEIGHBOR_POS (helper->fwd_offset); pos++)
error = pgbuf_flush_neighbor_safe (thread_p, helper->pages_bufptr[pos], &helper->vpids[pos], &was_page_flushed);

pgbuf_flush_neighbor_safe re-routes each member through the single-page path (re-validating its VPID), so per-page WAL still holds; the batch force just guarantees the log is current before the contiguous write begins. A single-page window (npages <= 1) skips the batch force and flushes the lone page directly.

Called from 8.5, it walks each LRU from the bottom through the victim zone, budgeting by each list’s lru_victim_flush_priority_per_lru:

// pgbuf_get_victim_candidates_from_lru -- src/storage/page_buffer.c
for (bufptr = pgbuf_Pool.buf_LRU_list[lru_idx].bottom;
bufptr != NULL && PGBUF_IS_BCB_IN_LRU_VICTIM_ZONE (bufptr) && i > 0;
bufptr = bufptr->prev_BCB, i--) {
if (pgbuf_bcb_is_dirty (bufptr)) {
pgbuf_Pool.victim_cand_list[victim_cand_count].bufptr = bufptr;
pgbuf_Pool.victim_cand_list[victim_cand_count].vpid = bufptr->vpid;
victim_cand_count++; /* <- dirty -> flush before victimization */
}
#if defined (SERVER_MODE)
else if (try_direct_assign && pgbuf_is_any_thread_waiting_for_direct_victim ()
&& pgbuf_is_bcb_victimizable (bufptr, false) && PGBUF_BCB_TRYLOCK (bufptr) == 0) {
if (pgbuf_is_bcb_victimizable (bufptr, true) && pgbuf_assign_direct_victim (thread_p, bufptr)) {
try_direct_assign = false; *assigned_directly = true; /* <- clean bcb handed to a waiter */
}
PGBUF_BCB_UNLOCK (bufptr);
}
#endif
}

Two outputs: dirty BCBs go to the candidate list (they need a flush before victimization), while a single clean victimizable BCB may be handed straight to a starving waiter (assigned_directly) under trylock so the scan never blocks. Candidate VPIDs are snapshot so the flush loop in 8.5 can detect a reassigned BCB (guard 1 there). The whole walk runs under the per-LRU mutex.

8.7 The seq-flusher and pgbuf_flush_seq_list pacing

Section titled “8.7 The seq-flusher and pgbuf_flush_seq_list pacing”

Checkpoint flushing is rate-controlled by a PGBUF_SEQ_FLUSHER: unlike the victim daemon, it spreads writes across one-second “super-intervals” so checkpoint I/O does not starve the foreground.

struct pgbuf_seq_flusher — every field:

FieldRoleWhy it exists
flush_listarray of PGBUF_VICTIM_CANDIDATE_LIST (bufptr+vpid)working set for this pass
flush_upto_lsanewest oldest-LSA over all listed pagesWAL gate; pages beyond it are skipped
control_intervals_cntintervals elapsed this 1 s super-intervalfeeds the flush_per_interval math
control_flushedpages flushed so far this super-intervallets a slow interval be compensated next
interval_msecduration of one pacing intervalcomputed in pgbuf_flush_chkpt_seq_list as 1000 * PGBUF_CHKPT_BURST_PAGES / chkpt_flush_rate, where chkpt_flush_rate = 1000 / PRM_ID_LOG_CHECKPOINT_SLEEP_MSECS — not from the struct flush_rate field
flush_max_sizecapacity of flush_list, set at initbatch-size bound; checkpoint refills when full
flush_cntlive element countend bound of the flush loop
flush_idxindex of next element to flushresumes across interval boundaries
flushed_pagespages flushed this call (return param)accumulated by the caller
flush_ratemax pages/sec (negative = unlimited)target the pacing math converges to; set to chkpt_flush_rate each interval
burst_modeflush a chunk ASAP vs one page then sleepburst keeps data I/O sequential

pgbuf_initialize_seq_flusher zeroes the struct, sets flush_max_size, allocates flush_list, and defaults burst_mode = true. pgbuf_flush_seq_list derives flush_per_interval from the control counters: with control_intervals_cnt > 0 it targets flush_rate * (control_intervals_cnt+1) / control_total_cnt_intervals minus what was already flushed (compensation), floored at PGBUF_CHKPT_MIN_FLUSH_RATE (=50) scaled by the interval. The loop runs while flush_idx < flush_cnt && flushed_pages < flush_per_interval:

// pgbuf_flush_seq_list -- src/storage/page_buffer.c
PGBUF_BCB_LOCK (bufptr); locked_bcb = true;
if (!VPID_EQ (&bufptr->vpid, &f_list[seq_flusher->flush_idx].vpid) || !pgbuf_bcb_is_dirty (bufptr)
|| (flush_if_already_flushed == false && !LSA_ISNULL (&bufptr->oldest_unflush_lsa)
&& LSA_GT (&bufptr->oldest_unflush_lsa, &seq_flusher->flush_upto_lsa)))
{ PGBUF_BCB_UNLOCK (bufptr); dropped_pages++; continue; } /* <- stale / beyond chkpt horizon */
if (pgbuf_bcb_safe_flush_force_lock (thread_p, bufptr, true) == NO_ERROR) { /* ... done_flush = true */ }

The flush_if_already_flushed heuristic re-flushes an already-flushed page only if its VPID is contiguous with the next list entry — preferring write sequentiality over avoiding a redundant write. After each page, non-burst mode sleeps time_remaining / pages_remaining ms (skipped when below 1000 / PGBUF_CHKPT_MAX_FLUSH_RATE) to spread the writes; burst mode only checks the absolute limit_time and breaks (*time_rem = -1) when exceeded. flush_upto_lsa is the WAL gate: only pages whose oldest-unflush LSA is at or below it flush in this checkpoint.

8.8 The three daemons and checkpoint flush

Section titled “8.8 The three daemons and checkpoint flush”

Three SERVER_MODE daemons register via REGISTER_DAEMON:

DaemonTask typeLooperBody
pgbuf_Page_flush_daemondedicated pgbuf_page_flush_daemon_task (subclass of cubthread::entry_task)pgbuf_get_page_flush_interval (timed if PRM_ID_PAGE_BG_FLUSH_INTERVAL_MSECS > 0, else infinite wait)loop pgbuf_flush_victim_candidates
pgbuf_Page_post_flush_daemonentry_callable_task (pgbuf_page_post_flush_execute)3-tier looper {1,10,100} mspgbuf_assign_flushed_pages — drain flushed_bcbs, assign direct victims
pgbuf_Page_maintenance_daemonentry_callable_task (pgbuf_page_maintenance_execute)fixed 100 mspgbuf_adjust_quotas + pgbuf_direct_victims_maintenance

The flush daemon runs at least once if explicitly woken (was_woken_up), then loops while pgbuf_keep_victim_flush_thread_running or until pgbuf_flush_victim_candidates sets stop_iteration:

// pgbuf_page_flush_daemon_task::execute -- src/storage/page_buffer.c
bool force_one_run = pgbuf_Page_flush_daemon->was_woken_up ();
while (force_one_run || pgbuf_keep_victim_flush_thread_running ()) {
pgbuf_flush_victim_candidates (&thread_ref, prm_get_float_value (PRM_ID_PB_BUFFER_FLUSH_RATIO),
&m_perf_track, &stop_iteration);
force_one_run = false;
if (stop_iteration) break;
}

It is the only class-based dedicated task; post-flush and maintenance are callable functions in entry_callable_task. Foreground threads nudge it via pgbuf_wakeup_page_flush_daemon when no victim is found. pgbuf_flush_control_from_dirty_ratio adds a separate adaptive signal — a rate bump that grows quadratically as dirties_cnt exceeds num_buffers/2, plus the dirty growth rate — to flush harder before the pool saturates.

Checkpoint flush. pgbuf_flush_checkpoint forces the log to flush_upto_lsa, sets is_checkpoint=true, then scans all BCBs. Each dirty non-temporary page with oldest_unflush_lsa <= flush_upto_lsa is appended to the shared seq_chkpt_flusher.flush_list; when full (>= flush_max_size) it is qsorted by VPID and drained through pgbuf_flush_chkpt_seq_list (which calls the paced pgbuf_flush_seq_list), then refilled. A page older than prev_chkpt_redo_lsa asserts (ER_LOG_CHECKPOINT_SKIP_INVALID_PAGE) — it should have flushed in the previous checkpoint. The smallest unflushed LSA among skipped pages returns in smallest_lsa to advance the redo horizon. The flush_all family (pgbuf_flush_all, _all_unfixed, _all_unfixed_and_set_lsa_as_null) is an unpaced sweep over all BCBs via pgbuf_flush_all_helper, used only by the log/recovery manager.

  1. All single-page flushes funnel through pgbuf_bcb_safe_flush_internal, whose CAS loop on the atomic latch picks immediate flush, async-request-on-unfix, or block-on-PGBUF_LATCH_FLUSH.
  2. pgbuf_bcb_flush_with_wal enforces the WAL invariant by saving the page LSA, NULLing oldest_unflush_lsa under the mutex, dropping it, and calling logpb_flush_log_for_wal (.., &lsa) before any fileio_write / dwb_add_page; skipping the force would lose redo for a persisted page.
  3. DIRTY clears at flush start (pgbuf_bcb_mark_is_flushing) so a concurrent re-dirty is never lost; a failed write restores DIRTY and the saved oldest_unflush_lsa via pgbuf_bcb_mark_was_not_flushed.
  4. At most one flusher per BCB: FLUSHING_TO_DISK plus the blocking FLUSH-latch path serialize writers, preventing an old image from overwriting a newer one.
  5. The Page Flush Daemon (pgbuf_flush_victim_candidates + pgbuf_get_victim_candidates_from_lru) collects dirty bottom-of-LRU candidates, skips fixed/hot/need-WAL pages, may batch neighbors through the shared-global pgbuf_Flush_helper window (which forces WAL once for the whole run and can include interior clean pages for sequentiality), and retries once when all candidates were WAL-blocked.
  6. Checkpoint uses a rate-controlled PGBUF_SEQ_FLUSHER (pgbuf_flush_seq_list) with burst/spread pacing and a flush_upto_lsa WAL gate; flush_all* is an unpaced sweep.
  7. Three daemons exist — one dedicated class (page flush) plus two callable tasks (post-flush, maintenance) — and the DWB, when created, stages every non-temp write into a block before the data files (see cubrid-double-write-buffer.md).

Chapter 9: Victim Selection the LFCQs and Direct Victim Hand-off

Section titled “Chapter 9: Victim Selection the LFCQs and Direct Victim Hand-off”

This chapter answers: when the invalid (free) list is empty, how does a thread find an evictable BCB, and when none is found, how is a freed BCB handed straight to a sleeping waiter? The high-level companion sketched “LFCQ — victim selection” and “Direct victim hand-off” at altitude; here we trace every branch.

The two paths are duals. The pull path (pgbuf_get_victim) scans the layered lock-free queues (LFCQs) for a clean BCB to claim. The push path (pgbuf_assign_direct_victim / pgbuf_get_direct_victim) is the inverse: a producer that already cleaned a BCB wakes a waiter, writes the BCB into the waiter’s mailbox slot, and skips the LRU. A thread that fails the pull path becomes a waiter (suspend/wake is Ch. 5).

9.1 The two structs: mailbox and candidate slot

Section titled “9.1 The two structs: mailbox and candidate slot”

pgbuf_direct_victim is the global mailbox-and-queues record (pgbuf_Pool.direct_victims), SERVER_MODE-only. pgbuf_victim_candidate_list is the scratch array the flush daemon (Ch. 8) fills; in scope only because the spec names it.

// pgbuf_direct_victim -- src/storage/page_buffer.c
struct pgbuf_direct_victim {
PGBUF_BCB **bcb_victims; /* per-thread mailbox: bcb_victims[tid] = BCB handed to thread tid */
lockfree::circular_queue<THREAD_ENTRY *> *waiter_threads_high_priority;
lockfree::circular_queue<THREAD_ENTRY *> *waiter_threads_low_priority;
};
// pgbuf_victim_candidate_list -- src/storage/page_buffer.c
struct pgbuf_victim_candidate_list {
PGBUF_BCB *bufptr; /* selected BCB as victim candidate */
VPID vpid; /* page id of the page managed by the BCB */
};
Struct.FieldRoleWhy it exists
direct_victim.bcb_victimsArray of num_total_threads BCB ptrs by thread_p->index; slot [tid] = tid’s BCB or NULL.The mailbox. Producer writes a slot under the waiter’s entry lock, waiter reads+NULLs its own on wake — one slot/thread, no contention.
direct_victim.waiter_threads_high_priorityLFCQ of threads blocking the system on a victim.Drained first — latency-critical fixers jump the queue.
direct_victim.waiter_threads_low_priorityLFCQ of threads that tolerate waiting.Drained 1-in-4 ahead of high — the 75/25 weighting (§9.5).
victim_candidate_list.bufptrBCB the flush pass selected to clean.Lets the flusher re-lock+flush in a second pass without re-scanning.
victim_candidate_list.vpidSnapshot of bufptr->vpid at selection.Detects reassignment before flush; a stale bufptr whose vpid no longer matches is skipped.

9.2 pgbuf_get_victim — the staged LFCQ scan

Section titled “9.2 pgbuf_get_victim — the staged LFCQ scan”

The queues hold list indices, never BCBs; an index sits in a queue iff its list has count_vict_cand > 0 and PGBUF_LRU_VICTIM_LFCQ_FLAG set. The function walks four stages, returning the first locked BCB it claims.

Stage 1 — own private, only with a private LRU that is over quota:

// pgbuf_get_victim -- src/storage/page_buffer.c
if (PGBUF_THREAD_HAS_PRIVATE_LRU (thread_p)) {
private_lru_idx = PGBUF_LRU_INDEX_FROM_PRIVATE (PGBUF_PRIVATE_LRU_FROM_THREAD (thread_p));
lru_list = PGBUF_GET_LRU_LIST (private_lru_idx);
if (PGBUF_LRU_LIST_IS_ONE_TWO_OVER_QUOTA (lru_list) /* zone1+2 exceeds quota */
|| (PGBUF_LRU_LIST_IS_OVER_QUOTA (lru_list) && lru_list->count_vict_cand > 0)) {
victim = pgbuf_get_victim_from_lru_list (thread_p, private_lru_idx);
if (victim != NULL) { return victim; } /* <- happy path */
if (!PGBUF_VACUUM_SHOULD_IGNORE_UNFIX (thread_p))
restrict_other = PGBUF_LRU_LIST_IS_OVER_QUOTA_WITH_BUFFER (lru_list); /* gate stage 2 */
searched_own = true;
} }

restrict_other is set only for a non-vacuum thread comfortably over quota (quota + MAX(10, quota*0.01) buffer); it confines stage 2 to big-private lists. searched_own stops stage 4 repeating stage 1.

Stage 2 — other private, entered only when PGBUF_PAGE_QUOTA_IS_ENABLED && has_flush_thread; it calls pgbuf_lfcq_get_victim_from_private_lru (thread_p, restrict_other) (§9.4) and returns on the first claim.

Stage 3 — shared, in a guarded loop — the only looping stage, and only without a flush daemon to refill candidates:

do {
victim = pgbuf_lfcq_get_victim_from_shared_lru (thread_p, has_flush_thread);
if (victim != NULL) { return victim; } /* <- happy path */
current_consume_cursor = pgbuf_Pool.shared_lrus_with_victims->get_consumer_cursor ();
}
while (!has_flush_thread && !pgbuf_Pool.shared_lrus_with_victims->is_empty ()
&& ((int) (current_consume_cursor - initial_consume_cursor) <= pgbuf_Pool.num_LRU_list)
&& (++nloops <= pgbuf_Pool.num_LRU_list));

The four while conditions each stop the spin: a flush daemon present, the queue empty, more indices consumed than there are shared lists, or nloops exceeding num_LRU_list (a paranoia guard). With a flush daemon the body runs exactly once.

Stage 4 — last-resort own private, ignoring quota. Only if stages 1-3 failed and stage 1 never ran (PGBUF_THREAD_HAS_PRIVATE_LRU && !searched_own), it re-calls pgbuf_get_victim_from_lru_list on the own list and returns the result; otherwise it falls through to return victim (NULL). This guards the source-documented deadlock: all private lists just below quota, shared lists with no zone-3, nothing victimizable or flushable. A NULL return tells the caller to enqueue as a waiter and sleep (Ch. 5). Figure 9-1 traces all four stages.

Figure 9-1 — pgbuf_get_victim staged scan.

flowchart TD
  B{"own private over quota?"} -- yes --> C["victim_from_lru own"]
  C -->|found| R["return victim"]
  C -->|fail| D["restrict_other, searched_own"]
  B -- no --> E
  D --> E{"quota+flush thread?"}
  E -- yes --> F["lfcq private: big then ordinary"]
  F -->|found| R
  E -- no --> G
  F -->|fail| G["loop: lfcq shared"]
  G -->|found| R
  G -->|exhausted| H{"not searched_own?"}
  H -- yes --> I["victim_from_lru own, no quota"]
  I -->|found| R
  H -- no --> J["NULL: wait"]
  I -->|fail| J

9.3 pgbuf_get_victim_from_lru_list — bottom-up scan, four exclusions

Section titled “9.3 pgbuf_get_victim_from_lru_list — bottom-up scan, four exclusions”

Where a BCB is actually claimed: scan from victim_hint toward the bottom of zone 3, apply the exclusion mask, and on success remove the BCB and return it locked. Three early NULL returns precede the scan, then the hint is resynced:

// pgbuf_get_victim_from_lru_list -- src/storage/page_buffer.c
if (lru_list->count_vict_cand == 0) { return NULL; } /* <- 1: no candidates, no mutex */
pthread_mutex_lock (&lru_list->mutex);
if (lru_list->bottom == NULL || !PGBUF_IS_BCB_IN_LRU_VICTIM_ZONE (lru_list->bottom))
{ pthread_mutex_unlock (&lru_list->mutex); return NULL; } /* <- 2: no zone-3 */
if (PGBUF_IS_PRIVATE_LRU_ONE_TWO_OVER_QUOTA (lru_idx))
pgbuf_lru_adjust_zones (thread_p, lru_list, false); /* shrink zone1 so zone3 grows */
lru_victim_cnt = lru_list->count_vict_cand;
if (lru_victim_cnt <= 0) { pthread_mutex_unlock (&lru_list->mutex); return NULL; } /* <- 3: race emptied it */
if (!pgbuf_bcb_is_dirty (lru_list->bottom) && lru_list->victim_hint != lru_list->bottom)
(void) ATOMIC_TAS_ADDR (&lru_list->victim_hint, /* resync drifted hint */
PGBUF_IS_BCB_IN_LRU_VICTIM_ZONE (lru_list->bottom) ? lru_list->bottom : (PGBUF_BCB *) NULL);
bufptr_start = (lru_list->victim_hint == NULL) ? lru_list->bottom : lru_list->victim_hint;

The scan loop. Walk prev_BCB upward from bufptr_start, in zone 3, capped at MAX_DEPTH (1000). Per BCB:

  1. Excl. 1 — avoid-victim flag. pgbuf_bcb_avoid_victim tests PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK = DIRTY | FLUSHING_TO_DISK | VICTIM_DIRECT | INVALIDATE_DIRECT_VICTIM (the four exclusions: dirty, mid-flush, already-assigned, invalidation-pending). Any bit → continue.
  2. Excl. 2 — fixed/has waiters. pgbuf_is_bcb_fixed_by_any (bufptr, false): if fcnt > 0, next_wait_thrd != NULL, or latch held, it is valid-but-busy. Record as bufptr_victimizable (first becomes the hint via CAS), count it, continue; break when found_victim_cnt reaches lru_victim_cnt.
  3. Claim — PGBUF_BCB_TRYLOCK (conditional: we hold the list mutex and must not block on the BCB mutex — lock-order rule, Ch. 5):
    • Trylock ok + pgbuf_is_bcb_victimizable(bufptr, true): the win — advance hint to bufptr->prev_BCB, pgbuf_remove_from_lru_list, then panic-assign via pgbuf_panic_assign_direct_victims_from_lru iff waiter_threads_low_priority->size() >= 5 + num_total_threads/20 (the low-priority backlog drain), wake the flush daemon if the new bottom is dirty, unlock, push VPID to Aout (Ch. 7), return locked BCB.
    • Trylock ok but not victimizable (flag flipped under us): PGBUF_BCB_UNLOCK, next iteration.
    • Trylock fails: the BCB mutex is held elsewhere, only possible with a flush daemon — asserts pgbuf_is_page_flush_daemon_available(). Record + hint, count it, honor the early-out.

TO_VACUUM note: PGBUF_BCB_TO_VACUUM_FLAG is not in the mask, so a to-vacuum BCB is still victimizable here; its forcing to the LRU bottom happens at unfix/direct-assign time, not in this scan.

Failure tail. No claim, and a stale hint with no candidate found (bufptr_victimizable == NULL && victim_hint != NULL) → reset hint to bottom (if candidates remain) or NULL via CAS, unlock, wake flush daemon, return NULL.

Invariant — victim_hint marks the lowest point worth scanning; count_vict_cand counts clean zone-3 BCBs. The scan walks only upward from the hint and trusts count_vict_cand (kept by the LRU bookkeeping helpers as BCBs enter/leave zone 3 clean) as the early-out ceiling. The hint stays honest via the CAS-advance on every claim/record and the resync above. The documented TPCC drift (hint sits before the first victimizable BCB) only wastes scan steps — each candidate is re-validated under its own BCB lock before being claimed, so the hint is a performance hint, never a safety property; drift is tolerated, not fixed.

9.4 Quota gating in the private LFCQ helper

Section titled “9.4 Quota gating in the private LFCQ helper”

pgbuf_lfcq_get_victim_from_private_lru picks which private list to scan and whether to re-enqueue it:

// pgbuf_lfcq_get_victim_from_private_lru -- src/storage/page_buffer.c
if (pgbuf_Pool.big_private_lrus_with_victims->consume (lru_idx)) { /* big first */ }
else {
if (restricted) { return NULL; } /* <- restricted: big only */
if (!pgbuf_Pool.private_lrus_with_victims->consume (lru_idx)) { return NULL; } /* <- both empty */ }
lru_list = PGBUF_GET_LRU_LIST (lru_idx);
if (PGBUF_LRU_LIST_COUNT (lru_list) > PBGUF_BIG_PRIVATE_MIN_SIZE /* big: >100 ... */
&& PGBUF_LRU_LIST_COUNT (lru_list) > 2 * lru_list->quota && lru_list->count_vict_cand > 1) {
if (pgbuf_Pool.big_private_lrus_with_victims->produce (lru_idx)) added_back = true; } /* re-queue BIG before scan */
victim = pgbuf_get_victim_from_lru_list (thread_p, lru_idx);
if (added_back) return victim;
if (lru_list->count_vict_cand > 0 && PGBUF_LRU_LIST_IS_OVER_QUOTA (lru_list))
{ if (pgbuf_Pool.private_lrus_with_victims->produce (lru_idx)) return victim; }
lru_list->flags &= ~PGBUF_LRU_VICTIM_LFCQ_FLAG; /* not re-queued: clear so next candidate re-adds it */
return victim;

Invariant — a private list is victimizable only while over quota. “Big-private” = count > 100 and count > 2*quota and >1 candidate; re-queued before scanning so peers drain it in parallel. A non-big list is re-queued only while still over quota with candidates; otherwise its LFCQ flag is cleared and it leaves rotation until pgbuf_adjust_quotas (§9.8) or a new candidate re-adds it. A list at/below quota is never poached. The shared sibling pgbuf_lfcq_get_victim_from_shared_lru has no quota, so it simply re-enqueues while count_vict_cand > 0 and (single-threaded) retries the same list once.

9.5 pgbuf_assign_direct_victim — producer side

Section titled “9.5 pgbuf_assign_direct_victim — producer side”

When a BCB becomes clean+free (end of flush, panic-assign in §9.3, or last-unfix), its owner may hand it to a waiter. The producer holds the BCB mutex; the only invalidating flag tolerated is FLUSHING_TO_DISK (flush itself calls this):

// pgbuf_assign_direct_victim -- src/storage/page_buffer.c
while (pgbuf_get_thread_waiting_for_direct_victim (waiter_thread)) { /* 75/25: low 1-in-4, else high */
thread_lock_entry (waiter_thread);
if (waiter_thread->resume_status != THREAD_ALLOC_BCB_SUSPENDED)
{ thread_unlock_entry (waiter_thread); continue; } /* <- waiter gone, try next */
thread_wakeup_already_had_mutex (waiter_thread, THREAD_ALLOC_BCB_RESUMED);
pgbuf_bcb_update_flags (thread_p, bcb, PGBUF_BCB_VICTIM_DIRECT_FLAG, PGBUF_BCB_FLUSHING_TO_DISK_FLAG);
pgbuf_Pool.direct_victims.bcb_victims[waiter_thread->index] = bcb; /* <- write mailbox before unlock */
thread_unlock_entry (waiter_thread);
return true; } /* <- assigned */
return false; /* <- no waiters */

pgbuf_get_thread_waiting_for_direct_victim holds the 75/25 weighting (low queue 1-in-4, else high), skipping dead queue entries. The while skips any waiter no longer THREAD_ALLOC_BCB_SUSPENDED; the BCB pointer is written before the entry lock releases, so the waiter never wakes to an empty slot. Empty queues → false, and the caller disposes of the BCB normally.

Invariant — a handed-off victim is exclusively owned, so no other thread can claim it. The producer enters with the BCB mutex held (PGBUF_BCB_CHECK_OWN) and stamps PGBUF_BCB_VICTIM_DIRECT_FLAG while still holding it — and that flag is one of the four bits in PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK, so from that instant pgbuf_bcb_avoid_victim returns true and the §9.3 scan, the §9.4 helpers, and pgbuf_assign_flushed_pages all skip the BCB. The only writer of bcb_victims[tid] is the producer; the only reader is thread tid via the ATOMIC_TAS_ADDR in §9.6 — one slot per thread, single producer, single consumer. Even pgbuf_invalidate_bcb defers (§9.7). Thus between assignment and collection the BCB is logically owned by exactly one waiter; a concurrent re-fixer cannot steal it, only mark INVALIDATE_DIRECT_VICTIM to release it back (§9.6).

9.6 pgbuf_get_direct_victim — consumer side and the invalidation retry

Section titled “9.6 pgbuf_get_direct_victim — consumer side and the invalidation retry”

The slot read is a TAS that clears the slot atomically:

// pgbuf_get_direct_victim -- src/storage/page_buffer.c
PGBUF_BCB *bcb = (PGBUF_BCB *) ATOMIC_TAS_ADDR (&pgbuf_Pool.direct_victims.bcb_victims[thread_p->index], NULL);
PGBUF_BCB_LOCK (bcb);
if (pgbuf_bcb_is_invalid_direct_victim (bcb)) { /* <- re-fix race */
pgbuf_bcb_update_flags (thread_p, bcb, 0, PGBUF_BCB_INVALIDATE_DIRECT_VICTIM_FLAG); /* clear it */
PGBUF_BCB_UNLOCK (bcb);
return NULL; } /* <- caller re-sleeps */
pgbuf_bcb_update_flags (thread_p, bcb, 0, PGBUF_BCB_VICTIM_DIRECT_FLAG); /* clear VICTIM_DIRECT */
if (!pgbuf_is_bcb_victimizable (bcb, true)) { assert (false); PGBUF_BCB_UNLOCK (bcb); return NULL; }
switch (pgbuf_bcb_get_zone (bcb)) {
case PGBUF_VOID_ZONE: break; /* already detached (from flush) */
case PGBUF_INVALID_ZONE: assert (false); break; /* impossible */
default: /* still in an LRU: detach + Aout */
lru_idx = pgbuf_bcb_get_lru_index (bcb);
pgbuf_lru_remove_bcb (thread_p, bcb);
pgbuf_add_vpid_to_aout_list (thread_p, &bcb->vpid, lru_idx); break; }
return bcb; /* locked, in VOID zone */

The invalidation retry. Between assignment and collection a re-fixer may find the BCB on a hash hit (Ch. 3). It cannot steal a VICTIM_DIRECT BCB, so it sets INVALIDATE_DIRECT_VICTIM; the waiter observes it, clears the flag (releasing ownership), unlocks, and returns NULL — which the caller treats like a failed pgbuf_get_victim (re-enqueue and sleep). The BCB is left in place — the “puts it back and re-sleeps” path. Otherwise the zone switch detaches an in-LRU BCB (Aout-recorded) or no-ops a VOID one; post-condition: locked, VOID_ZONE, ready for reuse.

Figure 9-2 — direct hand-off.

stateDiagram-v2
  [*] --> Clean: bcb flushed or unfixed clean
  Clean --> Assigning: assign direct victim, hold bcb mutex
  Assigning --> NoWaiter: queues drained, return false
  Assigning --> Assigned: live waiter, set VICTIM_DIRECT, write mailbox, wake
  Assigned --> Collected: waiter TAS reads slot, VICTIM_DIRECT seen
  Collected --> Detached: clear VICTIM_DIRECT, remove from LRU, push Aout
  Assigned --> Invalidated: re-fixer sets INVALIDATE_DIRECT_VICTIM
  Invalidated --> ReSleep: waiter clears flag, returns NULL, re-enqueues
  NoWaiter --> [*]
  Detached --> [*]
  ReSleep --> [*]

9.7 pgbuf_invalidate_bcb and the already-assigned victim

Section titled “9.7 pgbuf_invalidate_bcb and the already-assigned victim”

pgbuf_invalidate_bcb tears a BCB out of the page table when its page is gone (dealloc, volume removal). It is in scope here for exactly one branch: an already-assigned direct victim is left alone — if pgbuf_bcb_is_direct_victim (bufptr) is true the function unlocks and returns NO_ERROR, since the waiting thread will victimize it momentarily and racing to invalidate it would corrupt the hand-off. (The remaining branches — the LATCH_INVALID early no-op, the clear-dirty plus zone removal, and the NO_LATCH hash-chain delete onto the invalid list versus the unexpected assert(false) tail — are the ordinary tear-down path and belong to the BCB-lifecycle chapters.)

pgbuf_adjust_quotas (full logic in Ch. 10) keeps everything above viable: it recomputes each private list’s quota and zone thresholds, and re-adds to the LFCQs any over-quota list with candidates that fell out of rotation. The quota/threshold values read by §9.2’s stage gates and §9.4’s re-enqueue test all originate here.

  1. pgbuf_get_victim is a four-stage priority scan: own private (over quota) → other private (big first, then ordinary unless restricted) → shared (loops only without a flush daemon) → own private ignoring quota as a deadlock-avoidance last resort. NULL means “sleep and wait for a direct victim.”
  2. The LFCQs hold list indices, not BCBs. A list is enqueued iff it has candidates and PGBUF_LRU_VICTIM_LFCQ_FLAG is set; consumers re-enqueue while over quota with candidates, else clear the flag so it re-enters lazily.
  3. Quota gating protects working sets. A private list is victimizable only while over quota; big-private (>100, >2*quota, >1 cand) lists are re-queued before scanning so they drain in parallel.
  4. pgbuf_get_victim_from_lru_list re-validates every candidate under its own BCB lock, applying the four exclusions in PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK. TO_VACUUM is deliberately not an exclusion here.
  5. victim_hint is a performance hint, not a safety property. Its documented drift only wastes scan steps; correctness comes entirely from per-BCB re-validation under lock.
  6. Direct hand-off is a mailbox protocol. The producer picks a waiter (High/Low at 75/25), stamps VICTIM_DIRECT, writes bcb_victims[tid] under the waiter’s entry lock, and wakes it.
  7. The consumer handles the re-fix race: observing INVALIDATE_DIRECT_VICTIM it clears the flag, leaves the BCB in place, returns NULL so the caller re-sleeps; pgbuf_invalidate_bcb likewise leaves an already-assigned direct victim untouched.

Chapter 10: Adaptive Quotas Ordered Fix and Special Paths

Section titled “Chapter 10: Adaptive Quotas Ordered Fix and Special Paths”

Three families sit outside the single-page lifecycle of Chapters 3-9: the adaptive quota daemon (100 ms) re-sizing private LRU lists, ordered fix (multi-page deadlock avoidance over pgbuf_fix), and special fix paths that each bypass part of the normal path. For private/shared lists, victim zones, and LFCQ queues see the companion cubrid-page-buffer-manager.md and Chapter 9 — not re-derived here.

10.1 The two structs: pgbuf_page_quota and pgbuf_watcher

Section titled “10.1 The two structs: pgbuf_page_quota and pgbuf_watcher”

pgbuf_page_quota (one instance, pgbuf_Pool.quota) holds the global state pgbuf_adjust_quotas reads/writes. Per-list outputs (quota, threshold_lru1/2) live on each PGBUF_LRU_LIST (Chapter 1), not here.

FieldRoleWhy it exists
num_private_LRU_listprivate-list count; PGBUF_PAGE_QUOTA_IS_ENABLED is > 0master enable switch; 0 makes the subsystem inert
lru_victim_flush_priority_per_lruper-list float, flush priorityflush daemon (Ch 8) biases flushing toward over-quota lists
private_lru_session_cntactive sessions per private listpgbuf_assign_private_lru picks the zero-session list first
private_pages_ratiofraction of BCBs that should be privatesmoothed target driving all_private_quota
add_shared_lru_idxcircular index for shared-list relocationround-robins shared-LRU assignment on private-to-shared migration
avoid_shared_lru_idxshared list to avoid when relocatingsteers traffic off the fattest list so it drains via victimization
last_adjust_timeTSC_TICKS of last adjustgates the 1 ms / 500 ms cadence checks
adjust_agemonotonic counter, bumped each adjustgeneration stamp other code compares against
is_adjustingre-entrancy guardonly one thread runs the adjust body at a time

pgbuf_watcher is the caller-owned ordered-fix handle: stack-allocated, init’d with PGBUF_INIT_WATCHER(w, rank, hfid), passed to pgbuf_ordered_fix, then threaded onto the holder’s watcher list so the machinery can re-fix it.

FieldRoleWhy it exists
pgptrfixed page, or NULLfix output; also the “is watcher live” test (PGBUF_IS_CLEAN_WATCHER)
next / prevlinks in the holder’s watcher listone holder (one fixed BCB) may carry several watchers
group_idVPID of the group’s heap-header pagedeadlock key: pages of one heap share a group
latch_mode (7 bits)latch held by this watcherre-fix restores the same mode; WRITE on any watcher promotes the page
page_was_unfixed (1 bit)true if ordered fix had to unfix-and-refix this pagetells caller “in-page pointers moved; re-read”
initial_rank (4 bits)rank set at init timecaller’s declared rank before any fix
curr_rank (4 bits)effective rank after fixpromoted to PGBUF_ORDERED_HEAP_HDR if this page is its own group header
magic (debug)0x12345678catches an uninitialized/garbage watcher
watched_at / init_at (debug)source location stringsleak / double-fix diagnostics

Invariant (watcher rank monotonicity within a group): every watcher on the same physical page must carry the same curr_rank and group_id. pgbuf_ordered_fix_release enforces this while scanning a holder’s watcher list; a mismatch raises ER_PB_ORDERED_INCONSISTENCY (fatal). If violated, the VPID sort below is ill-defined and the deadlock guarantee collapses.

The rank ordering is the enum PGBUF_ORDERED_RANK (in page_buffer.h): PGBUF_ORDERED_HEAP_HDR = 0 (fixed first) < PGBUF_ORDERED_HEAP_NORMAL < PGBUF_ORDERED_HEAP_OVERFLOW (fixed last) < PGBUF_ORDERED_RANK_UNDEFINED (sentinel). A pgbuf_watcher hangs off a PGBUF_HOLDER’s first_watcher..last_watcher chain (whose bufptr is the fixed PGBUF_BCB) and tags the page via group_id with the heap-header VPID defining its group.

10.2 pgbuf_adjust_quotas — recomputing private quotas every 100 ms

Section titled “10.2 pgbuf_adjust_quotas — recomputing private quotas every 100 ms”

The Page Maintenance Daemon (100 ms cubthread::looper) calls pgbuf_page_maintenance_execute, which after a boot guard (BO_IS_FLUSH_DAEMON_AVAILABLE) calls pgbuf_adjust_quotas then pgbuf_direct_victims_maintenance (10.3). Cadence gates and exits:

  1. Disabled / already running. if (!PGBUF_PAGE_QUOTA_IS_ENABLED || quota->is_adjusting) return; else is_adjusting = 1.
  2. Too soon (< 1 ms). if diff_usec < 1000, clear the guard, return.
  3. Low activity and < 500 ms. if pg_unfix_cnt < PGBUF_TRAN_THRESHOLD_ACTIVITY (num_buffers/4) and < 500 ms elapsed, bail. Busy pool adjusts ~every 1 ms; idle pool waits 500 ms.
  4. Very low activity flag. pg_unfix_cnt.exchange(0) reads-and-resets; if the prior value < THRESHOLD/100, set low_overall_activity = true.

Then it stamps last_adjust_time and bumps adjust_age.

Phase A — per-list hits. One pass over PGBUF_TOTAL_LRU_COUNT lists. lru_hits = ATOMIC_TAS_32(&monitor->lru_hits[i], 0) is read-and-reset, scaled to hits/sec, accumulated into lru_private_hits/lru_shared_hits, and total_victims += PGBUF_GET_LRU_LIST(i)->count_vict_cand. Each private list’s activity sample is history-smoothed: if diff_usec >= tensec_usec (>10 s) monitor->lru_activity[i] = lru_hits (old sample stale, replace); else it is the time-weighted blend ((tensec_usec - diff_usec) * old + diff_usec * lru_hits) / tensec_usec.

Phase B — private ratio. If low_overall_activity, force private_ratio = MIN_PRIVATE_RATIO (starve privates); else lru_private_hits / (private + shared) clamped to [0.01, 0.998] (shared floored to 1), then 10 s-smoothed into quota->private_pages_ratio.

Phase C — redistribute (two mutually exclusive branches):

  • No private activity (sum_private_lru_activity_total == 0): all_private_quota = 0; every private list gets quota = threshold_lru1 = threshold_lru2 = 0, pgbuf_lru_adjust_zones under the list mutex if it still holds pages, and a push onto the victim LFCQ (pgbuf_lfcq_add_lru_with_victims) if it has over-quota candidates.
  • Some private activity (else): the budget is all_private_quota = (int)((num_buffers - invalid_cnt) * quota->private_pages_ratio), split per list proportional to activity:
// pgbuf_adjust_quotas (phase C, active) -- src/storage/page_buffer.c
new_quota = (int) ((float) lru_activity[i] / sum_private_lru_activity_total * all_private_quota);
new_quota = MIN (new_quota, PGBUF_PRIVATE_LRU_MAX_HARD_QUOTA); /* absolute cap */
new_quota = MIN (new_quota, num_buffers / 2); /* half-pool cap */
lru_list->threshold_lru1 = lru_list->threshold_lru2 = (int) (new_quota * PGBUF_LRU_ZONE_MIN_RATIO);

The two caps stop a single list monopolizing the pool; threshold_lru1/2 are the zone sizes the Chapter 7 unfix path reads.

Phase D — shared lists. Leftover budget spreads evenly: avg_shared_lru_size = (num_buffers - all_private_quota) / num_LRU_list, floored at PGBUF_MIN_SHARED_LIST_ADJUST_SIZE; threshold_lru1/2 from the configured ratio_lru1/2. Each over-threshold shared list is re-zoned and, if it has candidates, queued for victims.

Phase E — victim_rich and release. `monitor.victim_rich = total_victims

= (int)(0.1 * num_buffers); quota->is_adjusting = 0;. victim_rich` is Chapter 9’s cheap “push hard on victimization?” hint — true above 10% of pool.

Invariant (single-writer adjust): is_adjusting is set on entry and cleared on every exit (all four early returns and the tail). Not a mutex — a best-effort flag for a single-threaded daemon. An early return that forgets to clear it freezes the subsystem forever (every later call hits gate 1).

flowchart TD
  B{"enabled and\nnot adjusting?"} -->|no| Z["return"]
  B -->|yes| C["is_adjusting=1"]
  C --> D{"diff<1ms?"}
  D -->|yes| Y["is_adjusting=0;\nreturn"]
  D -->|no| E{"low activity\nand diff<500ms?"}
  E -->|yes| Y
  E -->|no| K{"sum activity==0?"}
  K -->|yes| J["all private quota=0"]
  K -->|no| S["split by activity,\ncap abs and pool/2"]
  J --> L["shared thresholds;\nvictim_rich; is_adjusting=0"]
  S --> L

Figure 10-1. pgbuf_adjust_quotas: cadence gates (1-4) and the two Phase C redistribution branches.

10.3 pgbuf_direct_victims_maintenance — the backup victim hand-off

Section titled “10.3 pgbuf_direct_victims_maintenance — the backup victim hand-off”

The fast path assigns victims as a side effect of unfix/flush; on an idle system that never fires, so a blocked thread could starve. This backup walks lists round-robin and hands victims out directly, once over private lists and once over shared:

// pgbuf_direct_victims_maintenance -- src/storage/page_buffer.c
static int prv_index = 0; /* round-robin cursors, single-threaded use only */
static int shr_index = 0;
for (index = prv_index, restarted = false;
pgbuf_is_any_thread_waiting_for_direct_victim () && nassigns > 0
&& index != prv_index && !restarted;
(index == PGBUF_PRIVATE_LRU_COUNT - 1) ? index = 0, restarted = true : index++)
pgbuf_lfcq_assign_direct_victims (thread_p, PGBUF_LRU_INDEX_FROM_PRIVATE (index), &nassigns);
prv_index = index; /* persist cursor for next tick */
// ... a second, structurally identical loop over shared lists, then shr_index = index ...

The cursor starts at prv_index; index != prv_index therefore becomes the wrap-around terminator only after the iterator has advanced past it. Each loop stops when (a) no thread waits, (b) the per-iteration budget DEFAULT_ASSIGNS_PER_ITERATION (5) is spent, or (c) it wrapped once (restarted). The prv_index = index / shr_index = index write-backs are what make the static cursors persist across ticks so each tick sweeps different lists — hence single-threaded use only. pgbuf_lfcq_assign_direct_victims retries from lru_list->bottom if the cached victim_hint is stale (a CAS resets it), self-healing the hint.

10.4 pgbuf_ordered_fix_release — multi-page deadlock avoidance

Section titled “10.4 pgbuf_ordered_fix_release — multi-page deadlock avoidance”

Heap ops hold several pages at once; fixing them in different orders deadlocks. The heap header must stay fixed first, so plain VPID ordering is insufficient. Ordered fix keeps a rank (header < normal < overflow), sorts by VPID within rank, and — if a new request violates that order against held pages — unfixes the offenders, sorts, re-fixes in order. Entry contract: req_watcher->pgptr must be NULL else ER_FAILED_ASSERTION; curr_rank becomes PGBUF_ORDERED_HEAP_HDR if the requested VPID is the group header, else initial_rank.

Branch 1 — conditional first attempt. If the thread holds no other page (holder == NULL, or holder->thrd_link == NULL && VPID_EQ(req_vpid, &holder->bufptr->vpid) — only this one), use PGBUF_UNCONDITIONAL_LATCH; else PGBUF_CONDITIONAL_LATCH so a would-be deadlock fails fast. Then ret_pgptr = pgbuf_fix_release(...).

  • Got it: find the holder, resolve group id (existing watcher, or fix the heap header via pgbuf_get_groupid_and_unfix if PAGE_HEAP), attach via pgbuf_add_watch_instance_internal, goto exit. Common no-reorder case.
  • Did not get it, branch on error: ER_PB_BAD_PAGEID/ER_INTERRUPTED → exit; OLD_PAGE_MAYBE_DEALLOCATED+ER_PB_BAD_PAGEID → treat deallocated, exit; LK_ZERO_WAIT/LK_FORCE_ZERO_WAITER_LK_PAGE_TIMEOUT (no error set for force, scans continue), exit; UNCONDITIONAL → already blocked and failed, exit with the error; else (conditional failed) → fall through to reorder, clearing er_status.

Branch 2 — classify held pages. Walk holders, skipping watch_count <= 0 (no watcher; assumed deadlock-safe). Gather each watched holder’s watchers into ordered_holders_info[], verifying the 10.1 invariant. diff = pgbuf_compare_hold_vpid_for_sort(req, held): diff < 0 (held sorts after req) → save for unfix; diff == 0 (same page) → ER_FAILED_ASSERTION; diff > 0 (held sorts before) → leave fixed. If the request has no group yet (req_page_has_group == false), diff is forced -1 so all watched pages are unfixed and re-fixed.

Branch 3 — unfix the out-of-order pages. For each saved entry, pgbuf_bcb_register_avoid_deallocation(holder->bufptr) pins it across the gap, pgbuf_unfix runs fix_count times, then each watcher is PGBUF_CLEAR_WATCHER’d and gets pg_watcher->page_was_unfixed = true.

Branch 4 — resolve missing group, then sort. If req had no group, re-fix it unconditionally, derive its group id (clear dealloc-prevent if OLD_PAGE_PREVENT_DEALLOC downgraded to OLD_PAGE), append the requested page, qsort by pgbuf_compare_hold_vpid_for_sort (rank, volid, pageid).

Branch 5 — re-fix in sorted order. Requested page uses caller’s request_mode/fetch_mode; restored pages use saved latch_mode+OLD_PAGE. All PGBUF_UNCONDITIONAL_LATCH now — order guaranteed, blocking deadlock-free. Failures: ER_INTERRUPTED exits; ER_PB_BAD_PAGEID on the requested page under OLD_PAGE_MAYBE_DEALLOCATED is tolerated; failure restoring a held page → ER_PB_ORDERED_REFIX_FAILED (serious — watchers partially live).

Invariant (caller must honor page_was_unfixed): any unfixed-and-refixed watcher has page_was_unfixed == true. Pointers cached into that page may now be invalid — another thread could have modified it during the gap. Reusing a stale pointer reads corrupt data — the single most important contract of ordered fix.

10.5 pgbuf_ordered_unfix — watcher-aware unfix

Section titled “10.5 pgbuf_ordered_unfix — watcher-aware unfix”

pgbuf_ordered_unfix is the counterpart: if watcher_object->pgptr == NULL it assert_release(false) and returns; otherwise pgbuf_get_holder finds the holder, a for (watcher = holder->last_watcher; ...; watcher = watcher->prev) scan finds the exact watcher, then after the invariant assert (assert(holder->fix_count >= holder->watch_count)) it calls pgbuf_remove_watcher and one pgbuf_unfix.

Invariant (fix_count >= watch_count): a holder may be fixed more than watched (a plain fix adds no watcher) but never the reverse. Asserted here and in 10.4’s classification loop. Violation means a watcher outlived its fix — a use-after-unfix.

10.6 pgbuf_promote_read_latch_release — R-to-W promotion

Section titled “10.6 pgbuf_promote_read_latch_release — R-to-W promotion”

Converts a held READ latch to WRITE without fully unfixing, per a PGBUF_PROMOTE_CONDITION; a CAS loop on the packed atomic_latch (Chapter 5):

  • Sole reader (holder->fix_count == impl.fcnt): unless the next waiter is a promoter (then fail ER_PAGE_LATCH_PROMOTE_FAIL), flip impl_new.latch_mode = PGBUF_LATCH_WRITE in place. No blocking.
  • Other readers present (fix_count != fcnt): PGBUF_PROMOTE_ONLY_READER or next waiter is a promoter → fail ER_PAGE_LATCH_PROMOTE_FAIL (CASE #1/#2). PGBUF_PROMOTE_SHARED_READER → subtract our fixes from fcnt, mark waiter_exists, set need_block, leave the loop.

If need_block, it effectively unfixes (holder->fix_count = 0, remove holder), sets thread_p->wait_for_latch_promote = true, blocks via pgbuf_block_bcb for WRITE, and on wake re-allocates a holder with the saved fix_count.

Invariant (promoter mutual exclusion): at most one waiter on a BCB carries wait_for_latch_promote; both branches that detect a promoter waiter abort rather than queue behind it. Two blocked promoters would each wait for the other to drop its read latch — a deadlock. The abort returns ER_PAGE_LATCH_PROMOTE_FAIL; callers retry by unfix + fix WRITE.

  • pgbuf_simple_fix / pgbuf_simple_unfix — temp files only. Latchless, LRU-mutexless; only add_fcnt(&bufptr->atomic_latch, 1), never latches (“Cannot be mixed with general FIX(LATCH)”). Resident → if a direct-victim claim is pending, invalidate it when need_fix, else NULL. Absent → if need_fix: lock hash, pgbuf_claim_bcb_for_fix, insert, add to private/shared LRU; else NULL. Unfix = lock, add_fcnt(..., -1), unlock.
  • pgbuf_fix_if_not_deallocated — vacuum’s dealloc-aware fix. disk_is_page_sector_reserved first: DISK_INVALID → NO_ERROR, *page = NULL (deallocated, not error); DISK_ERROR → propagate; DISK_VALID → real fix with OLD_PAGE_MAYBE_DEALLOCATED, then a NULL + ER_PB_BAD_PAGEID is swallowed (raced) unless mid recovery-redo.
  • pgbuf_invalidate — drop a page (caller holds WRITE). fcnt > 1 → just unfix (pgbuf_unlatch_thrd_holder + pgbuf_unlatch_bcb_upon_unfix), no invalidation. fcnt == 1 → flush if dirty (pgbuf_bcb_safe_flush_force_lock), record VPID, unfix, re-lock, re-check; if the BCB was reused, re-fixed, or avoid-victim → skip, else pgbuf_invalidate_bcb detaches it. Persistent pages run this as a post-commit postpone; temp pages unconditionally. pgbuf_invalidate_all iterates a volume.
  • pgbuf_notify_vacuum_follows / pgbuf_bcb_is_to_vacuum — vacuum hint. Sets PGBUF_BCB_TO_VACUUM_FLAG via pgbuf_bcb_update_flags(thread_p, bcb, PGBUF_BCB_TO_VACUUM_FLAG, 0) (“vacuum will revisit, prefer not to victimize”). pgbuf_bcb_is_to_vacuum tests it; the victim-flush path clears it on commit-to-flush, so the hint is one-shot.
  • TDE hook (out-of-scope). pgbuf_set_tde_algorithm early-returns if the algorithm is unchanged, else clears the existing bits (pflag &= ~FILEIO_PAGE_FLAG_ENCRYPTED_MASK) and ORs in the new encryption bit in iopage->prv.pflag (FILEIO_PAGE_FLAG_ENCRYPTED_AES/_ARIA), logs undoredo unless skip_logging, marks dirty. The page buffer only carries these bits through dirty/flush (Ch 6); en/decryption is the TDE module’s. Noted only so a maintainer knows pflag has a TDE tenant.
  1. The 100 ms Page Maintenance Daemon runs pgbuf_adjust_quotas then pgbuf_direct_victims_maintenance (idle victim backup); the former is gated by a 1 ms / 500 ms cadence and num_buffers/4 activity, guarded by the single-writer is_adjusting flag that must clear on every exit, and sets per-list quota/threshold_lru1/2 (capped at PGBUF_PRIVATE_LRU_MAX_HARD_QUOTA and half the pool) plus victim_rich (>10% of pool).
  2. Ordered fix ranks pages (heap-hdr < normal < overflow) then sorts by VPID; held pages sorting after the request are unfixed, the set qsorted, all re-fixed unconditionally. Its load-bearing output page_was_unfixed means the caller must re-read that page (cached pointers may be stale).
  3. Watcher invariants (same rank/group per page; fix_count >= watch_count per holder) are fatal-enforced. pgbuf_promote_read_latch_release flips R-to-W in place as sole reader, blocks under PGBUF_PROMOTE_SHARED_READER, aborts with ER_PAGE_LATCH_PROMOTE_FAIL if another promoter waits.
  4. Each special path bypasses one step: pgbuf_simple_fix (latchless temp), pgbuf_fix_if_not_deallocated (deallocated = NULL non-error), pgbuf_invalidate (detach a singly-fixed page), pgbuf_notify_vacuum_follows (one-shot anti-victim hint); the TDE pflag tenant is the out-of-scope boundary.

The following are line numbers as observed on 2026-06-17; symbols are the canonical anchor and line numbers are hints that decay.

SymbolFileLine
dwb_set_data_on_next_slotsrc/storage/double_write_buffer.cpp2686
dwb_add_pagesrc/storage/double_write_buffer.cpp2726
dwb_is_createdsrc/storage/double_write_buffer.cpp2909
fileio_page_reservedsrc/storage/file_io.h166
fileio_page_watermarksrc/storage/file_io.h179
fileio_pagesrc/storage/file_io.h186
PGBUF_DEFAULT_FIX_COUNTsrc/storage/page_buffer.c90
PGBUF_NUM_ALLOC_HOLDERsrc/storage/page_buffer.c94
PGBUF_FIX_COUNT_THRESHOLDsrc/storage/page_buffer.c106
pgbuf_latch_timeoutsrc/storage/page_buffer.c107
PGBUF_IOPAGE_BUFFER_SIZEsrc/storage/page_buffer.c118
PGBUF_FIND_BCB_PTRsrc/storage/page_buffer.c135
PGBUF_LRU_NBITSsrc/storage/page_buffer.c148
PGBUF_LRU_INDEX_MASKsrc/storage/page_buffer.c150
PGBUF_LRU_INDEX_MASKsrc/storage/page_buffer.c182
PGBUF_LRU_1_ZONEsrc/storage/page_buffer.c197
PGBUF_LRU_ZONE_MASKsrc/storage/page_buffer.c201
PGBUF_INVALID_ZONEsrc/storage/page_buffer.c205
PGBUF_VOID_ZONEsrc/storage/page_buffer.c206
PGBUF_ZONE_MASKsrc/storage/page_buffer.c211
PGBUF_GET_ZONEsrc/storage/page_buffer.c215
PGBUF_GET_LRU_INDEXsrc/storage/page_buffer.c216
PGBUF_BCB_DIRTY_FLAGsrc/storage/page_buffer.c224
PGBUF_BCB_FLUSHING_TO_DISK_FLAGsrc/storage/page_buffer.c227
PGBUF_BCB_VICTIM_DIRECT_FLAGsrc/storage/page_buffer.c234
PGBUF_BCB_INVALIDATE_DIRECT_VICTIM_FLAGsrc/storage/page_buffer.c235
PGBUF_BCB_MOVE_TO_LRU_BOTTOM_FLAGsrc/storage/page_buffer.c237
PGBUF_BCB_TO_VACUUM_FLAGsrc/storage/page_buffer.c239
PGBUF_BCB_ASYNC_FLUSH_REQsrc/storage/page_buffer.c241
PGBUF_BCB_FLAGS_MASKsrc/storage/page_buffer.c244
PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASKsrc/storage/page_buffer.c258
PGBUF_BCB_INIT_FLAGSsrc/storage/page_buffer.c265
PGBUF_BCB_COUNT_FIX_SHIFT_BITSsrc/storage/page_buffer.c268
PGBUF_BCB_AVOID_DEALLOC_MASKsrc/storage/page_buffer.c269
PGBUF_TRAN_THRESHOLD_ACTIVITYsrc/storage/page_buffer.c276
PGBUF_AOUT_NOT_FOUNDsrc/storage/page_buffer.c279
PGBUF_SHOULD_IGNORE_UNFIXsrc/storage/page_buffer.c290
HASH_SIZE_BITSsrc/storage/page_buffer.c295
PGBUF_HASH_SIZEsrc/storage/page_buffer.c296
PGBUF_HASH_VALUEsrc/storage/page_buffer.c300
PGBUF_FLUSH_VICTIM_BOOST_MULTsrc/storage/page_buffer.c305
PGBUF_NEIGHBOR_FLUSH_NONDIRTYsrc/storage/page_buffer.c307
PGBUF_MAX_NEIGHBOR_PAGESsrc/storage/page_buffer.c310
PGBUF_NEIGHBOR_POSsrc/storage/page_buffer.c314
PGBUF_CHKPT_MAX_FLUSH_RATEsrc/storage/page_buffer.c322
PGBUF_CHKPT_MIN_FLUSH_RATEsrc/storage/page_buffer.c323
PGBUF_CHKPT_BURST_PAGESsrc/storage/page_buffer.c326
PGBUF_LRU_ZONE_MIN_RATIOsrc/storage/page_buffer.c342
PGBUF_LOCK_HOLDERsrc/storage/page_buffer.c348
pgbuf_holder_statsrc/storage/page_buffer.c441
pgbuf_batch_flush_helpersrc/storage/page_buffer.c451
pgbuf_holdersrc/storage/page_buffer.c461
pgbuf_holder_anchorsrc/storage/page_buffer.c479
pgbuf_holder_setsrc/storage/page_buffer.c488
pgbuf_atomic_latch_implsrc/storage/page_buffer.c494
pgbuf_bcbsrc/storage/page_buffer.c506
atomic_latchsrc/storage/page_buffer.c513
flagssrc/storage/page_buffer.c514
next_wait_thrdsrc/storage/page_buffer.c516
count_fix_and_avoid_deallocsrc/storage/page_buffer.c528
oldest_unflush_lsasrc/storage/page_buffer.c536
pgbuf_iopage_buffersrc/storage/page_buffer.c541
struct pgbuf_iopage_buffersrc/storage/page_buffer.c541
pgbuf_buffer_locksrc/storage/page_buffer.c557
struct pgbuf_buffer_locksrc/storage/page_buffer.c557
pgbuf_buffer_hashsrc/storage/page_buffer.c570
pgbuf_lru_listsrc/storage/page_buffer.c580
victim_hintsrc/storage/page_buffer.c589
count_vict_candsrc/storage/page_buffer.c602
pgbuf_invalid_listsrc/storage/page_buffer.c621
struct pgbuf_invalid_listsrc/storage/page_buffer.c621
pgbuf_aout_bufsrc/storage/page_buffer.c636
struct pgbuf_aout_listsrc/storage/page_buffer.c645
pgbuf_aout_listsrc/storage/page_buffer.c645
pgbuf_seq_flushersrc/storage/page_buffer.c669
struct pgbuf_page_monitorsrc/storage/page_buffer.c688
struct pgbuf_page_quotasrc/storage/page_buffer.c710
pgbuf_page_quotasrc/storage/page_buffer.c710
struct pgbuf_direct_victimsrc/storage/page_buffer.c737
pgbuf_buffer_poolsrc/storage/page_buffer.c749
struct pgbuf_victim_candidate_listsrc/storage/page_buffer.c833
pgbuf_Flush_helpersrc/storage/page_buffer.c840
AOUT_HASH_IDXsrc/storage/page_buffer.c854
PGBUF_BCB_LOCKsrc/storage/page_buffer.c869
PGBUF_BCB_TRYLOCKsrc/storage/page_buffer.c871
PGBUF_IS_BCB_IN_LRU_VICTIM_ZONEsrc/storage/page_buffer.c919
PGBUF_IS_BCB_IN_LRUsrc/storage/page_buffer.c920
PGBUF_IS_BCB_OLD_ENOUGHsrc/storage/page_buffer.c927
PGBUF_PRIVATE_LRU_MAX_HARD_QUOTAsrc/storage/page_buffer.c943
PGBUF_MIN_PAGES_IN_SHARED_LISTsrc/storage/page_buffer.c946
PGBUF_TOTAL_LRU_COUNTsrc/storage/page_buffer.c969
PGBUF_IS_PRIVATE_LRU_INDEXsrc/storage/page_buffer.c975
PGBUF_LRU_LIST_IS_OVER_QUOTAsrc/storage/page_buffer.c977
PGBUF_LRU_LIST_IS_OVER_QUOTA_WITH_BUFFERsrc/storage/page_buffer.c987
set_latchsrc/storage/page_buffer.c1310
add_fcntsrc/storage/page_buffer.c1324
set_waiter_existssrc/storage/page_buffer.c1368
get_latchsrc/storage/page_buffer.c1398
get_implsrc/storage/page_buffer.c1406
pgbuf_thread_variables_initsrc/storage/page_buffer.c1415
pgbuf_hash_func_mirrorsrc/storage/page_buffer.c1441
pgbuf_hash_vpidsrc/storage/page_buffer.c1480
pgbuf_compare_vpidsrc/storage/page_buffer.c1494
pgbuf_initializesrc/storage/page_buffer.c1518
pgbuf_finalizesrc/storage/page_buffer.c1796
pgbuf_fix_with_retrysrc/storage/page_buffer.c1993
pgbuf_fix_releasesrc/storage/page_buffer.c2041
pgbuf_simple_fixsrc/storage/page_buffer.c2475
pgbuf_simple_unfixsrc/storage/page_buffer.c2569
pgbuf_promote_read_latch_debugsrc/storage/page_buffer.c2624
pgbuf_promote_read_latch_releasesrc/storage/page_buffer.c2628
pgbuf_unfixsrc/storage/page_buffer.c2850
pgbuf_invalidatesrc/storage/page_buffer.c3158
pgbuf_flushsrc/storage/page_buffer.c3341
pgbuf_flush_with_walsrc/storage/page_buffer.c3364
pgbuf_flush_if_requestedsrc/storage/page_buffer.c3404
pgbuf_flush_all_helpersrc/storage/page_buffer.c3438
pgbuf_get_victim_candidates_from_lrusrc/storage/page_buffer.c3564
pgbuf_flush_victim_candidatessrc/storage/page_buffer.c3645
pgbuf_flush_checkpointsrc/storage/page_buffer.c3960
pgbuf_flush_chkpt_seq_listsrc/storage/page_buffer.c4102
pgbuf_flush_seq_listsrc/storage/page_buffer.c4210
pgbuf_set_dirtysrc/storage/page_buffer.c4700
pgbuf_set_lsasrc/storage/page_buffer.c4771
pgbuf_set_tde_algorithmsrc/storage/page_buffer.c4881
pgbuf_set_bcb_page_vpidsrc/storage/page_buffer.c5214
pgbuf_initialize_bcb_tablesrc/storage/page_buffer.c5334
pgbuf_initialize_hash_tablesrc/storage/page_buffer.c5452
pgbuf_initialize_lock_tablesrc/storage/page_buffer.c5481
pgbuf_initialize_lru_listsrc/storage/page_buffer.c5519
pgbuf_initialize_aout_listsrc/storage/page_buffer.c5582
pgbuf_initialize_invalid_listsrc/storage/page_buffer.c5686
pgbuf_initialize_thrd_holdersrc/storage/page_buffer.c5701
pgbuf_allocate_thrd_holder_entrysrc/storage/page_buffer.c5783
pgbuf_find_thrd_holdersrc/storage/page_buffer.c5870
pgbuf_remove_thrd_holdersrc/storage/page_buffer.c5971
pgbuf_latch_bcb_upon_fixsrc/storage/page_buffer.c6073
pgbuf_unlatch_bcb_upon_unfixsrc/storage/page_buffer.c6417
pgbuf_unlatch_void_zone_bcbsrc/storage/page_buffer.c6652
pgbuf_should_move_private_to_sharedsrc/storage/page_buffer.c6758
pgbuf_block_bcbsrc/storage/page_buffer.c6803
pgbuf_timed_sleep_error_handlingsrc/storage/page_buffer.c6925
pgbuf_timed_sleepsrc/storage/page_buffer.c7014
pgbuf_wakeup_reader_writersrc/storage/page_buffer.c7186
pgbuf_search_hash_chainsrc/storage/page_buffer.c7327
pgbuf_lockfree_fix_rosrc/storage/page_buffer.c7452
pgbuf_search_hash_chain_no_bcb_locksrc/storage/page_buffer.c7517
pgbuf_insert_into_hash_chainsrc/storage/page_buffer.c7569
pgbuf_lock_pagesrc/storage/page_buffer.c7718
pgbuf_unlock_pagesrc/storage/page_buffer.c7831
pgbuf_allocate_bcbsrc/storage/page_buffer.c7916
pgbuf_claim_bcb_for_fixsrc/storage/page_buffer.c8133
pgbuf_victimize_bcbsrc/storage/page_buffer.c8372
pgbuf_invalidate_bcbsrc/storage/page_buffer.c8424
pgbuf_bcb_safe_flush_force_unlocksrc/storage/page_buffer.c8494
pgbuf_bcb_safe_flush_force_locksrc/storage/page_buffer.c8517
pgbuf_bcb_safe_flush_internalsrc/storage/page_buffer.c8550
pgbuf_get_bcb_from_invalid_listsrc/storage/page_buffer.c8644
pgbuf_put_bcb_into_invalid_listsrc/storage/page_buffer.c8693
pgbuf_get_victimsrc/storage/page_buffer.c8805
pgbuf_is_bcb_fixed_by_anysrc/storage/page_buffer.c8995
pgbuf_is_bcb_victimizablesrc/storage/page_buffer.c9023
pgbuf_get_victim_from_lru_listsrc/storage/page_buffer.c9053
pgbuf_panic_assign_direct_victims_from_lrusrc/storage/page_buffer.c9279
pgbuf_direct_victims_maintenancesrc/storage/page_buffer.c9346
pgbuf_lfcq_assign_direct_victimssrc/storage/page_buffer.c9388
pgbuf_lru_add_bcb_to_topsrc/storage/page_buffer.c9432
pgbuf_lru_add_bcb_to_middlesrc/storage/page_buffer.c9482
pgbuf_lru_add_bcb_to_bottomsrc/storage/page_buffer.c9570
pgbuf_lru_fall_bcb_to_zone_3src/storage/page_buffer.c9788
pgbuf_lru_boost_bcbsrc/storage/page_buffer.c9858
pgbuf_lru_move_from_private_to_sharedsrc/storage/page_buffer.c10064
pgbuf_remove_from_lru_listsrc/storage/page_buffer.c10089
pgbuf_move_bcb_to_bottom_lrusrc/storage/page_buffer.c10157
pgbuf_add_vpid_to_aout_listsrc/storage/page_buffer.c10201
pgbuf_remove_vpid_from_aout_listsrc/storage/page_buffer.c10282
pgbuf_bcb_flush_with_walsrc/storage/page_buffer.c10456
pgbuf_wake_flush_waiterssrc/storage/page_buffer.c10694
pgbuf_is_exist_blocked_reader_writersrc/storage/page_buffer.c10741
pgbuf_wakeupsrc/storage/page_buffer.c11319
pgbuf_set_dirty_buffer_ptrsrc/storage/page_buffer.c11369
pgbuf_flush_page_and_neighbors_fbsrc/storage/page_buffer.c11527
pgbuf_flush_neighbor_safesrc/storage/page_buffer.c11762
pgbuf_add_bufptr_to_batchsrc/storage/page_buffer.c11820
pgbuf_ordered_fix_releasesrc/storage/page_buffer.c11985
pgbuf_ordered_unfixsrc/storage/page_buffer.c12860
pgbuf_add_watch_instance_internalsrc/storage/page_buffer.c12927
pgbuf_initialize_page_quota_parameterssrc/storage/page_buffer.c13326
pgbuf_initialize_page_quotasrc/storage/page_buffer.c13370
pgbuf_initialize_page_monitorsrc/storage/page_buffer.c13430
pgbuf_adjust_quotassrc/storage/page_buffer.c13639
pgbuf_initialize_seq_flushersrc/storage/page_buffer.c14016
pgbuf_flush_control_from_dirty_ratiosrc/storage/page_buffer.c14233
pgbuf_fix_if_not_deallocated_with_callersrc/storage/page_buffer.c14735
pgbuf_assign_direct_victimsrc/storage/page_buffer.c14809
pgbuf_assign_flushed_pagessrc/storage/page_buffer.c14876
pgbuf_get_thread_waiting_for_direct_victimsrc/storage/page_buffer.c14946
pgbuf_get_direct_victimsrc/storage/page_buffer.c14978
pgbuf_lru_advance_victim_hintsrc/storage/page_buffer.c15131
pgbuf_bcb_update_flagssrc/storage/page_buffer.c15171
pgbuf_bcb_change_zonesrc/storage/page_buffer.c15269
pgbuf_bcb_get_zonesrc/storage/page_buffer.c15374
pgbuf_bcb_get_zonesrc/storage/page_buffer.c15375
pgbuf_bcb_get_lru_indexsrc/storage/page_buffer.c15386
pgbuf_bcb_get_lru_indexsrc/storage/page_buffer.c15387
pgbuf_bcb_is_dirtysrc/storage/page_buffer.c15400
pgbuf_bcb_set_dirtysrc/storage/page_buffer.c15412
pgbuf_bcb_mark_is_flushingsrc/storage/page_buffer.c15463
pgbuf_bcb_mark_was_flushedsrc/storage/page_buffer.c15486
pgbuf_bcb_mark_was_not_flushedsrc/storage/page_buffer.c15500
pgbuf_bcb_is_flushingsrc/storage/page_buffer.c15513
pgbuf_bcb_should_be_moved_to_bottom_lrusrc/storage/page_buffer.c15561
pgbuf_notify_vacuum_followssrc/storage/page_buffer.c15574
pgbuf_bcb_is_to_vacuumsrc/storage/page_buffer.c15589
pgbuf_bcb_avoid_victimsrc/storage/page_buffer.c15603
pgbuf_bcb_register_avoid_deallocationsrc/storage/page_buffer.c15627
pgbuf_bcb_unregister_avoid_deallocationsrc/storage/page_buffer.c15640
pgbuf_bcb_should_avoid_deallocationsrc/storage/page_buffer.c15684
pgbuf_bcb_register_fixsrc/storage/page_buffer.c15720
pgbuf_bcb_is_hotsrc/storage/page_buffer.c15741
pgbuf_lfcq_get_victim_from_private_lrusrc/storage/page_buffer.c15802
pgbuf_lfcq_get_victim_from_shared_lrusrc/storage/page_buffer.c15894
pgbuf_bcb_register_hit_for_lrusrc/storage/page_buffer.c15979
pgbuf_get_page_flush_intervalsrc/storage/page_buffer.c16353
pgbuf_page_maintenance_executesrc/storage/page_buffer.c16375
pgbuf_page_flush_daemon_tasksrc/storage/page_buffer.c16396
pgbuf_page_maintenance_daemon_initsrc/storage/page_buffer.c16531
pgbuf_page_flush_daemon_initsrc/storage/page_buffer.c16549
pgbuf_page_post_flush_daemon_initsrc/storage/page_buffer.c16567
pgbuf_is_page_flush_daemon_availablesrc/storage/page_buffer.c16673
pgbuf_is_temp_lsasrc/storage/page_buffer.c16683
pgbuf_init_temp_page_lsasrc/storage/page_buffer.c16689
PAGE_FETCH_MODEsrc/storage/page_buffer.h172
PGBUF_LATCH_MODEsrc/storage/page_buffer.h190
PGBUF_ORDERED_RANKsrc/storage/page_buffer.h222
pgbuf_watchersrc/storage/page_buffer.h234
PGBUF_TEMP_LSAsrc/storage/page_buffer.h258
PGBUF_ATOMIC_LATCHsrc/storage/page_buffer.h365
  • cubrid-page-buffer-manager.md — the high-level companion. See also cubrid-double-write-buffer.md (the flush path below) and cubrid-log-manager-detail.md (the WAL rule the flush obeys).
  • Raw analyses under raw/code-analysis/cubrid/storage/buffer_manager/.
  • Code: src/storage/page_buffer.{c,h}.
  • Methodology: knowledge/methodology/code-analysis-detail-doc.md.