CUBRID Page Buffer Manager — Code-Level Deep Dive

Where this document fits: The high-level analysis cubrid-page-buffer-manager.md covers design intent and theoretical background. This document traces every branch and field at the code level. Each chapter is self-contained, but reading in order follows the full lifecycle of a single data page inside the buffer pool — fix, latch, dirty, flush, victimize.

Contents:

Ch	Title	Status
1	Data-Structure Map	✅
2	Initialization and Memory Layout	✅
3	The Fix Entry Path and Page-Table Lookup	✅
4	Miss Handling BCB Claim and the PGBUF Allocation Lock	✅
5	The BCB Atomic Latch Acquire Block and Wake	✅
6	Dirtying a Page and the Packed Flags and Zone Word	✅
7	Unfix LRU Movement Aout History and Private to Shared Migration	✅
8	Flushing Under the WAL Rule and the Flush Daemons	✅
9	Victim Selection the LFCQs and Direct Victim Hand-off	✅
10	Adaptive Quotas Ordered Fix and Special Paths	✅

Chapter 1: Data-Structure Map

Field-level reference for every struct the page buffer manager owns. The high-level companion (cubrid-page-buffer-manager.md) explains why a buffer manager needs a BCB, a VPID hash, a free list, and multi-zone LRUs — see its ### Buffer Control Block — PGBUF_BCB and ### Three-zone LRU lists sections; we do not repeat that framing. What the companion simplified — the BCB latch, drawn there as a pthread_mutex_t + latch_mode pair — is corrected here against the real source. All structs are TU-private to src/storage/page_buffer.c except pgbuf_watcher / PGBUF_LATCH_MODE (public, in page_buffer.h); the single global is pgbuf_Pool of type PGBUF_BUFFER_POOL. Each table below uses one Role & rationale column to keep field coverage exhaustive at low cost.

1.1 The packed-word vocabulary

Two BCB words are bit-packed; later chapters read them through accessor macros. The flags word (volatile int, 32 bits): BCB-flag bits in the high byte, the zone in bits 16-19, the LRU index in the low 16 bits.

// PGBUF zone + index layout -- src/storage/page_buffer.c
#define PGBUF_LRU_NBITS 16
#define PGBUF_LRU_INDEX_MASK (PGBUF_LRU_LIST_MAX_COUNT - 1)     /* 0x0000FFFF */
  PGBUF_LRU_1_ZONE = 1 << PGBUF_LRU_NBITS,     /* 0x00010000 */
  PGBUF_LRU_2_ZONE = 2 << PGBUF_LRU_NBITS,     /* 0x00020000 */
  PGBUF_LRU_3_ZONE = 3 << PGBUF_LRU_NBITS,     /* 0x00030000 */
  PGBUF_LRU_ZONE_MASK = PGBUF_LRU_1_ZONE | PGBUF_LRU_2_ZONE | PGBUF_LRU_3_ZONE,
  PGBUF_INVALID_ZONE = 1 << (PGBUF_LRU_NBITS + 2),  /* 0x00040000 */
  PGBUF_VOID_ZONE    = 2 << (PGBUF_LRU_NBITS + 2),  /* 0x00080000 */

The two-bit skip is deliberate: LRU zones use bits 16-17; INVALID/VOID jump to bit 18+ so their masks never collide with an LRU index setting bit 16 or 17. The flag bits sit in the top byte (PGBUF_BCB_DIRTY_FLAG 0x80000000, then ..._FLUSHING_TO_DISK, ..._VICTIM_DIRECT, ..._INVALIDATE_DIRECT_VICTIM, ..._MOVE_TO_LRU_BOTTOM, ..._TO_VACUUM, ..._ASYNC_FLUSH_REQ, descending one bit each). One word reads high-to-low: [flag byte | reserved | zone 16-19 | lru index 0-15]. Flag semantics belong to Chapter 6.

The second packed word, count_fix_and_avoid_dealloc, splits via PGBUF_BCB_COUNT_FIX_SHIFT_BITS (16) and PGBUF_BCB_AVOID_DEALLOC_MASK (0x0000FFFF): high 16 a saturating fix counter (hot-page detection), low 16 an atomically-mutated avoid-deallocation count. Fused into one int because 2-byte atomics are not portable — the field comment says so verbatim.

1.2 `pgbuf_atomic_latch_impl` — the real BCB latch

This is the single biggest correction to the high-level doc. The latch is not a mutex plus a latch_mode int; it is a 64-bit atomic word reinterpreted through a union:

// PGBUF_ATOMIC_LATCH + union pgbuf_atomic_latch_impl -- page_buffer.h / page_buffer.c
typedef std::atomic<uint64_t> PGBUF_ATOMIC_LATCH;
union pgbuf_atomic_latch_impl
{
  uint64_t raw;
  struct {
    PGBUF_LATCH_MODE latch_mode;   /* uint16_t enum: NO/READ/WRITE/FLUSH/INVALID */
    uint16_t waiter_exists;        /* a thread is parked on next_wait_thrd */
    int32_t  fcnt;                 /* current fix count under the latch */
  } impl;
};

The BCB’s atomic_latch field is the std::atomic<uint64_t> itself. Code loads it (memory_order_acquire) into a stack PGBUF_ATOMIC_LATCH_IMPL, edits the sub-fields, CAS-es the whole raw back (set_latch_and_fcnt, set_latch_and_add_fcnt; get_latch does the read half).

Field	Role & rationale
`raw`	The whole 64-bit word; CAS updates mode+waiter+fcnt in one instruction, no hot-path mutex
`impl.latch_mode`	Current mode (`PGBUF_LATCH_MODE`, `uint16_t`); separates shared-read from exclusive-write
`impl.waiter_exists`	1 if a thread is parked; tells the unlatcher to wake `next_wait_thrd`
`impl.fcnt`	Fix count under the latch; read mode allows `fcnt > 1`, releases at 0

PGBUF_LATCH_MODE is uint16_t-backed to fit the union’s first two bytes: PGBUF_NO_LATCH=0, _READ=1, _WRITE=2, _FLUSH=3 (block mode only — a page is never fixed in flush mode), _INVALID=4.

Invariant — the union is only ever touched as a whole raw word. Writing latch_mode directly through the live atomic would tear the 64-bit word and lose fcnt/waiter_exists updates racing on the other halves. The BCB’s mutex guards list/flag transitions, not the latch. Chapter 5 traces the CAS loop branch-by-branch.

1.3 `pgbuf_bcb` — the buffer control block, field by field

// struct pgbuf_bcb -- src/storage/page_buffer.c (condensed; SERVER_MODE fields elided)
struct pgbuf_bcb
{
  VPID vpid;
  PGBUF_ATOMIC_LATCH atomic_latch;      /* the 64-bit union latch from 1.2 */
  volatile int flags;                   /* flag byte | zone | lru index (1.1) */
  PGBUF_BCB *hash_next, *prev_BCB, *next_BCB;
  int tick_lru_list, tick_lru3;
  volatile int count_fix_and_avoid_dealloc;   /* two-purpose; see 1.1 */
  int hit_age;
  LOG_LSA oldest_unflush_lsa;
  PGBUF_IOPAGE_BUFFER *iopage_buffer;
  // ... condensed: mutex, owner_mutex, next_wait_thrd, latch_last_thread under #if SERVER_MODE ...
};

Field	Role & rationale
`mutex` (SM)	Per-BCB `pthread_mutex_t`; serializes list/flag transitions the latch doesn’t cover
`owner_mutex` (SM)	Index of thread holding `mutex`; assert aid for wrong-owner/double unlock
`vpid`	Volume+page id of resident page; the hash key
`atomic_latch`	R/W page latch (union, 1.2); user-level latch off the kernel-mutex hot path
`flags`	Packed flag-byte + zone + LRU index; one atomically-readable replacement/dirty word
`next_wait_thrd` (SM)	FIFO head of threads blocked on the latch; `waiter_exists` points here
`latch_last_thread` (SM)	Last thread that latched; diagnostic trail
`hash_next`	Next BCB in hash bucket chain; collision chaining (1.5)
`prev_BCB`	Previous LRU node; doubly-linked LRU gives O(1) unlink
`next_BCB`	Next LRU node or free-list next; reused per the §1.3 invariant
`tick_lru_list`	List tick when BCB entered its LRU; vs list `tick_list` to decide age boost
`tick_lru3`	Position stamp inside zone 3; tells victim-hint which zone-3 BCB is lowest
`count_fix_and_avoid_dealloc`	Hi 16 fix count, lo 16 avoid-dealloc; hot-page detection fused with dealloc protection (1.1)
`hit_age`	Age stamp of last hit; feeds activity/quota (Ch 10)
`oldest_unflush_lsa`	Oldest LSA of unflushed change; WAL anchor — page not written until log durable (Ch 8)
`iopage_buffer`	Pointer to this BCB’s payload slot; separates control metadata from aligned payload

Invariant — next_BCB belongs to exactly one list at a time. A BCB is in an LRU list, the invalid free list, or transiently PGBUF_VOID_ZONE (neither). The flags zone field is the source of truth; zone change and next_BCB relink must be one critical section or a BCB appears in two lists. Chapter 7 traces the relink.

1.4 `pgbuf_iopage_buffer` — the page payload slot

// struct pgbuf_iopage_buffer -- src/storage/page_buffer.c
struct pgbuf_iopage_buffer
{
  PGBUF_BCB *bcb;             /* back-pointer to owning BCB */
#if (__WORDSIZE == 32)
  int dummy;                 /* pad so iopage starts 8-byte aligned */
#endif
  FILEIO_PAGE iopage;        /* the actual buffered IO page */
};

Field	Role & rationale
`bcb`	Back-pointer to owning `pgbuf_bcb`; a `PAGE_PTR` into `iopage.page` recovers the BCB via `CAST_PGPTR_TO_BFPTR`
`dummy` (32-bit)	4-byte filler; on 32-bit `bcb` is 4 bytes, so the pad pushes `iopage` to offset 8 to align `LOG_LSA`
`iopage`	Embedded `FILEIO_PAGE`; one allocation holds control + on-disk image inline

The first bytes of iopage are the header (prv), whose lsa and ptype the WAL/recovery paths read directly:

// struct fileio_page -- src/storage/file_io.h
struct fileio_page_reserved { LOG_LSA lsa; INT32 pageid; INT16 volid;
  unsigned char ptype; unsigned char pflag; /* ... condensed ... */ };
struct fileio_page_watermark { LOG_LSA lsa; /* duplicates prv.lsa */ };
struct fileio_page
{
  FILEIO_PAGE_RESERVED prv;     /* system area at start */
  char page[1];                 /* user area */
  FILEIO_PAGE_WATERMARK prv2;   /* end-of-page watermark, duplicates prv.lsa */
};

fileio_page is not header-only: a trailing prv2 (FILEIO_PAGE_WATERMARK) sits at end-of-page holding a copy of prv.lsa. Since page[1] is a flexible array the layout is logical, not literal — fileio_get_page_watermark_pos computes prv2’s real address from the page size rather than dereferencing the member.

Invariant — iopage_buffer->bcb round-trips and prv.lsa never moves backward. The back-pointer (bcb->iopage_buffer->bcb == bcb) is set once at init (Ch 2). prv.lsa is the page’s durable-recovery watermark — it advances as the page changes, is mirrored into prv2 at flush, and is what oldest_unflush_lsa and the WAL rule (Ch 8) compare against.

1.5 The VPID hash: `pgbuf_buffer_hash` and `pgbuf_buffer_lock`

pgbuf_buffer_hash is { pthread_mutex_t hash_mutex; PGBUF_BCB *hash_next; PGBUF_BUFFER_LOCK *lock_next; }; pgbuf_buffer_lock is { VPID vpid; PGBUF_BUFFER_LOCK *lock_next; THREAD_ENTRY *next_wait_thrd; } (mutex/thread fields under #if SERVER_MODE).

Struct.Field	Role & rationale
`buffer_hash.hash_mutex`	Bucket lock; protects both chains in the bucket
`buffer_hash.hash_next`	Resident-BCB chain head; lookup walks it matching `vpid` (Ch 3)
`buffer_hash.lock_next`	Pending PGBUF-lock chain head; VPIDs being read in, not yet a BCB
`buffer_lock.vpid`	VPID reserved for read-in; a second fixer finds the in-flight read and waits
`buffer_lock.lock_next`	Next lock record in bucket; chains concurrent in-flight reads
`buffer_lock.next_wait_thrd` (SM)	Queue waiting on this read-in; woken when the page lands

The buffer-lock table is fixed-size — one record per thread, since at most one outstanding read per thread. Chapter 4 traces how a miss claims a lock before allocating a BCB.

flowchart LR
  H["pgbuf_buffer_hash[bucket]"] -->|hash_next| B1["BCB"] -->|hash_next| B2["BCB"]
  H -->|lock_next| K1["pgbuf_buffer_lock vpid=A"] -->|lock_next| K2["pgbuf_buffer_lock vpid=B"]
  K1 -->|next_wait_thrd| T["waiting threads"]

Figure 1-1: one bucket anchors a resident-BCB chain and a pending-lock chain under one hash_mutex.

1.6 `pgbuf_lru_list` — one multi-zone LRU

struct pgbuf_lru_list holds, after #if SERVER_MODE pthread_mutex_t mutex: PGBUF_BCB *top, *bottom, *bottom_1, *bottom_2; PGBUF_BCB *volatile victim_hint; int count_lru1/2/3, count_vict_cand, threshold_lru1/2, quota, tick_list, tick_lru3; volatile int flags; int index;.

Field	Role & rationale
`mutex` (SM)	List lock; protects link integrity
`top` / `bottom`	Head (MRU) / tail (LRU); new/boosted link at top, eviction ends at bottom
`bottom_1`	Last BCB of zone 1 (NULL if empty); O(1) zone-1→2 boundary move
`bottom_2`	Last BCB of zone 2 (NULL if empty); zone 2/3 boundary marker
`victim_hint`	Volatile victim-scan start; avoids re-walking pinned BCBs each search
`count_lru1`/`2`/`3`	Per-zone BCB counts; drive rebalancing and victim availability
`count_vict_cand`	Victimizable BCB count; lets victim search skip empty lists (Ch 9)
`threshold_lru1`/`2`	Target sizes of zones 1/2; a BCB falls a zone when its zone exceeds threshold, zone 3 is the rest
`quota`	Target size of a private list; adaptive per-session (Ch 10), unused for shared
`tick_list`	Bumped on add/boost; BCB stores its entry tick, the difference gauges staleness
`tick_lru3`	Bumped when a BCB falls to zone 3; stamps `bcb->tick_lru3` for victim-hint order
`flags`	Per-list flag word; marks bulk/quota list state
`index`	This list’s index in `buf_LRU_list[]`; stored into BCB `flags` low 16 so a BCB knows its home list

Invariant — victim_hint may drift below the true first victim, and the code tolerates it. Everything below the hint should be dirty, but TPCC core dumps showed it sometimes sitting before the first victimizable BCB — a known unfixed bug flagged TODO. Consumers treat the hint as a start point and re-validate every candidate; trusting it as exact would skip valid victims or victimize a dirty page. Chapter 9 walks the scan.

1.7 `pgbuf_invalid_list` — the free pool

struct pgbuf_invalid_list is { pthread_mutex_t invalid_mutex (SERVER_MODE); PGBUF_BCB *invalid_top; int invalid_cnt; }.

Field	Role & rationale
`invalid_mutex` (SM)	Free-list lock; serializes push/pop of free BCBs
`invalid_top`	Free-chain head (links via BCB `next_BCB`); a miss pops here before victimizing (Ch 4)
`invalid_cnt`	Count of free BCBs; “pool empty → victimize” without walking the chain

A free BCB is PGBUF_INVALID_ZONE and uses only next_BCB; prev_BCB is unused (per §1.3 invariant).

1.8 The holder triad — per-thread fix bookkeeping

A thread records each fix in a pgbuf_holder it owns, not in the BCB. Ownership chain: pgbuf_holder_anchor (per-thread head) → pgbuf_holder (one per held page) → pgbuf_holder_set (the slab holders are carved from).

// pgbuf_holder / _anchor / _set -- src/storage/page_buffer.c (condensed)
struct pgbuf_holder {
  int fix_count; PGBUF_BCB *bufptr;
  PGBUF_HOLDER *thrd_link, *next_holder;     /* hold-list / free-list links */
  PGBUF_HOLDER_STAT perf_stat;               /* #if !NDEBUG: char fixed_at[64*1024]; int fixed_at_size; */
  int watch_count; PGBUF_WATCHER *first_watcher, *last_watcher; };
// pgbuf_holder_anchor: { int num_free_cnt, num_hold_cnt; PGBUF_HOLDER *thrd_free_list, *thrd_hold_list; }
// pgbuf_holder_set:    { PGBUF_HOLDER element[PGBUF_NUM_ALLOC_HOLDER /*==10*/]; PGBUF_HOLDER_SET *next_set; }

Struct.Field	Role & rationale
`holder.fix_count`	Re-fix depth on this BCB; only the last unfix releases the latch
`holder.bufptr`	The held BCB; links per-thread holder to shared BCB
`holder.thrd_link`	Next in hold list; lets `pgbuf_unfix_all` walk every held page
`holder.next_holder`	Next in free list; recycles slots without re-alloc
`holder.perf_stat`	`PGBUF_HOLDER_STAT` flags; perf accounting of page usage
`holder.fixed_at`/`fixed_at_size` (dbg)	Fix call-site capture; debug where-fixed tracing
`holder.watch_count`	Watchers attached here; ordered-fix watchers (Ch 10) hang off the holder
`holder.first_watcher`/`last_watcher`	Watcher list ends; O(1) append/detach during ordered fix
`anchor.num_free_cnt`/`num_hold_cnt`	Free/used counters; fast “need a new slab?” decision
`anchor.thrd_free_list`/`thrd_hold_list`	Free/hold list heads; thread’s private view of its fixes
`holder_set.element[10]`	Slab of 10 pre-allocated holders; handed out in batches, never returned
`holder_set.next_set`	Next slab; the global free-holder pool is a list of slabs

1.9 `pgbuf_watcher` — the ordered-fix bit-field

// struct pgbuf_watcher -- src/storage/page_buffer.h (debug magic/strings elided)
struct pgbuf_watcher {
  PAGE_PTR pgptr; PGBUF_WATCHER *next, *prev;
  PGBUF_ORDERED_GROUP group_id;     /* VPID of group's HEAP header */
  unsigned latch_mode:7;            /* requested latch mode */
  unsigned page_was_unfixed:1;      /* set if any refix occurred */
  unsigned initial_rank:4;          /* rank at init */
  unsigned curr_rank:4; };          /* rank after fix */

Field	Role & rationale
`pgptr`	The watched page handle; what the caller reads/writes
`next`/`prev`	Links in the holder’s watcher list; a page may carry several, O(1) detach
`group_id`	VPID of the grouping heap header; ordered fix orders pages within a group to avoid deadlock
`latch_mode:7`	Requested latch mode; 7 bits cover the small enum, packed tight
`page_was_unfixed:1`	Set on unfix+refix during reorder; cached `pgptr` may have moved, revalidate
`initial_rank:4`	Rank at watcher init; desired fix order before reorder
`curr_rank:4`	Rank after fixing; detects out-of-order fixes

Chapter 10 traces the ordered-fix reorder loop driving these bits.

1.10 `pgbuf_buffer_pool` — the global root

pgbuf_Pool ties everything together. Fields a modifier must know:

Field	Role & rationale
`num_buffers`	Total BCB frames (≈ 10 × num_trans); fixed pool size bounding every table
`BCB_table`	`PGBUF_BCB[]`; the control blocks
`buf_hash_table`	`PGBUF_BUFFER_HASH[]`; the VPID hash (1.5)
`buf_lock_table`	`PGBUF_BUFFER_LOCK[]`; one pending-read record per thread (1.5)
`iopage_table`	`PGBUF_IOPAGE_BUFFER[]`; page payloads, parallel to `BCB_table`
`num_LRU_list`	Number of shared LRU lists; first slice of `buf_LRU_list`
`ratio_lru1`/`ratio_lru2`	Zone-1/2 size ratios; seed each list’s `threshold_lru1/2`
`buf_LRU_list`	`PGBUF_LRU_LIST[]` shared+garbage+private; one backing array, index decides class
`buf_AOUT_list`	`PGBUF_AOUT_LIST` victim history; the “Aout” half of 2Q (Ch 7)
`buf_invalid_list`	`PGBUF_INVALID_LIST` free pool (1.7); source of fresh BCBs
`victim_cand_list`	Victim-candidate array; flush daemon working set (Ch 8)
`seq_chkpt_flusher`	`PGBUF_SEQ_FLUSHER`; rate-controlled checkpoint flush state
`monitor`	`PGBUF_PAGE_MONITOR`; dirty count, per-LRU hits, victim/fix counters
`quota`	`PGBUF_PAGE_QUOTA`; private-list quota tuning (Ch 10)
`thrd_holder_info`	`PGBUF_HOLDER_ANCHOR[]` per thread; per-thread holder anchors (1.8)
`thrd_reserved_holder`	Backing memory for all holders; pre-reserved holder space
`free_holder_set_mutex` (SM)	Shared free-holder pool lock; serializes slab hand-out
`free_holder_set`/`free_index`	First slab with free entries, first free slot; global holder allocator cursor
`check_for_interrupts`	Set when interrupts must be checked; log mgr toggles under `TR_TABLE_CS`
`is_flushing_victims`/`is_checkpoint` (SM)	Daemon-state flags; coordinate flush vs. checkpoint
`direct_victims` (SM)	Victim array + two priority waiter LFCQs; direct victim hand-off (Ch 9)
`flushed_bcbs` (SM)	LFCQ of post-flush BCBs; post-flush processing queue
`private/big_private/shared_lrus_with_victims`	Three LFCQs of LRU indices with victims; victim search consults these vs scanning all (Ch 9)
`show_status`/`_old`/`_snapshot`/`_mutex`	`SHOW STATUS` reporting state; statistics surfaced to `SHOW` queries

1.11 The zone/index accessors

The only sanctioned readers of the packed flags word; the rest of the code never masks flags by hand.

// pgbuf_bcb_get_zone / _get_lru_index / PGBUF_IS_BCB_IN_LRU -- page_buffer.c
STATIC_INLINE PGBUF_ZONE pgbuf_bcb_get_zone (const PGBUF_BCB * bcb)
{ return PGBUF_GET_ZONE (bcb->flags); }              /* (flags & PGBUF_ZONE_MASK) */
STATIC_INLINE int pgbuf_bcb_get_lru_index (const PGBUF_BCB * bcb)
{ assert (PGBUF_IS_BCB_IN_LRU (bcb));                /* <- precondition */
  return PGBUF_GET_LRU_INDEX (bcb->flags); }         /* (flags & 0x0000FFFF) */
#define PGBUF_IS_BCB_IN_LRU(bcb) ((pgbuf_bcb_get_zone (bcb) & PGBUF_LRU_ZONE_MASK) != 0)

Branch analysis:

pgbuf_bcb_get_zone — single unconditional return, no error path. Every legal flags yields exactly one of the five PGBUF_ZONE values; a corrupted flags falls to the caller’s switch default.
pgbuf_bcb_get_lru_index — two branches via the assert. Debug, BCB in an LRU zone: assert passes, returns the low 16 bits (the home list’s index, §1.6). Debug, not in an LRU zone (INVALID/VOID): assert fires — those bits are meaningless. In release the assert is compiled out and the function returns whatever the low 16 bits hold, so callers own the precondition; hence call sites guard with PGBUF_IS_BCB_IN_LRU.
PGBUF_IS_BCB_IN_LRU — one boolean, two outcomes. ANDs the zone against PGBUF_LRU_ZONE_MASK (zones 1/2/3). INVALID/VOID set bits outside that mask → false; any LRU zone → true. The gate making branch 2 safe.

1.12 Pointer-relationship panorama

flowchart TB
  POOL["pgbuf_buffer_pool (pgbuf_Pool)"]
  POOL -->|BCB_table| BCB["pgbuf_bcb[]"]
  POOL -->|iopage_table| IOP["pgbuf_iopage_buffer[]"]
  POOL -->|buf_hash_table| HASH["pgbuf_buffer_hash[]"]
  POOL -->|buf_lock_table| LOCK["pgbuf_buffer_lock[]"]
  POOL -->|buf_LRU_list| LRU["pgbuf_lru_list[] shared+garbage+private"]
  POOL -->|buf_invalid_list| INV["pgbuf_invalid_list"]
  POOL -->|thrd_holder_info| ANC["pgbuf_holder_anchor[] per thread"]
  POOL -->|free_holder_set| SET["pgbuf_holder_set slabs"]
  HASH -->|hash_next| BCB
  HASH -->|lock_next| LOCK
  BCB -->|iopage_buffer| IOP
  IOP -->|bcb| BCB
  BCB -->|atomic_latch| LATCH["pgbuf_atomic_latch_impl union"]
  LRU -->|top/bottom/bottom_1/bottom_2/victim_hint| BCB
  INV -->|invalid_top via next_BCB| BCB
  ANC -->|thrd_hold_list| HLD["pgbuf_holder"]
  SET -->|element[10]| HLD
  HLD -->|bufptr| BCB
  HLD -->|first_watcher/last_watcher| WAT["pgbuf_watcher"]

Figure 1-2: the full pointer panorama. Every later chapter operates on a sub-graph of this picture.

1.13 Chapter summary — key takeaways

The BCB latch is a 64-bit atomic union, not a mutex. pgbuf_atomic_latch_impl packs latch_mode+waiter_exists+fcnt into one std::atomic<uint64_t>, touched only as the whole raw word. The BCB’s mutex guards list/flag transitions, not the latch.
flags is three things in one word — flag byte (24-31), zone (16-19, with a skip so INVALID/VOID don’t collide with the index), LRU index (0-15). Read it only through the §1.11 accessors.
count_fix_and_avoid_dealloc is two fused counters — hi 16 a saturating fix count, lo 16 an atomic avoid-dealloc count, fused because 2-byte atomics aren’t portable.
next_BCB is shared between the LRU and free lists; the zone is the single source of truth. Relink and zone change must be one critical section.
victim_hint is advisory and can drift below the true first victim — a known unfixed bug; treat it as a start point and re-validate.
The iopage is embedded with a back-pointer and an alignment pad, so PAGE_PTR → BCB round-trips and FILEIO_PAGE stays 8-byte aligned; its prv.lsa (mirrored at end-of-page in prv2) is the WAL/recovery watermark.
Fixes are bookkept per-thread in the holder triad (anchor → holder → slab of 10); pgbuf_bcb_get_lru_index is valid only when PGBUF_IS_BCB_IN_LRU holds — the assert encodes that precondition.

Chapter 2: Initialization and Memory Layout

This chapter answers: where does every page-buffer structure come from at server start, and how is each table sized, allocated, and cross-wired before the first pgbuf_fix runs? The high-level companion (cubrid-page-buffer-manager.md) names the players in its CUBRID’s Approach section (BCB, page table, invalid list, three-zone LRU, private LRUs, Aout, LFCQs) and Ch. 1 gives the struct map; neither is re-derived below.

Everything lives in one file-scope object, pgbuf_Pool (type PGBUF_BUFFER_POOL). pgbuf_initialize is the orchestrator: it zeroes pgbuf_Pool field-by-field, derives sizes, then calls ten sub-initializers in a fixed order — quota parameters first (they fix the LRU count), then the dependent tables, then the quota/monitor arrays sized by PGBUF_TOTAL_LRU_COUNT.

2.1 `pgbuf_initialize` — the orchestrator and its size derivation

The function opens with per-field zeroing plus memset on embedded sub-structs. The std::atomic_int members of monitor cannot be memset, so they use .store(0). This manual reset is what makes pgbuf_finalize safe on a half-built pool: every pointer is NULL before any allocation (scalar sentinels like free_index are set to 0 here and re-set to -1 later in §2.9). num_buffers is read then floored; the two LRU-zone ratios are each clamped:

// pgbuf_initialize -- src/storage/page_buffer.c
pgbuf_Pool.num_buffers = prm_get_integer_value (PRM_ID_PB_NBUFFERS);
if (pgbuf_Pool.num_buffers < PGBUF_MINIMUM_BUFFERS)   /* MAX_NTRANS * 10 */
  pgbuf_Pool.num_buffers = PGBUF_MINIMUM_BUFFERS;     /* <- silent floor, never error */
pgbuf_Pool.ratio_lru1 = prm_get_float_value (PRM_ID_PB_LRU_HOT_RATIO);
pgbuf_Pool.ratio_lru2 = prm_get_float_value (PRM_ID_PB_LRU_BUFFER_RATIO);
pgbuf_Pool.ratio_lru1 = MAX (pgbuf_Pool.ratio_lru1, PGBUF_LRU_ZONE_MIN_RATIO); /* clamp lru1 into  */
pgbuf_Pool.ratio_lru1 = MIN (pgbuf_Pool.ratio_lru1, PGBUF_LRU_ZONE_MAX_RATIO); /* [0.05f, 0.90f]   */
pgbuf_Pool.ratio_lru2 = MAX (pgbuf_Pool.ratio_lru2, PGBUF_LRU_ZONE_MIN_RATIO); /* lru2 floor       */
pgbuf_Pool.ratio_lru2 = MIN (pgbuf_Pool.ratio_lru2, 1.0f - PGBUF_LRU_ZONE_MIN_RATIO - pgbuf_Pool.ratio_lru1);

The ratios are only stored here; they govern LRU1/LRU2 thresholds in Ch. 6/7. Two asserts follow: ratio_lru2 stays in [0.05, 0.90] and the two-zone sum stays in [0.099, 0.951]. Each sub-initializer failure does goto error, which calls pgbuf_finalize.

flowchart TD
  A["pgbuf_initialize"] --> B["pgbuf_initialize_page_quota_parameters\nfixes num_private_LRU_list"]
  B --> C["pgbuf_initialize_bcb_table\nBCB_table + iopage_table"]
  C --> D["pgbuf_initialize_hash_table\n2^20 buckets"]
  D --> E["pgbuf_initialize_lock_table\none record per thread"]
  E --> F["pgbuf_initialize_lru_list\nfixes num_LRU_list, builds shared+private"]
  F --> G["pgbuf_initialize_invalid_list\nall BCBs seeded here"]
  G --> H["pgbuf_initialize_aout_list"]
  H --> I["pgbuf_initialize_thrd_holder\npre-allocate holder sets"]
  I --> J["pgbuf_initialize_page_quota\narrays sized by TOTAL_LRU_COUNT"]
  J --> K["pgbuf_initialize_page_monitor\nlru_hits/lru_activity arrays"]
  K --> L["victim_cand_list + seq_chkpt_flusher\n+ SERVER_MODE LFCQs + show_status"]
  B -.error.-> Z["goto error -> pgbuf_finalize"]
  C -.error.-> Z
  F -.error.-> Z
  L --> M["NO_ERROR"]

Figure 2-1. The ten sub-initializers in call order. Quota parameters are first because they set num_private_LRU_list, which feeds PGBUF_TOTAL_LRU_COUNT, which sizes the LRU, quota, and monitor arrays. Note the invalid list (G) is seeded before the Aout list (H).

After the ten, the orchestrator allocates victim_cand_list (one per buffer), sizes the checkpoint flusher at MIN(0.25 * num_buffers, 65536), and under SERVER_MODE allocates the direct-victim array (bcb_victims, one per thread) and three lockfree::circular_queue objects (waiter_threads_high_priority, waiter_threads_low_priority, flushed_bcbs). The private/big_private_lrus_with_victims queues are created only if PGBUF_PAGE_QUOTA_IS_ENABLED; shared_lrus_with_victims always. Finally show_status (MAX_NTRANS + 1 records) is allocated and zeroed. These LFCQs and the daemons are Ch. 8–9.

Invariant — “every pointer NULL before first allocation.” Enforcement: the opening per-field reset NULLs every pointer member, so error: can call pgbuf_finalize at any point and finalize frees exactly what was allocated (every free is != NULL-guarded). What breaks: a pointer field added but not NULL-initialized here would feed garbage to free_and_init on a mid-init failure — a free of uninitialized memory.

2.2 `pgbuf_initialize_bcb_table` — BCB/iopage allocation, cross-linking, alignment

Two parallel arrays are allocated and each validated with MEM_SIZE_IS_VALID: BCB_table (metadata, num_buffers * PGBUF_BCB_SIZEOF) and iopage_table (page frames, num_buffers * PGBUF_IOPAGE_BUFFER_SIZE). Both iopage failure branches (bad-size and OOM) roll back BCB_table themselves (free_and_init, guarded != NULL) rather than relying on finalize, then return ER_PRM_BAD_VALUE / ER_OUT_OF_VIRTUAL_MEMORY. The per-BCB loop then initializes each BCB and cross-links it to its iopage frame symmetrically; next_BCB chains every BCB into one forward list (last NULL) that the invalid list inherits:

// pgbuf_initialize_bcb_table -- src/storage/page_buffer.c
for (i = 0; i < pgbuf_Pool.num_buffers; i++)
  {
    bufptr = PGBUF_FIND_BCB_PTR (i);                 /* base + i * sizeof(PGBUF_BCB) */
    pthread_mutex_init (&bufptr->mutex, NULL);
    VPID_SET_NULL (&bufptr->vpid);
    placement_new (&bufptr->atomic_latch, 0);        /* C++ atomic needs placement-new, not memset */
    bufptr->atomic_latch.store (impl.raw);           /* impl = {mode INVALID, no waiter, fcnt 0}; Ch.5 */
    bufptr->next_BCB = (i == pgbuf_Pool.num_buffers - 1) ? NULL : PGBUF_FIND_BCB_PTR (i + 1); /* chain */
    bufptr->flags = PGBUF_BCB_INIT_FLAGS;            /* == PGBUF_INVALID_ZONE, no other flag */
    /* ... clear hash_next/prev_BCB/count_fix_and_avoid_dealloc/hit_age/oldest_unflush_lsa/ticks ... */
    ioptr = PGBUF_FIND_IOPAGE_PTR (i);               /* base + i * PGBUF_IOPAGE_BUFFER_SIZE */
    /* ... fileio_init_lsa_of_page; set iopage.prv pageid/volid = -1, ptype UNKNOWN ... */
    bufptr->iopage_buffer = ioptr;  ioptr->bcb = bufptr;   /* <- symmetric cross-link */
  }

graph LR
  subgraph BCB_table
    b0["BCB[0]"]
    b1["BCB[1]"]
  end
  subgraph iopage_table
    p0["iopage[0]"]
    p1["iopage[1]"]
  end
  b0 -->|iopage_buffer| p0
  p0 -->|bcb| b0
  b1 -->|iopage_buffer| p1
  p1 -->|bcb| b1
  b0 -->|next_BCB| b1

Figure 2-2. Parallel arrays, symmetric per-slot cross-link, and the next_BCB chain the invalid list inherits.

Invariant — “iopage is 8-byte aligned.” Enforcement: struct pgbuf_iopage_buffer places PGBUF_BCB *bcb first, then on 32-bit builds (__WORDSIZE == 32) inserts an explicit int dummy so the following FILEIO_PAGE iopage starts on an 8-byte boundary (an unsupported platform that is neither LINUX/WINDOWS/AIX trips a #error). PGBUF_IOPAGE_BUFFER_SIZE (offsetof(.., iopage) + SIZEOF_IOPAGE_PAGESIZE_AND_GUARD()) is the stride PGBUF_FIND_IOPAGE_PTR multiplies by i, so every frame stays aligned. What breaks: dropping the dummy yields misaligned buffers and undefined direct-I/O behavior.

2.3 `pgbuf_initialize_hash_table` — the fixed 2^20 bucket page table

The page table size is a compile-time constant, independent of num_buffers:

// pgbuf_initialize_hash_table -- src/storage/page_buffer.c
hashsize = PGBUF_HASH_SIZE;                          /* (1 << HASH_SIZE_BITS) == 1 << 20 == 1048576 */
pgbuf_Pool.buf_hash_table = (PGBUF_BUFFER_HASH *) malloc (hashsize * PGBUF_BUFFER_HASH_SIZEOF);
/* ... OOM check; loop: pthread_mutex_init each hash_mutex; hash_next = lock_next = NULL ... */

A power-of-two bucket count keeps the final masking step a single AND — pgbuf_hash_func_mirror finishes with hash_val & ((1 << HASH_SIZE_BITS) - 1), so no modulo/division is needed (the function still bit-reverses the 8 LSBs of volid into the high bits in a small loop before XOR-ing with pageid). Each bucket has its own hash_mutex (SERVER_MODE only) — no global page-table lock. The hash_next walk is Ch. 3.

2.4 `pgbuf_initialize_lock_table` — one buffer-lock record per thread

The buffer-lock table has one slot per server thread, indexed by thread index; it is the rendezvous used while a miss is being resolved (Ch. 4):

// pgbuf_initialize_lock_table -- src/storage/page_buffer.c
thrd_num_total = thread_num_total_threads ();        /* SA mode asserts thrd_num_total == 1 */
pgbuf_Pool.buf_lock_table = (PGBUF_BUFFER_LOCK *) malloc (thrd_num_total * PGBUF_BUFFER_LOCK_SIZEOF);
/* ... OOM check; loop: VPID_SET_NULL(vpid); lock_next = NULL; (SERVER_MODE) next_wait_thrd = NULL ... */

Sizing by thread count works because a thread resolves at most one miss at a time; its record is reused for whichever VPID it is bringing in.

2.5 `pgbuf_initialize_lru_list` — shared + private list count and per-list reset

This initializer first fixes num_LRU_list (the shared count): a non-zero parameter is taken verbatim; zero is auto-derived:

// pgbuf_initialize_lru_list -- src/storage/page_buffer.c
pgbuf_Pool.num_LRU_list = prm_get_integer_value (PRM_ID_PB_NUM_LRU_CHAINS);
if (pgbuf_Pool.num_LRU_list == 0)
  {
    pgbuf_Pool.num_LRU_list = (int) MAX_NTRANS;       /* default: one shared list per transaction slot */
    if (pgbuf_Pool.num_buffers / pgbuf_Pool.num_LRU_list < PGBUF_MIN_PAGES_IN_SHARED_LIST)  /* 1000 */
      pgbuf_Pool.num_LRU_list = pgbuf_Pool.num_buffers / PGBUF_MIN_PAGES_IN_SHARED_LIST;     /* coarsen */
    pgbuf_Pool.num_LRU_list = MAX (pgbuf_Pool.num_LRU_list, 4);   /* floor: at least 4 shared LRUs */
  }

Branch logic: one list per transaction; if that gives fewer than 1000 pages per list, coarsen; never below 4. The allocation covers shared + private (PGBUF_TOTAL_LRU_COUNT = PGBUF_SHARED_LRU_COUNT + PGBUF_PRIVATE_LRU_COUNT, the latter = num_private_LRU_list from §2.7). Private lists occupy the high index range; PGBUF_IS_PRIVATE_LRU_INDEX(i) is true for i >= PGBUF_SHARED_LRU_COUNT.

// pgbuf_initialize_lru_list -- src/storage/page_buffer.c
pgbuf_Pool.buf_LRU_list = (PGBUF_LRU_LIST *) malloc (PGBUF_TOTAL_LRU_COUNT * PGBUF_LRU_LIST_SIZEOF);
/* ... OOM check; loop over PGBUF_TOTAL_LRU_COUNT lists: ... */
    pgbuf_Pool.buf_LRU_list[i].index = i;            /* self-index, used to recover list from a BCB */
    /* ... pthread_mutex_init; top/bottom/bottom_1/bottom_2 = NULL; counts/victim_hint/ticks cleared ... */
    pgbuf_Pool.buf_LRU_list[i].threshold_lru1 = 0;   /* <- initial threshold ZERO, set later */
    pgbuf_Pool.buf_LRU_list[i].threshold_lru2 = 0;  pgbuf_Pool.buf_LRU_list[i].quota = 0;
    pgbuf_Pool.buf_LRU_list[i].flags = 0;

Both kinds of list use the same loop — they differ only by index range, not by struct. The thresholds and quota start at 0, not from the §2.1 ratios; they get real values from the quota machinery (Ch. 7/10) once num_buffers is distributed. At init every list is empty, so zero is correct.

2.6 `pgbuf_initialize_invalid_list` and the Aout list

The invalid (free) list is the cheapest initializer — it points its head at BCB[0] and trusts the next_BCB chain from §2.2:

// pgbuf_initialize_invalid_list -- src/storage/page_buffer.c
pthread_mutex_init (&pgbuf_Pool.buf_invalid_list.invalid_mutex, NULL);
pgbuf_Pool.buf_invalid_list.invalid_top = PGBUF_FIND_BCB_PTR (0);   /* head of the next_BCB chain */
pgbuf_Pool.buf_invalid_list.invalid_cnt = pgbuf_Pool.num_buffers;   /* every BCB starts invalid */

Invariant — “all BCBs begin in the invalid list.” Enforcement: every BCB’s flag is PGBUF_INVALID_ZONE (§2.2) and invalid_cnt == num_buffers — the same truth stored twice. What breaks: the first num_buffers misses pop here before any eviction; if the count and the flags disagree, a popped BCB could be double-counted or skipped, so Ch. 7 keeps the two in sync on every move.

The Aout list (pgbuf_initialize_aout_list, struct pgbuf_aout_list) records eviction history to decide whether a re-faulted page was recently evicted. Capacity is num_buffers * aout_ratio (where aout_ratio = prm_get_float_value(PRM_ID_PB_AOUT_RATIO)), capped at PGBUF_LIMIT_AOUT_BUFFERS (32768); a non-positive ratio disables it (max_count = 0, early return NO_ERROR after Aout_mutex is already initialized). Otherwise it pre-allocates a bufarray of max_count PGBUF_AOUT_BUF nodes chained into a free list (Aout_free at bufarray[0]), then builds num_hashes = MAX(max_count / AOUT_HASH_DIVIDE_RATIO, 1) MHT tables. The error_return path nulls Aout_free, frees bufarray, then destroys the MHTs with a loop that stops at the first NULL slot (for (i = 0; list->aout_buf_ht[i] != NULL; i++)) — so only the tables actually created are destroyed, unlike the pgbuf_finalize loop which iterates the full num_hashes — frees aout_buf_ht, destroys Aout_mutex, returns ER_FAILED.

PGBUF_AOUT_LIST (the Aout container):

Field	Role	Why it exists
`Aout_mutex` (SERVER_MODE)	guards the whole Aout list	history mutated on every eviction/refault
`Aout_top`	most-recently-evicted end	newest history entry
`Aout_bottom`	oldest end	the entry discarded when the list overflows
`Aout_free`	head of the free node list	nodes preallocated, never `malloc`’d per insert
`bufarray`	the single allocation of all nodes	one block beats per-node alloc
`num_hashes`	count of MHT lookup tables	shards the lookup to cut contention
`aout_buf_ht`	array of MHT tables, VPID to node	O(1) “was this page recently evicted?”
`max_count`	capacity; 0 means disabled	bounds memory and acts as the on/off switch

PGBUF_PAGE_QUOTA (adaptive private-LRU sizing — populated in §2.7):

Field	Role	Why it exists
`num_private_LRU_list`	number of private LRUs; 0 disables quota	master switch for the private-LRU feature
`lru_victim_flush_priority_per_lru`	per-LRU flush priority (`TOTAL_LRU_COUNT` floats)	tells flush daemons where dirty pressure is
`private_lru_session_cnt`	active sessions per private LRU	a list with 0 sessions can be reclaimed
`private_pages_ratio`	fraction of all BCBs that are private	target the quota adjuster steers toward
`add_shared_lru_idx`	round-robin cursor for relocating to shared	spreads BCBs evenly across shared lists
`avoid_shared_lru_idx`	shared LRU to skip when relocating	avoids piling onto an oversized list
`last_adjust_time`	timestamp of last quota adjustment	rate-limits the adjuster
`adjust_age`	monotonic adjustment counter	versions the quota state
`is_adjusting`	re-entrancy guard for the adjuster	only one thread adjusts quotas at a time

PGBUF_PAGE_MONITOR (per-LRU statistics — populated in §2.8):

Field	Role	Why it exists
`dirties_cnt`	count of dirty BCBs (INT64)	drives flush urgency
`lru_hits`	LRU1 hits per LRU (`TOTAL_LRU_COUNT` ints)	recency-quality signal for quota tuning
`lru_activity`	activity level per LRU	detects idle private lists for reclamation
`lru_shared_pgs_cnt`	BCBs across all shared LRUs (volatile)	complements `private_pages_ratio`
`pg_unfix_cnt`	unfix counter (`std::atomic_int`)	triggers periodic quota refresh
`lru_victim_req_cnt`	victim requests across all LRUs	victim-pressure gauge
`fix_req_cnt`	fix requests (`std::atomic_int`)	overall load gauge
`bcb_locks` (SERVER_MODE)	per-thread BCB-mutex usage tracking	lock-contention diagnostics
`victim_rich`	true when victims are plentiful	fast-path hint for the fix code

2.7 Quota bootstrap — `pgbuf_initialize_page_quota_parameters` then `_page_quota`

The split is deliberate. Parameters runs before the BCB/LRU tables because it fixes num_private_LRU_list (a dependency of PGBUF_TOTAL_LRU_COUNT); the data initializer runs after them because it allocates arrays sized by that total.

// pgbuf_initialize_page_quota_parameters -- src/storage/page_buffer.c
quota = &(pgbuf_Pool.quota);  memset (quota, 0, sizeof (PGBUF_PAGE_QUOTA));
tsc_getticks (&quota->last_adjust_time);  quota->adjust_age = 0;  quota->is_adjusting = 0;
#if defined (SERVER_MODE)
  quota->num_private_LRU_list = prm_get_integer_value (PRM_ID_PB_NUM_PRIVATE_CHAINS);
  if (quota->num_private_LRU_list == -1)
    quota->num_private_LRU_list = MAX_NTRANS + VACUUM_MAX_WORKER_COUNT;   /* auto: one per worker */
  else if (quota->num_private_LRU_list == 0)
    { /* disabled */ }                                                    /* <- explicit no-op branch */
  else if (quota->num_private_LRU_list < PGBUF_PRIVATE_LRU_MIN_COUNT)     /* 4 */
    quota->num_private_LRU_list = PGBUF_PRIVATE_LRU_MIN_COUNT;            /* floor when user-set */
#else
  quota->num_private_LRU_list = 0;                                       /* SA_MODE: no private LRUs */
#endif

Outcomes: -1 (auto) becomes MAX_NTRANS + VACUUM_MAX_WORKER_COUNT; 0 stays disabled; positive below 4 is raised to 4; SA-mode is always 0. This integer drives PGBUF_PAGE_QUOTA_IS_ENABLED (> 0) everywhere. The data initializer then allocates the two arrays and seeds the session counts:

// pgbuf_initialize_page_quota -- src/storage/page_buffer.c
quota->lru_victim_flush_priority_per_lru = (float *) malloc (PGBUF_TOTAL_LRU_COUNT  * sizeof (float)); /* ALL lists */
quota->private_lru_session_cnt           = (int *)   malloc (PGBUF_PRIVATE_LRU_COUNT * sizeof (int));  /* PRIVATE only */
/* ... each OOM -> error_status, goto exit; loop zeros priority for all, session_cnt only where ... */
/* ...   PGBUF_IS_PRIVATE_LRU_INDEX(i) holds, indexed via PGBUF_PRIVATE_LIST_FROM_LRU_INDEX(i) ...   */
quota->private_pages_ratio = PGBUF_PAGE_QUOTA_IS_ENABLED ? 1.0f : 0;   /* start fully private if enabled */
quota->add_shared_lru_idx = 0;  quota->avoid_shared_lru_idx = -1;

Both failures land on a single exit: (which returns error_status); the orchestrator’s goto error then runs finalize, which frees whatever was allocated.

2.8 `pgbuf_initialize_page_monitor`

Mirroring quota-data, the monitor first re-NULLs its pointer members, then allocates two per-LRU integer arrays sized by PGBUF_TOTAL_LRU_COUNT:

// pgbuf_initialize_page_monitor -- src/storage/page_buffer.c
monitor->lru_hits     = (int *) malloc (PGBUF_TOTAL_LRU_COUNT * sizeof (int));
monitor->lru_activity = (int *) malloc (PGBUF_TOTAL_LRU_COUNT * sizeof (int));
/* ... each OOM -> goto exit; loop zeros both; lru_victim_req_cnt/lru_shared_pgs_cnt = 0 ... */
monitor->fix_req_cnt.store (0);  monitor->pg_unfix_cnt.store (0);   /* atomics: .store, not memset */
#if defined (SERVER_MODE)
  if (pgbuf_Monitor_locks)                          /* forced true in !NDEBUG; param-driven in NDEBUG */
    monitor->bcb_locks = (PGBUF_MONITOR_BCB_MUTEX *) calloc (count_threads, sizeof (PGBUF_MONITOR_BCB_MUTEX));
#endif
monitor->victim_rich = false;                       /* no BCBs in lists yet, so no victims */

bcb_locks is per-thread (sized by thread_num_total_threads()), allocated only when lock monitoring is on (pgbuf_Monitor_locks is set in §2.1: forced true in debug builds, read from PRM_ID_PB_MONITOR_LOCKS in NDEBUG). All error paths funnel through exit:.

2.9 `pgbuf_initialize_thrd_holder` — pre-allocated per-thread holder pools

A holder records that a thread has a BCB fixed. Each thread gets a private free list of PGBUF_DEFAULT_FIX_COUNT (7) holders so the common fix path never allocates:

// pgbuf_initialize_thrd_holder -- src/storage/page_buffer.c
thrd_num_total = thread_num_total_threads ();
pgbuf_Pool.thrd_holder_info     = (PGBUF_HOLDER_ANCHOR *) malloc (thrd_num_total * PGBUF_HOLDER_ANCHOR_SIZEOF);
pgbuf_Pool.thrd_reserved_holder = (PGBUF_HOLDER *) malloc (thrd_num_total * PGBUF_DEFAULT_FIX_COUNT * PGBUF_HOLDER_SIZEOF);
/* ... each OOM check; per-thread anchor i: num_hold_cnt=0, num_free_cnt=7, thrd_hold_list=NULL ... */
    pgbuf_Pool.thrd_holder_info[i].thrd_free_list = &(pgbuf_Pool.thrd_reserved_holder[i * PGBUF_DEFAULT_FIX_COUNT]);
    /* ... inner loop chains the 7 reserved holders via next_holder, last == NULL ... */
pthread_mutex_init (&pgbuf_Pool.free_holder_set_mutex, NULL);
pgbuf_Pool.free_holder_set = NULL;  pgbuf_Pool.free_index = -1;   /* -1 == no shared free holder; grow on demand */

The reserved holders are one flat array sliced per thread by i * PGBUF_DEFAULT_FIX_COUNT. When a thread exceeds 7 concurrent fixes, pgbuf_allocate_thrd_holder_entry falls back to the shared free_holder_set, malloc’d in PGBUF_HOLDER_SET blocks (PGBUF_NUM_ALLOC_HOLDER = 10 elements each) and never freed until finalize; free_index == -1 is the “pool empty, grow it” sentinel set here (it was a transient 0 from the §2.1 reset).

2.10 `pgbuf_thread_variables_init` — a worker claims its private LRU index

Called when a worker’s THREAD_ENTRY comes online, this hook wires the worker to its private LRU and holder anchor:

// pgbuf_thread_variables_init -- src/storage/page_buffer.c
if (!thread_p) return;
if (pgbuf_Pool.quota.num_private_LRU_list > 0 && thread_p->private_lru_index != -1)
  thread_p->m_is_private_lru_enabled = true;       /* quota on AND this worker has a private slot */
else
  thread_p->m_is_private_lru_enabled = false;
if (!thread_p->m_holder_anchor)
  thread_p->m_holder_anchor = &pgbuf_Pool.thrd_holder_info[thread_p->index];   /* bind to its slice */

private_lru_index lives on THREAD_ENTRY (default -1), assigned elsewhere when a transaction acquires a private list. This function only interprets it: a worker uses a private LRU iff quota is enabled and its index != -1. The anchor bind is idempotent (guarded by !m_holder_anchor) and gives O(1) access to the §2.9 slice. Vacuum workers and SA-mode fall to false, using shared LRUs only.

2.11 `pgbuf_finalize` — teardown order

Teardown is not the strict reverse of init; it is a flat sequence of NULL-guarded frees, each safe because of the §2.1 invariant: (1) hash table — destroy all hash_mutexes, free buf_hash_table; (2) lock table — free buf_lock_table; (3) BCB table — destroy every BCB mutex, free BCB_table, set num_buffers = 0; (4) free iopage_table; (5) LRU lists — destroy every list mutex, free buf_LRU_list; (6) destroy invalid_mutex; (7) thrd holder — free thrd_holder_info/thrd_reserved_holder, destroy free_holder_set_mutex, walk and free every lazily-grown free_holder_set block; (8) victim_cand_list, then Aout (free bufarray, mht_destroy each of num_hashes slots, free aout_buf_ht, destroy Aout_mutex, zero fields), then seq_chkpt_flusher.flush_list; (9) quota arrays; (10) monitor arrays + (SERVER_MODE) bcb_locks; (11) SERVER_MODE: free direct_victims.bcb_victims, delete the two waiter queues and flushed_bcbs; (12) delete the three _lrus_with_victims queues; (13) free show_status, destroy its mutex; (14) thread_clear_all_holder_anchor () — the symmetric undo of §2.10.

C++ objects (lockfree::circular_queue) use delete, not free_and_init, because they were new’d; mixing would corrupt the heap. num_buffers is zeroed early (step 3) so a racing reader sees an empty pool. With every free != NULL-guarded and pointers NULL-initialized in §2.1, finalize is correct whether the pool is fully built or failed mid-init.

2.12 Chapter summary — key takeaways

Ten sub-initializers, fixed order. pgbuf_initialize zeroes pgbuf_Pool field-by-field (atomics via .store), then calls ten sub-initializers; quota parameters must run first (they fix num_private_LRU_list) and quota/monitor data must run last (sized by PGBUF_TOTAL_LRU_COUNT).
num_buffers is floored, not validated (below MAX_NTRANS * 10 it is silently raised); each LRU zone ratio is independently clamped (lru1 into [0.05, 0.90], lru2 floored at 0.05 then capped so the sum leaves room) and only stored.
BCB and iopage are parallel arrays, symmetrically cross-linked; the next_BCB chain is what the invalid list inherits; the int dummy padding enforces 8-byte iopage alignment on 32-bit builds.
The page table is a fixed 2^20 buckets, each with its own hash_mutex (no global lock); a power-of-two size makes the final hash step a single AND. Lock and holder pools are sized by thread count, not buffer count.
All BCBs start in the invalid list (invalid_cnt == num_buffers, every flag PGBUF_INVALID_ZONE) — one truth stored twice. LRU thresholds start at 0 because every list is empty.
Quota is one integer switch — num_private_LRU_list (-1 auto, 0 disabled, positive floored to 4, 0 in SA mode) drives PGBUF_PAGE_QUOTA_IS_ENABLED; a worker uses a private LRU iff quota is on and its THREAD_ENTRY.private_lru_index != -1.
Finalize is a flat NULL-guarded sequence, safe at any partial-build point; it deletes C++ queues but frees C arrays, and ends by clearing per-thread holder-anchor back-pointers — see Ch. 8 for the daemon set this chapter only constructs.

Chapter 3: The Fix Entry Path and Page-Table Lookup

Every page access enters through pgbuf_fix (compiled as pgbuf_fix_release in release builds, pgbuf_fix_debug under !NDEBUG). This chapter dissects that function as a master state machine, from argument validation to the moment a hit hands off to latching (Chapter 5) or a miss hands off to BCB claim (Chapter 4). For the big-picture flow and the meaning of the zones, flags, and BCB struct, see ### How a page fix flows, ### Page table — VPID hash, and ### Buffer Control Block — PGBUF_BCB in cubrid-page-buffer-manager.md. The fetch mode (PAGE_FETCH_MODE) is the biggest source of branching: its seven values reappear at the lock-free fast path, the miss fork, the page-VPID check, and the PAGE_UNKNOWN switch near the exit.

3.1 The seven PAGE_FETCH_MODE values

// PAGE_FETCH_MODE -- src/storage/page_buffer.h
typedef enum
{
  OLD_PAGE = 0,                 /* must already exist on disk or in buffer */
  NEW_PAGE,                     /* newly allocated; may be created in buffer */
  OLD_PAGE_IF_IN_BUFFER,        /* return only if resident; never fix from disk */
  OLD_PAGE_PREVENT_DEALLOC,     /* fetch + mark to block dealloc */
  OLD_PAGE_DEALLOCATED,         /* deliberately fetch a deallocated page */
  OLD_PAGE_MAYBE_DEALLOCATED,   /* fetch, tolerate deallocated (warn) */
  RECOVERY_PAGE                 /* recovery: new/old/deallocated all valid */
} PAGE_FETCH_MODE;

Mode	Validation skipped?	Miss → claim from disk?	Behaviour on `PAGE_UNKNOWN` page at exit
`OLD_PAGE`	no	yes	`assert(false)` + `ER_ERROR_SEVERITY` `ER_PB_BAD_PAGEID`, unfix, return NULL
`NEW_PAGE`	no	yes (created in buffer)	accepted, returned
`OLD_PAGE_IF_IN_BUFFER`	suppresses errors in `pgbuf_is_valid_page`	no — returns NULL on miss	accepted, returned
`OLD_PAGE_PREVENT_DEALLOC`	no	yes	treated like `OLD_PAGE`: `assert(false)`, unfix, NULL
`OLD_PAGE_DEALLOCATED`	no	yes	accepted, returned
`OLD_PAGE_MAYBE_DEALLOCATED`	no	yes	warning `ER_PB_BAD_PAGEID`, unfix, return NULL
`RECOVERY_PAGE`	bypasses the page-validation block entirely	yes	accepted, returned

3.2 Argument validation and the unconditional→conditional downgrade

pgbuf_fix_release validates before touching shared state. Four guards fire in order, each an early return NULL. The first two are assert_release (false) checks rejecting an illegal request_mode (non-R/W) or condition; then pgbuf_Pool.monitor.fix_req_cnt is bumped, then the page-validation and pageid guards:

// pgbuf_fix_release -- src/storage/page_buffer.c
if (pgbuf_get_check_page_validation_level (PGBUF_DEBUG_PAGE_VALIDATION_FETCH)
    && fetch_mode != RECOVERY_PAGE)                  /* <- recovery skips validation */
  {
    if (pgbuf_is_valid_page (thread_p, vpid, fetch_mode == OLD_PAGE_IF_IN_BUFFER) != DISK_VALID)
      return NULL;                                   /* IF_IN_BUFFER suppresses errors */
  }
if (vpid->pageid < 0)                                /* <- always-on cheap check */
  {
    er_set (ER_FATAL_ERROR_SEVERITY, ARG_FILE_LINE, ER_PB_BAD_PAGEID, 2, ...);
    return NULL;                                     /* fatal: ER_FATAL_ERROR_SEVERITY */
  }

The page-validation block runs only when debug validation is armed and the mode is not RECOVERY_PAGE (recovery may legitimately fix pages disk metadata says do not exist). For OLD_PAGE_IF_IN_BUFFER the second argument is true, suppressing the error log since “not valid” is normal. The pivotal transformation comes next: if condition == PGBUF_UNCONDITIONAL_LATCH and pgbuf_find_current_wait_msecs (thread_p) is LK_ZERO_WAIT or LK_FORCE_ZERO_WAIT, condition is silently set to PGBUF_CONDITIONAL_LATCH.

Invariant — a zero-wait transaction never blocks on a page latch. The downgrade happens here, before any hashing or latching, and everything downstream keys off condition. Skipping it would let a zero-wait transaction sleep indefinitely in pgbuf_latch_bcb_upon_fix.

3.3 The `try_again` loop and the interrupt check

Perf tracking is sampled just before the try_again: label, the loop re-entry point:

// pgbuf_fix_release -- src/storage/page_buffer.c
try_again:
  if (logtb_get_check_interrupt (thread_p) == true)
    if (logtb_is_interrupted (thread_p, true, &pgbuf_Pool.check_for_interrupts) == true)
      {
        er_set (ER_ERROR_SEVERITY, ARG_FILE_LINE, ER_INTERRUPTED, 0);
        PGBUF_BCB_CHECK_MUTEX_LEAKS ();   /* <- assert no mutex held on exit */
        return NULL;
      }

The interrupt check sits inside the loop, so every retry re-checks for interruption. Exactly one statement jumps back to try_again: the miss path’s pgbuf_claim_bcb_for_fix returning NULL with its retry out-parameter set (a BCB-claim race; Chapter 4). The §3.2 guards sit above the label and run once.

// pgbuf_fix_release -- src/storage/page_buffer.c (miss fork, retry edge)
bufptr = pgbuf_claim_bcb_for_fix (thread_p, vpid, fetch_mode, hash_anchor, &perf, &retry, false);
if (bufptr == NULL)
  {
    if (retry) { retry = false; goto try_again; }   /* <- the only re-entry */
    ASSERT_ERROR ();
    return NULL;
  }

pgbuf_fix_with_retry is a thin wrapper around pgbuf_fix, not part of the loop. It re-calls pgbuf_fix while it returns NULL, switching on er_errid (): NO_ERROR/ER_INTERRUPTED retry without bumping i; the three timeout errors (ER_LK_UNILATERALLY_ABORTED, ER_LK_PAGE_TIMEOUT, ER_PAGE_LATCH_TIMEDOUT) do i++; anything else sets noretry. The loop breaks (with ER_PAGE_LATCH_ABORTED) once noretry || i > retry — so interrupts never consume retry budget and any other error exits at once.

3.4 Hashing the VPID

The page table is indexed by PGBUF_HASH_VALUE, calling pgbuf_hash_func_mirror:

// pgbuf_hash_func_mirror -- src/storage/page_buffer.c
#define HASH_SIZE_BITS 20                            /* 2^20 ~ 1M anchors, fixed */
#define VOLID_LSB_BITS 8
  reverse_mask = 1 << (HASH_SIZE_BITS - 1);          /* top bit of the 20-bit space */
  for (i = VOLID_LSB_BITS; i > 0; i--)               /* bit-reverse low 8 volid bits */
    { if (volid_lsb & lsb_mask) reversed_volid_lsb |= reverse_mask;
      reverse_mask >>= 1; lsb_mask <<= 1; }
  hash_val = vpid->pageid ^ reversed_volid_lsb;      /* XOR pageid with mirrored volid */
  hash_val = hash_val & ((1 << HASH_SIZE_BITS) - 1); /* clamp to 2^20 buckets */

The “mirror” trick bit-reverses the low 8 volid bits into the top of the 20-bit space, then XORs with the pageid (which dominates the low bits), so different volumes get disjoint high-bit signatures and adjacent ids across volumes do not share chains.

Two sibling helpers serve the Aout victim-history mht table (Chapter 7), not the main page table: pgbuf_hash_vpid is a generic modulo hash, ((vpid->pageid | ((unsigned int) vpid->volid) << 24) % htsize), and pgbuf_compare_vpid is its ordering callback (same volume ⇒ pageid difference, else volid difference). The main buf_hash_table uses pgbuf_hash_func_mirror only, comparing via VPID_EQ.

3.5 The lock-free read-only fast path

Before grabbing any anchor mutex, a read fix of a present page tries to fix without locking. The guard requires all four of: request_mode == PGBUF_LATCH_READ, fetch_mode in the three eligible modes, and condition == PGBUF_UNCONDITIONAL_LATCH — so after the §3.2 downgrade a zero-wait transaction is ineligible. On a non-NULL pgbuf_lockfree_fix_ro it bumps num_hit and goto fast_path, bypassing the hash walk and the latch pass. The function does a lock-free chain walk, then a CAS on the BCB latch word:

// pgbuf_lockfree_fix_ro -- src/storage/page_buffer.c
bufptr = pgbuf_search_hash_chain_no_bcb_lock (thread_p,
           &pgbuf_Pool.buf_hash_table[PGBUF_HASH_VALUE (vpid)], vpid);
if (bufptr == NULL) return NULL;                    /* not resident -> slow path */
do {
    impl = get_impl (&bufptr->atomic_latch); new_impl = impl;
    if (impl.impl.latch_mode != PGBUF_LATCH_READ     /* must already be read-latched */
        || impl.impl.waiter_exists || impl.impl.fcnt == 0  /* no writer queued, still held */
        || bufptr->vpid.pageid != vpid->pageid       /* re-validate identity ... */
        || bufptr->vpid.volid != vpid->volid)        /* ... against ABA reuse */
      return NULL;                                    /* any failure -> slow path */
    new_impl.impl.fcnt++;                             /* bump fix count */
} while (!bufptr->atomic_latch.compare_exchange_weak (impl.raw, new_impl.raw,
           std::memory_order_acq_rel, std::memory_order_acquire));

Invariant — the fast path only adds a reader to an already-read-held BCB. The CAS refuses unless the latch is PGBUF_LATCH_READ, fcnt != 0, and has no waiter — so it never upgrades from free/write, never starves a queued writer, and the in-loop VPID re-check defeats ABA. Any failure returns NULL to the slow path; never an error.

The chain walk it uses, pgbuf_search_hash_chain_no_bcb_lock, is bare: it pointer-chases hash_anchor->hash_next returning the first VPID_EQ match, with no mutex or trylock — the CAS above does all synchronization.

On a successful CAS the function still has holder bookkeeping to do before returning the page: pgbuf_find_thrd_holder either finds the caller is already a holder (bump holder->fix_count, set hold_has_read_latch) or, in SERVER_MODE, allocates a fresh holder via pgbuf_allocate_thrd_holder_entry (a NULL return there is assert(false)

NULL). Only then does CAST_BFPTR_TO_PGPTR produce the PAGE_PTR and the caller reaches fast_path:.

3.6 The locked hash-chain walk and the hit/miss fork

If the fast path is skipped or returns NULL, the slow path sets hash_anchor, clears buf_lock_acquired, and calls pgbuf_search_hash_chain. If that returns a direct-victim BCB (pgbuf_bcb_is_direct_victim), pgbuf_bcb_update_flags (..., PGBUF_BCB_INVALIDATE_DIRECT_VICTIM_FLAG, ...) tells the victim-waiter it cannot use this BCB.

The anchor it walks is one slot of the buf_hash_table[], a PGBUF_BUFFER_HASH:

Field	Role	Why
`hash_mutex`	per-bucket `pthread_mutex_t` (`SERVER_MODE` only)	Serializes chain insert/remove and the buffer-lock chain; the only mutex phase two holds while walking. Per-bucket, not global, so different buckets hash concurrently.
`hash_next`	head of the BCB hash chain (`PGBUF_BCB *`)	The chain `pgbuf_search_hash_chain` pointer-chases via each BCB’s own `hash_next`; resident pages for this bucket live here.
`lock_next`	head of the buffer-lock chain (`PGBUF_BUFFER_LOCK *`)	Records VPIDs a thread has claimed but not yet inserted (the miss path, Chapter 4), so a second fixer for the same VPID waits instead of double-claiming. Also protected by `hash_mutex`.

pgbuf_search_hash_chain is the workhorse: a two-phase search with an exact return contract — non-NULL ⇒ caller holds bufptr->mutex (not the hash mutex); NULL ⇒ caller holds hash_anchor->hash_mutex.

Phase one (one_phase:) walks the chain without the hash mutex, trying a non-blocking PGBUF_BCB_TRYLOCK on the matched BCB. The load-bearing core, per matched bufptr:

// pgbuf_search_hash_chain -- src/storage/page_buffer.c (one_phase core)
      rv = PGBUF_BCB_TRYLOCK (bufptr);
      if (rv != 0)
        { if (rv != EBUSY) goto two_phase;     /* trylock error -> escalate */
          PGBUF_BCB_LOCK (bufptr); }           /* EBUSY -> block on the bcb mutex */
      if (!VPID_EQ (&(bufptr->vpid), vpid))    /* bcb reused under us? */
        { PGBUF_BCB_UNLOCK (bufptr); goto one_phase; }   /* <- restart phase 1 */
      break;                                   /* matched + locked -> return bufptr */

Three branches leave phase one (Figure 3-1): clean trylock + VPID recheck (return bufptr); EBUSY → blocking PGBUF_BCB_LOCK then recheck; and a non-EBUSY error escalating via goto two_phase. The post-lock recheck catches a slot repurposed between match and lock.

Phase two (two_phase:/try_again:) re-runs the same walk under the hash mutex, differing in three points: on a clean trylock it unlocks the hash mutex before returning; on EBUSY it unlocks the hash mutex before the blocking PGBUF_BCB_LOCK and re-validates via goto try_again; and a non-EBUSY failure is fatal — er_set_with_oserror (ER_CSS_PTHREAD_MUTEX_TRYLOCK) then return NULL.

Invariant — lock ordering is hash mutex then BCB mutex, never the reverse. Phase two always drops the hash mutex before a blocking PGBUF_BCB_LOCK; inverting it would deadlock insert/remove paths. The ER_CSS_PTHREAD_MUTEX_TRYLOCK branch is the one place the function returns NULL while not holding the hash mutex — a fatal OS failure.

flowchart TD
  A["pgbuf_search_hash_chain"] --> B["one_phase: walk chain, no hash mutex"]
  B --> C{"VPID match?"}
  C -- "no, end of chain" --> TP["two_phase"]
  C -- "yes" --> D["PGBUF_BCB_TRYLOCK"]
  D -- "rv==0" --> E{"VPID still equal?"}
  D -- "EBUSY" --> F["PGBUF_BCB_LOCK block"]
  D -- "other err" --> TP
  F --> E
  E -- "no, reused" --> B
  E -- "yes" --> R1["return bufptr, holds bcb mutex"]
  TP --> G["lock hash_mutex; walk chain"]
  G --> H{"VPID match?"}
  H -- "no, end" --> R2["return NULL, holds hash mutex"]
  H -- "yes" --> I["PGBUF_BCB_TRYLOCK"]
  I -- "rv==0 or EBUSY" --> JK["unlock hash_mutex; if EBUSY PGBUF_BCB_LOCK"]
  I -- "other err" --> ERR["fatal: return NULL"]
  JK --> L{"VPID still equal?"}
  L -- "no" --> G
  L -- "yes" --> R3["return bufptr, holds bcb mutex"]

Back in pgbuf_fix_release, the returned bufptr drives the hit/miss fork into three outcomes:

Hit (bufptr != NULL): increment num_hit; if NEW_PAGE, assert the page is clean-LSA or dirty (a NEW_PAGE re-using a buffered, invalidated page). Control falls through to pgbuf_bcb_register_fix and the latch pass (Chapter 5).
OLD_PAGE_IF_IN_BUFFER miss: this mode never reads from disk, so unlock the hash mutex and return NULL — the only mode that short-circuits a miss.
General miss: call pgbuf_claim_bcb_for_fix (Chapter 4). On NULL with retry, goto try_again; on NULL without retry, ASSERT_ERROR and return NULL; on success, set buf_lock_acquired = true and continue to the page-VPID check.

3.7 Post-claim VPID re-check and the `maybe_deallocated` branch

After a hit or a successful claim the caller holds bufptr->mutex; pgbuf_bcb_register_fix and pgbuf_set_bcb_page_vpid run, then page identity is re-validated:

// pgbuf_fix_release -- src/storage/page_buffer.c
maybe_deallocated = (fetch_mode == OLD_PAGE_MAYBE_DEALLOCATED);
if (pgbuf_check_bcb_page_vpid (bufptr, maybe_deallocated) != true)
  {
    if (buf_lock_acquired)
      { pgbuf_put_bcb_into_invalid_list (thread_p, bufptr);     /* releases bcb mutex */
        (void) pgbuf_unlock_page (thread_p, hash_anchor, vpid, true); }
    else
      { PGBUF_BCB_UNLOCK (bufptr); }                            /* hit case: just unlock */
    PGBUF_BCB_CHECK_MUTEX_LEAKS ();
    return NULL;
  }
if (fetch_mode == OLD_PAGE_PREVENT_DEALLOC)
  pgbuf_bcb_register_avoid_deallocation (bufptr);              /* pin against dealloc */

The maybe_deallocated flag relaxes pgbuf_check_bcb_page_vpid so a deallocated VPID is not a failure for OLD_PAGE_MAYBE_DEALLOCATED. The cleanup branch differs by ownership: a fresh claim (buf_lock_acquired) is recycled to the invalid list and the page lock dropped; a hit only unlocks the BCB. Past here the function enters the latch pass (Chapter 5) and, on success, jumps to fast_path: where the §3.1 PAGE_UNKNOWN switch runs — the last place fetch mode steers the result.

3.8 Chapter summary — key takeaways

pgbuf_fix_release is a state machine: four early-return validations, then a try_again loop whose only re-entry edge is a BCB-claim race via pgbuf_claim_bcb_for_fix’s retry out-parameter.
A zero-wait transaction (LK_ZERO_WAIT/LK_FORCE_ZERO_WAIT) has its unconditional fix rewritten to conditional before hashing.
The lock-free fast path (pgbuf_lockfree_fix_ro) covers read latches in the three eligible modes under an unconditional request; its CAS only adds a reader to an already-read-held, waiter-free BCB, re-validating the VPID against ABA.
pgbuf_hash_func_mirror bit-reverses the low 8 volid bits into the top of a 20-bit space and XORs with pageid; pgbuf_hash_vpid/ pgbuf_compare_vpid belong to the separate Aout mht table.
pgbuf_search_hash_chain is two-phase; its return contract (non-NULL ⇒ holds BCB mutex; NULL ⇒ holds hash mutex) and strict hash-then-BCB lock ordering are load-bearing invariants.
The hit/miss fork: hit → latch pass (Ch. 5); OLD_PAGE_IF_IN_BUFFER miss → immediate NULL; general miss → BCB claim (Ch. 4) with a retry.
Fetch mode steers four points — fast-path eligibility, the miss short-circuit, the maybe_deallocated re-check, and the final PAGE_UNKNOWN switch — so any fix bug starts with the caller’s mode.

Chapter 4: Miss Handling BCB Claim and the PGBUF Allocation Lock

Chapter 3 left us where a pgbuf_fix lookup returns not in the page table. This chapter answers: how does a thread reserve the VPID against racing allocators, obtain a fresh BCB from the invalid list or a victim, read the page bytes, and insert the BCB into the hash chain? The companion (cubrid-page-buffer-manager.md §“How a page fix flows”, §“PGBUF lock”) sketches Step 1 / Step 2; here we trace every branch, assuming the reader knows the PGBUF_BCB layout, the five zones, and the victim sources from Chapters 1-3 and the companion’s §“LFCQ”.

The miss path is a four-layer onion: pgbuf_claim_bcb_for_fix (outer coordinator) takes the per-bucket VPID lock via pgbuf_lock_page, calls pgbuf_allocate_bcb (source selector: invalid list, then victim, then sleep on the direct-victim queue), whose cheapest source is pgbuf_get_bcb_from_invalid_list. Victim search (pgbuf_get_victim) and the direct-victim hand-off are Chapter 9 black boxes.

4.1 `pgbuf_invalid_list` — the free pool, every field

The invalid list is the pool of BCBs bound to no page (all BCBs at server start; error-rolled-back or invalidated BCBs at runtime). It is a LIFO stack guarded by one mutex.

// struct pgbuf_invalid_list -- src/storage/page_buffer.c
struct pgbuf_invalid_list
{
#if defined(SERVER_MODE)
  pthread_mutex_t invalid_mutex; /* integrity of the singly-linked list */
#endif
  PGBUF_BCB *invalid_top;        /* head of the list (LIFO) */
  int invalid_cnt;               /* # of entries */
};

Field	Role	Why it exists
`invalid_mutex`	Serializes push/pop of the stack	Without it two poppers could grab the same head. SERVER_MODE only — SA mode is single-threaded.
`invalid_top`	Head pointer; chain runs through `bufptr->next_BCB`	An invalid BCB is on no LRU list, so it reuses its `next_BCB` LRU pointer as the invalid-chain link — no separate field.
`invalid_cnt`	Live count of free BCBs	Read by quota math (`pgbuf_adjust_quotas`, Ch 10); also a fast “is the pool exhausted?” probe before taking the mutex.

Invariant — invalid_top chains exclusively through next_BCB, and a BCB is on the invalid list iff its zone is PGBUF_INVALID_ZONE. pgbuf_get_bcb_from_invalid_list flips the popped BCB to PGBUF_VOID_ZONE; pgbuf_put_bcb_into_invalid_list flips it back and asserts (bufptr->flags & PGBUF_BCB_FLAGS_MASK) == 0 — a BCB returning to the pool must carry no dirty/flushing/victim flag. If violated, the BCB re-enters the free pool still advertising a pending flush, and a later claimer treats stale page bytes as a clean fresh page.

flowchart LR
  IT["invalid_top"] --> B1["BCB a"]
  B1 -->|next_BCB| B2["BCB b"]
  B2 -->|next_BCB| B3["BCB c"]
  B3 -->|next_BCB| NUL["NULL"]

Figure 4-1 — The invalid list is a LIFO stack threaded through each BCB’s next_BCB pointer; invalid_cnt tracks its length.

`pgbuf_get_bcb_from_invalid_list` — double-checked-locking pop

This is pgbuf_allocate_bcb’s cheapest source. It pops one BCB with a lock-free fast path so the common “pool empty” case never touches the mutex.

// pgbuf_get_bcb_from_invalid_list -- src/storage/page_buffer.c
  if (pgbuf_Pool.buf_invalid_list.invalid_top == NULL)   /* (1) fast path: empty */
    return NULL;                                         /*     no mutex taken */
  rv = pthread_mutex_lock (&pgbuf_Pool.buf_invalid_list.invalid_mutex);
  if (pgbuf_Pool.buf_invalid_list.invalid_top == NULL)   /* (2) re-check under mutex */
    { pthread_mutex_unlock (...); return NULL; }          /*     someone emptied it */
  else                                                   /* (3) pop the LIFO top */
    { bufptr = pgbuf_Pool.buf_invalid_list.invalid_top;
      pgbuf_Pool.buf_invalid_list.invalid_top = bufptr->next_BCB; /* advance head */
      pgbuf_Pool.buf_invalid_list.invalid_cnt -= 1;
      pthread_mutex_unlock (...);
      PGBUF_BCB_LOCK (bufptr);                            /* now hold bufptr->mutex */
      bufptr->next_BCB = NULL;                            /* sever invalid-chain link */
      pgbuf_bcb_change_zone (thread_p, bufptr, 0, PGBUF_VOID_ZONE);  /* INVALID -> VOID */
      return bufptr; }

Three branches: (1) unlocked empty check returns NULL (no mutex). (2) post-mutex re-check returns NULL if a racing popper drained the list between the two reads. (3) the pop advances invalid_top, decrements invalid_cnt, drops the list mutex, then locks the BCB, nulls its chain link, and flips it to PGBUF_VOID_ZONE, returned under bufptr->mutex.

4.2 `pgbuf_claim_bcb_for_fix` — the outer coordinator

The fix path calls this on a page-table miss. Its contract is unusual: it is entered holding hash_anchor->hash_mutex and may exit having released it, having set *try_again, or having returned a fully-loaded BCB under bufptr->mutex. The four exit branches:

// pgbuf_claim_bcb_for_fix -- src/storage/page_buffer.c
  /* Branch A: a prior trylock on the bucket failed -> bail, no retry. */
  if (er_errid () == ER_CSS_PTHREAD_MUTEX_TRYLOCK)
    { pthread_mutex_unlock (&hash_anchor->hash_mutex); return NULL; }
  /* Branch B: take the VPID lock; hash_mutex is released inside. */
  if (!already_locked && pgbuf_lock_page (...) != PGBUF_LOCK_HOLDER)
    { *try_again = true; return NULL; }            /* <- LOSER of a same-VPID race */
  bufptr = pgbuf_allocate_bcb (thread_p, vpid);
  if (bufptr == NULL)                              /* Branch C: pool dirty / interrupted */
    { ASSERT_ERROR (); (void) pgbuf_unlock_page (..., true); return NULL; }
  /* Branch D: success. Scrub the fresh BCB. */
  bufptr->vpid = *vpid;
  /* atomic_latch <- {PGBUF_NO_LATCH, waiter=false, fcnt=0}; clears stale victim latch */
  pgbuf_bcb_update_flags (..., 0, PGBUF_BCB_ASYNC_FLUSH_REQ);   /* clear stray flag */
  LSA_SET_NULL (&bufptr->oldest_unflush_lsa);                   /* nothing unflushed yet */

Branch A — a failed trylock on the bucket leaves ER_CSS_PTHREAD_MUTEX_TRYLOCK in the error slot; drop the mutex and return NULL without touching *try_again (the caller pre-initialized it to false, so its goto try_again does not fire and pgbuf_fix’s own retry loop re-drives the lookup). Branch B is the race protocol: already_locked is true only for the dealloc-aware caller. HOLDER → we own the VPID, fall through. WAITER → another thread is allocating it; we already slept in pgbuf_lock_page, so set *try_again = true and the caller’s goto try_again re-runs the lookup and hits the BCB the winner inserted. Branch C (allocation failure, §4.4) must undo the VPID lock: it holds no mutex, so pgbuf_unlock_page(..., true) re-acquires hash_anchor->hash_mutex to unlink the record. Branch D scrubs the BCB; the atomic_latch reset clears a victim’s stale PGBUF_LATCH_INVALID (the real latch is acquired in Chapter 5). Bytes are then loaded (§4.5).

flowchart TD
  S["enter holding hash_mutex"] --> A{"errid == TRYLOCK?"}
  A -- yes --> AR["unlock hash_mutex; return NULL\ntry_again untouched, stays false"]
  A -- no --> B{"pgbuf_lock_page\n== HOLDER?"}
  B -- "WAITER" --> BR["try_again=true; return NULL\n-> caller re-looks-up, hits"]
  B -- "HOLDER" --> C["pgbuf_allocate_bcb"]
  C --> D{"bufptr == NULL?"}
  D -- yes --> DR["unlock_page need_hash=true\nreturn NULL; propagate error"]
  D -- no --> E["init BCB; load bytes -> 4.5"]
  E --> G["return BCB under bufptr->mutex"]

Figure 4-2 — pgbuf_claim_bcb_for_fix branch map. Only B-WAITER and the success branch leave the caller a different state to act on; the two error branches both unwind the VPID lock.

4.3 `pgbuf_lock_page` / `pgbuf_unlock_page` — the VPID race lock

The PGBUF lock is not the BCB latch and not the bucket mutex. It is a logical lock keyed on the VPID, on a chain hanging off the same hash bucket as the BCB chain: when no BCB exists yet for a VPID, it ensures exactly one thread allocates it. The lock record pgbuf_buffer_lock is statically pre-allocated one per thread (no malloc on the hot path):

// struct pgbuf_buffer_lock -- src/storage/page_buffer.c
struct pgbuf_buffer_lock
{
  VPID vpid;                    /* the VPID being allocated */
  PGBUF_BUFFER_LOCK *lock_next; /* next record on this bucket's lock chain */
#if defined(SERVER_MODE)
  THREAD_ENTRY *next_wait_thrd; /* FIFO of threads blocked on this VPID */
#endif
};

pgbuf_lock_page is entered holding hash_anchor->hash_mutex and always releases it before returning. Two branches:

// pgbuf_lock_page -- src/storage/page_buffer.c
  for (cur = hash_anchor->lock_next; cur != NULL; cur = cur->lock_next)
    if (VPID_EQ (&cur->vpid, vpid))               /* LOSER: VPID already being allocated */
      {
        cur_thrd_entry->next_wait_thrd = cur->next_wait_thrd;
        cur->next_wait_thrd = cur_thrd_entry;     /* push onto waiter FIFO */
        pgbuf_sleep (cur_thrd_entry, &hash_anchor->hash_mutex);  /* releases mutex, sleeps */
        if (cur_thrd_entry->resume_status != THREAD_PGBUF_RESUMED)
          { /* woke for interrupt: re-take mutex, splice self out of waiter list */ }
        return PGBUF_LOCK_WAITER;
      }
  /* WINNER: VPID absent. Claim this thread's static record. */
  cur = &pgbuf_Pool.buf_lock_table[cur_thrd_entry->index];
  cur->vpid = *vpid; cur->next_wait_thrd = NULL;
  cur->lock_next = hash_anchor->lock_next; hash_anchor->lock_next = cur;  /* link at head */
  pthread_mutex_unlock (&hash_anchor->hash_mutex);
  return PGBUF_LOCK_HOLDER;

Loser branch (VPID on the chain): push onto that record’s next_wait_thrd FIFO; pgbuf_sleep releases hash_mutex and suspends. The resume_status != THREAD_PGBUF_RESUMED sub-branch handles an interrupt that woke the thread without the winner unlocking: it re-takes hash_mutex and splices itself out of the waiter list so a later pgbuf_unlock_page does not wake a departed thread. Result is WAITER. Winner branch (VPID absent): claim this thread’s static record, link at the chain head, return HOLDER.

Invariant — at most one BCB-less allocation per VPID is in flight. Enforced because the winner installs its record under hash_mutex before releasing it, so any later scanner under the same mutex sees the record and becomes a waiter. Two winners would create two BCBs for one VPID — the lookup would be nondeterministic and one copy’s writes lost.

pgbuf_unlock_page is the mirror. need_hash_mutex says whether to acquire hash_anchor->hash_mutex itself (error paths, true) or whether the caller already holds it (success path after pgbuf_insert_into_hash_chain, false).

// pgbuf_unlock_page -- src/storage/page_buffer.c
  if (need_hash_mutex) pthread_mutex_lock (&hash_anchor->hash_mutex);
  /* find this VPID's record; if found, unlink it ... */
  if (cur != NULL)
    {
      /* splice out of lock_next chain */
      pthread_mutex_unlock (&hash_anchor->hash_mutex);
      while ((t = cur->next_wait_thrd) != NULL)   /* wake EVERY waiter */
        { cur->next_wait_thrd = t->next_wait_thrd; t->next_wait_thrd = NULL;
          pgbuf_wakeup_uncond (t); }
    }
  else pthread_mutex_unlock (&hash_anchor->hash_mutex);  /* record gone (error case) */

It unlinks the record, drops the mutex, then wakes all waiters. Each woken loser re-runs the fix, finds the BCB in the table (inserted before unlocking on the success path), and proceeds via the Chapter 3 hash-hit path. Waking after dropping the mutex avoids immediate re-contention.

4.4 `pgbuf_allocate_bcb` — the source selector

The VPID-lock winner needs an actual BCB. The selector tries three sources in cost order.

// pgbuf_allocate_bcb -- src/storage/page_buffer.c
  bufptr = pgbuf_get_bcb_from_invalid_list (thread_p);   /* Source 1: free list, cheapest */
  if (bufptr != NULL) return bufptr;        /* short-circuit: SKIPS the 'end:' victimize */
  bufptr = pgbuf_get_victim (thread_p);                  /* Source 2: scan LFCQs */
  if (bufptr != NULL) goto end;             /* victim still needs pgbuf_victimize_bcb */

Source 1 short-circuits: a BCB off the invalid list (§4.1) is bound to no page and has no flags, so it returns immediately — it does not reach end: and so is not victimized. Source 2 is different: by the time pgbuf_get_victim returns, the victim has already been unlinked from its LRU list and flipped to PGBUF_VOID_ZONE — pgbuf_get_victim_from_lru_list calls pgbuf_remove_from_lru_list (which does the unlink + zone flip) before handing the BCB back (§4.6). What still remains is the BCB’s link in the hash chain and its old latch. That is precisely why Source 2 must fall through to end:: pgbuf_victimize_bcb is what detaches the hash chain and invalidates the latch. So the goto end is about hash-chain and latch cleanup, not about LRU unlinking, which already happened inside pgbuf_get_victim.

If both fail, behavior forks on build mode and daemon availability. In SERVER_MODE with the flush daemon up, the thread enqueues on a direct-victim waiter queue and suspends with a timeout:

// pgbuf_allocate_bcb -- src/storage/page_buffer.c (SERVER_MODE, flush daemon up)
 retry:
  high_priority = high_priority || VACUUM_IS_THREAD_VACUUM (thread_p)
                                || pgbuf_is_thread_high_priority (thread_p);
  thread_lock_entry (thread_p);
  if (high_priority) waiter_threads_high_priority->produce (thread_p);
  else if (!waiter_threads_low_priority->produce (thread_p))   /* low queue jammed */
    { if (!waiter_threads_high_priority->produce (thread_p)) { assert(false); goto end; } }
  pgbuf_wakeup_page_flush_daemon (thread_p);            /* ensure SOMEONE will feed us */
  r = thread_suspend_timeout_wakeup_and_unlock_entry (..., THREAD_ALLOC_BCB_SUSPENDED);

high_priority is true for vacuum threads, threads already holding a hot-page latch, or on a retry. The low-priority produce-failure sub-branch guards a preempted-consumer wedge: if the low queue cannot accept, the thread is pushed to the high queue. After enqueueing it wakes the flush daemon and suspends. On wake, four sub-branches:

Normal handoff (THREAD_ALLOC_BCB_RESUMED): a producer put a BCB in this thread’s slot; pgbuf_get_direct_victim reads it and we goto end.
Stolen-back: pgbuf_get_direct_victim returns NULL (the BCB was re-fixed between assign and get, companion §“Direct victim hand-off”); set high_priority and goto retry.
Interrupt/shutdown (other resume_status): undo any half-assigned victim, then raise ER_INTERRUPTED so the claim path takes Branch C.
r != NO_ERROR (the timeout else): asserts no timeout, re-stamps resume_status, and is a can’t-happen path under the assert.

The no-daemon else (SA mode or crash recovery) cannot sleep for a producer, so it flushes via pgbuf_wakeup_page_flush_daemon, re-scans with pgbuf_get_victim, and asserts a victim now exists. The shared tail victimizes any acquired victim:

// pgbuf_allocate_bcb -- src/storage/page_buffer.c
 end:
  if (bufptr != NULL)
    { if (pgbuf_victimize_bcb (thread_p, bufptr) != NO_ERROR) { assert (false); bufptr = NULL; } }
  else if (er_errid () == NO_ERROR)
    er_set (..., ER_PB_ALL_BUFFERS_DIRTY, ...);
  return bufptr;

pgbuf_victimize_bcb re-checks victimizability under the BCB mutex, unlinks the BCB from its hash chain (pgbuf_delete_from_hash_chain), and stamps PGBUF_LATCH_INVALID into the atomic latch. It does not change the zone — the victim was already moved to VOID by pgbuf_remove_from_lru_list during pgbuf_get_victim (§4.6). Invalid-list BCBs skip this (never in a chain). If bufptr is still NULL, ER_PB_ALL_BUFFERS_DIRTY is set so Branch C has an error to propagate.

Invariant — every BCB leaving pgbuf_allocate_bcb non-NULL is detached from any hash chain and LRU list, held under bufptr->mutex. Invalid-list BCBs never attached; victims detach via pgbuf_remove_from_lru_list + pgbuf_victimize_bcb. A still-linked BCB leaking out would let the claimer bind a second VPID onto a slot still reachable by its old VPID.

4.5 Loading the page bytes — NEW_PAGE vs read vs DWB

Back in pgbuf_claim_bcb_for_fix (Branch D), the initialized BCB needs its page bytes. The fork is on fetch_mode.

// pgbuf_claim_bcb_for_fix -- src/storage/page_buffer.c (read branch)
  if (fetch_mode != NEW_PAGE)
    {
      /* DWB first: a torn-write copy may be fresher than the volume. */
      if (dwb_read_page (thread_p, vpid, &...iopage, &success) != NO_ERROR)
        { assert (false); return NULL; }                 /* (1) DWB error: can't-happen */
      else if (success == true) { /* copied from DWB, no disk read */ }
      else if (fileio_read (...) == NULL)                /* (2) volume read failed */
        { ASSERT_ERROR ();
          pgbuf_put_bcb_into_invalid_list (thread_p, bufptr);     /* releases bufptr->mutex */
          (void) pgbuf_unlock_page (..., true); return NULL; }
      /* (3) decrypt if TDE-protected; on failure roll back like (2) */
      if (tde_algo != TDE_ALGORITHM_NONE && tde_decrypt_data_page (...) != NO_ERROR)
        { ASSERT_ERROR (); pgbuf_put_bcb_into_invalid_list (...); pgbuf_unlock_page (..., true); return NULL; }
      if (pgbuf_is_temporary_volume (vpid->volid) && !pgbuf_is_temp_lsa (...))  /* temp first-touch */
        { pgbuf_init_temp_page_lsa (...); pgbuf_set_dirty_buffer_ptr (thread_p, bufptr); }
    }

The read branch honors DWB-first ordering (companion §“Double Write Buffer”): dwb_read_page sets success when the double-write buffer holds a copy of this VPID, short-circuiting the disk read. Only on a DWB miss does fileio_read hit the volume. Three error sub-branches — (1) the DWB-error guard, (2) the volume-read failure, (3) the TDE-decrypt failure — each roll the BCB back via pgbuf_put_bcb_into_invalid_list (nulls the VPID, sets PGBUF_LATCH_INVALID, flips the zone to INVALID, releases bufptr->mutex) then pgbuf_unlock_page(..., true), so half-read bytes never linger in the pool. The temp first-touch sub-branch stamps a sentinel temp LSA and marks the page dirty.

// pgbuf_claim_bcb_for_fix -- src/storage/page_buffer.c (NEW_PAGE branch)
  else
    {
      if (pgbuf_is_temporary_volume (vpid->volid))
        pgbuf_init_temp_page_lsa (&...iopage, IO_PAGESIZE);
      else fileio_init_lsa_of_page (&...iopage, IO_PAGESIZE);
      if (bufptr->vpid.volid > NULL_VOLID)               /* perm: mark page immature */
        { ...iopage.prv.pageid = -1; ...iopage.prv.volid = -1; }
    }
  return bufptr;

The NEW_PAGE branch has nothing on disk to read, so it initializes the in-page LSA and (for permanent volumes) stamps prv.pageid/volid = -1 to mark the page immature — the real identity is written later by pgbuf_set_bcb_page_vpid (§4.6). It cannot fail, so there is no rollback sub-branch. Both branches return the loaded BCB under bufptr->mutex.

4.6 `pgbuf_set_bcb_page_vpid` and the hash-chain insertion

The fix path (Chapter 5 territory, shown for the bind step) stamps the page identity and inserts the BCB. pgbuf_set_bcb_page_vpid has three branches:

// pgbuf_set_bcb_page_vpid -- src/storage/page_buffer.c
  if (bufptr == NULL || VPID_ISNULL (&bufptr->vpid))     /* (A) guard: nothing to stamp */
    { assert (bufptr != NULL); assert (!VPID_ISNULL (&bufptr->vpid)); return; }
  if (bufptr->vpid.volid > NULL_VOLID)                   /* permanent volume only */
    {
      if (prv.pageid == NULL_PAGEID && prv.volid == NULL_VOLID)  /* (B) first time */
        { prv.pageid = bufptr->vpid.pageid;              /* write identity into header */
          prv.volid  = bufptr->vpid.volid;
          prv.ptype  = PAGE_UNKNOWN; /* + p_reserve_1/2, tde_nonce zeroed */ }
      else                                               /* (C) already stamped */
        { assert (prv.volid == bufptr->vpid.volid);      /* values not reset on dealloc */
          assert (prv.pageid == bufptr->vpid.pageid); }  /* identity must match -- no rewrite */
    }

(A) the top guard: a NULL BCB or null VPID is a caller bug — it asserts and returns without touching bytes. (B) first-time path (immature NEW_PAGE sentinel from §4.5): writes the VPID into the in-page header, making the bytes self-identifying on disk. (C) the else — a re-allocated/already-stamped perm page (the in-page identity survives deallocation): the function leaves the bytes untouched and only asserts that the stored prv.volid/pageid still equal the BCB’s VPID. Temp pages (volid <= NULL_VOLID) fall through all three with no action.

// pgbuf_insert_into_hash_chain -- src/storage/page_buffer.c
  pthread_mutex_lock (&hash_anchor->hash_mutex);
  bufptr->hash_next = hash_anchor->hash_next; hash_anchor->hash_next = bufptr;  /* link at head */
  /* hash_mutex stays held; released by the following pgbuf_unlock_page (need_hash_mutex=false) */

pgbuf_insert_into_hash_chain links the BCB at the head of the bucket’s BCB chain and deliberately keeps hash_mutex held — the immediately-following pgbuf_unlock_page(..., false) unlinks the VPID lock record and releases the same mutex. Holding it across both means a racing loser cannot see a window where the BCB is in the chain but the VPID lock is already gone (which would let it wrongly become a winner).

Where does the BCB land in the LRU? At this point the BCB is in PGBUF_VOID_ZONE. An invalid-list BCB was put there by pgbuf_get_bcb_from_invalid_list; a victim was moved there by pgbuf_remove_from_lru_list (call chain pgbuf_get_victim -> pgbuf_get_victim_from_lru_list -> pgbuf_remove_from_lru_list, whose tail does the pgbuf_bcb_change_zone (..., PGBUF_VOID_ZONE)), and pgbuf_victimize_bcb afterward only detaches it from the hash chain and invalidates the latch. VOID means “on no list yet.” The BCB does not enter an LRU list during claim/insert; it lands in a zone only at the matching unfix, where pgbuf_unfix routes it to LRU 2 normally, or LRU 1 when the fixer is a vacuum worker or the VPID was an Aout-history hit (boost) — Chapter 7’s subject.

stateDiagram-v2
  [*] --> INVALID: server start
  INVALID --> VOID: pop from invalid list
  Victim --> VOID: pgbuf_remove_from_lru_list during get_victim
  VOID --> LRU2: unfix normal -> Ch 7
  VOID --> LRU1: unfix vacuum or Aout-hit -> Ch 7
  VOID --> INVALID: read error rollback

Figure 4-3 — Zone trajectory of a claimed BCB. The claim path lands it in VOID; LRU placement happens at unfix (Chapter 7). Any read error returns it straight to INVALID.

4.7 Chapter summary — key takeaways

pgbuf_claim_bcb_for_fix has four exits: trylock-bail (Branch A), loser-retry with *try_again=true (B), allocation-failure unwind (C), and success returning a loaded BCB under bufptr->mutex (D).
The PGBUF lock makes same-VPID allocation single-winner. pgbuf_lock_page returns HOLDER to one thread and parks the rest as WAITERs on a per-record FIFO; the winner inserts the BCB and pgbuf_unlock_page wakes everyone, who re-drive the fix and hit it.
pgbuf_allocate_bcb short-circuits the invalid list (returns without victimizing); a victim falls through to end: for pgbuf_victimize_bcb. When both fail, server mode sleeps on a direct-victim queue with four wake sub-branches (handoff, stolen-back, interrupt, r != NO_ERROR timeout-assert); SA/recovery flushes inline.
pgbuf_get_bcb_from_invalid_list uses double-checked locking: an unlocked empty-list fast return, a post-mutex re-check return, and the pop branch that advances invalid_top, decrements invalid_cnt, locks the BCB and flips it to VOID. Every read-error path funnels through pgbuf_put_bcb_into_invalid_list + pgbuf_unlock_page(...,true), so no half-bound slot leaks.
DWB is consulted before disk on every read miss (dwb_read_page sets success, eliminating fileio_read); NEW_PAGE never reads or fails, marking permanent pages immature (prv.pageid/volid = -1). pgbuf_set_bcb_page_vpid then stamps identity only on the first-time branch; on a re-allocated perm page it leaves bytes untouched and only asserts the stored identity still matches the VPID.
The claim path leaves the BCB in VOID. A victim reaches VOID via pgbuf_remove_from_lru_list during pgbuf_get_victim — not via pgbuf_victimize_bcb, which only detaches the hash chain and invalidates the latch. Hash-chain insertion holds hash_mutex across the VPID-lock release to close the winner/loser race window; LRU-zone placement is deferred to unfix in Chapter 7.

Chapter 5: The BCB Atomic Latch Acquire Block and Wake

Chapter 3 left us with a BCB in hand and its mutex held. The question: given a fixer that owns the BCB mutex, how does the per-page read/write latch decide compatibility, block an incompatible request on a per-page waiter list, time it out, and wake waiters in order on release? For latch semantics at design level see the high-level companion’s “Latch modes and the fix protocol” section; here we trace every branch.

The subsystem rests on one 64-bit word — the atomic_latch — plus a singly linked next_wait_thrd queue on the BCB. The BCB mutex serializes the queue; the latch word is mutated by lock-free CAS so a waiter being woken can publish its grant without re-taking the mutex.

5.1 The packed latch word: `pgbuf_atomic_latch_impl`

The latch is std::atomic<uint64_t> atomic_latch, accessed through a union:

// union pgbuf_atomic_latch_impl -- src/storage/page_buffer.c
union pgbuf_atomic_latch_impl {
  uint64_t raw;                 /* the word actually CAS'd */
  struct {
    PGBUF_LATCH_MODE latch_mode;  /* enum:uint16_t NO_LATCH=0, READ=1, WRITE=2, FLUSH=3 */
    uint16_t waiter_exists;       /* 1 if next_wait_thrd has a R/W waiter */
    int32_t fcnt;                 /* number of granted fixes */
  } impl;
};

Field	Role	Why it exists
`raw`	`uint64_t` payload `compare_exchange_*` operates on	Moves the three fields atomically in one CAS; a torn read is impossible only because they share this word. The layout (16 + 16 + 32 bits) exactly fills 64.
`impl.latch_mode`	Current grant mode. `PGBUF_LATCH_FLUSH` is only a block mode, never a grant — the header at `PGBUF_LATCH_MODE` says so.	The compatibility decision (5.3) reads this first; `PGBUF_NO_LATCH` doubles as the “idle” sentinel.
`impl.waiter_exists`	A hint, not a count: true once a R/W request is queued.	The writer-starvation guard (5.3, Case 1).
`impl.fcnt`	Total fix count across holders sharing the current mode	At 0 the page can switch mode or be victimized; compared against one holder’s `fix_count` for “am I the only reader”.

get_impl snapshots with an acquire load. The single-field mutators (set_latch, add_fcnt, set_waiter_exists, set_latch_and_fcnt, set_latch_and_add_fcnt) each run a load/compare_exchange_weak loop. pgbuf_latch_bcb_upon_fix does not call those helpers: it computes the whole new_impl from a fresh old_impl and retries a single compare_exchange_strong on bufptr->atomic_latch — so its entire decision tree is one atomic transition, recomputed on contention.

Invariant — the latch word transitions atomically; no partial publish. Every mutation is do { old = get_impl(); ...build new... } while (!CAS(old, new)). With two separate stores a concurrent fixer could observe latch_mode == READ with a stale fcnt, grant a compatible read, and corrupt holder accounting. The single-word CAS forbids that torn intermediate. The decision-tree CAS uses the strong form (no spurious failure); the per-field helpers and the error/wakeup repair loops use the weak form inside their retry loops.

flowchart LR
  BCB["pgbuf_bcb"] --> AL["atomic_latch\n(uint64_t word)"]
  BCB --> NWT["next_wait_thrd\n(THREAD_ENTRY*)"]
  AL --> IMPL["impl: latch_mode | waiter_exists | fcnt"]
  NWT --> T1["THREAD_ENTRY\nrequest_latch_mode\nrequest_fix_count\nwait_for_latch_promote"]
  T1 --> T2["THREAD_ENTRY ..."]
  TH["THREAD_ENTRY\nm_holder_anchor"] --> HL["thrd_hold_list"]
  HL --> H1["pgbuf_holder\nfix_count, bufptr"]
  H1 --> H2["pgbuf_holder (thrd_link)"]

Figure 5-1 — The latch word and the two lists on it: the per-BCB waiter queue (next_wait_thrd) and the per-thread holder list (thrd_hold_list).

5.2 Holder bookkeeping: `pgbuf_holder`

A pgbuf_holder records this thread’s fixes on one BCB. The latch word counts fixes globally; the holder counts my slice so promotion and unfix can be reasoned about locally.

// struct pgbuf_holder -- src/storage/page_buffer.c
struct pgbuf_holder {
  int fix_count;            /* the count of fix by the holder */
  PGBUF_BCB *bufptr;        /* pointer to BCB */
  PGBUF_HOLDER *thrd_link;  /* next holder in this thread's hold list */
  PGBUF_HOLDER *next_holder;/* next in this thread's *free* list */
  PGBUF_HOLDER_STAT perf_stat;
#if !defined(NDEBUG)
  char fixed_at[64 * 1024]; /* call-site trail for leak debugging */
  int fixed_at_size;
#endif
  int watch_count;          /* number of PGBUF_WATCHERs on this holder */
  PGBUF_WATCHER *first_watcher;
  PGBUF_WATCHER *last_watcher;
};

Field	Role	Why it exists
`fix_count`	How many times this thread fixed this BCB	`old_impl.impl.fcnt == holder->fix_count` (5.3) means “global == mine”, i.e. only fixer. At 0 the holder is recycled.
`bufptr`	Back-pointer to the recorded BCB	`pgbuf_find_thrd_holder` matches on this.
`thrd_link`	Next holder in this thread’s in-use list (`thrd_hold_list`)	Links the many pages a thread holds; `next_holder` must be NULL while on this list (asserted in `pgbuf_find_thrd_holder`).
`next_holder`	Next holder in this thread’s free list (`thrd_free_list`)	`next_holder` meaningful only while free, `thrd_link` only while in use. Never both.
`perf_stat`	`PGBUF_HOLDER_STAT` bitfield (`dirty_before_hold`, `dirtied_by_holder`, `hold_has_write_latch`, `hold_has_read_latch`)	Feeds `perfmon`; the `hold_has_*` bits are set when the latch is granted in any branch of `pgbuf_latch_bcb_upon_fix`.
`fixed_at` / `fixed_at_size`	Fixed-size buffer holding the concatenated call-site trail (file:line of each fix) and its length	Leak / double-fix debugging in non-`NDEBUG` builds only; absent from release builds.
`watch_count` / `first_watcher` / `last_watcher`	Ordered-fix watcher chain (Chapter 10)	Must be 0 / NULL before recycle — `pgbuf_remove_thrd_holder` asserts `watch_count == 0`.

Invariant — a holder lives on exactly one of the two per-thread lists. While in use it is on thrd_hold_list with next_holder == NULL (pgbuf_find_thrd_holder asserts this on every node it walks); while free it is on thrd_free_list reached via next_holder, and thrd_link is dead. The free-list and hold-list links never both point somewhere at once.

Three helpers maintain the lists. pgbuf_allocate_thrd_holder_entry pops thrd_free_list if non-empty (no global mutex); else takes free_holder_set_mutex, carves the next element from the shared free_holder_set, and grows it with a fresh malloced PGBUF_HOLDER_SET when free_index == -1. Either way the holder is pushed onto thrd_hold_list:

// pgbuf_allocate_thrd_holder_entry -- src/storage/page_buffer.c
holder->next_holder = NULL;                            /* disconnect from free list */
holder->thrd_link = thrd_holder_info->thrd_hold_list;  /* push onto hold list */
thrd_holder_info->thrd_hold_list = holder;
thrd_holder_info->num_hold_cnt += 1;
holder->first_watcher = NULL; holder->last_watcher = NULL; holder->watch_count = 0;

pgbuf_find_thrd_holder walks thrd_hold_list for the holder whose bufptr matches, else NULL; its assert (holder->next_holder == NULL) enforces the “never on both lists” invariant. pgbuf_remove_thrd_holder asserts fix_count == 0 and watch_count == 0, prepends the holder to thrd_free_list first, then unlinks it from thrd_hold_list (head special-case, else walk to the predecessor); a missing entry trips assert (false) and returns ER_FAILED.

5.3 `pgbuf_latch_bcb_upon_fix` — the compatibility decision tree

The caller holds the BCB mutex; a scope_exit unlock_BCB guard releases it on every exit unless a branch .release()s it. It looks up the caller’s holder once, then runs a do { ...recompute new_impl... } while (!compare_exchange_strong) loop. request_fcnt starts at 1 and is reset at the top of every retry.

flowchart TB
  S["snapshot old_impl; new_impl = old_impl\nrequest_fcnt = 1"] --> IDLE{"buf_lock_acquired\nor latch_mode == NO_LATCH ?"}
  IDLE -- yes --> SETIDLE["is_page_idle=true\nnormalize old to clean idle\nnew: mode=request, fcnt=1"]
  IDLE -- no --> C1{"READ req on\nREAD-latched page ?"}
  C1 -- yes --> W{"waiter_exists ?"}
  W -- no --> GR1["can_latch=true; fcnt++"]
  W -- yes --> OWN{"holder != NULL ?"}
  OWN -- yes --> GR2["can_latch=true; fcnt++"]
  OWN -- no --> BLK1["can_latch=false\nwriter-starvation guard"]
  C1 -- no --> H{"holder != NULL ?"}
  H -- no --> BLK2["Case 3: can_latch=false\nwaiter_exists=true"]
  H -- yes --> WR{"latch_mode == WRITE ?"}
  WR -- yes --> GR3["Sub 2-1: can_latch=true; fcnt++"]
  WR -- no --> SOLE{"old fcnt == holder fix_count ?"}
  SOLE -- yes --> GR4["Sub 2-2: in-place promote\nmode=WRITE, fcnt=1"]
  SOLE -- no --> COND{"CONDITIONAL ?"}
  COND -- yes --> CFAIL["can_latch=false\nwaiter_exists=true, then reject"]
  COND -- no --> PROM["promote_needed=true\nfcnt -= holder fix_count\nwaiter_exists=true"]

Figure 5-2 — Every branch of the new_impl computation. The loop CASes old_impl.raw to new_impl.raw with compare_exchange_strong; on failure it re-snapshots and re-walks the whole tree.

Idle short-circuit. If buf_lock_acquired (fresh BCB, Chapter 4) or the page is PGBUF_NO_LATCH, the code normalizes old_impl to a clean idle state before building new_impl — this matters for the CAS expectation:

// pgbuf_latch_bcb_upon_fix -- src/storage/page_buffer.c
if (is_page_idle == true) {
    old_impl.impl.waiter_exists = false;          /* <- expect a clean word */
    old_impl.impl.latch_mode = PGBUF_NO_LATCH; old_impl.impl.fcnt = 0;
    new_impl = old_impl;
    new_impl.impl.latch_mode = request_mode; new_impl.impl.fcnt = 1; /* grant */
}

(In SA_MODE only, a non-idle page with no holder is a leaked latch — the code assert (0)s and treats it as idle.)

Case 1 — R on R. No waiter: grant (can_latch = true; fcnt++). Waiter present: the reader may join only if already a holder (re-entrant); a brand-new reader (holder == NULL) blocks.

Invariant — readers yield to a queued writer. Once waiter_exists is set, only re-entrant readers may join an R-latch (the holder == NULL test). If violated, a stream of fresh readers would indefinitely defer the queued writer.

Case 2 — caller already a holder (not R-on-R, so the page is WRITE-latched or the caller is a R-holder asking for WRITE):

Sub-2-1, page WRITE-latched: re-fix (R or W) is a pure passthrough, can_latch = true; fcnt++ — the W-holder shortcut.
Sub-2-2, page READ-latched requesting WRITE (in-place promotion): if old_impl.impl.fcnt == holder->fix_count the caller is the sole reader, so the latch flips to WRITE in place (mode = WRITE; fcnt = 1). If other readers exist, a PGBUF_CONDITIONAL_LATCH sets waiter_exists and falls to rejection; an unconditional one sets promote_needed, deducts its own fixes (new_impl.impl.fcnt -= holder->fix_count), sets waiter_exists, and blocks as a tail waiter (see 5.4).

The “unreachable in-place upgrade” the high-level companion mentions refers to a historically-removed contended upgrade; the sole-reader in-place flip (sub-2-2) is reachable today, alongside the dedicated promotion entry point pgbuf_promote_read_latch_debug and the one-promoter assert in pgbuf_block_bcb (see summary item 3).

Case 3 — caller not a holder, request incompatible (W on R, R/W on W by a stranger): can_latch = false; waiter_exists = true; the thread blocks.

After the CAS succeeds the function dispatches on the outcome flags:

is_page_idle or can_latch — granted: allocate a holder (idle / stranger paths) or bump the existing holder’s fix_count, set perf bits, update latch_last_thread, return NO_ERROR.
promote_needed — roll the holder’s read fixes into request_fcnt (request_fcnt += holder->fix_count), zero the holder, pgbuf_remove_thrd_holder, fall into the block path.
block/promote + PGBUF_CONDITIONAL_LATCH — reject ER_FAILED (raise ER_LK_PAGE_TIMEOUT first if the txn’s wait_msec == LK_ZERO_WAIT).
block/promote, unconditional — unlock_BCB.release(), call pgbuf_block_bcb(..., as_promote = false); on return the latch is held, so allocate a holder with fix_count = request_fcnt, set *is_latch_wait = true, return NO_ERROR.

5.4 `pgbuf_block_bcb` — enqueue and sleep

The caller holds the BCB mutex with waiter_exists true (asserted). It stamps request_latch_mode and request_fix_count (the count to credit fcnt with on wake), then enqueues by the as_promote flag:

// pgbuf_block_bcb -- src/storage/page_buffer.c
cur_thrd_entry->request_latch_mode = request_mode;
cur_thrd_entry->request_fix_count = request_fcnt;  /* SPECIAL_NOTE */
if (as_promote) {
    /* Safe guard: there can be only one promoter. */
    assert (bufptr->next_wait_thrd == NULL
            || !bufptr->next_wait_thrd->wait_for_latch_promote);
    cur_thrd_entry->next_wait_thrd = bufptr->next_wait_thrd;     /* head insert */
    bufptr->next_wait_thrd = cur_thrd_entry;
} else { cur_thrd_entry->next_wait_thrd = NULL; /* ... walk to tail, link ... */ }

The as_promote flag is the caller’s choice, and it splits two distinct callers:

Head insert (as_promote = true) is used only by pgbuf_promote_read_latch_debug (the explicit pgbuf_promote_read_latch path): a promoter that already released its read latch must win the race against fresh waiters, so it jumps the queue. The assert enforces at most one promoter in the queue.
Tail insert (as_promote = false, FIFO) is used by every other caller — including the promote_needed branch of pgbuf_latch_bcb_upon_fix (5.3), which calls pgbuf_block_bcb(..., false). So a promotion discovered during a fix enqueues at the tail, not the head; only the dedicated promote API head-inserts. The async-flush path (Chapter 8) also tail-inserts with PGBUF_LATCH_FLUSH.

Then it sleeps by mode:

PGBUF_LATCH_FLUSH (flush-waiter, Chapter 8) sleeps infinitely via thread_suspend_wakeup_and_unlock_entry; on a non-RESUMED wake (interrupt) it re-locks the BCB, unlinks itself from next_wait_thrd, returns ER_FAILED.
R/W waiters go through pgbuf_timed_sleep. CUBRID builds no wait-for graph across page latches; it relies on timeout — “When the request is waken up by timeout, the request is treated as a victim.” On a successful return the function sets bufptr->latch_last_thread = thread_p.

5.5 `pgbuf_timed_sleep` and `pgbuf_timed_sleep_error_handling`

pgbuf_timed_sleep locks the thread entry, then drops the BCB mutex (ordering: thread entry inside BCB), computes the timeout, and suspends:

// pgbuf_timed_sleep -- src/storage/page_buffer.c
thread_lock_entry (thread_p); PGBUF_BCB_UNLOCK (bufptr);
old_wait_msecs = wait_secs = pgbuf_find_current_wait_msecs (thread_p);
/* LK_ZERO_WAIT/LK_FORCE_ZERO_WAIT -> 0, else wait_secs = pgbuf_latch_timeout */
try_again:
  to.tv_sec = (int) time (NULL) + wait_secs;
  thread_p->resume_status = THREAD_PGBUF_SUSPENDED;
  r = thread_suspend_timeout_wakeup_and_unlock_entry (thread_p, &to, THREAD_PGBUF_SUSPENDED);

pgbuf_latch_timeout defaults to 300 * 1000, reset from PRM_ID_PAGE_LATCH_TIMEOUT at boot. Three return branches:

NO_ERROR — signalled. Re-lock the entry. If resume_status == THREAD_PGBUF_RESUMED a waker granted the latch (5.6) — return NO_ERROR (the latch is already ours, fcnt bumped by the waker). Else an interrupt: set request_latch_mode = PGBUF_NO_LATCH, call the error handler, raise ER_INTERRUPTED, return ER_FAILED.
ER_CSS_PTHREAD_COND_TIMEDOUT — timed out. If RESUMED in the race, return NO_ERROR. If the txn is no longer active (logtb_is_current_active false) loop to try_again (don’t time out a committing/aborting txn). Else a page-latch deadlock victim: save the mode, set request_latch_mode = PGBUF_NO_LATCH (the marker the waker uses to skip us), run the error handler, goto er_set_return.
else — pthread_cond failure: er_set_with_oserror (ER_CSS_PTHREAD_COND_TIMEDWAIT), return ER_FAILED.

er_set_return formats by the original wait spec, then releases the BCB mutex and returns ER_FAILED:

LK_INFINITE_WAIT — ER_PAGE_LATCH_TIMEDOUT then ER_LK_UNILATERALLY_ABORTED (guarded by an assert (0) marked FIXME).
positive old_wait_msecs — ER_PAGE_LATCH_TIMEDOUT + ER_LK_PAGE_TIMEOUT (the latter reports save_request_latch_mode).
otherwise — just unlock.

pgbuf_timed_sleep_error_handling runs when a waiter abandons the queue, re-locks the BCB, and unlinks the thread in three cases:

flowchart TB
  L["PGBUF_BCB_LOCK"] --> E{"next_wait_thrd == NULL ?"}
  E -- yes --> R0["case 1: already removed by waker, return"]
  E -- no --> F{"head == thrd_entry ?"}
  F -- no --> M["case 2: walk list, unlink thrd_entry, return"]
  F -- yes --> H["case 3: pop head\nthen wake consecutive READ waiters\nuntil non-grantable or WRITE"]

Figure 5-3 — Three removal cases. Only case 3 (the abandoning thread at the head) must repair the queue by waking the readers it was shadowing. In case 3 it pops the head, then loops: for each follower, if the page latch_mode == READ and the waiter wants READ it CASes (compare_exchange_weak) fcnt += request_fix_count, locks the entry, unlinks it, and wakes it (pgbuf_wakeup); a WRITE waiter or non-grantable state breaks the loop.

5.6 `pgbuf_wakeup_reader_writer` — ordered wake on unlatch

When unfix drops fcnt to 0 and resets the mode to PGBUF_NO_LATCH (both asserted on entry), this function walks next_wait_thrd once and grants what it can. The caller holds the BCB mutex.

// pgbuf_wakeup_reader_writer -- src/storage/page_buffer.c
for (thrd_entry = bufptr->next_wait_thrd; thrd_entry != NULL; thrd_entry = next_thrd_entry) {
    next_thrd_entry = thrd_entry->next_wait_thrd;
    if (thrd_entry->request_latch_mode == PGBUF_NO_LATCH) { /* unlink, continue -- corpse */ }
    if (thrd_entry->request_latch_mode == PGBUF_LATCH_FLUSH) {
        assert (pgbuf_bcb_is_async_flush_request (bufptr) || pgbuf_bcb_is_flushing (bufptr));
        prev_thrd_entry = thrd_entry; continue;   /* skip -- leave in list, do NOT wake */
    }
    /* ... R/W grant via compare_exchange_strong loop ... */
}

Branch by branch:

PGBUF_NO_LATCH waiter — a thread that gave up (timed out / interrupted, not yet self-removed). Unlink and continue — “clean a timed-out waiter”.
PGBUF_LATCH_FLUSH waiter — not a latch holder; flush wakes it separately. Advance prev_thrd_entry and continue, leaving it in place — the “skip the FLUSH waiter” rule. Advancing prev rather than unlinking keeps the FLUSH entry queued and followers reachable behind it.
R/W waiter — enter the inner CAS loop (compare_exchange_strong) on a fresh impl:
- latch_mode == NO_LATCH, or (latch_mode == READ and waiter wants READ): grantable. Lock the thread entry; re-check request_latch_mode == PGBUF_NO_LATCH (it may have timed out between the outer test and the lock) — if so unlink, can_grant = false, break. Else can_grant = true, add request_fix_count to fcnt, set latch_mode to the waiter’s mode.
- Else if latch_mode == READ (page already R-granted this pass, this waiter wants WRITE): set prev_thrd_entry = thrd_entry and break only the inner CAS loop. should_stop stays false. The writer is left in place and skipped; the outer walk continues to look for more READ waiters behind it (“Look for other readers.”). The walk truly stops only at the WRITE arm.
- Else (latch_mode == WRITE): should_stop = true, break.
After the inner loop, should_stop breaks the outer loop; otherwise can_grant unlinks the waiter and pgbuf_wakeups it.

Net effect (matching the header comment): each READ grant leaves latch_mode == READ, so all consecutive READ waiters at the head are woken; a WRITE waiter met while the page is R-granted is skipped and scanning continues behind it; only a held WRITE latch (the WRITE arm via should_stop) ends the pass, so at most one writer is granted.

Finally the function recomputes the hint:

// pgbuf_wakeup_reader_writer -- src/storage/page_buffer.c
if (!pgbuf_is_exist_blocked_reader_writer (bufptr))
    set_waiter_exists (&bufptr->atomic_latch, false);  /* clear the guard */

Invariant — waiter_exists is true iff a R/W waiter is queued. pgbuf_is_exist_blocked_reader_writer walks next_wait_thrd and counts only PGBUF_LATCH_READ/PGBUF_LATCH_WRITE entries (FLUSH and NO_LATCH don’t count). A stale bit after the last R/W waiter was woken would make Case 1 of pgbuf_latch_bcb_upon_fix block fresh readers forever (phantom starvation).

5.7 Chapter summary — key takeaways

The latch is one uint64_t (pgbuf_atomic_latch_impl): latch_mode | waiter_exists | fcnt (16 + 16 + 32 bits). Single-word CASes are the only thing that makes the mode/count triple tear-free under concurrency.
pgbuf_latch_bcb_upon_fix is a branch-complete tree retried under one compare_exchange_strong: idle short-circuit; R-on-R (grant unless a waiter exists and the caller is not a holder — the writer-starvation guard); W-holder passthrough (sub-2-1); sole-reader in-place promotion (sub-2-2); and block paths for strangers (Case 3) and contended promotions.
The sole-reader in-place R-to-W flip (sub-2-2) is live and reachable; the source-anchored promotion API is pgbuf_promote_read_latch_debug plus the one-promoter assert in pgbuf_block_bcb. Any deprecation of a contended in-place upgrade is external project history, not visible in this source.
Holders (pgbuf_holder) record this thread’s per-BCB fix_count; comparing it against the latch word’s fcnt decides “am I the only fixer?”. allocate/find/remove keep a holder on exactly one of the free or hold lists, never both (the next_holder == NULL assert).
Blocking is timed, not graph-based: pgbuf_block_bcb enqueues on next_wait_thrd — tail for ordinary waiters and the promote_needed branch of the fix path, head only for the dedicated pgbuf_promote_read_latch_debug API (as_promote). pgbuf_timed_sleep waits pgbuf_latch_timeout; a timeout makes the waiter a deadlock victim and pgbuf_timed_sleep_error_handling removes it (repairing the head case by waking shadowed readers).
pgbuf_wakeup_reader_writer walks the queue once: clean NO_LATCH corpses, skip FLUSH waiters, wake all consecutive READ waiters, grant at most one WRITE. A WRITE waiter met mid-pass while readers are granted does not stop further reader grants — only a held WRITE latch (should_stop) ends the walk.
waiter_exists is a hint retracked exactly by pgbuf_is_exist_blocked_reader_writer after every wake, so a stale bit cannot phantom-starve fresh readers.

Chapter 6: Dirtying a Page and the Packed Flags and Zone Word

A modification to a page is three small acts under the BCB’s write latch: stamp the page image with the redo LSA, mark the BCB dirty, and — the first time only — record its oldest unflushed LSA. None takes a lock; they mutate one 32-bit word, bcb->flags, with a lock-free compare-and-swap retry loop, plus one separately-packed counter word. This chapter dissects that accessor layer, which every later chapter — unfix (Ch 7), flush under WAL (Ch 8), victim selection (Ch 9) — reads or mutates, so its invariants are load-bearing. For the why — the WAL contract, the checkpoint horizon — see the high-level companion; this chapter does not re-derive that theory.

6.1 One word, three fields: flags, zone, lru index

PGBUF_BCB::flags is volatile int. The source’s own comment fixes the layout exactly (Figure 6-1): “(bcb flags + zone = 2 bytes) + (lru index = 2 bytes)”. So the 32-bit word splits at the half: the low 16 bits are the LRU list index (PGBUF_LRU_INDEX_MASK = 0x0000FFFF, since PGBUF_LRU_NBITS = 16), and the high 16 bits carry the flag bits and the zone selector together.

// pgbuf_bcb (struct) -- src/storage/page_buffer.c
  volatile int flags;            /* <- packed: flag bits + zone + lru index */
  // ... condensed ...
  volatile int count_fix_and_avoid_dealloc;  /* <- a SECOND packed word, see 6.8 */
  LOG_LSA oldest_unflush_lsa;    /* <- WAL watermark, established once per dirty cycle */

Within that high half the two namespaces are bit-disjoint: the seven flag bits occupy the very top (0x80000000..0x02000000, bits 25-31), while the zone enum sits just above the index, at bits 16-19. Zone values are shifts of PGBUF_LRU_NBITS: the LRU sub-zones are 1<<16, 2<<16, 3<<16; the non-LRU zones skip two further bits (PGBUF_LRU_NBITS + 2 = 18) so they cannot collide with the LRU mask — PGBUF_INVALID_ZONE = 1<<18, PGBUF_VOID_ZONE = 2<<18. Because PGBUF_BCB_FLAGS_MASK and PGBUF_ZONE_MASK | PGBUF_LRU_INDEX_MASK never share a bit, pgbuf_bcb_update_flags can touch only flag bits and pgbuf_bcb_change_zone only zone+index — each preserving the other — both via CAS on the same word.

A BCB is born with PGBUF_BCB_INIT_FLAGS = PGBUF_INVALID_ZONE: no flag bits set, zone = INVALID, index 0. That is the start state of Figure 6-3.

The complete zone catalogue:

Zone value	Bits	Numeric	Meaning	Set by (zone moves go through `pgbuf_bcb_change_zone`)
`PGBUF_INVALID_ZONE`	`1<<18`	`0x00040000`	free/uninitialized BCB (on invalid list)	`PGBUF_BCB_INIT_FLAGS`; reset on free
`PGBUF_VOID_ZONE`	`2<<18`	`0x00080000`	transient: read from disk before list insert, or removed from list before victimizing	read-from-disk path; victim extraction
`PGBUF_LRU_1_ZONE`	`1<<16`	`0x00010000`	hottest LRU sub-zone; never victimized	unfix/boost into zone 1
`PGBUF_LRU_2_ZONE`	`2<<16`	`0x00020000`	buffer sub-zone between hot and victim; never victimized	LRU zone adjustment
`PGBUF_LRU_3_ZONE`	`3<<16`	`0x00030000`	victimization sub-zone; only zone with eligible candidates	LRU zone adjustment / fall from zone 2

Three masks decode it: PGBUF_LRU_ZONE_MASK (= 1|2|3 << 16) ORs the three LRU sub-zone bits; PGBUF_ZONE_MASK (= PGBUF_LRU_ZONE_MASK | PGBUF_INVALID_ZONE | PGBUF_VOID_ZONE) covers every zone; PGBUF_LRU_INDEX_MASK carries the low-16 list index. PGBUF_GET_ZONE(flags) is (PGBUF_ZONE)(flags & PGBUF_ZONE_MASK).

6.2 The flag catalogue: every bit, producer, and consumer

Seven flags live in the high bits of flags. The composite PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK is the OR of the first four — the states that disqualify a BCB from being victimized; “Blocks victim?” marks membership in that mask. The table carries producer, clearer, and reader per flag.

Flag	Bit	Producer	Cleared by	Reader	Blocks victim?
`PGBUF_BCB_DIRTY_FLAG`	`0x80000000`	`pgbuf_bcb_set_dirty`, `_update_flags`	`_clear_dirty`, `_mark_is_flushing`	`pgbuf_bcb_is_dirty`	yes
`PGBUF_BCB_FLUSHING_TO_DISK_FLAG`	`0x40000000`	`pgbuf_bcb_mark_is_flushing`	`_mark_was_flushed` / `_was_not_flushed`	`pgbuf_bcb_is_flushing`	yes
`PGBUF_BCB_VICTIM_DIRECT_FLAG`	`0x20000000`	direct-victim hand-off (Ch 9)	replaced by INVALIDATE	`pgbuf_bcb_is_direct_victim`	yes
`PGBUF_BCB_INVALIDATE_DIRECT_VICTIM_FLAG`	`0x10000000`	fixer grabbing a direct victim (Ch 4/5)	when the waiter re-requests	`pgbuf_bcb_is_invalid_direct_victim`	yes
`PGBUF_BCB_MOVE_TO_LRU_BOTTOM_FLAG`	`0x08000000`	dealloc path	unfix that moves it (Ch 7)	`pgbuf_bcb_should_be_moved_to_bottom_lru`	no
`PGBUF_BCB_TO_VACUUM_FLAG`	`0x04000000`	`pgbuf_notify_vacuum_follows`	vacuum routing	`pgbuf_bcb_is_to_vacuum`	no
`PGBUF_BCB_ASYNC_FLUSH_REQ`	`0x02000000`	async flush requesters	`pgbuf_bcb_mark_is_flushing`	`pgbuf_bcb_is_async_flush_request`	no

mark_is_flushing (when the page is dirty) clears both DIRTY — the flush captured the image — and ASYNC_FLUSH_REQ — the request is now in flight — while setting FLUSHING: one transition swaps three bits.

// PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK -- src/storage/page_buffer.c
#define PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK \
  (PGBUF_BCB_DIRTY_FLAG \
   | PGBUF_BCB_FLUSHING_TO_DISK_FLAG \
   | PGBUF_BCB_VICTIM_DIRECT_FLAG \
   | PGBUF_BCB_INVALIDATE_DIRECT_VICTIM_FLAG)   /* <- the 4 disqualifiers; the other 3 flags are victim-neutral */

Invariant — the victim-candidate count tracks exactly the BCBs in LRU zone 3 with none of the four disqualifier bits. Any transition adding/removing a disqualifier bit while the BCB sits in zone 3 must symmetrically add/remove it from candidacy, enforced in pgbuf_bcb_update_flags, pgbuf_bcb_change_zone, and the pgbuf_bcb_set_dirty fast path. Omit it in any one and the LRU victim counter drifts, so victimizers skip valid candidates or chase phantoms (Ch 9). pgbuf_bcb_avoid_victim is the read-side query of the same mask.

6.3 The shared CAS-loop shape: `pgbuf_bcb_update_flags` and `pgbuf_bcb_change_zone`

Both share one lock-free skeleton — read bcb->flags, compute the new word, CAS, retry — differing only in which half they recompute and what they reconcile afterward.

pgbuf_bcb_update_flags is the general flag mutator: set some bits, clear others, preserving zone and unnamed flags. Every flag transition except the dirty fast path goes through it.

// pgbuf_bcb_update_flags -- src/storage/page_buffer.c
  assert ((set_flags & (~PGBUF_BCB_FLAGS_MASK)) == 0);    /* <- callers may only touch flag bits ... */
  assert ((clear_flags & (~PGBUF_BCB_FLAGS_MASK)) == 0);  /* <- ... never zone/index bits */
  do
    {
      old_flags = bcb->flags;
      new_flags = old_flags | set_flags;
      new_flags = new_flags & (~clear_flags);
      if (old_flags == new_flags)
        return;                       /* <- no-op: bits already as desired, skip CAS + bookkeeping (contention saver) */
    }
  while (!ATOMIC_CAS_32 (&bcb->flags, old_flags, new_flags));

pgbuf_bcb_change_zone does the opposite: same loop, but the assignment recomputes zone+index — new_flags = (old_flags & PGBUF_BCB_FLAGS_MASK) | new_zone_idx; where new_zone_idx = PGBUF_MAKE_ZONE (new_lru_idx, new_zone) — preserving all flag bits, and it has no early no-op return.

After the CAS the two diverge. update_flags runs two fix-ups (Figure 6-2): a zone-3 victim-candidacy adjustment (only when PGBUF_GET_ZONE (old_flags) == PGBUF_LRU_3_ZONE — read from old_flags since the zone never changes here), and a dirties_cnt adjustment keyed on whether DIRTY toggled, closed by an assert pinning 0 <= dirties_cnt <= num_buffers.

flowchart TD
  A["enter: set_flags, clear_flags — Figure 6-2"] --> B["old = bcb->flags<br/>new = (old | set) & ~clear"]
  B --> C{"old == new?"}
  C -->|yes| R["return (no-op)"]
  C -->|no| D{"CAS(flags, old, new)?"}
  D -->|fail| B
  D -->|ok| E{"zone(old) == LRU_3?"}
  E -->|yes| F{"victim candidacy changed?"}
  F -->|"became valid"| G["lru_add_victim_candidate"]
  F -->|"became invalid"| H["lru_remove_victim_candidate"]
  F -->|"no change"| I
  E -->|no| I["dirty bit toggled?"]
  G --> I
  H --> I
  I -->|"set->clear"| J["dirties_cnt -= 1"]
  I -->|"clear->set"| K["dirties_cnt += 1"]
  I -->|"unchanged"| L["assert range; done"]
  J --> L
  K --> L

change_zone reconciles per-list zone counters and victim candidacy. Zone moves run under the LRU list mutex, so count_lru1/2/3 are plain increments, not atomics — only the CAS on flags is lock-free, since a concurrent pgbuf_set_dirty may flip a flag bit on the same word with no mutex. Branch map:

is_valid_victim_candidate = (old_flags & INVALID_VICTIM_CANDIDATE_MASK) == 0 — a flags property, unchanged by the move, so it holds on both sides.
Leaving (old_flags & PGBUF_LRU_ZONE_MASK): decrement lru_shared_pgs_cnt if the old list was shared; switch on old zone decrementing the right count_lruN; in zone 3 and valid-candidate, pgbuf_lru_remove_victim_candidate; default: assert(false).
Entering (new_zone & PGBUF_LRU_ZONE_MASK): symmetric increments, with pgbuf_lru_add_victim_candidate in zone 3 for a valid candidate; default: assert(false).

The default: assert(false) arms encode a totality invariant: an LRU-zone BCB is in exactly one of zones 1/2/3 (the zone field is a single value, not a mask of memberships). A second assert guards hint coherence: lru_list->victim_hint != bcb || zone(old) != LRU_3 — the hint must already have been retargeted before a zone-3 BCB leaves, unless checkpoint (via update_flags) is concurrently retargeting it.

stateDiagram-v2
  [*] --> INVALID: init flags = PGBUF_INVALID_ZONE
  INVALID --> VOID: read from disk
  VOID --> LRU: unfix inserts into list
  LRU --> LRU: adjust zones 1 to 2 to 3
  LRU --> VOID: selected as victim
  VOID --> [*]: reused for new page
  note right of LRU
    only LRU_3 sub-zone
    is victim-eligible
  end note

Figure 6-3: zone transitions driven by pgbuf_bcb_change_zone. The flag namespace rides along untouched on every edge.

6.4 `pgbuf_bcb_get_zone` and the decode macros

pgbuf_bcb_get_zone is a pure decode — it masks the word and returns the zone enum:

// pgbuf_bcb_get_zone -- src/storage/page_buffer.c
STATIC_INLINE PGBUF_ZONE
pgbuf_bcb_get_zone (const PGBUF_BCB * bcb)
{
  return PGBUF_GET_ZONE (bcb->flags);   /* <- (flags & PGBUF_ZONE_MASK) */
}

Two macros build on it to answer the two questions later chapters ask most:

// PGBUF_IS_BCB_IN_LRU* -- src/storage/page_buffer.c
#define PGBUF_IS_BCB_IN_LRU_VICTIM_ZONE(bcb) (pgbuf_bcb_get_zone (bcb) == PGBUF_LRU_3_ZONE)
#define PGBUF_IS_BCB_IN_LRU(bcb) ((pgbuf_bcb_get_zone (bcb) & PGBUF_LRU_ZONE_MASK) != 0)

PGBUF_IS_BCB_IN_LRU_VICTIM_ZONE is exact-equality (only zone 3 is victim-eligible); PGBUF_IS_BCB_IN_LRU is a mask test — PGBUF_LRU_ZONE_MASK ORs all three LRU sub-zone bits, so zones 1/2/3 match but VOID (2<<18) and INVALID (1<<18) do not, since their bits fall outside the mask. pgbuf_bcb_get_lru_index asserts PGBUF_IS_BCB_IN_LRU before returning the low-16 index.

6.5 Setting dirty: three entry points, one fast path

A modifier reaches dirty through a tiny call chain. The public pgbuf_set_dirty recovers the BCB via CAST_PGPTR_TO_BFPTR, validates vpid (debug only), delegates to pgbuf_set_dirty_buffer_ptr, then unfixes only if the caller passed free_page == FREE. pgbuf_set_dirty_buffer_ptr is the latch/perf layer over the real mutator:

// pgbuf_set_dirty_buffer_ptr -- src/storage/page_buffer.c
  pgbuf_bcb_set_dirty (thread_p, bufptr);
  holder = pgbuf_find_thrd_holder (thread_p, bufptr);
  assert (get_latch (&bufptr->atomic_latch) == PGBUF_LATCH_WRITE);  /* <- dirtier MUST hold the write latch */
  assert (holder != NULL);
  // ... condensed: mark holder->perf_stat.dirtied_by_holder, perfmon PSTAT_PB_NUM_DIRTIES ...

Invariant — a page is only dirtied while its setter holds the BCB write latch. The assert (get_latch (...) == PGBUF_LATCH_WRITE) enforces it, serializing concurrent writers (Ch 5) so the DIRTY/LSA pair stays consistent though each is mutated lock-free. The CAS in pgbuf_bcb_set_dirty defends only against other threads racing on unrelated bits of the same word (a no-latch change_zone), not against two writers.

pgbuf_bcb_set_dirty is a hand-coded fast path that bypasses update_flags because dirtying is the hottest case (the source comment says so explicitly):

// pgbuf_bcb_set_dirty -- src/storage/page_buffer.c
  do
    {
      old_flags = bcb->flags;
      if (old_flags & PGBUF_BCB_DIRTY_FLAG)
        return;                       /* <- already dirty: skip CAS + counter (common case) */
    }
  while (!ATOMIC_CAS_32 (&bcb->flags, old_flags, old_flags | PGBUF_BCB_DIRTY_FLAG));
  ATOMIC_INC_64 (&pgbuf_Pool.monitor.dirties_cnt, 1);     /* <- dirties_cnt += 1; assert range follows */
  if (PGBUF_GET_ZONE (old_flags) == PGBUF_LRU_3_ZONE
      && (old_flags & PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK) == 0)
    pgbuf_lru_remove_victim_candidate (thread_p, pgbuf_lru_list_from_bcb (bcb), bcb);  /* <- newly dirty -> drop candidacy */

Branch map: (1) already-dirty → early return; (2) CAS sets the bit; (3) dirties_cnt += 1 then assert range; (4) if the BCB was a valid zone-3 candidate before DIRTY (the test reads old_flags), the new bit disqualifies it, so remove it — the §6.2 invariant inlined for speed. Note this fast path only ever sets DIRTY, so unlike update_flags it never needs the dirty-cleared branch or the add-candidate branch.

6.6 Recording the oldest unflushed LSA — `pgbuf_set_lsa`

pgbuf_set_lsa (log/recovery manager only) stamps the redo LSA and establishes oldest_unflush_lsa once per dirty cycle, with special branches for temporary and auxiliary volumes:

// pgbuf_set_lsa -- src/storage/page_buffer.c
  // ... condensed: debug-gated page-pointer validation may return NULL; assert (lsa_ptr != NULL) ...
  if (pgbuf_is_temp_lsa (bufptr->iopage_buffer->iopage.prv.lsa)
      || PGBUF_IS_AUXILIARY_VOLUME (bufptr->vpid.volid) == true)
    return NULL;                       /* <- branch 2: temp/aux pages are never WAL-tracked: bail */
  if (pgbuf_is_temporary_volume (bufptr->vpid.volid) == true)
    {
      pgbuf_init_temp_page_lsa (&bufptr->iopage_buffer->iopage, IO_PAGESIZE);  /* <- branch 3: force sentinel temp LSA */
      if (logtb_is_current_active (thread_p))
        return NULL;                   /* <- active txn on temp page carries no real LSA */
    }
  fileio_set_page_lsa (&bufptr->iopage_buffer->iopage, lsa_ptr, IO_PAGESIZE);  /* <- branch 4: write redo LSA into image */
  if (LSA_ISNULL (&bufptr->oldest_unflush_lsa))    /* <- branch 5: FIRST dirty since last flush? */
    {
      if (LSA_LT (lsa_ptr, &log_Gl.chkpt_redo_lsa))
        { /* ... condensed: re-read chkpt_redo_lsa under chkpt_lsa_lock; if still older,
             raise ER_LOG_CHECKPOINT_SKIP_INVALID_PAGE + assert(false) ... */ }
      LSA_COPY (&bufptr->oldest_unflush_lsa, lsa_ptr);   /* <- watermark established */
    }
  // ... condensed: branch 6, #if defined(NDEBUG) also calls pgbuf_set_dirty_buffer_ptr (safety net) ...

Two facts the comments compress: pgbuf_is_temp_lsa compares the stored LSA against sentinel PGBUF_TEMP_LSA = { NULL_LOG_PAGEID - 1, NULL_LOG_OFFSET - 1 } (i.e. (-2,-2)), and the watermark lives here, not in set-dirty, because pages can be dirtied before any LSA exists — so it anchors at the first LSA set. The release-build #if defined(NDEBUG) tail forcing pgbuf_set_dirty_buffer_ptr is a safety net for any missed set-dirty call, since an LSA was just written and must be flushed.

Invariant — oldest_unflush_lsa is the LSA of the earliest modification not yet on disk, set once on the first dirty after a clean state and cleared only on flush. The LSA_ISNULL guard makes later set_lsa calls in the same cycle leave it untouched; it never advances forward. Ch 8’s WAL rule reads it for log-flush ordering and resets it to NULL on a successful flush; checkpoint reads it to find the oldest dirty page.

6.7 The readers: `pgbuf_bcb_is_dirty` and `pgbuf_bcb_avoid_victim`

Both are single-mask predicates over the same word — no lock, just a volatile read:

// pgbuf_bcb_is_dirty -- src/storage/page_buffer.c
STATIC_INLINE bool
pgbuf_bcb_is_dirty (const PGBUF_BCB * bcb)
{
  return (bcb->flags & PGBUF_BCB_DIRTY_FLAG) != 0;
}

// pgbuf_bcb_avoid_victim -- src/storage/page_buffer.c
STATIC_INLINE bool
pgbuf_bcb_avoid_victim (const PGBUF_BCB * bcb)
{
  return (bcb->flags & PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK) != 0;  /* <- ANY of the 4 disqualifiers */
}

The relation is hierarchical: a dirty BCB always makes avoid_victim true (DIRTY ∈ the mask), but avoid_victim can also be true for a clean BCB that is mid-flush or a (invalidated) direct victim. Hence Ch 9’s victim scan calls avoid_victim, not is_dirty — dirtiness is only one of four ways to be ineligible. The sibling per-flag readers (pgbuf_bcb_is_flushing, _is_direct_victim, _is_invalid_direct_victim, _is_async_flush_request, _is_to_vacuum, _should_be_moved_to_bottom_lru) are each the same one-bit (flags & FLAG) != 0 test.

6.8 The dual-purpose counter — `count_fix_and_avoid_dealloc`

A separate volatile word, never overlapping flags, packs two 16-bit sub-counters into one 32-bit int so each can be mutated by a single atomic:

Sub-field	Bits	Mask / shift	Mutators	Reader
avoid-dealloc count	low 16	`PGBUF_BCB_AVOID_DEALLOC_MASK = 0x0000FFFF`	`pgbuf_bcb_register_avoid_deallocation` (`+1`), `_unregister_` (`-1`, CAS)	`pgbuf_bcb_should_avoid_deallocation`
fix count	high 16	`<< PGBUF_BCB_COUNT_FIX_SHIFT_BITS (16)`	`pgbuf_bcb_register_fix` (`+ 1<<16`, capped)	`pgbuf_bcb_is_hot`

They are merged (per the struct comment) because avoid-dealloc must change atomically yet 2-byte atomics are uncommon, so both ride in one CPU-native 4-byte word:

// pgbuf_bcb_register_avoid_deallocation -- src/storage/page_buffer.c
  assert ((bcb->count_fix_and_avoid_dealloc & 0x00008000) == 0);  /* <- low-half top bit clear: overflow guard */
  (void) ATOMIC_INC_32 (&bcb->count_fix_and_avoid_dealloc, 1);    /* <- +1 touches only the low half */

register_fix adds 1 << 16 but only while below the cap PGBUF_FIX_COUNT_THRESHOLD << 16 — once hot, it stops counting (hotness is a one-way latch, not a live count). pgbuf_bcb_is_hot compares against that same PGBUF_FIX_COUNT_THRESHOLD << 16 (fix count drives LRU hotness, Ch 7). The unregister path uses a CAS loop and tolerates an avoid-dealloc count already at zero (a pgbuf_ordered_fix corner case where the page was victimized and reloaded), logging via er_log_debug and breaking rather than underflowing. This counter is the second, orthogonal victim gate — fixed or dealloc-protected pages held out of reach — independent of the flag gate dirtiness rides; Ch 9 consumes both.

6.9 Chapter summary — key takeaways

bcb->flags is one 32-bit word split at the half: low 16 bits = LRU index (PGBUF_LRU_INDEX_MASK), high 16 bits = flag bits (0x80000000..0x02000000) plus the zone selector (bits 16-19; LRU 1..3<<16, INVALID 1<<18, VOID 2<<18). Flag and zone bits are disjoint, which lets the two mutators share the word without clobbering each other.
pgbuf_bcb_update_flags and pgbuf_bcb_change_zone share one CAS retry loop: the former mutates flag bits (no-op early return) and reconciles dirties_cnt plus zone-3 candidacy; the latter mutates zone+index and reconciles per-list zone counters under the LRU mutex. Every reconciliation branch must run or the victim counter drifts; the default: assert(false) arms encode that an LRU BCB is in exactly one of zones 1/2/3.
The four-bit PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK (DIRTY, FLUSHING, VICTIM_DIRECT, INVALIDATE_DV) defines victim ineligibility; the other three flags (MOVE_TO_LRU_BOTTOM, TO_VACUUM, ASYNC_FLUSH_REQ) are victim-neutral. pgbuf_bcb_avoid_victim reads the whole mask; pgbuf_bcb_is_dirty reads one bit of it.
Dirtying takes the hand-coded fast path pgbuf_bcb_set_dirty (set-DIRTY-only CAS) rather than the general mutator, maintaining the candidacy/dirty-counter invariants inline. The BCB write latch — asserted in pgbuf_set_dirty_buffer_ptr, not the CAS — serializes concurrent writers and keeps the DIRTY/LSA pair consistent.
oldest_unflush_lsa is established exactly once per dirty cycle in pgbuf_set_lsa, guarded by LSA_ISNULL, never advanced forward, validated against the checkpoint redo horizon. Temp and auxiliary volumes are excluded (sentinel PGBUF_TEMP_LSA = (-2,-2)).
count_fix_and_avoid_dealloc is a second packed word carrying fix-count (high 16 bits, capped, drives hotness via pgbuf_bcb_is_hot) and avoid-dealloc count (low 16 bits) so both fit one native atomic — the orthogonal fix/dealloc victim gate, independent of the flag gate that dirtiness rides.

Chapter 7: Unfix LRU Movement Aout History and Private to Shared Migration

This chapter answers: on unfix, how does a BCB move through the three LRU zones, when does it boost to the top, when does it migrate from a private to a shared list, and what role does the Aout 2Q ghost list play? Zone model, private/shared split, and 2Q intent are in the high-level companion (cubrid-page-buffer-manager.md); the BCB struct, packed flags/zone word, and dirty bit are in Chapters 1 and 6 — reused here.

7.1 The unfix funnel — `pgbuf_unfix` to `pgbuf_unlatch_bcb_upon_unfix`

// pgbuf_unfix -- src/storage/page_buffer.c
CAST_PGPTR_TO_BFPTR (bufptr, pgptr);
holder_status = pgbuf_unlatch_thrd_holder (thread_p, bufptr, &holder_perf_stat);
// ... perf tracking (perfmon_pbx_unfix) elided ...
if (pgbuf_lockfree_unfix_ro (thread_p, bufptr))      /* <- pure read latch: CAS-drop fcnt, no mutex */
  return;                                            /* <- never touches LRU */
PGBUF_BCB_LOCK (bufptr);
(void) pgbuf_unlatch_bcb_upon_unfix (thread_p, bufptr, holder_status);  /* releases mutex inside */

Invariant — the read-only fast path never reorders LRU. A shared latch being dropped does not change zones; pgbuf_lockfree_unfix_ro returns true after a CAS, so reordering is reserved for the last unfixer or a writer — otherwise every reader would contend on the list mutex.

pgbuf_unlatch_bcb_upon_unfix is the decision engine; its prologue CASes the fix count down:

// pgbuf_unlatch_bcb_upon_unfix -- src/storage/page_buffer.c
do {
  blocked_reader_writer = false; is_zero_fcnt = false;
  impl_orig = get_impl (&bufptr->atomic_latch); impl_new = impl_orig;
  impl_new.impl.fcnt--;                              /* <- drop one fix */
  blocked_reader_writer = impl_orig.impl.waiter_exists;
  if (impl_new.impl.fcnt == 0) {
    is_zero_fcnt = true; impl_new.impl.latch_mode = PGBUF_NO_LATCH;  /* <- last unfixer drops latch */
  }
  if (impl_new.impl.fcnt < 0) {                      /* <- "freed too much": defensive reset */
    assert (false); er_set (...); impl_new.impl.fcnt = 0;
    impl_new.impl.waiter_exists = false;
    impl_new.impl.latch_mode = PGBUF_NO_LATCH; is_zero_fcnt = true; break;
  }
} while (!bufptr->atomic_latch.compare_exchange_weak (impl_orig.raw, impl_new.raw, ...));

The CAS (Chapter 5) yields is_zero_fcnt (last holder) and blocked_reader_writer (a latch waiter queued). Reordering runs only when is_zero_fcnt && !blocked_reader_writer — a queued waiter re-latches the BCB immediately, so moving it would be wasted work.

flowchart TD
  A["pgbuf_unlatch_bcb_upon_unfix\nCAS: fcnt--"] --> B{"is_zero_fcnt?"}
  B -->|no| W["wakeup reader/writer\nrelease mutex"]
  B -->|yes| C{"MOVE_TO_LRU_BOTTOM?"}
  C -->|yes| D["pgbuf_move_bcb_to_bottom_lru\ndealloc shortcut"]
  C -->|no| E{"blocked_reader_writer?"}
  E -->|yes| W
  E -->|no| F["switch on zone"]
  F --> Z0["VOID -> pgbuf_unlatch_void_zone_bcb"]
  F --> Z1["LRU_1 -> keep or prv->shr"]
  F --> Z2["LRU_2 -> keep, boost if old, or prv->shr"]
  F --> Z3["LRU_3 -> boost or prv->shr or direct-victim"]
  Z0 --> W
  Z1 --> W
  Z2 --> W
  Z3 --> W
  D --> W

Figure 7-1 — Branch structure of pgbuf_unlatch_bcb_upon_unfix. Only the zero-fcnt, no-waiter path reaches the zone switch.

7.2 The dealloc shortcut and the zone switch

// pgbuf_unlatch_bcb_upon_unfix -- src/storage/page_buffer.c
if (is_zero_fcnt) {
  assert (LSA_ISNULL (&bufptr->oldest_unflush_lsa) || pgbuf_bcb_is_dirty (bufptr));
  if (pgbuf_bcb_should_be_moved_to_bottom_lru (bufptr))      /* <- MOVE_TO_LRU_BOTTOM flag */
    pgbuf_move_bcb_to_bottom_lru (thread_p, bufptr);         /* dealloc shortcut */
  else if (blocked_reader_writer == false) {
    th_lru_idx = PGBUF_THREAD_HAS_PRIVATE_LRU (thread_p)
      ? PGBUF_LRU_INDEX_FROM_PRIVATE (PGBUF_PRIVATE_LRU_FROM_THREAD (thread_p)) : -1;  /* own list or none */
    switch (pgbuf_bcb_get_zone (bufptr)) { /* ... see 7.3 ... */ }
  }
}

pgbuf_bcb_should_be_moved_to_bottom_lru tests the PGBUF_BCB_MOVE_TO_LRU_BOTTOM_FLAG bit, set on the dealloc path: a deallocated page is worthless hot, so it is shoved to the bottom for first reclamation. th_lru_idx (own private list or -1) is the reference point for every private/shared decision below.

Invariant — oldest_unflush_lsa implies the dirty bit. Chapter 6’s WAL invariant at unfix: a page with a pending flush LSA must stay dirty, or the flush daemons (Chapter 8) skip it and break WAL.

7.3 Zone-by-zone branch trace

Two guards recur in every LRU case, quoted once and referenced for all three zones:

// pgbuf_unlatch_bcb_upon_unfix (per-case prologue) -- src/storage/page_buffer.c
if (PGBUF_SHOULD_IGNORE_UNFIX (thread_p, bufptr)) { ...KEEP_VAC stat...; break; }   /* <- don't warm cache */
if (pgbuf_should_move_private_to_shared (thread_p, bufptr, th_lru_idx)) {           /* <- see 7.5 */
  pgbuf_lru_move_from_private_to_shared (thread_p, bufptr); ...PRV_TO_SHR_MID stat...; break;
}

PGBUF_SHOULD_IGNORE_UNFIX is not vacuum-only: its real definition is VACUUM_IS_THREAD_VACUUM_WORKER (th) || pgbuf_is_temporary_volume (buf->vpid.volid) (SERVER_MODE; false otherwise). It fires for vacuum workers and for pages on temporary volumes — both should not warm the cache or promote a BCB to hot (the source comment also names the checkpoint thread as a logical member of this set). pgbuf_should_move_private_to_shared (7.5) escalates contended pages. Only the default action after these guards differs by zone. Note the LRU_3 case applies its SHOULD_IGNORE_UNFIX branch before the private-to-shared check (see below).

VOID (Chapter 4): delegated to pgbuf_unlatch_void_zone_bcb (7.4).

LRU_1 (hottest): after the guards, do nothing but register a hit — zone 1 is never reordered:

/* after the per-case prologue, plus a PRV_KEEP/SHR_KEEP stat: */
pgbuf_bcb_register_hit_for_lru (bufptr); break;  /* <- never boost zone 1 */

LRU_2 (boost-eligible): boost only if aged:

if (PGBUF_IS_BCB_OLD_ENOUGH (bufptr, pgbuf_lru_list_from_bcb (bufptr)))
  pgbuf_lru_boost_bcb (thread_p, bufptr);            /* <- aged enough -> promote to top */
else { ...PRV_KEEP / SHR_KEEP stat... }              /* <- too new: leave in place */
pgbuf_bcb_register_hit_for_lru (bufptr); break;

LRU_3 (victim zone): a real unfix always boosts, but its PGBUF_SHOULD_IGNORE_UNFIX branch may instead hand the BCB out as a direct victim:

case PGBUF_LRU_3_ZONE:
  if (PGBUF_SHOULD_IGNORE_UNFIX (...)) {
    if (!pgbuf_bcb_avoid_victim (bufptr) && pgbuf_assign_direct_victim (thread_p, bufptr))
      { ...DIRECT_VACUUM_LRU stat... }               /* <- give it straight to a waiter */
    else { ...THREE_KEEP_VAC stat... }
    break;
  }
  if (pgbuf_should_move_private_to_shared (...)) { ...move; THREE_PRV_TO_SHR_MID...; break; }
  pgbuf_lru_boost_bcb (thread_p, bufptr);            /* <- rule 3: always boost from zone 3 */
  pgbuf_bcb_register_hit_for_lru (bufptr); break;

After the switch the function wakes any latch waiter (pgbuf_wakeup_reader_writer); on a requested async flush (pgbuf_bcb_is_async_flush_request) it uses pgbuf_bcb_safe_flush_force_unlock (which unlocks), else unlocks directly. assert (... != PGBUF_LATCH_FLUSH) guards that unfix never sees a flush latch — flushing is the daemons’ job (Chapter 8).

7.4 VOID-zone landing — `pgbuf_unlatch_void_zone_bcb` and the Aout hit

A VOID BCB was just claimed for a non-resident page. It first removes the VPID from Aout (recording a re-fix as a hit), then branches on private-list ownership and Aout membership:

// pgbuf_unlatch_void_zone_bcb -- src/storage/page_buffer.c
if (pgbuf_Pool.buf_AOUT_list.max_count > 0) { aout_enabled = true;
  aout_list_id = pgbuf_remove_vpid_from_aout_list (thread_p, &bcb->vpid); }  /* <- 2Q lookup+remove */
if (PGBUF_VACUUM_SHOULD_IGNORE_UNFIX (thread_p)) {                          /* vacuum worker only here */
  if (!pgbuf_bcb_avoid_victim (bcb) && pgbuf_assign_direct_victim (thread_p, bcb)) {
    // ... if Aout on: pgbuf_add_vpid_to_aout_list (..., aout_list_id) ... <- re-ghost
    return; }
  aout_list_id = PGBUF_AOUT_NOT_FOUND;               /* <- vacuum never gets Aout-boost */
}
if (thread_private_lru_index != -1) {
  if (PGBUF_VACUUM_SHOULD_IGNORE_UNFIX (thread_p)) {                      /* <- vacuum: top, no hit */
    pgbuf_lru_add_new_bcb_to_top (thread_p, bcb, thread_private_lru_index); return; }
  if (!aout_enabled || thread_private_lru_index == aout_list_id) {        /* <- Aout HIT -> top of LRU 1 */
    pgbuf_lru_add_new_bcb_to_top (thread_p, bcb, thread_private_lru_index);
    pgbuf_bcb_register_hit_for_lru (bcb); return; }
  if (aout_list_id == PGBUF_AOUT_NOT_FOUND) {                             /* <- cold miss -> middle */
    pgbuf_lru_add_new_bcb_to_middle (thread_p, bcb, thread_private_lru_index);
    pgbuf_bcb_register_hit_for_lru (bcb); return; }
  /* fall through: ghosted in a *different* private list -> shared */
}
pgbuf_lru_add_new_bcb_to_middle (thread_p, bcb, pgbuf_get_shared_lru_index_for_add ());  /* <- shared middle */
if (!PGBUF_VACUUM_SHOULD_IGNORE_UNFIX (thread_p)) pgbuf_bcb_register_hit_for_lru (bcb);

Note this branch gates on PGBUF_VACUUM_SHOULD_IGNORE_UNFIX (vacuum worker only) — the temp-volume arm of PGBUF_SHOULD_IGNORE_UNFIX from 7.3 is not applied in the VOID path. Placement (private-LRU thread, Aout enabled, non-vacuum):

Aout result	Placement	Meaning
`== aout_list_id`	top of own private list (LRU 1)	evicted from my list — 2Q second-touch, promote hot
`!aout_enabled`	top of own private list	no history; first unfix treated as warm
`PGBUF_AOUT_NOT_FOUND`	middle of own private list	never seen — cold, lands at the LRU-1/2 boundary
different list	middle of shared list	shared across workers; quotas (Ch 10) apply

Invariant — Aout removal precedes placement. aout_list_id is captured once before any insertion and drives the whole branch. If remove came after placement, two threads re-fixing the same page could both “hit” and double-promote; Aout_mutex (7.6) serializes lookup-and-remove so exactly one thread consumes the ghost.

7.5 `pgbuf_should_move_private_to_shared` — the migration test

// pgbuf_should_move_private_to_shared -- src/storage/page_buffer.c
int bcb_lru_idx = pgbuf_bcb_get_lru_index (bcb);
if (PGBUF_IS_SHARED_LRU_INDEX (bcb_lru_idx)) return false;     /* <- already shared */
if (thread_private_lru_index != bcb_lru_idx) return true;      /* cond 1: foreign-thread unfix */
if (!pgbuf_bcb_is_hot (bcb)) return false;                     /* cond 2a: must be hot */
if (!PGBUF_IS_BCB_OLD_ENOUGH (bcb, PGBUF_GET_LRU_LIST (bcb_lru_idx))) return false;  /* cond 2b: and old */
return true;                                                   /* hot + aged -> escalate to shared */

Two triggers: (1) foreign unfix — the BCB lives in private list X but the unfixer’s own list is Y (or -1); a page touched by >1 worker goes shared. (2) hot and old — same list, but both hot (pgbuf_bcb_is_hot) and old enough.

// pgbuf_bcb_is_hot / pgbuf_bcb_register_fix -- src/storage/page_buffer.c
// hot: count_fix_and_avoid_dealloc >= (PGBUF_FIX_COUNT_THRESHOLD << PGBUF_BCB_COUNT_FIX_SHIFT_BITS)
//      == 64 << 16  (fix count lives in the high 16 bits)
// register_fix saturates: stops incrementing once the threshold bit is set.

count_fix_and_avoid_dealloc packs the fix count (high 16 bits, bumped by pgbuf_bcb_register_fix and saturating at the 64-fix threshold) and the avoid-dealloc count (low 16 bits, PGBUF_BCB_AVOID_DEALLOC_MASK); see Chapter 6.

// PGBUF_IS_BCB_OLD_ENOUGH -- src/storage/page_buffer.c
#define PGBUF_IS_BCB_OLD_ENOUGH(bcb, lru_list) \
  (PGBUF_AGE_DIFF ((bcb)->tick_lru_list, (lru_list)->tick_list) >= ((lru_list)->count_lru2 / 2))

A BCB stamps tick_lru_list from tick_list on insert; tick_list bumps on every add-to-top/middle. “Old enough” = passed by at least half of zone 2’s worth (count_lru2 / 2) of newer inserts — so a page fixed twice in quick succession is not boosted on the second unfix. PGBUF_AGE_DIFF handles the wraparound of the 31-bit tick.

7.6 The Aout 2Q ghost list — `pgbuf_aout_buf` and `pgbuf_aout_list`

Aout holds VPIDs only (not BCBs) for recently victimized pages — a FIFO fronted by per-shard hash tables for O(1) lookup.

// struct pgbuf_aout_buf -- src/storage/page_buffer.c
struct pgbuf_aout_buf {
  VPID vpid;            /* page VPID */
  int lru_idx;          /* which LRU list it was evicted from */
  PGBUF_AOUT_BUF *next; /* next element in list */
  PGBUF_AOUT_BUF *prev; /* prev element in list */
};

Field	Role
`vpid`	ghosted identity / hash key; `VPID_SET_NULL` marks a free node
`lru_idx`	LRU list it was evicted from — re-fix re-enters the same list (7.4)
`next` / `prev`	FIFO links, doubling as the free-list link when recycled; `prev` gives O(1) middle unlink on a hit

// struct pgbuf_aout_list -- src/storage/page_buffer.c
struct pgbuf_aout_list {
  pthread_mutex_t Aout_mutex;   /* integrity of the whole list (SERVER_MODE) */
  PGBUF_AOUT_BUF *Aout_top;     /* top of the queue (most recent) */
  PGBUF_AOUT_BUF *Aout_bottom;  /* bottom of the queue (oldest) */
  PGBUF_AOUT_BUF *Aout_free;    /* free list of recycled nodes */
  PGBUF_AOUT_BUF *bufarray;     /* preallocated node array */
  int num_hashes;               /* number of hash shards */
  MHT_TABLE **aout_buf_ht;      /* per-shard VPID -> node hash */
  int max_count;                /* capacity; <= 0 disables Aout */
};

Field	Role
`Aout_mutex`	global list+hash lock; serializes the 7.4 lookup-remove
`Aout_top` / `Aout_bottom`	newest / oldest ghost — insertion vs eviction points
`Aout_free`	recycled nodes; avoids malloc on the victim path
`bufarray`	preallocated backing storage; no runtime alloc
`num_hashes` / `aout_buf_ht`	shard count and per-shard `MHT_TABLE*` for O(1) VPID lookup over the FIFO (`AOUT_HASH_IDX`)
`max_count`	capacity / enable switch; `<= 0` disables Aout entirely

The LRU list struct these BCBs move within (pgbuf_lru_list) carries the boundary pointers and tick clocks 7.5/7.7 lean on:

Field	Role	Used by
`top` / `bottom`	list endpoints	add-to-top, add-to-bottom
`bottom_1`	last BCB of zone 1	add-to-middle inserts after it
`bottom_2`	last BCB of zone 2	repaired on removal (zone-2 care, 7.7)
`victim_hint`	where victim search starts	advanced on every remove
`count_lru1/2/3`	per-zone populations	`count_lru2/2` = old-enough threshold
`threshold_lru1/2`	zone-size targets	drive `pgbuf_lru_adjust_zone*`
`tick_list`	bumped on add-to-top/middle	boost-age clock (`PGBUF_IS_BCB_OLD_ENOUGH`)
`tick_lru3`	bumped on fall-to-zone-3	victim-hint ordering
`index`	list id	private vs shared classification

flowchart LR
  subgraph AOUT["pgbuf_aout_list (FIFO + hash)"]
    T["Aout_top\n(newest)"] --> N1["node"] --> N2["node"] --> B["Aout_bottom\n(oldest)"]
    HT["aout_buf_ht[shard]\nVPID -> node"] -.-> N1
    FR["Aout_free\n(recycled)"]
  end
  VICT["victimization\npgbuf_add_vpid_to_aout_list"] --> T
  B -->|"full -> drop oldest"| FR
  REFIX["re-fix\npgbuf_remove_vpid_from_aout_list"] -.->|hit| FR

Figure 7-2 — Aout as a fixed-capacity FIFO with a hash index.

pgbuf_add_vpid_to_aout_list (from the direct-victim branches of 7.4 and pgbuf_lru_fall_bcb_to_zone_3): under Aout_mutex, if Aout_free is empty it evicts Aout_bottom (mht_rem), else pops a free node; stamps lru_idx/vpid, mht_puts, links at Aout_top. pgbuf_remove_vpid_from_aout_list: mht_get; if absent returns PGBUF_AOUT_NOT_FOUND (-2, a true fault); if present it captures aout_list_id = aout_buf->lru_idx, unlinks, mht_rems, nulls the VPID, resets lru_idx, pushes the node onto Aout_free, and returns lru_idx — the value 7.4 compares against thread_private_lru_index.

7.7 LRU insertion primitives and the boost

Zone-2 and zone-3 boosts route through pgbuf_lru_boost_bcb:

// pgbuf_lru_boost_bcb -- src/storage/page_buffer.c
assert (zone != PGBUF_LRU_1_ZONE);                   /* <- never called on zone 1 */
pthread_mutex_lock (&lru_list->mutex);
pgbuf_remove_from_lru_list (thread_p, bcb, lru_list);/* unlink */
pgbuf_lru_add_bcb_to_top (thread_p, bcb, lru_list);  /* relink at top of zone 1 */
if (zone == PGBUF_LRU_2_ZONE) pgbuf_lru_adjust_zone1 (thread_p, lru_list, true);   /* only zone 1 grew */
else                          pgbuf_lru_adjust_zones (thread_p, lru_list, true);   /* zone 3: rebalance all */
pthread_mutex_unlock (&lru_list->mutex);

pgbuf_lru_add_bcb_to_top patches the links, sets top (and bottom/ bottom_1 if empty), increments tick_list (the clock that ages every other BCB for PGBUF_IS_BCB_OLD_ENOUGH), then change_zone marks it PGBUF_LRU_1_ZONE. pgbuf_lru_add_bcb_to_middle inserts after bottom_1 (the zone-1 bottom), also bumps tick_list, and marks zone 2; pgbuf_lru_add_bcb_to_bottom appends at bottom, stamps tick_lru3, and marks zone 3.

pgbuf_remove_from_lru_list is the inverse and repairs every boundary pointer before moving the BCB to VOID:

// pgbuf_remove_from_lru_list -- src/storage/page_buffer.c
if (lru_list->top == bufptr)      lru_list->top = bufptr->next_BCB;
if (lru_list->bottom == bufptr)   lru_list->bottom = bufptr->prev_BCB;
if (lru_list->bottom_1 == bufptr) lru_list->bottom_1 = bufptr->prev_BCB;
if (lru_list->bottom_2 == bufptr) {                  /* <- zone-2 boundary needs care */
  if (bufptr->prev_BCB != NULL && pgbuf_bcb_get_zone (bufptr->prev_BCB) == PGBUF_LRU_2_ZONE)
    lru_list->bottom_2 = bufptr->prev_BCB;
  else { assert (lru_list->count_lru2 == 1); lru_list->bottom_2 = NULL; }
}
/* splice neighbors, null this bcb's links */
pgbuf_lru_advance_victim_hint (thread_p, lru_list, bufptr, bcb_prev, false);
pgbuf_bcb_change_zone (thread_p, bufptr, 0, PGBUF_VOID_ZONE);  /* <- now belongs to no zone */

Invariant — a removed BCB’s zone matches its links: it lands in VOID, and the victim hint advances first. The function ends with change_zone(..., PGBUF_VOID_ZONE) so a BCB unlinked from a list never keeps an LRU zone tag; and it calls pgbuf_lru_advance_victim_hint before the splice — if the hint were not advanced first, a victimizer could chase a dangling prev_BCB. Boost = remove + add-to-top leaves the BCB momentarily in VOID, but the whole sequence runs under lru_list->mutex, so no thread observes the gap.

pgbuf_lru_fall_bcb_to_zone_3 is the demotion counterpart, run by the zone-adjust functions when zones 1/2 exceed their thresholds:

// pgbuf_lru_fall_bcb_to_zone_3 -- src/storage/page_buffer.c
assert (pgbuf_bcb_get_zone (bcb) == PGBUF_LRU_1_ZONE || pgbuf_bcb_get_zone (bcb) == PGBUF_LRU_2_ZONE);
#if defined (SERVER_MODE)
if (pgbuf_is_bcb_victimizable (bcb, false) && pgbuf_is_any_thread_waiting_for_direct_victim ()) {
  if (pgbuf_bcb_is_to_vacuum (bcb)) { /* ...stat; fall through... */ }
  else if (PGBUF_BCB_TRYLOCK (bcb) == 0) {           /* <- conditional: avoid list/bcb lock-order deadlock */
    VPID vpid_copy = bcb->vpid;
    if (pgbuf_is_bcb_victimizable (bcb, true) && pgbuf_assign_direct_victim (thread_p, bcb)) {
      pgbuf_remove_from_lru_list (thread_p, bcb, lru_list); PGBUF_BCB_UNLOCK (bcb);
      pgbuf_add_vpid_to_aout_list (thread_p, &vpid_copy, lru_list->index);  /* <- ghost on the way out */
      return; }
    PGBUF_BCB_UNLOCK (bcb);                           /* not assigned; fall through */
  } }
#endif
bcb->tick_lru3 = lru_list->tick_lru3;                /* stamp zone-3 position */
if (++lru_list->tick_lru3 >= DB_INT32_MAX) lru_list->tick_lru3 = 0;
pgbuf_bcb_change_zone (thread_p, bcb, lru_list->index, PGBUF_LRU_3_ZONE);

PGBUF_BCB_TRYLOCK is conditional because lock order is normally bcb-then-list but we already hold the list mutex; rather than deadlock it gives up and lets the BCB be victimized later (Chapter 9). The direct-victim branch ghosts the VPID into Aout on the way out, closing the 2Q loop with 7.4’s lookup. tick_lru3 (small at the bottom) feeds the victim hint, distinct from tick_lru_list which feeds the boost age.

7.8 Chapter summary — key takeaways

LRU movement is gated on is_zero_fcnt && !blocked_reader_writer. Shared-read unfixes take pgbuf_lockfree_unfix_ro and never touch the list; only the last unfixer with no waiter reorders the BCB.
MOVE_TO_LRU_BOTTOM is a dealloc shortcut that bypasses the zone switch and shoves a deallocated page to the bottom for fast reclamation.
Zone sets the default action: zone 1 never boosts; zone 2 boosts only when PGBUF_IS_BCB_OLD_ENOUGH (past half of count_lru2); zone 3 always boosts on a real unfix, or hands out a direct victim under the ignore-unfix branch.
PGBUF_SHOULD_IGNORE_UNFIX is vacuum-OR-temp-volume, not vacuum-only: pages on temporary volumes are also kept from warming the cache, and the VOID path narrows this to vacuum workers (PGBUF_VACUUM_SHOULD_IGNORE_UNFIX).
Boost = pgbuf_remove_from_lru_list + pgbuf_lru_add_bcb_to_top under the list mutex; add-to-top bumps tick_list, the clock that ages every other BCB for the old-enough test. A removed BCB always lands in VOID, keeping the zone field consistent with the list it is linked in.
pgbuf_should_move_private_to_shared fires on two triggers: a foreign-thread unfix (immediate), or a same-list page that is both hot (>= 64 packed, saturating fixes) and old enough — escalating contended pages to the shared pool.
Aout is a fixed-capacity VPID-only ghost FIFO with a hash index, and its lookup-and-remove is serialized under Aout_mutex and precedes placement, so one thread consumes a ghost and no page is double-promoted: a re-fix found for the same private list lands at the top of LRU 1, a different list goes shared, not-found lands cold in the middle; victimization re-ghosts the outgoing VPID to keep the loop closed.

Chapter 8: Flushing Under the WAL Rule and the Flush Daemons

A dirty page may not reach disk until the log record for its most recent change is durable. This chapter traces how the page buffer enforces that log-before-page ordering inside pgbuf_bcb_flush_with_wal, how the three flush daemons pace and batch their writes, and where the double-write buffer (DWB) intercepts the write. For the why of WAL and the high-level picture, see the companion cubrid-page-buffer-manager.md (“Write-Ahead Logging”, “Flushing and the daemons”). The DWB’s block geometry and crash-recovery rationale live in cubrid-double-write-buffer.md; the durability semantics of logpb_flush_log_for_wal and the flushed-LSA bookkeeping live in cubrid-log-manager-detail.md — both are referenced, not re-derived, here. The flushing/dirty flags and the atomic-latch FLUSH mode come from Chapters 6 and 5; this chapter shows how the flush path consumes them.

8.1 Single-page entry points

Four public entry points push one page toward disk; all funnel into pgbuf_bcb_safe_flush_internal, which decides whether the flush happens now, is delegated, or is awaited.

flowchart TB
  F["pgbuf_flush\n(optionally unfix after)"] --> FW["pgbuf_flush_with_wal"]
  FIR["pgbuf_flush_if_requested\n(permanently write-latched page)"] -->|ASYNC_FLUSH_REQ set| SFFU
  FW --> SFFU["pgbuf_bcb_safe_flush_force_unlock"]
  SFFU --> SFI["pgbuf_bcb_safe_flush_internal"]
  SFI -->|immediate_flush| BFW["pgbuf_bcb_flush_with_wal"]
  SFI -->|page write-latched by other| REQ["set ASYNC_FLUSH_REQ\nlet holder flush on unfix"]
  SFI -->|already flushing / latched, synchronous| BLK["pgbuf_block_bcb\nPGBUF_LATCH_FLUSH wait"]

Figure 8-1 — Single-page flush entry points converging on pgbuf_bcb_safe_flush_internal.

pgbuf_flush_with_wal is the canonical caller — it asserts a READ+ latch held by the calling thread, locks the BCB mutex, and delegates synchronously:

// pgbuf_flush_with_wal -- src/storage/page_buffer.c
CAST_PGPTR_TO_BFPTR (bufptr, pgptr);
/* In CUBRID, the caller is holding WRITE page latch */
assert (get_latch (&bufptr->atomic_latch) >= PGBUF_LATCH_READ
        && pgbuf_find_thrd_holder (thread_p, bufptr) != NULL);
PGBUF_BCB_LOCK (bufptr);
if (pgbuf_bcb_safe_flush_force_unlock (thread_p, bufptr, true) != NO_ERROR) /* <- synchronous=true */
  { ASSERT_ERROR (); return NULL; }

pgbuf_flush wraps this and unfixes afterward when free_page == FREE; its header warns it does not guarantee the page reached disk, so callers needing durability use pgbuf_flush_with_wal and check the return. pgbuf_flush_if_requested serves a thread holding a page permanently write-latched (it can never unfix to trigger a normal flush): it asserts a WRITE latch held by the caller, checks pgbuf_bcb_is_async_flush_request (bcb), and only when set locks and flushes with synchronous=false — the consumer side of the PGBUF_BCB_ASYNC_FLUSH_REQ flag the daemon/checkpoint sets on a write-latched victim.

8.2 The decision core: `pgbuf_bcb_safe_flush_internal`

The caller holds the BCB mutex. The function short-circuits clean pages, then runs a CAS loop choosing among the outcomes below. A flush cannot happen immediately for exactly two reasons, both spelled out in the source: the page is write-latched by another thread (its contents could change mid-write), or another thread is already flushing it (two writers could reorder an old version over a new one).

// pgbuf_bcb_safe_flush_internal -- src/storage/page_buffer.c
if (!pgbuf_bcb_is_dirty (bufptr))
  return NO_ERROR;                       /* <- clean: nothing to do, stays locked */
do {
  immediate_flush = false; block = false; is_flushing = false;
  impl = get_impl (&bufptr->atomic_latch); impl_new = impl;
  is_flushing = pgbuf_bcb_is_flushing (bufptr);
  if (!is_flushing
      && (impl.impl.latch_mode == PGBUF_NO_LATCH || impl.impl.latch_mode == PGBUF_LATCH_READ
          || (impl.impl.latch_mode == PGBUF_LATCH_WRITE
              && pgbuf_find_thrd_holder (thread_p, bufptr) != NULL)))  /* <- I am the writer */
    immediate_flush = true;
  else {
    assert (is_flushing || impl.impl.latch_mode == PGBUF_LATCH_WRITE);  /* <- only these reach else */
    if (synchronous)
      { block = true; impl_new.impl.waiter_exists = true; }  /* <- publish waiter into latch word */
  }
} while (!bufptr->atomic_latch.compare_exchange_strong (impl.raw, impl_new.raw, ...));

Outcome	Condition	Action
`immediate_flush`	not flushing; unlatched, read-latched, or write-latched by me	`pgbuf_bcb_flush_with_wal (..., false, locked)` — flush now
async request	not flushing, write-latched by another	`pgbuf_bcb_update_flags (..., PGBUF_BCB_ASYNC_FLUSH_REQ, 0)` — holder flushes on unfix
block	flushing or foreign write-latch, `synchronous==true`	`*locked=false; pgbuf_block_bcb (..., PGBUF_LATCH_FLUSH, ...)` — sleep
no-wait return	foreign latch/flush, `synchronous==false`	return `NO_ERROR` without flushing

Note the async-request flag is set whenever the immediate path was not taken and the BCB is not already flushing (i.e. a foreign write-latch), regardless of synchronous — the synchronous caller then also blocks.

Invariant — at most one flusher per BCB. A page flushes only while pgbuf_bcb_is_flushing is false when the writer commits the flushing flag (set inside pgbuf_bcb_flush_with_wal). A second thread sees is_flushing == true, cannot take immediate_flush, and blocks (PGBUF_LATCH_FLUSH) or returns. Violate it and two fileio_write calls race, landing an older image after a newer one and corrupting the page. The force_unlock/force_lock wrappers only normalize the locked out-parameter, since the internal function may drop the mutex when it blocks.

8.3 `pgbuf_bcb_flush_with_wal` — the durable write

The heart of the chapter. The caller holds the mutex; the function copies the page, enforces WAL, writes through the DWB or directly, and on success clears FLUSHING and wakes waiters; on failure it reverts DIRTY and oldest_unflush_lsa.

flowchart TB
  A["mark_is_flushing\nset FLUSHING, clear DIRTY"] --> C["copy_unflushed_lsa\nsave lsa+oldest_unflush\nNULL oldest_unflush, UNLOCK"]
  C --> D{oldest_unflush_lsa\nnon-null?}
  D -->|yes| E["logpb_flush_log_for_wal"]
  D -->|no| F["debug: changed not logged"]
  E --> G{uses_dwb?}
  F --> G
  G -->|yes| H["dwb_add_page"]
  G -->|no| I["fileio_write"]
  H --> J{error?}
  I --> J
  J -->|fail| K["mark_was_not_flushed\nrestore DIRTY+lsa, wake, ER_FAILED"]
  J -->|ok, flush thread + waiter| L["queue to flushed_bcbs\nwake post-flush daemon"]
  J -->|ok| M["mark_was_flushed\nclear FLUSHING, wake"]

Figure 8-2 — pgbuf_bcb_flush_with_wal branch map.

Step 1 — claim the flush, clear DIRTY. was_dirty = pgbuf_bcb_mark_is_flushing (...) sets FLUSHING_TO_DISK and atomically clears DIRTY and ASYNC_FLUSH_REQ:

// pgbuf_bcb_mark_is_flushing -- src/storage/page_buffer.c
if (pgbuf_bcb_is_dirty (bcb)) {
  pgbuf_bcb_update_flags (thread_p, bcb, PGBUF_BCB_FLUSHING_TO_DISK_FLAG,
                          PGBUF_BCB_DIRTY_FLAG | PGBUF_BCB_ASYNC_FLUSH_REQ);  /* <- set | clear */
  return true;
}

Invariant — DIRTY clears at flush start, not end. A concurrent re-dirty during the long copy-to-write window re-sets DIRTY and is not lost; the page just flushes again. On write failure, pgbuf_bcb_mark_was_not_flushed re-sets DIRTY (when was_dirty).

Step 2 — copy the image. At start_copy_page the page is copied into a stack iopage (via tde_encrypt_data_page if TDE-encrypted, else a memcpy of IO_PAGESIZE). If uses_dwb, the copy is staged into a DWB slot by dwb_set_data_on_next_slot; on a granted slot the local iopage is nulled and control jumps to copy_unflushed_lsa.

Step 3 — WAL enforcement. At copy_unflushed_lsa it saves the page LSA and oldest_unflush_lsa, NULLs bufptr->oldest_unflush_lsa, drops the mutex, and — if the saved oldest_unflush_lsa is non-null — forces the log:

// pgbuf_bcb_flush_with_wal -- src/storage/page_buffer.c
LSA_COPY (&lsa, &(bufptr->iopage_buffer->iopage.prv.lsa));
LSA_COPY (&oldest_unflush_lsa, &bufptr->oldest_unflush_lsa);
LSA_SET_NULL (&bufptr->oldest_unflush_lsa);
PGBUF_BCB_UNLOCK (bufptr); *is_bcb_locked = false;
if (!LSA_ISNULL (&oldest_unflush_lsa))
  logpb_flush_log_for_wal (thread_p, &lsa);   /* <- log-before-page: force log up to page LSA */

WAL INVARIANT — log up to the page LSA is durable before the write. The page is never handed to fileio_write/dwb_add_page until the log tail is forced through lsa (the page’s own prv.lsa). Enforcement is structural: logpb_flush_log_for_wal (thread_p, &lsa) sits between the mutex drop and the write, and lsa is read before the mutex drops so it cannot be re-stamped underneath. The trigger gate is oldest_unflush_lsa != NULL, but the force targets lsa (the newest change), not oldest_unflush_lsa. What breaks if skipped: a crash after the page write but before the log write leaves a persisted change whose redo/undo never reached disk — recovery cannot reconstruct or roll it back, so the page is silently corrupt. See cubrid-log-manager-detail.md for how logpb_flush_log_for_wal guarantees durability to a given LSA. The else branch (null oldest_unflush_lsa) is the rare “changed but not logged” case (temporary volumes) and only emits a debug note.

Null-ing bufptr->oldest_unflush_lsa lets a re-dirty during the write window re-stamp a fresh value that a later flush re-forces; on write failure the saved value is restored (Step 5a).

Step 4 — the write. DWB use is gated by uses_dwb = dwb_is_created () && !is_temp (temp volumes always bypass it). If uses_dwb, dwb_add_page registers the page’s VPID into the staged slot; the DWB batches pages and flushes a full block to the double-write area and then the data files (block geometry and the torn-page recovery argument are in cubrid-double-write-buffer.md). Subtle branch: if DWB was disabled between staging and adding, dwb_add_page returns dwb_slot == NULL, so the code clears uses_dwb, re-locks, and goto start_copy_page to retry direct. The direct path does a plain fileio_write (bumping num_pages_written, PSTAT_PB_NUM_IOWRITES) with mode FILEIO_WRITE_NO_COMPENSATE_WRITE when a DWB exists globally (double-write makes torn-page compensation redundant), else FILEIO_WRITE_DEFAULT_WRITE.

Step 5a — write failure. Re-lock, pgbuf_bcb_mark_was_not_flushed (.., was_dirty) clears FLUSHING and restores DIRTY, restore the saved oldest_unflush_lsa, wake PGBUF_LATCH_FLUSH waiters (only if next_wait_thrd != NULL), return ER_FAILED.

Step 5b — success, daemon hand-off. If this is the page flush daemon (is_page_flush_thread), the post-flush daemon exists, a thread waits for a direct victim, and the BCB is accepted into pgbuf_Pool.flushed_bcbs (via produce), the BCB is left unlocked but un-cleared for the post-flush daemon to assign as a victim (Chapter 9), which is then woken — mark_was_flushed is deliberately not called on this path. Step 5c (otherwise) re-locks, calls pgbuf_bcb_mark_was_flushed (clears FLUSHING), and wakes flush waiters when any are queued.

8.4 Waking the FLUSH waiters: `pgbuf_wake_flush_waiters`

Threads that took the block branch in 8.2 park on the BCB’s next_wait_thrd list with request_latch_mode == PGBUF_LATCH_FLUSH. The waker unlinks only FLUSH waiters, leaving READ/WRITE latch waiters in place:

// pgbuf_wake_flush_waiters -- src/storage/page_buffer.c
for (crt_waiter = bcb->next_wait_thrd; crt_waiter != NULL; crt_waiter = save_next_waiter) {
  save_next_waiter = crt_waiter->next_wait_thrd;
  if (crt_waiter->request_latch_mode == PGBUF_LATCH_FLUSH) {
    if (prev_waiter != NULL) prev_waiter->next_wait_thrd = save_next_waiter;
    else bcb->next_wait_thrd = save_next_waiter;     /* <- unlink only FLUSH waiters */
    crt_waiter->next_wait_thrd = NULL;
    pgbuf_wakeup_uncond (crt_waiter);
  } else {
    prev_waiter = crt_waiter;                        /* <- keep latch waiters threaded */
  }
}

The caller must hold the BCB mutex. Both the failure and success paths of 8.3 call it, but only when next_wait_thrd != NULL. Mixing FLUSH and latch waiters on one list is why the loop tracks prev_waiter instead of truncating.

8.5 The Page Flush Daemon: candidate collection and flushing

pgbuf_flush_victim_candidates is the daemon body: size the scan, collect dirty candidates from the LRUs, force the log, flush each survivor.

Adaptive scan width. It reads/resets lru_victim_req_cnt and fix_req_cnt for lru_miss_rate, then boosts flush_ratio * num_buffers by up to PGBUF_FLUSH_VICTIM_BOOST_MULT (=10) when misses are high — but only when not in checkpoint (checkpoint already flushes, so boosting would double-flush). The result caps at ~200 MB of pages.

// pgbuf_flush_victim_candidates -- src/storage/page_buffer.c
if (pgbuf_Pool.is_checkpoint == false) {
  lru_dynamic_flush_adj = MAX (1.0f, 1 + (PGBUF_FLUSH_VICTIM_BOOST_MULT - 1) * lru_miss_rate);
  lru_dynamic_flush_adj = MIN (PGBUF_FLUSH_VICTIM_BOOST_MULT, lru_dynamic_flush_adj);
} else lru_dynamic_flush_adj = 1.0f;
check_count_lru = (int) (cfg_check_cnt * lru_dynamic_flush_adj);
check_count_lru = MIN (check_count_lru, (200 * 1024 * 1024) / DB_PAGESIZE);

Branches after collection. If victim_count == 0, nothing to flush; sets *stop true (so the daemon loop in 8.8 breaks) only when scanning was actually attempted (check_count_lru > 0 && lru_sum_flush_priority > 0), then goto end. Otherwise it wakes the log flush daemon (WAL needs the log current — log_wakeup_log_flush_daemon, or logpb_force_flush_pages if no daemon), optionally qsorts the list by VPID under PRM_ID_PB_SEQUENTIAL_VICTIM_FLUSH, and sets is_flushing_victims = true.

Per-candidate loop. For each candidate it locks the BCB and applies four guards:

VPID changed / not dirty / already flushing -> num_skipped_already_flushed, unlock, continue.
left the LRU victim zone or got fixed/hot -> num_skipped_fixed_or_hot, unlock, continue.
logpb_need_wal (page LSA beyond flushed log) -> record max lsa_need_wal, bump count_need_wal, wake log flush daemon, num_skipped_need_wal, unlock, continue.
else flush: pgbuf_flush_page_and_neighbors_fb when PGBUF_NEIGHBOR_PAGES > 1, else pgbuf_bcb_flush_with_wal (..., true, &is_bcb_locked) (is_page_flush_thread=true; the loop unlocks the BCB if it stayed locked). On error -> goto end.

The repeat retry. At end, if every candidate was skipped purely for WAL (count_need_wal == victim_count) and a thread still waits for a direct victim, the daemon forces the log itself (logpb_flush_log_for_wal) and jumps to repeat exactly once (the second pass asserts LSAs advanced), then clears is_flushing_victims.

Neighbor batching: pgbuf_flush_page_and_neighbors_fb. When PGBUF_NEIGHBOR_PAGES > 1, branch 4 calls this function, which grows a contiguous-VPID window around the anchor so a run of physically adjacent pages is written in one sequential sweep. The window state lives in a static file-scope global, pgbuf_Flush_helper (type pgbuf_batch_flush_helper) — not a per-call stack object; a single shared instance is safe because the dedicated page-flush daemon is the only caller, and each invocation zeroes the counters at entry.

// pgbuf_batch_flush_helper -- src/storage/page_buffer.c
struct pgbuf_batch_flush_helper
{
  int npages;          /* <- pages currently staged in the window */
  int fwd_offset;      /* <- pages added forward (higher pageid) of anchor */
  int back_offset;     /* <- pages added backward (lower pageid) of anchor */
  PGBUF_BCB *pages_bufptr[2 * PGBUF_MAX_NEIGHBOR_PAGES - 1];  /* <- window BCBs */
  VPID vpids[2 * PGBUF_MAX_NEIGHBOR_PAGES - 1];               /* <- their VPIDs */
};
// static PGBUF_BATCH_FLUSH_HELPER pgbuf_Flush_helper;  <- the single shared instance

Field	Role	Why it exists
`npages`	pages staged in the window	end bound of the per-window flush loop; trimmed when a tail/head neighbor is clean
`fwd_offset`	forward reach (higher `pageid`) from anchor	next forward candidate is `anchor + fwd_offset + 1`
`back_offset`	backward reach (lower `pageid`) from anchor	next backward candidate is `anchor - back_offset - 1`
`pages_bufptr[2*MAX-1]`	BCB handles for every window member	the BCBs flushed; sized to reach `PGBUF_MAX_NEIGHBOR_PAGES-1` (=31) each way around the anchor
`vpids[2*MAX-1]`	VPID snapshot per member, parallel to `pages_bufptr[]`	validate key: `pgbuf_flush_neighbor_safe` re-checks it so a member whose VPID changed before its write is skipped

PGBUF_NEIGHBOR_POS (off) indexes the arrays relative to the anchor (PGBUF_NEIGHBOR_PAGES - 1 + off). The window is not strictly dirty-only: when PGBUF_NEIGHBOR_FLUSH_NONDIRTY is enabled the probe deliberately admits interior clean pages to keep the on-disk run contiguous, abandoning the batch only on two consecutive non-dirties (NEIGHBOR_ABORT_TWO_CONSECTIVE_NONDIRTIES) or when non-dirties exceed half the window past a small threshold (NEIGHBOR_ABORT_TOO_MANY_NONDIRTIES). A clean page at the very tail or head is then trimmed (decrement the offset and npages) so the run does not end on a wasted write. Before the sweep the neighbor path enforces WAL once for the whole window:

// pgbuf_flush_page_and_neighbors_fb -- src/storage/page_buffer.c
/* WAL protocol: force log record to disk */
logpb_flush_log_for_wal (thread_p, &log_newest_oldest_unflush_lsa);
for (pos = PGBUF_NEIGHBOR_POS (-helper->back_offset); pos <= PGBUF_NEIGHBOR_POS (helper->fwd_offset); pos++)
  error = pgbuf_flush_neighbor_safe (thread_p, helper->pages_bufptr[pos], &helper->vpids[pos], &was_page_flushed);

pgbuf_flush_neighbor_safe re-routes each member through the single-page path (re-validating its VPID), so per-page WAL still holds; the batch force just guarantees the log is current before the contiguous write begins. A single-page window (npages <= 1) skips the batch force and flushes the lone page directly.

8.6 `pgbuf_get_victim_candidates_from_lru`

Called from 8.5, it walks each LRU from the bottom through the victim zone, budgeting by each list’s lru_victim_flush_priority_per_lru:

// pgbuf_get_victim_candidates_from_lru -- src/storage/page_buffer.c
for (bufptr = pgbuf_Pool.buf_LRU_list[lru_idx].bottom;
     bufptr != NULL && PGBUF_IS_BCB_IN_LRU_VICTIM_ZONE (bufptr) && i > 0;
     bufptr = bufptr->prev_BCB, i--) {
  if (pgbuf_bcb_is_dirty (bufptr)) {
    pgbuf_Pool.victim_cand_list[victim_cand_count].bufptr = bufptr;
    pgbuf_Pool.victim_cand_list[victim_cand_count].vpid = bufptr->vpid;
    victim_cand_count++;                              /* <- dirty -> flush before victimization */
  }
#if defined (SERVER_MODE)
  else if (try_direct_assign && pgbuf_is_any_thread_waiting_for_direct_victim ()
           && pgbuf_is_bcb_victimizable (bufptr, false) && PGBUF_BCB_TRYLOCK (bufptr) == 0) {
    if (pgbuf_is_bcb_victimizable (bufptr, true) && pgbuf_assign_direct_victim (thread_p, bufptr)) {
      try_direct_assign = false; *assigned_directly = true;  /* <- clean bcb handed to a waiter */
    }
    PGBUF_BCB_UNLOCK (bufptr);
  }
#endif
}

Two outputs: dirty BCBs go to the candidate list (they need a flush before victimization), while a single clean victimizable BCB may be handed straight to a starving waiter (assigned_directly) under trylock so the scan never blocks. Candidate VPIDs are snapshot so the flush loop in 8.5 can detect a reassigned BCB (guard 1 there). The whole walk runs under the per-LRU mutex.

8.7 The seq-flusher and `pgbuf_flush_seq_list` pacing

Checkpoint flushing is rate-controlled by a PGBUF_SEQ_FLUSHER: unlike the victim daemon, it spreads writes across one-second “super-intervals” so checkpoint I/O does not starve the foreground.

struct pgbuf_seq_flusher — every field:

Field	Role	Why it exists
`flush_list`	array of `PGBUF_VICTIM_CANDIDATE_LIST` (bufptr+vpid)	working set for this pass
`flush_upto_lsa`	newest oldest-LSA over all listed pages	WAL gate; pages beyond it are skipped
`control_intervals_cnt`	intervals elapsed this 1 s super-interval	feeds the `flush_per_interval` math
`control_flushed`	pages flushed so far this super-interval	lets a slow interval be compensated next
`interval_msec`	duration of one pacing interval	computed in `pgbuf_flush_chkpt_seq_list` as `1000 * PGBUF_CHKPT_BURST_PAGES / chkpt_flush_rate`, where `chkpt_flush_rate = 1000 / PRM_ID_LOG_CHECKPOINT_SLEEP_MSECS` — not from the struct `flush_rate` field
`flush_max_size`	capacity of `flush_list`, set at init	batch-size bound; checkpoint refills when full
`flush_cnt`	live element count	end bound of the flush loop
`flush_idx`	index of next element to flush	resumes across interval boundaries
`flushed_pages`	pages flushed this call (return param)	accumulated by the caller
`flush_rate`	max pages/sec (negative = unlimited)	target the pacing math converges to; set to `chkpt_flush_rate` each interval
`burst_mode`	flush a chunk ASAP vs one page then sleep	burst keeps data I/O sequential

pgbuf_initialize_seq_flusher zeroes the struct, sets flush_max_size, allocates flush_list, and defaults burst_mode = true. pgbuf_flush_seq_list derives flush_per_interval from the control counters: with control_intervals_cnt > 0 it targets flush_rate * (control_intervals_cnt+1) / control_total_cnt_intervals minus what was already flushed (compensation), floored at PGBUF_CHKPT_MIN_FLUSH_RATE (=50) scaled by the interval. The loop runs while flush_idx < flush_cnt && flushed_pages < flush_per_interval:

// pgbuf_flush_seq_list -- src/storage/page_buffer.c
PGBUF_BCB_LOCK (bufptr); locked_bcb = true;
if (!VPID_EQ (&bufptr->vpid, &f_list[seq_flusher->flush_idx].vpid) || !pgbuf_bcb_is_dirty (bufptr)
    || (flush_if_already_flushed == false && !LSA_ISNULL (&bufptr->oldest_unflush_lsa)
        && LSA_GT (&bufptr->oldest_unflush_lsa, &seq_flusher->flush_upto_lsa)))
  { PGBUF_BCB_UNLOCK (bufptr); dropped_pages++; continue; }   /* <- stale / beyond chkpt horizon */
if (pgbuf_bcb_safe_flush_force_lock (thread_p, bufptr, true) == NO_ERROR) { /* ... done_flush = true */ }

The flush_if_already_flushed heuristic re-flushes an already-flushed page only if its VPID is contiguous with the next list entry — preferring write sequentiality over avoiding a redundant write. After each page, non-burst mode sleeps time_remaining / pages_remaining ms (skipped when below 1000 / PGBUF_CHKPT_MAX_FLUSH_RATE) to spread the writes; burst mode only checks the absolute limit_time and breaks (*time_rem = -1) when exceeded. flush_upto_lsa is the WAL gate: only pages whose oldest-unflush LSA is at or below it flush in this checkpoint.

8.8 The three daemons and checkpoint flush

Three SERVER_MODE daemons register via REGISTER_DAEMON:

Daemon	Task type	Looper	Body
`pgbuf_Page_flush_daemon`	dedicated `pgbuf_page_flush_daemon_task` (subclass of `cubthread::entry_task`)	`pgbuf_get_page_flush_interval` (timed if `PRM_ID_PAGE_BG_FLUSH_INTERVAL_MSECS > 0`, else infinite wait)	loop `pgbuf_flush_victim_candidates`
`pgbuf_Page_post_flush_daemon`	`entry_callable_task (pgbuf_page_post_flush_execute)`	3-tier looper {1,10,100} ms	`pgbuf_assign_flushed_pages` — drain `flushed_bcbs`, assign direct victims
`pgbuf_Page_maintenance_daemon`	`entry_callable_task (pgbuf_page_maintenance_execute)`	fixed 100 ms	`pgbuf_adjust_quotas` + `pgbuf_direct_victims_maintenance`

The flush daemon runs at least once if explicitly woken (was_woken_up), then loops while pgbuf_keep_victim_flush_thread_running or until pgbuf_flush_victim_candidates sets stop_iteration:

// pgbuf_page_flush_daemon_task::execute -- src/storage/page_buffer.c
bool force_one_run = pgbuf_Page_flush_daemon->was_woken_up ();
while (force_one_run || pgbuf_keep_victim_flush_thread_running ()) {
  pgbuf_flush_victim_candidates (&thread_ref, prm_get_float_value (PRM_ID_PB_BUFFER_FLUSH_RATIO),
                                 &m_perf_track, &stop_iteration);
  force_one_run = false;
  if (stop_iteration) break;
}

It is the only class-based dedicated task; post-flush and maintenance are callable functions in entry_callable_task. Foreground threads nudge it via pgbuf_wakeup_page_flush_daemon when no victim is found. pgbuf_flush_control_from_dirty_ratio adds a separate adaptive signal — a rate bump that grows quadratically as dirties_cnt exceeds num_buffers/2, plus the dirty growth rate — to flush harder before the pool saturates.

Checkpoint flush. pgbuf_flush_checkpoint forces the log to flush_upto_lsa, sets is_checkpoint=true, then scans all BCBs. Each dirty non-temporary page with oldest_unflush_lsa <= flush_upto_lsa is appended to the shared seq_chkpt_flusher.flush_list; when full (>= flush_max_size) it is qsorted by VPID and drained through pgbuf_flush_chkpt_seq_list (which calls the paced pgbuf_flush_seq_list), then refilled. A page older than prev_chkpt_redo_lsa asserts (ER_LOG_CHECKPOINT_SKIP_INVALID_PAGE) — it should have flushed in the previous checkpoint. The smallest unflushed LSA among skipped pages returns in smallest_lsa to advance the redo horizon. The flush_all family (pgbuf_flush_all, _all_unfixed, _all_unfixed_and_set_lsa_as_null) is an unpaced sweep over all BCBs via pgbuf_flush_all_helper, used only by the log/recovery manager.

8.9 Chapter summary — key takeaways

All single-page flushes funnel through pgbuf_bcb_safe_flush_internal, whose CAS loop on the atomic latch picks immediate flush, async-request-on-unfix, or block-on-PGBUF_LATCH_FLUSH.
pgbuf_bcb_flush_with_wal enforces the WAL invariant by saving the page LSA, NULLing oldest_unflush_lsa under the mutex, dropping it, and calling logpb_flush_log_for_wal (.., &lsa) before any fileio_write / dwb_add_page; skipping the force would lose redo for a persisted page.
DIRTY clears at flush start (pgbuf_bcb_mark_is_flushing) so a concurrent re-dirty is never lost; a failed write restores DIRTY and the saved oldest_unflush_lsa via pgbuf_bcb_mark_was_not_flushed.
At most one flusher per BCB: FLUSHING_TO_DISK plus the blocking FLUSH-latch path serialize writers, preventing an old image from overwriting a newer one.
The Page Flush Daemon (pgbuf_flush_victim_candidates + pgbuf_get_victim_candidates_from_lru) collects dirty bottom-of-LRU candidates, skips fixed/hot/need-WAL pages, may batch neighbors through the shared-global pgbuf_Flush_helper window (which forces WAL once for the whole run and can include interior clean pages for sequentiality), and retries once when all candidates were WAL-blocked.
Checkpoint uses a rate-controlled PGBUF_SEQ_FLUSHER (pgbuf_flush_seq_list) with burst/spread pacing and a flush_upto_lsa WAL gate; flush_all* is an unpaced sweep.
Three daemons exist — one dedicated class (page flush) plus two callable tasks (post-flush, maintenance) — and the DWB, when created, stages every non-temp write into a block before the data files (see cubrid-double-write-buffer.md).

Chapter 9: Victim Selection the LFCQs and Direct Victim Hand-off

This chapter answers: when the invalid (free) list is empty, how does a thread find an evictable BCB, and when none is found, how is a freed BCB handed straight to a sleeping waiter? The high-level companion sketched “LFCQ — victim selection” and “Direct victim hand-off” at altitude; here we trace every branch.

The two paths are duals. The pull path (pgbuf_get_victim) scans the layered lock-free queues (LFCQs) for a clean BCB to claim. The push path (pgbuf_assign_direct_victim / pgbuf_get_direct_victim) is the inverse: a producer that already cleaned a BCB wakes a waiter, writes the BCB into the waiter’s mailbox slot, and skips the LRU. A thread that fails the pull path becomes a waiter (suspend/wake is Ch. 5).

9.1 The two structs: mailbox and candidate slot

pgbuf_direct_victim is the global mailbox-and-queues record (pgbuf_Pool.direct_victims), SERVER_MODE-only. pgbuf_victim_candidate_list is the scratch array the flush daemon (Ch. 8) fills; in scope only because the spec names it.

// pgbuf_direct_victim -- src/storage/page_buffer.c
struct pgbuf_direct_victim {
  PGBUF_BCB **bcb_victims;        /* per-thread mailbox: bcb_victims[tid] = BCB handed to thread tid */
  lockfree::circular_queue<THREAD_ENTRY *> *waiter_threads_high_priority;
  lockfree::circular_queue<THREAD_ENTRY *> *waiter_threads_low_priority;
};
// pgbuf_victim_candidate_list -- src/storage/page_buffer.c
struct pgbuf_victim_candidate_list {
  PGBUF_BCB *bufptr;             /* selected BCB as victim candidate */
  VPID vpid;                     /* page id of the page managed by the BCB */
};

Struct.Field	Role	Why it exists
`direct_victim.bcb_victims`	Array of `num_total_threads` BCB ptrs by `thread_p->index`; slot `[tid]` = tid’s BCB or `NULL`.	The mailbox. Producer writes a slot under the waiter’s entry lock, waiter reads+`NULL`s its own on wake — one slot/thread, no contention.
`direct_victim.waiter_threads_high_priority`	LFCQ of threads blocking the system on a victim.	Drained first — latency-critical fixers jump the queue.
`direct_victim.waiter_threads_low_priority`	LFCQ of threads that tolerate waiting.	Drained 1-in-4 ahead of high — the 75/25 weighting (§9.5).
`victim_candidate_list.bufptr`	BCB the flush pass selected to clean.	Lets the flusher re-lock+flush in a second pass without re-scanning.
`victim_candidate_list.vpid`	Snapshot of `bufptr->vpid` at selection.	Detects reassignment before flush; a stale `bufptr` whose `vpid` no longer matches is skipped.

9.2 `pgbuf_get_victim` — the staged LFCQ scan

The queues hold list indices, never BCBs; an index sits in a queue iff its list has count_vict_cand > 0 and PGBUF_LRU_VICTIM_LFCQ_FLAG set. The function walks four stages, returning the first locked BCB it claims.

Stage 1 — own private, only with a private LRU that is over quota:

// pgbuf_get_victim -- src/storage/page_buffer.c
if (PGBUF_THREAD_HAS_PRIVATE_LRU (thread_p)) {
    private_lru_idx = PGBUF_LRU_INDEX_FROM_PRIVATE (PGBUF_PRIVATE_LRU_FROM_THREAD (thread_p));
    lru_list = PGBUF_GET_LRU_LIST (private_lru_idx);
    if (PGBUF_LRU_LIST_IS_ONE_TWO_OVER_QUOTA (lru_list)                      /* zone1+2 exceeds quota */
        || (PGBUF_LRU_LIST_IS_OVER_QUOTA (lru_list) && lru_list->count_vict_cand > 0)) {
        victim = pgbuf_get_victim_from_lru_list (thread_p, private_lru_idx);
        if (victim != NULL) { return victim; }                              /* <- happy path */
        if (!PGBUF_VACUUM_SHOULD_IGNORE_UNFIX (thread_p))
          restrict_other = PGBUF_LRU_LIST_IS_OVER_QUOTA_WITH_BUFFER (lru_list); /* gate stage 2 */
        searched_own = true;
      } }

restrict_other is set only for a non-vacuum thread comfortably over quota (quota + MAX(10, quota*0.01) buffer); it confines stage 2 to big-private lists. searched_own stops stage 4 repeating stage 1.

Stage 2 — other private, entered only when PGBUF_PAGE_QUOTA_IS_ENABLED && has_flush_thread; it calls pgbuf_lfcq_get_victim_from_private_lru (thread_p, restrict_other) (§9.4) and returns on the first claim.

Stage 3 — shared, in a guarded loop — the only looping stage, and only without a flush daemon to refill candidates:

do {
    victim = pgbuf_lfcq_get_victim_from_shared_lru (thread_p, has_flush_thread);
    if (victim != NULL) { return victim; }                                  /* <- happy path */
    current_consume_cursor = pgbuf_Pool.shared_lrus_with_victims->get_consumer_cursor ();
  }
while (!has_flush_thread && !pgbuf_Pool.shared_lrus_with_victims->is_empty ()
       && ((int) (current_consume_cursor - initial_consume_cursor) <= pgbuf_Pool.num_LRU_list)
       && (++nloops <= pgbuf_Pool.num_LRU_list));

The four while conditions each stop the spin: a flush daemon present, the queue empty, more indices consumed than there are shared lists, or nloops exceeding num_LRU_list (a paranoia guard). With a flush daemon the body runs exactly once.

Stage 4 — last-resort own private, ignoring quota. Only if stages 1-3 failed and stage 1 never ran (PGBUF_THREAD_HAS_PRIVATE_LRU && !searched_own), it re-calls pgbuf_get_victim_from_lru_list on the own list and returns the result; otherwise it falls through to return victim (NULL). This guards the source-documented deadlock: all private lists just below quota, shared lists with no zone-3, nothing victimizable or flushable. A NULL return tells the caller to enqueue as a waiter and sleep (Ch. 5). Figure 9-1 traces all four stages.

Figure 9-1 — pgbuf_get_victim staged scan.

flowchart TD
  B{"own private over quota?"} -- yes --> C["victim_from_lru own"]
  C -->|found| R["return victim"]
  C -->|fail| D["restrict_other, searched_own"]
  B -- no --> E
  D --> E{"quota+flush thread?"}
  E -- yes --> F["lfcq private: big then ordinary"]
  F -->|found| R
  E -- no --> G
  F -->|fail| G["loop: lfcq shared"]
  G -->|found| R
  G -->|exhausted| H{"not searched_own?"}
  H -- yes --> I["victim_from_lru own, no quota"]
  I -->|found| R
  H -- no --> J["NULL: wait"]
  I -->|fail| J

9.3 `pgbuf_get_victim_from_lru_list` — bottom-up scan, four exclusions

Where a BCB is actually claimed: scan from victim_hint toward the bottom of zone 3, apply the exclusion mask, and on success remove the BCB and return it locked. Three early NULL returns precede the scan, then the hint is resynced:

// pgbuf_get_victim_from_lru_list -- src/storage/page_buffer.c
if (lru_list->count_vict_cand == 0) { return NULL; }                        /* <- 1: no candidates, no mutex */
pthread_mutex_lock (&lru_list->mutex);
if (lru_list->bottom == NULL || !PGBUF_IS_BCB_IN_LRU_VICTIM_ZONE (lru_list->bottom))
  { pthread_mutex_unlock (&lru_list->mutex); return NULL; }                 /* <- 2: no zone-3 */
if (PGBUF_IS_PRIVATE_LRU_ONE_TWO_OVER_QUOTA (lru_idx))
  pgbuf_lru_adjust_zones (thread_p, lru_list, false);                       /* shrink zone1 so zone3 grows */
lru_victim_cnt = lru_list->count_vict_cand;
if (lru_victim_cnt <= 0) { pthread_mutex_unlock (&lru_list->mutex); return NULL; } /* <- 3: race emptied it */
if (!pgbuf_bcb_is_dirty (lru_list->bottom) && lru_list->victim_hint != lru_list->bottom)
  (void) ATOMIC_TAS_ADDR (&lru_list->victim_hint,                          /* resync drifted hint */
                          PGBUF_IS_BCB_IN_LRU_VICTIM_ZONE (lru_list->bottom) ? lru_list->bottom : (PGBUF_BCB *) NULL);
bufptr_start = (lru_list->victim_hint == NULL) ? lru_list->bottom : lru_list->victim_hint;

The scan loop. Walk prev_BCB upward from bufptr_start, in zone 3, capped at MAX_DEPTH (1000). Per BCB:

Excl. 1 — avoid-victim flag. pgbuf_bcb_avoid_victim tests PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK = DIRTY | FLUSHING_TO_DISK | VICTIM_DIRECT | INVALIDATE_DIRECT_VICTIM (the four exclusions: dirty, mid-flush, already-assigned, invalidation-pending). Any bit → continue.
Excl. 2 — fixed/has waiters. pgbuf_is_bcb_fixed_by_any (bufptr, false): if fcnt > 0, next_wait_thrd != NULL, or latch held, it is valid-but-busy. Record as bufptr_victimizable (first becomes the hint via CAS), count it, continue; break when found_victim_cnt reaches lru_victim_cnt.
Claim — PGBUF_BCB_TRYLOCK (conditional: we hold the list mutex and must not block on the BCB mutex — lock-order rule, Ch. 5):
- Trylock ok + pgbuf_is_bcb_victimizable(bufptr, true): the win — advance hint to bufptr->prev_BCB, pgbuf_remove_from_lru_list, then panic-assign via pgbuf_panic_assign_direct_victims_from_lru iff waiter_threads_low_priority->size() >= 5 + num_total_threads/20 (the low-priority backlog drain), wake the flush daemon if the new bottom is dirty, unlock, push VPID to Aout (Ch. 7), return locked BCB.
- Trylock ok but not victimizable (flag flipped under us): PGBUF_BCB_UNLOCK, next iteration.
- Trylock fails: the BCB mutex is held elsewhere, only possible with a flush daemon — asserts pgbuf_is_page_flush_daemon_available(). Record + hint, count it, honor the early-out.

TO_VACUUM note: PGBUF_BCB_TO_VACUUM_FLAG is not in the mask, so a to-vacuum BCB is still victimizable here; its forcing to the LRU bottom happens at unfix/direct-assign time, not in this scan.

Failure tail. No claim, and a stale hint with no candidate found (bufptr_victimizable == NULL && victim_hint != NULL) → reset hint to bottom (if candidates remain) or NULL via CAS, unlock, wake flush daemon, return NULL.

Invariant — victim_hint marks the lowest point worth scanning; count_vict_cand counts clean zone-3 BCBs. The scan walks only upward from the hint and trusts count_vict_cand (kept by the LRU bookkeeping helpers as BCBs enter/leave zone 3 clean) as the early-out ceiling. The hint stays honest via the CAS-advance on every claim/record and the resync above. The documented TPCC drift (hint sits before the first victimizable BCB) only wastes scan steps — each candidate is re-validated under its own BCB lock before being claimed, so the hint is a performance hint, never a safety property; drift is tolerated, not fixed.

9.4 Quota gating in the private LFCQ helper

pgbuf_lfcq_get_victim_from_private_lru picks which private list to scan and whether to re-enqueue it:

// pgbuf_lfcq_get_victim_from_private_lru -- src/storage/page_buffer.c
if (pgbuf_Pool.big_private_lrus_with_victims->consume (lru_idx)) { /* big first */ }
else {
    if (restricted) { return NULL; }                                        /* <- restricted: big only */
    if (!pgbuf_Pool.private_lrus_with_victims->consume (lru_idx)) { return NULL; } /* <- both empty */ }
lru_list = PGBUF_GET_LRU_LIST (lru_idx);
if (PGBUF_LRU_LIST_COUNT (lru_list) > PBGUF_BIG_PRIVATE_MIN_SIZE              /* big: >100 ... */
    && PGBUF_LRU_LIST_COUNT (lru_list) > 2 * lru_list->quota && lru_list->count_vict_cand > 1) {
    if (pgbuf_Pool.big_private_lrus_with_victims->produce (lru_idx)) added_back = true; } /* re-queue BIG before scan */
victim = pgbuf_get_victim_from_lru_list (thread_p, lru_idx);
if (added_back) return victim;
if (lru_list->count_vict_cand > 0 && PGBUF_LRU_LIST_IS_OVER_QUOTA (lru_list))
  { if (pgbuf_Pool.private_lrus_with_victims->produce (lru_idx)) return victim; }
lru_list->flags &= ~PGBUF_LRU_VICTIM_LFCQ_FLAG;   /* not re-queued: clear so next candidate re-adds it */
return victim;

Invariant — a private list is victimizable only while over quota. “Big-private” = count > 100 and count > 2*quota and >1 candidate; re-queued before scanning so peers drain it in parallel. A non-big list is re-queued only while still over quota with candidates; otherwise its LFCQ flag is cleared and it leaves rotation until pgbuf_adjust_quotas (§9.8) or a new candidate re-adds it. A list at/below quota is never poached. The shared sibling pgbuf_lfcq_get_victim_from_shared_lru has no quota, so it simply re-enqueues while count_vict_cand > 0 and (single-threaded) retries the same list once.

9.5 `pgbuf_assign_direct_victim` — producer side

When a BCB becomes clean+free (end of flush, panic-assign in §9.3, or last-unfix), its owner may hand it to a waiter. The producer holds the BCB mutex; the only invalidating flag tolerated is FLUSHING_TO_DISK (flush itself calls this):

// pgbuf_assign_direct_victim -- src/storage/page_buffer.c
while (pgbuf_get_thread_waiting_for_direct_victim (waiter_thread)) {         /* 75/25: low 1-in-4, else high */
    thread_lock_entry (waiter_thread);
    if (waiter_thread->resume_status != THREAD_ALLOC_BCB_SUSPENDED)
      { thread_unlock_entry (waiter_thread); continue; }                    /* <- waiter gone, try next */
    thread_wakeup_already_had_mutex (waiter_thread, THREAD_ALLOC_BCB_RESUMED);
    pgbuf_bcb_update_flags (thread_p, bcb, PGBUF_BCB_VICTIM_DIRECT_FLAG, PGBUF_BCB_FLUSHING_TO_DISK_FLAG);
    pgbuf_Pool.direct_victims.bcb_victims[waiter_thread->index] = bcb;      /* <- write mailbox before unlock */
    thread_unlock_entry (waiter_thread);
    return true; }                                                          /* <- assigned */
return false;                                                               /* <- no waiters */

pgbuf_get_thread_waiting_for_direct_victim holds the 75/25 weighting (low queue 1-in-4, else high), skipping dead queue entries. The while skips any waiter no longer THREAD_ALLOC_BCB_SUSPENDED; the BCB pointer is written before the entry lock releases, so the waiter never wakes to an empty slot. Empty queues → false, and the caller disposes of the BCB normally.

Invariant — a handed-off victim is exclusively owned, so no other thread can claim it. The producer enters with the BCB mutex held (PGBUF_BCB_CHECK_OWN) and stamps PGBUF_BCB_VICTIM_DIRECT_FLAG while still holding it — and that flag is one of the four bits in PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK, so from that instant pgbuf_bcb_avoid_victim returns true and the §9.3 scan, the §9.4 helpers, and pgbuf_assign_flushed_pages all skip the BCB. The only writer of bcb_victims[tid] is the producer; the only reader is thread tid via the ATOMIC_TAS_ADDR in §9.6 — one slot per thread, single producer, single consumer. Even pgbuf_invalidate_bcb defers (§9.7). Thus between assignment and collection the BCB is logically owned by exactly one waiter; a concurrent re-fixer cannot steal it, only mark INVALIDATE_DIRECT_VICTIM to release it back (§9.6).

9.6 `pgbuf_get_direct_victim` — consumer side and the invalidation retry

The slot read is a TAS that clears the slot atomically:

// pgbuf_get_direct_victim -- src/storage/page_buffer.c
PGBUF_BCB *bcb = (PGBUF_BCB *) ATOMIC_TAS_ADDR (&pgbuf_Pool.direct_victims.bcb_victims[thread_p->index], NULL);
PGBUF_BCB_LOCK (bcb);
if (pgbuf_bcb_is_invalid_direct_victim (bcb)) {                             /* <- re-fix race */
    pgbuf_bcb_update_flags (thread_p, bcb, 0, PGBUF_BCB_INVALIDATE_DIRECT_VICTIM_FLAG); /* clear it */
    PGBUF_BCB_UNLOCK (bcb);
    return NULL; }                                                          /* <- caller re-sleeps */
pgbuf_bcb_update_flags (thread_p, bcb, 0, PGBUF_BCB_VICTIM_DIRECT_FLAG);    /* clear VICTIM_DIRECT */
if (!pgbuf_is_bcb_victimizable (bcb, true)) { assert (false); PGBUF_BCB_UNLOCK (bcb); return NULL; }
switch (pgbuf_bcb_get_zone (bcb)) {
  case PGBUF_VOID_ZONE: break;                                             /* already detached (from flush) */
  case PGBUF_INVALID_ZONE: assert (false); break;                          /* impossible */
  default:                                                                  /* still in an LRU: detach + Aout */
    lru_idx = pgbuf_bcb_get_lru_index (bcb);
    pgbuf_lru_remove_bcb (thread_p, bcb);
    pgbuf_add_vpid_to_aout_list (thread_p, &bcb->vpid, lru_idx); break; }
return bcb;                                                                 /* locked, in VOID zone */

The invalidation retry. Between assignment and collection a re-fixer may find the BCB on a hash hit (Ch. 3). It cannot steal a VICTIM_DIRECT BCB, so it sets INVALIDATE_DIRECT_VICTIM; the waiter observes it, clears the flag (releasing ownership), unlocks, and returns NULL — which the caller treats like a failed pgbuf_get_victim (re-enqueue and sleep). The BCB is left in place — the “puts it back and re-sleeps” path. Otherwise the zone switch detaches an in-LRU BCB (Aout-recorded) or no-ops a VOID one; post-condition: locked, VOID_ZONE, ready for reuse.

Figure 9-2 — direct hand-off.

stateDiagram-v2
  [*] --> Clean: bcb flushed or unfixed clean
  Clean --> Assigning: assign direct victim, hold bcb mutex
  Assigning --> NoWaiter: queues drained, return false
  Assigning --> Assigned: live waiter, set VICTIM_DIRECT, write mailbox, wake
  Assigned --> Collected: waiter TAS reads slot, VICTIM_DIRECT seen
  Collected --> Detached: clear VICTIM_DIRECT, remove from LRU, push Aout
  Assigned --> Invalidated: re-fixer sets INVALIDATE_DIRECT_VICTIM
  Invalidated --> ReSleep: waiter clears flag, returns NULL, re-enqueues
  NoWaiter --> [*]
  Detached --> [*]
  ReSleep --> [*]

9.7 `pgbuf_invalidate_bcb` and the already-assigned victim

pgbuf_invalidate_bcb tears a BCB out of the page table when its page is gone (dealloc, volume removal). It is in scope here for exactly one branch: an already-assigned direct victim is left alone — if pgbuf_bcb_is_direct_victim (bufptr) is true the function unlocks and returns NO_ERROR, since the waiting thread will victimize it momentarily and racing to invalidate it would corrupt the hand-off. (The remaining branches — the LATCH_INVALID early no-op, the clear-dirty plus zone removal, and the NO_LATCH hash-chain delete onto the invalid list versus the unexpected assert(false) tail — are the ordinary tear-down path and belong to the BCB-lifecycle chapters.)

9.8 `pgbuf_adjust_quotas` as the supplier

pgbuf_adjust_quotas (full logic in Ch. 10) keeps everything above viable: it recomputes each private list’s quota and zone thresholds, and re-adds to the LFCQs any over-quota list with candidates that fell out of rotation. The quota/threshold values read by §9.2’s stage gates and §9.4’s re-enqueue test all originate here.

9.9 Chapter summary — key takeaways

pgbuf_get_victim is a four-stage priority scan: own private (over quota) → other private (big first, then ordinary unless restricted) → shared (loops only without a flush daemon) → own private ignoring quota as a deadlock-avoidance last resort. NULL means “sleep and wait for a direct victim.”
The LFCQs hold list indices, not BCBs. A list is enqueued iff it has candidates and PGBUF_LRU_VICTIM_LFCQ_FLAG is set; consumers re-enqueue while over quota with candidates, else clear the flag so it re-enters lazily.
Quota gating protects working sets. A private list is victimizable only while over quota; big-private (>100, >2*quota, >1 cand) lists are re-queued before scanning so they drain in parallel.
pgbuf_get_victim_from_lru_list re-validates every candidate under its own BCB lock, applying the four exclusions in PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK. TO_VACUUM is deliberately not an exclusion here.
victim_hint is a performance hint, not a safety property. Its documented drift only wastes scan steps; correctness comes entirely from per-BCB re-validation under lock.
Direct hand-off is a mailbox protocol. The producer picks a waiter (High/Low at 75/25), stamps VICTIM_DIRECT, writes bcb_victims[tid] under the waiter’s entry lock, and wakes it.
The consumer handles the re-fix race: observing INVALIDATE_DIRECT_VICTIM it clears the flag, leaves the BCB in place, returns NULL so the caller re-sleeps; pgbuf_invalidate_bcb likewise leaves an already-assigned direct victim untouched.

Chapter 10: Adaptive Quotas Ordered Fix and Special Paths

Three families sit outside the single-page lifecycle of Chapters 3-9: the adaptive quota daemon (100 ms) re-sizing private LRU lists, ordered fix (multi-page deadlock avoidance over pgbuf_fix), and special fix paths that each bypass part of the normal path. For private/shared lists, victim zones, and LFCQ queues see the companion cubrid-page-buffer-manager.md and Chapter 9 — not re-derived here.

10.1 The two structs: `pgbuf_page_quota` and `pgbuf_watcher`

pgbuf_page_quota (one instance, pgbuf_Pool.quota) holds the global state pgbuf_adjust_quotas reads/writes. Per-list outputs (quota, threshold_lru1/2) live on each PGBUF_LRU_LIST (Chapter 1), not here.

Field	Role	Why it exists
`num_private_LRU_list`	private-list count; `PGBUF_PAGE_QUOTA_IS_ENABLED` is `> 0`	master enable switch; 0 makes the subsystem inert
`lru_victim_flush_priority_per_lru`	per-list float, flush priority	flush daemon (Ch 8) biases flushing toward over-quota lists
`private_lru_session_cnt`	active sessions per private list	`pgbuf_assign_private_lru` picks the zero-session list first
`private_pages_ratio`	fraction of BCBs that should be private	smoothed target driving `all_private_quota`
`add_shared_lru_idx`	circular index for shared-list relocation	round-robins shared-LRU assignment on private-to-shared migration
`avoid_shared_lru_idx`	shared list to avoid when relocating	steers traffic off the fattest list so it drains via victimization
`last_adjust_time`	`TSC_TICKS` of last adjust	gates the 1 ms / 500 ms cadence checks
`adjust_age`	monotonic counter, bumped each adjust	generation stamp other code compares against
`is_adjusting`	re-entrancy guard	only one thread runs the adjust body at a time

pgbuf_watcher is the caller-owned ordered-fix handle: stack-allocated, init’d with PGBUF_INIT_WATCHER(w, rank, hfid), passed to pgbuf_ordered_fix, then threaded onto the holder’s watcher list so the machinery can re-fix it.

Field	Role	Why it exists
`pgptr`	fixed page, or NULL	fix output; also the “is watcher live” test (`PGBUF_IS_CLEAN_WATCHER`)
`next` / `prev`	links in the holder’s watcher list	one holder (one fixed BCB) may carry several watchers
`group_id`	VPID of the group’s heap-header page	deadlock key: pages of one heap share a group
`latch_mode` (7 bits)	latch held by this watcher	re-fix restores the same mode; WRITE on any watcher promotes the page
`page_was_unfixed` (1 bit)	true if ordered fix had to unfix-and-refix this page	tells caller “in-page pointers moved; re-read”
`initial_rank` (4 bits)	rank set at init time	caller’s declared rank before any fix
`curr_rank` (4 bits)	effective rank after fix	promoted to `PGBUF_ORDERED_HEAP_HDR` if this page is its own group header
`magic` (debug)	`0x12345678`	catches an uninitialized/garbage watcher
`watched_at` / `init_at` (debug)	source location strings	leak / double-fix diagnostics

Invariant (watcher rank monotonicity within a group): every watcher on the same physical page must carry the same curr_rank and group_id. pgbuf_ordered_fix_release enforces this while scanning a holder’s watcher list; a mismatch raises ER_PB_ORDERED_INCONSISTENCY (fatal). If violated, the VPID sort below is ill-defined and the deadlock guarantee collapses.

The rank ordering is the enum PGBUF_ORDERED_RANK (in page_buffer.h): PGBUF_ORDERED_HEAP_HDR = 0 (fixed first) < PGBUF_ORDERED_HEAP_NORMAL < PGBUF_ORDERED_HEAP_OVERFLOW (fixed last) < PGBUF_ORDERED_RANK_UNDEFINED (sentinel). A pgbuf_watcher hangs off a PGBUF_HOLDER’s first_watcher..last_watcher chain (whose bufptr is the fixed PGBUF_BCB) and tags the page via group_id with the heap-header VPID defining its group.

10.2 `pgbuf_adjust_quotas` — recomputing private quotas every 100 ms

The Page Maintenance Daemon (100 ms cubthread::looper) calls pgbuf_page_maintenance_execute, which after a boot guard (BO_IS_FLUSH_DAEMON_AVAILABLE) calls pgbuf_adjust_quotas then pgbuf_direct_victims_maintenance (10.3). Cadence gates and exits:

Disabled / already running. if (!PGBUF_PAGE_QUOTA_IS_ENABLED || quota->is_adjusting) return; else is_adjusting = 1.
Too soon (< 1 ms). if diff_usec < 1000, clear the guard, return.
Low activity and < 500 ms. if pg_unfix_cnt < PGBUF_TRAN_THRESHOLD_ACTIVITY (num_buffers/4) and < 500 ms elapsed, bail. Busy pool adjusts ~every 1 ms; idle pool waits 500 ms.
Very low activity flag. pg_unfix_cnt.exchange(0) reads-and-resets; if the prior value < THRESHOLD/100, set low_overall_activity = true.

Then it stamps last_adjust_time and bumps adjust_age.

Phase A — per-list hits. One pass over PGBUF_TOTAL_LRU_COUNT lists. lru_hits = ATOMIC_TAS_32(&monitor->lru_hits[i], 0) is read-and-reset, scaled to hits/sec, accumulated into lru_private_hits/lru_shared_hits, and total_victims += PGBUF_GET_LRU_LIST(i)->count_vict_cand. Each private list’s activity sample is history-smoothed: if diff_usec >= tensec_usec (>10 s) monitor->lru_activity[i] = lru_hits (old sample stale, replace); else it is the time-weighted blend ((tensec_usec - diff_usec) * old + diff_usec * lru_hits) / tensec_usec.

Phase B — private ratio. If low_overall_activity, force private_ratio = MIN_PRIVATE_RATIO (starve privates); else lru_private_hits / (private + shared) clamped to [0.01, 0.998] (shared floored to 1), then 10 s-smoothed into quota->private_pages_ratio.

Phase C — redistribute (two mutually exclusive branches):

No private activity (sum_private_lru_activity_total == 0): all_private_quota = 0; every private list gets quota = threshold_lru1 = threshold_lru2 = 0, pgbuf_lru_adjust_zones under the list mutex if it still holds pages, and a push onto the victim LFCQ (pgbuf_lfcq_add_lru_with_victims) if it has over-quota candidates.
Some private activity (else): the budget is all_private_quota = (int)((num_buffers - invalid_cnt) * quota->private_pages_ratio), split per list proportional to activity:

// pgbuf_adjust_quotas (phase C, active) -- src/storage/page_buffer.c
new_quota = (int) ((float) lru_activity[i] / sum_private_lru_activity_total * all_private_quota);
new_quota = MIN (new_quota, PGBUF_PRIVATE_LRU_MAX_HARD_QUOTA);  /* absolute cap */
new_quota = MIN (new_quota, num_buffers / 2);                  /* half-pool cap */
lru_list->threshold_lru1 = lru_list->threshold_lru2 = (int) (new_quota * PGBUF_LRU_ZONE_MIN_RATIO);

The two caps stop a single list monopolizing the pool; threshold_lru1/2 are the zone sizes the Chapter 7 unfix path reads.

Phase D — shared lists. Leftover budget spreads evenly: avg_shared_lru_size = (num_buffers - all_private_quota) / num_LRU_list, floored at PGBUF_MIN_SHARED_LIST_ADJUST_SIZE; threshold_lru1/2 from the configured ratio_lru1/2. Each over-threshold shared list is re-zoned and, if it has candidates, queued for victims.

Phase E — victim_rich and release. `monitor.victim_rich = total_victims

= (int)(0.1 * num_buffers); quota->is_adjusting = 0;. victim_rich` is Chapter 9’s cheap “push hard on victimization?” hint — true above 10% of pool.

Invariant (single-writer adjust): is_adjusting is set on entry and cleared on every exit (all four early returns and the tail). Not a mutex — a best-effort flag for a single-threaded daemon. An early return that forgets to clear it freezes the subsystem forever (every later call hits gate 1).

flowchart TD
  B{"enabled and\nnot adjusting?"} -->|no| Z["return"]
  B -->|yes| C["is_adjusting=1"]
  C --> D{"diff<1ms?"}
  D -->|yes| Y["is_adjusting=0;\nreturn"]
  D -->|no| E{"low activity\nand diff<500ms?"}
  E -->|yes| Y
  E -->|no| K{"sum activity==0?"}
  K -->|yes| J["all private quota=0"]
  K -->|no| S["split by activity,\ncap abs and pool/2"]
  J --> L["shared thresholds;\nvictim_rich; is_adjusting=0"]
  S --> L

Figure 10-1. pgbuf_adjust_quotas: cadence gates (1-4) and the two Phase C redistribution branches.

10.3 `pgbuf_direct_victims_maintenance` — the backup victim hand-off

The fast path assigns victims as a side effect of unfix/flush; on an idle system that never fires, so a blocked thread could starve. This backup walks lists round-robin and hands victims out directly, once over private lists and once over shared:

// pgbuf_direct_victims_maintenance -- src/storage/page_buffer.c
static int prv_index = 0;    /* round-robin cursors, single-threaded use only */
static int shr_index = 0;
for (index = prv_index, restarted = false;
     pgbuf_is_any_thread_waiting_for_direct_victim () && nassigns > 0
       && index != prv_index && !restarted;
     (index == PGBUF_PRIVATE_LRU_COUNT - 1) ? index = 0, restarted = true : index++)
  pgbuf_lfcq_assign_direct_victims (thread_p, PGBUF_LRU_INDEX_FROM_PRIVATE (index), &nassigns);
prv_index = index;           /* persist cursor for next tick */
// ... a second, structurally identical loop over shared lists, then shr_index = index ...

The cursor starts at prv_index; index != prv_index therefore becomes the wrap-around terminator only after the iterator has advanced past it. Each loop stops when (a) no thread waits, (b) the per-iteration budget DEFAULT_ASSIGNS_PER_ITERATION (5) is spent, or (c) it wrapped once (restarted). The prv_index = index / shr_index = index write-backs are what make the static cursors persist across ticks so each tick sweeps different lists — hence single-threaded use only. pgbuf_lfcq_assign_direct_victims retries from lru_list->bottom if the cached victim_hint is stale (a CAS resets it), self-healing the hint.

10.4 `pgbuf_ordered_fix_release` — multi-page deadlock avoidance

Heap ops hold several pages at once; fixing them in different orders deadlocks. The heap header must stay fixed first, so plain VPID ordering is insufficient. Ordered fix keeps a rank (header < normal < overflow), sorts by VPID within rank, and — if a new request violates that order against held pages — unfixes the offenders, sorts, re-fixes in order. Entry contract: req_watcher->pgptr must be NULL else ER_FAILED_ASSERTION; curr_rank becomes PGBUF_ORDERED_HEAP_HDR if the requested VPID is the group header, else initial_rank.

Branch 1 — conditional first attempt. If the thread holds no other page (holder == NULL, or holder->thrd_link == NULL && VPID_EQ(req_vpid, &holder->bufptr->vpid) — only this one), use PGBUF_UNCONDITIONAL_LATCH; else PGBUF_CONDITIONAL_LATCH so a would-be deadlock fails fast. Then ret_pgptr = pgbuf_fix_release(...).

Got it: find the holder, resolve group id (existing watcher, or fix the heap header via pgbuf_get_groupid_and_unfix if PAGE_HEAP), attach via pgbuf_add_watch_instance_internal, goto exit. Common no-reorder case.
Did not get it, branch on error: ER_PB_BAD_PAGEID/ER_INTERRUPTED → exit; OLD_PAGE_MAYBE_DEALLOCATED+ER_PB_BAD_PAGEID → treat deallocated, exit; LK_ZERO_WAIT/LK_FORCE_ZERO_WAIT → ER_LK_PAGE_TIMEOUT (no error set for force, scans continue), exit; UNCONDITIONAL → already blocked and failed, exit with the error; else (conditional failed) → fall through to reorder, clearing er_status.

Branch 2 — classify held pages. Walk holders, skipping watch_count <= 0 (no watcher; assumed deadlock-safe). Gather each watched holder’s watchers into ordered_holders_info[], verifying the 10.1 invariant. diff = pgbuf_compare_hold_vpid_for_sort(req, held): diff < 0 (held sorts after req) → save for unfix; diff == 0 (same page) → ER_FAILED_ASSERTION; diff > 0 (held sorts before) → leave fixed. If the request has no group yet (req_page_has_group == false), diff is forced -1 so all watched pages are unfixed and re-fixed.

Branch 3 — unfix the out-of-order pages. For each saved entry, pgbuf_bcb_register_avoid_deallocation(holder->bufptr) pins it across the gap, pgbuf_unfix runs fix_count times, then each watcher is PGBUF_CLEAR_WATCHER’d and gets pg_watcher->page_was_unfixed = true.

Branch 4 — resolve missing group, then sort. If req had no group, re-fix it unconditionally, derive its group id (clear dealloc-prevent if OLD_PAGE_PREVENT_DEALLOC downgraded to OLD_PAGE), append the requested page, qsort by pgbuf_compare_hold_vpid_for_sort (rank, volid, pageid).

Branch 5 — re-fix in sorted order. Requested page uses caller’s request_mode/fetch_mode; restored pages use saved latch_mode+OLD_PAGE. All PGBUF_UNCONDITIONAL_LATCH now — order guaranteed, blocking deadlock-free. Failures: ER_INTERRUPTED exits; ER_PB_BAD_PAGEID on the requested page under OLD_PAGE_MAYBE_DEALLOCATED is tolerated; failure restoring a held page → ER_PB_ORDERED_REFIX_FAILED (serious — watchers partially live).

Invariant (caller must honor page_was_unfixed): any unfixed-and-refixed watcher has page_was_unfixed == true. Pointers cached into that page may now be invalid — another thread could have modified it during the gap. Reusing a stale pointer reads corrupt data — the single most important contract of ordered fix.

10.5 `pgbuf_ordered_unfix` — watcher-aware unfix

pgbuf_ordered_unfix is the counterpart: if watcher_object->pgptr == NULL it assert_release(false) and returns; otherwise pgbuf_get_holder finds the holder, a for (watcher = holder->last_watcher; ...; watcher = watcher->prev) scan finds the exact watcher, then after the invariant assert (assert(holder->fix_count >= holder->watch_count)) it calls pgbuf_remove_watcher and one pgbuf_unfix.

Invariant (fix_count >= watch_count): a holder may be fixed more than watched (a plain fix adds no watcher) but never the reverse. Asserted here and in 10.4’s classification loop. Violation means a watcher outlived its fix — a use-after-unfix.

10.6 `pgbuf_promote_read_latch_release` — R-to-W promotion

Converts a held READ latch to WRITE without fully unfixing, per a PGBUF_PROMOTE_CONDITION; a CAS loop on the packed atomic_latch (Chapter 5):

Sole reader (holder->fix_count == impl.fcnt): unless the next waiter is a promoter (then fail ER_PAGE_LATCH_PROMOTE_FAIL), flip impl_new.latch_mode = PGBUF_LATCH_WRITE in place. No blocking.
Other readers present (fix_count != fcnt): PGBUF_PROMOTE_ONLY_READER or next waiter is a promoter → fail ER_PAGE_LATCH_PROMOTE_FAIL (CASE #1/#2). PGBUF_PROMOTE_SHARED_READER → subtract our fixes from fcnt, mark waiter_exists, set need_block, leave the loop.

If need_block, it effectively unfixes (holder->fix_count = 0, remove holder), sets thread_p->wait_for_latch_promote = true, blocks via pgbuf_block_bcb for WRITE, and on wake re-allocates a holder with the saved fix_count.

Invariant (promoter mutual exclusion): at most one waiter on a BCB carries wait_for_latch_promote; both branches that detect a promoter waiter abort rather than queue behind it. Two blocked promoters would each wait for the other to drop its read latch — a deadlock. The abort returns ER_PAGE_LATCH_PROMOTE_FAIL; callers retry by unfix + fix WRITE.

10.7 Remaining special fix paths

pgbuf_simple_fix / pgbuf_simple_unfix — temp files only. Latchless, LRU-mutexless; only add_fcnt(&bufptr->atomic_latch, 1), never latches (“Cannot be mixed with general FIX(LATCH)”). Resident → if a direct-victim claim is pending, invalidate it when need_fix, else NULL. Absent → if need_fix: lock hash, pgbuf_claim_bcb_for_fix, insert, add to private/shared LRU; else NULL. Unfix = lock, add_fcnt(..., -1), unlock.
pgbuf_fix_if_not_deallocated — vacuum’s dealloc-aware fix. disk_is_page_sector_reserved first: DISK_INVALID → NO_ERROR, *page = NULL (deallocated, not error); DISK_ERROR → propagate; DISK_VALID → real fix with OLD_PAGE_MAYBE_DEALLOCATED, then a NULL + ER_PB_BAD_PAGEID is swallowed (raced) unless mid recovery-redo.
pgbuf_invalidate — drop a page (caller holds WRITE). fcnt > 1 → just unfix (pgbuf_unlatch_thrd_holder + pgbuf_unlatch_bcb_upon_unfix), no invalidation. fcnt == 1 → flush if dirty (pgbuf_bcb_safe_flush_force_lock), record VPID, unfix, re-lock, re-check; if the BCB was reused, re-fixed, or avoid-victim → skip, else pgbuf_invalidate_bcb detaches it. Persistent pages run this as a post-commit postpone; temp pages unconditionally. pgbuf_invalidate_all iterates a volume.
pgbuf_notify_vacuum_follows / pgbuf_bcb_is_to_vacuum — vacuum hint. Sets PGBUF_BCB_TO_VACUUM_FLAG via pgbuf_bcb_update_flags(thread_p, bcb, PGBUF_BCB_TO_VACUUM_FLAG, 0) (“vacuum will revisit, prefer not to victimize”). pgbuf_bcb_is_to_vacuum tests it; the victim-flush path clears it on commit-to-flush, so the hint is one-shot.
TDE hook (out-of-scope). pgbuf_set_tde_algorithm early-returns if the algorithm is unchanged, else clears the existing bits (pflag &= ~FILEIO_PAGE_FLAG_ENCRYPTED_MASK) and ORs in the new encryption bit in iopage->prv.pflag (FILEIO_PAGE_FLAG_ENCRYPTED_AES/_ARIA), logs undoredo unless skip_logging, marks dirty. The page buffer only carries these bits through dirty/flush (Ch 6); en/decryption is the TDE module’s. Noted only so a maintainer knows pflag has a TDE tenant.

10.8 Chapter summary — key takeaways

The 100 ms Page Maintenance Daemon runs pgbuf_adjust_quotas then pgbuf_direct_victims_maintenance (idle victim backup); the former is gated by a 1 ms / 500 ms cadence and num_buffers/4 activity, guarded by the single-writer is_adjusting flag that must clear on every exit, and sets per-list quota/threshold_lru1/2 (capped at PGBUF_PRIVATE_LRU_MAX_HARD_QUOTA and half the pool) plus victim_rich (>10% of pool).
Ordered fix ranks pages (heap-hdr < normal < overflow) then sorts by VPID; held pages sorting after the request are unfixed, the set qsorted, all re-fixed unconditionally. Its load-bearing output page_was_unfixed means the caller must re-read that page (cached pointers may be stale).
Watcher invariants (same rank/group per page; fix_count >= watch_count per holder) are fatal-enforced. pgbuf_promote_read_latch_release flips R-to-W in place as sole reader, blocks under PGBUF_PROMOTE_SHARED_READER, aborts with ER_PAGE_LATCH_PROMOTE_FAIL if another promoter waits.
Each special path bypasses one step: pgbuf_simple_fix (latchless temp), pgbuf_fix_if_not_deallocated (deallocated = NULL non-error), pgbuf_invalidate (detach a singly-fixed page), pgbuf_notify_vacuum_follows (one-shot anti-victim hint); the TDE pflag tenant is the out-of-scope boundary.

Position hints as of this revision

The following are line numbers as observed on 2026-06-17; symbols are the canonical anchor and line numbers are hints that decay.

Symbol	File	Line
`dwb_set_data_on_next_slot`	`src/storage/double_write_buffer.cpp`	2686
`dwb_add_page`	`src/storage/double_write_buffer.cpp`	2726
`dwb_is_created`	`src/storage/double_write_buffer.cpp`	2909
`fileio_page_reserved`	`src/storage/file_io.h`	166
`fileio_page_watermark`	`src/storage/file_io.h`	179
`fileio_page`	`src/storage/file_io.h`	186
`PGBUF_DEFAULT_FIX_COUNT`	`src/storage/page_buffer.c`	90
`PGBUF_NUM_ALLOC_HOLDER`	`src/storage/page_buffer.c`	94
`PGBUF_FIX_COUNT_THRESHOLD`	`src/storage/page_buffer.c`	106
`pgbuf_latch_timeout`	`src/storage/page_buffer.c`	107
`PGBUF_IOPAGE_BUFFER_SIZE`	`src/storage/page_buffer.c`	118
`PGBUF_FIND_BCB_PTR`	`src/storage/page_buffer.c`	135
`PGBUF_LRU_NBITS`	`src/storage/page_buffer.c`	148
`PGBUF_LRU_INDEX_MASK`	`src/storage/page_buffer.c`	150
`PGBUF_LRU_INDEX_MASK`	`src/storage/page_buffer.c`	182
`PGBUF_LRU_1_ZONE`	`src/storage/page_buffer.c`	197
`PGBUF_LRU_ZONE_MASK`	`src/storage/page_buffer.c`	201
`PGBUF_INVALID_ZONE`	`src/storage/page_buffer.c`	205
`PGBUF_VOID_ZONE`	`src/storage/page_buffer.c`	206
`PGBUF_ZONE_MASK`	`src/storage/page_buffer.c`	211
`PGBUF_GET_ZONE`	`src/storage/page_buffer.c`	215
`PGBUF_GET_LRU_INDEX`	`src/storage/page_buffer.c`	216
`PGBUF_BCB_DIRTY_FLAG`	`src/storage/page_buffer.c`	224
`PGBUF_BCB_FLUSHING_TO_DISK_FLAG`	`src/storage/page_buffer.c`	227
`PGBUF_BCB_VICTIM_DIRECT_FLAG`	`src/storage/page_buffer.c`	234
`PGBUF_BCB_INVALIDATE_DIRECT_VICTIM_FLAG`	`src/storage/page_buffer.c`	235
`PGBUF_BCB_MOVE_TO_LRU_BOTTOM_FLAG`	`src/storage/page_buffer.c`	237
`PGBUF_BCB_TO_VACUUM_FLAG`	`src/storage/page_buffer.c`	239
`PGBUF_BCB_ASYNC_FLUSH_REQ`	`src/storage/page_buffer.c`	241
`PGBUF_BCB_FLAGS_MASK`	`src/storage/page_buffer.c`	244
`PGBUF_BCB_INVALID_VICTIM_CANDIDATE_MASK`	`src/storage/page_buffer.c`	258
`PGBUF_BCB_INIT_FLAGS`	`src/storage/page_buffer.c`	265
`PGBUF_BCB_COUNT_FIX_SHIFT_BITS`	`src/storage/page_buffer.c`	268
`PGBUF_BCB_AVOID_DEALLOC_MASK`	`src/storage/page_buffer.c`	269
`PGBUF_TRAN_THRESHOLD_ACTIVITY`	`src/storage/page_buffer.c`	276
`PGBUF_AOUT_NOT_FOUND`	`src/storage/page_buffer.c`	279
`PGBUF_SHOULD_IGNORE_UNFIX`	`src/storage/page_buffer.c`	290
`HASH_SIZE_BITS`	`src/storage/page_buffer.c`	295
`PGBUF_HASH_SIZE`	`src/storage/page_buffer.c`	296
`PGBUF_HASH_VALUE`	`src/storage/page_buffer.c`	300
`PGBUF_FLUSH_VICTIM_BOOST_MULT`	`src/storage/page_buffer.c`	305
`PGBUF_NEIGHBOR_FLUSH_NONDIRTY`	`src/storage/page_buffer.c`	307
`PGBUF_MAX_NEIGHBOR_PAGES`	`src/storage/page_buffer.c`	310
`PGBUF_NEIGHBOR_POS`	`src/storage/page_buffer.c`	314
`PGBUF_CHKPT_MAX_FLUSH_RATE`	`src/storage/page_buffer.c`	322
`PGBUF_CHKPT_MIN_FLUSH_RATE`	`src/storage/page_buffer.c`	323
`PGBUF_CHKPT_BURST_PAGES`	`src/storage/page_buffer.c`	326
`PGBUF_LRU_ZONE_MIN_RATIO`	`src/storage/page_buffer.c`	342
`PGBUF_LOCK_HOLDER`	`src/storage/page_buffer.c`	348
`pgbuf_holder_stat`	`src/storage/page_buffer.c`	441
`pgbuf_batch_flush_helper`	`src/storage/page_buffer.c`	451
`pgbuf_holder`	`src/storage/page_buffer.c`	461
`pgbuf_holder_anchor`	`src/storage/page_buffer.c`	479
`pgbuf_holder_set`	`src/storage/page_buffer.c`	488
`pgbuf_atomic_latch_impl`	`src/storage/page_buffer.c`	494
`pgbuf_bcb`	`src/storage/page_buffer.c`	506
`atomic_latch`	`src/storage/page_buffer.c`	513
`flags`	`src/storage/page_buffer.c`	514
`next_wait_thrd`	`src/storage/page_buffer.c`	516
`count_fix_and_avoid_dealloc`	`src/storage/page_buffer.c`	528
`oldest_unflush_lsa`	`src/storage/page_buffer.c`	536
`pgbuf_iopage_buffer`	`src/storage/page_buffer.c`	541
`struct pgbuf_iopage_buffer`	`src/storage/page_buffer.c`	541
`pgbuf_buffer_lock`	`src/storage/page_buffer.c`	557
`struct pgbuf_buffer_lock`	`src/storage/page_buffer.c`	557
`pgbuf_buffer_hash`	`src/storage/page_buffer.c`	570
`pgbuf_lru_list`	`src/storage/page_buffer.c`	580
`victim_hint`	`src/storage/page_buffer.c`	589
`count_vict_cand`	`src/storage/page_buffer.c`	602
`pgbuf_invalid_list`	`src/storage/page_buffer.c`	621
`struct pgbuf_invalid_list`	`src/storage/page_buffer.c`	621
`pgbuf_aout_buf`	`src/storage/page_buffer.c`	636
`struct pgbuf_aout_list`	`src/storage/page_buffer.c`	645
`pgbuf_aout_list`	`src/storage/page_buffer.c`	645
`pgbuf_seq_flusher`	`src/storage/page_buffer.c`	669
`struct pgbuf_page_monitor`	`src/storage/page_buffer.c`	688
`struct pgbuf_page_quota`	`src/storage/page_buffer.c`	710
`pgbuf_page_quota`	`src/storage/page_buffer.c`	710
`struct pgbuf_direct_victim`	`src/storage/page_buffer.c`	737
`pgbuf_buffer_pool`	`src/storage/page_buffer.c`	749
`struct pgbuf_victim_candidate_list`	`src/storage/page_buffer.c`	833
`pgbuf_Flush_helper`	`src/storage/page_buffer.c`	840
`AOUT_HASH_IDX`	`src/storage/page_buffer.c`	854
`PGBUF_BCB_LOCK`	`src/storage/page_buffer.c`	869
`PGBUF_BCB_TRYLOCK`	`src/storage/page_buffer.c`	871
`PGBUF_IS_BCB_IN_LRU_VICTIM_ZONE`	`src/storage/page_buffer.c`	919
`PGBUF_IS_BCB_IN_LRU`	`src/storage/page_buffer.c`	920
`PGBUF_IS_BCB_OLD_ENOUGH`	`src/storage/page_buffer.c`	927
`PGBUF_PRIVATE_LRU_MAX_HARD_QUOTA`	`src/storage/page_buffer.c`	943
`PGBUF_MIN_PAGES_IN_SHARED_LIST`	`src/storage/page_buffer.c`	946
`PGBUF_TOTAL_LRU_COUNT`	`src/storage/page_buffer.c`	969
`PGBUF_IS_PRIVATE_LRU_INDEX`	`src/storage/page_buffer.c`	975
`PGBUF_LRU_LIST_IS_OVER_QUOTA`	`src/storage/page_buffer.c`	977
`PGBUF_LRU_LIST_IS_OVER_QUOTA_WITH_BUFFER`	`src/storage/page_buffer.c`	987
`set_latch`	`src/storage/page_buffer.c`	1310
`add_fcnt`	`src/storage/page_buffer.c`	1324
`set_waiter_exists`	`src/storage/page_buffer.c`	1368
`get_latch`	`src/storage/page_buffer.c`	1398
`get_impl`	`src/storage/page_buffer.c`	1406
`pgbuf_thread_variables_init`	`src/storage/page_buffer.c`	1415
`pgbuf_hash_func_mirror`	`src/storage/page_buffer.c`	1441
`pgbuf_hash_vpid`	`src/storage/page_buffer.c`	1480
`pgbuf_compare_vpid`	`src/storage/page_buffer.c`	1494
`pgbuf_initialize`	`src/storage/page_buffer.c`	1518
`pgbuf_finalize`	`src/storage/page_buffer.c`	1796
`pgbuf_fix_with_retry`	`src/storage/page_buffer.c`	1993
`pgbuf_fix_release`	`src/storage/page_buffer.c`	2041
`pgbuf_simple_fix`	`src/storage/page_buffer.c`	2475
`pgbuf_simple_unfix`	`src/storage/page_buffer.c`	2569
`pgbuf_promote_read_latch_debug`	`src/storage/page_buffer.c`	2624
`pgbuf_promote_read_latch_release`	`src/storage/page_buffer.c`	2628
`pgbuf_unfix`	`src/storage/page_buffer.c`	2850
`pgbuf_invalidate`	`src/storage/page_buffer.c`	3158
`pgbuf_flush`	`src/storage/page_buffer.c`	3341
`pgbuf_flush_with_wal`	`src/storage/page_buffer.c`	3364
`pgbuf_flush_if_requested`	`src/storage/page_buffer.c`	3404
`pgbuf_flush_all_helper`	`src/storage/page_buffer.c`	3438
`pgbuf_get_victim_candidates_from_lru`	`src/storage/page_buffer.c`	3564
`pgbuf_flush_victim_candidates`	`src/storage/page_buffer.c`	3645
`pgbuf_flush_checkpoint`	`src/storage/page_buffer.c`	3960
`pgbuf_flush_chkpt_seq_list`	`src/storage/page_buffer.c`	4102
`pgbuf_flush_seq_list`	`src/storage/page_buffer.c`	4210
`pgbuf_set_dirty`	`src/storage/page_buffer.c`	4700
`pgbuf_set_lsa`	`src/storage/page_buffer.c`	4771
`pgbuf_set_tde_algorithm`	`src/storage/page_buffer.c`	4881
`pgbuf_set_bcb_page_vpid`	`src/storage/page_buffer.c`	5214
`pgbuf_initialize_bcb_table`	`src/storage/page_buffer.c`	5334
`pgbuf_initialize_hash_table`	`src/storage/page_buffer.c`	5452
`pgbuf_initialize_lock_table`	`src/storage/page_buffer.c`	5481
`pgbuf_initialize_lru_list`	`src/storage/page_buffer.c`	5519
`pgbuf_initialize_aout_list`	`src/storage/page_buffer.c`	5582
`pgbuf_initialize_invalid_list`	`src/storage/page_buffer.c`	5686
`pgbuf_initialize_thrd_holder`	`src/storage/page_buffer.c`	5701
`pgbuf_allocate_thrd_holder_entry`	`src/storage/page_buffer.c`	5783
`pgbuf_find_thrd_holder`	`src/storage/page_buffer.c`	5870
`pgbuf_remove_thrd_holder`	`src/storage/page_buffer.c`	5971
`pgbuf_latch_bcb_upon_fix`	`src/storage/page_buffer.c`	6073
`pgbuf_unlatch_bcb_upon_unfix`	`src/storage/page_buffer.c`	6417
`pgbuf_unlatch_void_zone_bcb`	`src/storage/page_buffer.c`	6652
`pgbuf_should_move_private_to_shared`	`src/storage/page_buffer.c`	6758
`pgbuf_block_bcb`	`src/storage/page_buffer.c`	6803
`pgbuf_timed_sleep_error_handling`	`src/storage/page_buffer.c`	6925
`pgbuf_timed_sleep`	`src/storage/page_buffer.c`	7014
`pgbuf_wakeup_reader_writer`	`src/storage/page_buffer.c`	7186
`pgbuf_search_hash_chain`	`src/storage/page_buffer.c`	7327
`pgbuf_lockfree_fix_ro`	`src/storage/page_buffer.c`	7452
`pgbuf_search_hash_chain_no_bcb_lock`	`src/storage/page_buffer.c`	7517
`pgbuf_insert_into_hash_chain`	`src/storage/page_buffer.c`	7569
`pgbuf_lock_page`	`src/storage/page_buffer.c`	7718
`pgbuf_unlock_page`	`src/storage/page_buffer.c`	7831
`pgbuf_allocate_bcb`	`src/storage/page_buffer.c`	7916
`pgbuf_claim_bcb_for_fix`	`src/storage/page_buffer.c`	8133
`pgbuf_victimize_bcb`	`src/storage/page_buffer.c`	8372
`pgbuf_invalidate_bcb`	`src/storage/page_buffer.c`	8424
`pgbuf_bcb_safe_flush_force_unlock`	`src/storage/page_buffer.c`	8494
`pgbuf_bcb_safe_flush_force_lock`	`src/storage/page_buffer.c`	8517
`pgbuf_bcb_safe_flush_internal`	`src/storage/page_buffer.c`	8550
`pgbuf_get_bcb_from_invalid_list`	`src/storage/page_buffer.c`	8644
`pgbuf_put_bcb_into_invalid_list`	`src/storage/page_buffer.c`	8693
`pgbuf_get_victim`	`src/storage/page_buffer.c`	8805
`pgbuf_is_bcb_fixed_by_any`	`src/storage/page_buffer.c`	8995
`pgbuf_is_bcb_victimizable`	`src/storage/page_buffer.c`	9023
`pgbuf_get_victim_from_lru_list`	`src/storage/page_buffer.c`	9053
`pgbuf_panic_assign_direct_victims_from_lru`	`src/storage/page_buffer.c`	9279
`pgbuf_direct_victims_maintenance`	`src/storage/page_buffer.c`	9346
`pgbuf_lfcq_assign_direct_victims`	`src/storage/page_buffer.c`	9388
`pgbuf_lru_add_bcb_to_top`	`src/storage/page_buffer.c`	9432
`pgbuf_lru_add_bcb_to_middle`	`src/storage/page_buffer.c`	9482
`pgbuf_lru_add_bcb_to_bottom`	`src/storage/page_buffer.c`	9570
`pgbuf_lru_fall_bcb_to_zone_3`	`src/storage/page_buffer.c`	9788
`pgbuf_lru_boost_bcb`	`src/storage/page_buffer.c`	9858
`pgbuf_lru_move_from_private_to_shared`	`src/storage/page_buffer.c`	10064
`pgbuf_remove_from_lru_list`	`src/storage/page_buffer.c`	10089
`pgbuf_move_bcb_to_bottom_lru`	`src/storage/page_buffer.c`	10157
`pgbuf_add_vpid_to_aout_list`	`src/storage/page_buffer.c`	10201
`pgbuf_remove_vpid_from_aout_list`	`src/storage/page_buffer.c`	10282
`pgbuf_bcb_flush_with_wal`	`src/storage/page_buffer.c`	10456
`pgbuf_wake_flush_waiters`	`src/storage/page_buffer.c`	10694
`pgbuf_is_exist_blocked_reader_writer`	`src/storage/page_buffer.c`	10741
`pgbuf_wakeup`	`src/storage/page_buffer.c`	11319
`pgbuf_set_dirty_buffer_ptr`	`src/storage/page_buffer.c`	11369
`pgbuf_flush_page_and_neighbors_fb`	`src/storage/page_buffer.c`	11527
`pgbuf_flush_neighbor_safe`	`src/storage/page_buffer.c`	11762
`pgbuf_add_bufptr_to_batch`	`src/storage/page_buffer.c`	11820
`pgbuf_ordered_fix_release`	`src/storage/page_buffer.c`	11985
`pgbuf_ordered_unfix`	`src/storage/page_buffer.c`	12860
`pgbuf_add_watch_instance_internal`	`src/storage/page_buffer.c`	12927
`pgbuf_initialize_page_quota_parameters`	`src/storage/page_buffer.c`	13326
`pgbuf_initialize_page_quota`	`src/storage/page_buffer.c`	13370
`pgbuf_initialize_page_monitor`	`src/storage/page_buffer.c`	13430
`pgbuf_adjust_quotas`	`src/storage/page_buffer.c`	13639
`pgbuf_initialize_seq_flusher`	`src/storage/page_buffer.c`	14016
`pgbuf_flush_control_from_dirty_ratio`	`src/storage/page_buffer.c`	14233
`pgbuf_fix_if_not_deallocated_with_caller`	`src/storage/page_buffer.c`	14735
`pgbuf_assign_direct_victim`	`src/storage/page_buffer.c`	14809
`pgbuf_assign_flushed_pages`	`src/storage/page_buffer.c`	14876
`pgbuf_get_thread_waiting_for_direct_victim`	`src/storage/page_buffer.c`	14946
`pgbuf_get_direct_victim`	`src/storage/page_buffer.c`	14978
`pgbuf_lru_advance_victim_hint`	`src/storage/page_buffer.c`	15131
`pgbuf_bcb_update_flags`	`src/storage/page_buffer.c`	15171
`pgbuf_bcb_change_zone`	`src/storage/page_buffer.c`	15269
`pgbuf_bcb_get_zone`	`src/storage/page_buffer.c`	15374
`pgbuf_bcb_get_zone`	`src/storage/page_buffer.c`	15375
`pgbuf_bcb_get_lru_index`	`src/storage/page_buffer.c`	15386
`pgbuf_bcb_get_lru_index`	`src/storage/page_buffer.c`	15387
`pgbuf_bcb_is_dirty`	`src/storage/page_buffer.c`	15400
`pgbuf_bcb_set_dirty`	`src/storage/page_buffer.c`	15412
`pgbuf_bcb_mark_is_flushing`	`src/storage/page_buffer.c`	15463
`pgbuf_bcb_mark_was_flushed`	`src/storage/page_buffer.c`	15486
`pgbuf_bcb_mark_was_not_flushed`	`src/storage/page_buffer.c`	15500
`pgbuf_bcb_is_flushing`	`src/storage/page_buffer.c`	15513
`pgbuf_bcb_should_be_moved_to_bottom_lru`	`src/storage/page_buffer.c`	15561
`pgbuf_notify_vacuum_follows`	`src/storage/page_buffer.c`	15574
`pgbuf_bcb_is_to_vacuum`	`src/storage/page_buffer.c`	15589
`pgbuf_bcb_avoid_victim`	`src/storage/page_buffer.c`	15603
`pgbuf_bcb_register_avoid_deallocation`	`src/storage/page_buffer.c`	15627
`pgbuf_bcb_unregister_avoid_deallocation`	`src/storage/page_buffer.c`	15640
`pgbuf_bcb_should_avoid_deallocation`	`src/storage/page_buffer.c`	15684
`pgbuf_bcb_register_fix`	`src/storage/page_buffer.c`	15720
`pgbuf_bcb_is_hot`	`src/storage/page_buffer.c`	15741
`pgbuf_lfcq_get_victim_from_private_lru`	`src/storage/page_buffer.c`	15802
`pgbuf_lfcq_get_victim_from_shared_lru`	`src/storage/page_buffer.c`	15894
`pgbuf_bcb_register_hit_for_lru`	`src/storage/page_buffer.c`	15979
`pgbuf_get_page_flush_interval`	`src/storage/page_buffer.c`	16353
`pgbuf_page_maintenance_execute`	`src/storage/page_buffer.c`	16375
`pgbuf_page_flush_daemon_task`	`src/storage/page_buffer.c`	16396
`pgbuf_page_maintenance_daemon_init`	`src/storage/page_buffer.c`	16531
`pgbuf_page_flush_daemon_init`	`src/storage/page_buffer.c`	16549
`pgbuf_page_post_flush_daemon_init`	`src/storage/page_buffer.c`	16567
`pgbuf_is_page_flush_daemon_available`	`src/storage/page_buffer.c`	16673
`pgbuf_is_temp_lsa`	`src/storage/page_buffer.c`	16683
`pgbuf_init_temp_page_lsa`	`src/storage/page_buffer.c`	16689
`PAGE_FETCH_MODE`	`src/storage/page_buffer.h`	172
`PGBUF_LATCH_MODE`	`src/storage/page_buffer.h`	190
`PGBUF_ORDERED_RANK`	`src/storage/page_buffer.h`	222
`pgbuf_watcher`	`src/storage/page_buffer.h`	234
`PGBUF_TEMP_LSA`	`src/storage/page_buffer.h`	258
`PGBUF_ATOMIC_LATCH`	`src/storage/page_buffer.h`	365

Sources

cubrid-page-buffer-manager.md — the high-level companion. See also cubrid-double-write-buffer.md (the flush path below) and cubrid-log-manager-detail.md (the WAL rule the flush obeys).
Raw analyses under raw/code-analysis/cubrid/storage/buffer_manager/.
Code: src/storage/page_buffer.{c,h}.
Methodology: knowledge/methodology/code-analysis-detail-doc.md.

CUBRID Page Buffer Manager — Code-Level Deep Dive

Chapter 1: Data-Structure Map

1.1 The packed-word vocabulary

1.2 pgbuf_atomic_latch_impl — the real BCB latch

1.3 pgbuf_bcb — the buffer control block, field by field

1.4 pgbuf_iopage_buffer — the page payload slot

1.5 The VPID hash: pgbuf_buffer_hash and pgbuf_buffer_lock

1.6 pgbuf_lru_list — one multi-zone LRU

1.7 pgbuf_invalid_list — the free pool

1.8 The holder triad — per-thread fix bookkeeping

1.9 pgbuf_watcher — the ordered-fix bit-field

1.10 pgbuf_buffer_pool — the global root