PostgreSQL Buffer Manager — Shared Pool, Clock-Sweep Eviction, and the WAL-Before-Flush Rule

Contents:

Theoretical Background
Common DBMS Design
PostgreSQL’s Approach
Source Walkthrough
Source verification (as of 2026-06-05)
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Sources

Theoretical Background

A database is too large to live in memory, so the engine keeps the authoritative copy on disk (in PostgreSQL, in the per-fork segment files managed by the storage manager — see postgres-smgr-md.md) and caches a working subset of fixed-size pages in RAM. The component that owns that cache is the buffer manager (a/k/a buffer pool or page cache). Database Internals (Petrov, ch. 4 §“Buffer Management”) frames its job precisely: it “is responsible for caching pages read from disk in memory” and serves “as an intermediary between persistent storage (disk) and the rest of the [engine],” so that higher layers address data by page ID and never touch the file directly. Every read of a tuple, every index probe, every WAL-logged modification first goes through a buffer.

The buffer manager sits at the intersection of two hard constraints, and the design space is defined by how it answers each:

The cache is finite, so it must choose what to evict. When a backend needs a page that is not resident and no empty frame is free, some resident page must be ejected. The replacement policy (which page to evict) is the first axis. True LRU requires per-access bookkeeping that does not scale across many CPUs sharing one pool, so real engines use approximations — CLOCK, CLOCK-Pro, LRU-K, 2Q, ARC. The textbook calls out that “evicting pages that are still in use, called page faults,” and the surrounding cost model is what the policy is trying to minimize.
Cached pages are modified in place, so eviction interacts with recovery. A page made dirty in the cache and not yet written back embodies committed (or uncommitted) state that must survive a crash. The second axis is the force/steal policy from the recovery literature ([HABERMANN]/[MOHAN] — see dbms-papers/aries.md):
- Force vs no-force: must a transaction’s dirty pages be flushed to disk before it is allowed to commit (force), or may they stay in the cache and be written lazily (no-force)?
- Steal vs no-steal: may the buffer manager evict (and write to disk) a page dirtied by a transaction that has not yet committed (steal), or must it pin such pages until commit (no-steal)?
ARIES — the recovery method PostgreSQL follows — chooses the most performant and most demanding combination: steal + no-force. Steal means an uncommitted transaction’s changes can reach disk, so recovery must be able to undo them; no-force means a committed transaction’s changes may not have reached disk, so recovery must be able to redo them. Both directions are made sound by one invariant — the Write-Ahead Logging (WAL) rule: the log record describing a change must be on stable storage before the changed data page is. The buffer manager is the component that physically enforces the no-force/steal half of this contract, by refusing to write any data page to disk until the WAL is flushed up to that page’s LSN. WAL generation and flushing themselves live in postgres-xlog-wal.md; this document covers the enforcement point.

Two further textbook points shape the implementation. First, the buffer manager is a shared structure in a multi-process engine, so access to its metadata must be cheap and concurrent — a single global lock (PostgreSQL’s pre-8.1 BufMgrLock) is a notorious scalability bottleneck. Second, the unit of caching is the page (PostgreSQL BLCKSZ, default 8 KB; see postgres-page-layout.md for what is inside a page), identified by a page ID that the cache uses as the hash key. Everything below is PostgreSQL turning these two axes — replacement and force/steal — and the two constraints — concurrency and page-addressing — into concrete data structures.

Common DBMS Design

The textbook gives the model; this section names the engineering conventions that nearly every disk-resident buffer manager — PostgreSQL, Oracle, InnoDB, SQL Server, CUBRID — adopts in some form. PostgreSQL’s specific choices in §“PostgreSQL’s Approach” read as one set of dials within this shared space.

A descriptor array beside a page array

The pool is two parallel arrays: a contiguous block of page-sized frames holding the actual bytes, and a parallel array of buffer descriptors — small metadata records (one per frame) holding the page’s identity, its dirty/valid flags, a pin count, and the replacement-policy bookkeeping. A buffer is named by its index into these arrays. Keeping the hot metadata (the descriptor) physically separate from the cold 8 KB payload lets the engine scan, lock, and CAS descriptors without dragging page data through the CPU cache.

A hash table from page-ID to frame

Given a page ID, the engine must answer “is this page resident, and if so in which frame?” in roughly O(1). Every engine keeps a mapping hash table keyed on the page identity (tablespace/database/relation/fork/block, or some packing thereof) and valued by the frame index. To stop this table from becoming the new global bottleneck, the lock that protects it is striped into partitions keyed on the hash of the tag, so two backends touching unrelated pages rarely contend.

Pin counts vs content locks — two independent guards

There are two orthogonal things a backend can want from a buffer, and mature engines separate them:

A pin (reference count) says “do not evict this frame out from under me.” It guards the frame’s identity. Pins can be held for a long time (a sequential scan holds one for the whole page) and are cheap.
A content lock (shared/exclusive) says “do not let anyone read/write the bytes while I am writing/reading them.” It guards the page contents. Content locks are short-term.

Conflating the two forces every reader to take a heavyweight lock; separating them lets a reader pin-and-share-lock, determine visibility, drop the content lock, and keep reading tuple data under the pin alone.

An approximate-LRU replacement hand

Exact LRU needs a global list reshuffled on every access — unacceptable under concurrency. The dominant approximation is CLOCK (or second chance): frames are arranged in a circle, each carrying a small usage counter; a “hand” sweeps the circle, and at each frame it either evicts (usage 0, unpinned) or decrements the usage counter and moves on. Access sets the counter, sweeping clears it, so recently/frequently used pages survive a sweep. The counter’s maximum is a tunable trade-off between LRU fidelity and sweep cost.

A small free list for genuinely empty frames

At startup every frame is empty; as relations are dropped, frames become empty again. A free list of known-empty frames lets allocation skip the sweep entirely in those cases. The free list is a fast path layered over the clock; it is not the replacement policy.

Bulk-operation rings to avoid cache pollution

A one-shot large scan (a seqscan of a table bigger than the pool, a VACUUM, a COPY) would, under naive CLOCK, evict the entire working set to cache pages it will never look at again. The convention is a ring buffer (a/k/a buffer ring / strategy): a small fixed set of frames the bulk operation reuses among itself, leaving the rest of the pool untouched.

The WAL-before-flush enforcement point

Every steal/no-force engine puts one chokepoint on the path from “dirty page in cache” to “page bytes on disk”: before issuing the write, force the log to disk up to the page’s LSN. This single rule is what makes both undo (steal) and redo (no-force) recovery sound. The buffer manager is where it is enforced because the buffer manager is the only component that writes data pages.

Theory ↔ PostgreSQL mapping

Theory / convention	PostgreSQL name
Page-sized cache frame	`BufferBlocks` array, `BLCKSZ` bytes each
Per-frame metadata record	`BufferDesc` (in `BufferDescriptors[]`)
Page identity / cache key	`BufferTag` = (spcOid, dbOid, relNumber, forkNum, blockNum)
Dirty / valid / pinned flags + counters	`BufferDesc.state` — one atomic `uint32` (18b refcount, 4b usagecount, 10b flags)
Page-ID → frame hash table	`buf_table.c` shared hash, `BufTableLookup`/`Insert`/`Delete`
Striped lock on the hash	`BufMappingPartitionLock` — `NUM_BUFFER_PARTITIONS` = 128
Pin / reference count	`PinBuffer` / `UnpinBuffer`, backed by `PrivateRefCountArray`
Content (read/write) lock	`LWLock content_lock` per descriptor, via `LockBuffer`
CLOCK usage counter	`BUF_USAGECOUNT`, max `BM_MAX_USAGE_COUNT` = 5
CLOCK hand	`StrategyControl->nextVictimBuffer`, advanced by `ClockSweepTick`
Free list of empty frames	`StrategyControl->firstFreeBuffer` + `BufferDesc.freeNext`
Victim selection	`StrategyGetBuffer` → wrapped by `GetVictimBuffer`
Bulk-operation ring	`BufferAccessStrategy` (`BAS_BULKREAD`/`BAS_BULKWRITE`/`BAS_VACUUM`)
WAL-before-flush rule	`FlushBuffer` → `XLogFlush(BufferGetLSN(buf))`

PostgreSQL’s Approach

PostgreSQL’s buffer pool is a fixed-size region carved out of the one shared memory segment at postmaster startup (see postgres-architecture-overview.md §“Axis 2” — the pool does not grow at runtime). NBuffers (the shared_buffers GUC, in BLCKSZ units) frames are allocated once, and three parallel shared arrays are built beside them. The defining choices are: (1) all of a frame’s mutable header state — pin count, usage count, and ten flag bits — is packed into one 32-bit atomic word so the common pin/unpin operations need no spinlock; (2) replacement is a clock-sweep over that usage counter with a free-list fast path; (3) bulk operations run inside small ring buffers so they cannot blow out the cache; and (4) the write path enforces the WAL rule unconditionally for permanent relations.

The three shared arrays and their initialization

BufferManagerShmemInit (in buf_init.c) creates the descriptors, the page blocks, the per-buffer I/O condition variables, and a checkpoint-sort scratch array, then links every descriptor into one initial free list:

// BufferManagerShmemInit — storage/buffer/buf_init.c (condensed)
BufferDescriptors = (BufferDescPadded *)
    ShmemInitStruct("Buffer Descriptors",
                    NBuffers * sizeof(BufferDescPadded), &foundDescs);
BufferBlocks = (char *)
    TYPEALIGN(PG_IO_ALIGN_SIZE,
              ShmemInitStruct("Buffer Blocks",
                              NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE, &foundBufs));
// ... condensed: BufferIOCVArray, CkptBufferIds ...
for (i = 0; i < NBuffers; i++)
{
    BufferDesc *buf = GetBufferDescriptor(i);
    ClearBufferTag(&buf->tag);
    pg_atomic_init_u32(&buf->state, 0);
    buf->buf_id = i;
    buf->freeNext = i + 1;                 /* link all buffers as unused */
    LWLockInitialize(BufferDescriptorGetContentLock(buf), LWTRANCHE_BUFFER_CONTENT);
    ConditionVariableInit(BufferDescriptorGetIOCV(buf));
}
GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;
StrategyInitialize(!foundDescs);           /* builds the mapping hash + control */

The descriptor array is BufferDescPadded — each BufferDesc is padded to a 64-byte cache line (BUFFERDESC_PAD_TO_SIZE) so that two CPUs spinning on two adjacent descriptors do not false-share a cache line. The payload array (BufferBlocks) is aligned to PG_IO_ALIGN_SIZE so direct/async I/O can DMA into it without straddling.

flowchart LR
  subgraph SHM["one shared-memory segment (sized once at boot)"]
    direction TB
    DESC["BufferDescriptors[]<br/>NBuffers x BufferDescPadded<br/>(64-byte cache-line padded)"]
    BLK["BufferBlocks[]<br/>NBuffers x BLCKSZ (8KB)"]
    CV["BufferIOCVArray[]<br/>per-frame I/O condition vars"]
    HASH["buf_table hash<br/>BufferTag -> buf_id"]
    CTL["BufferStrategyControl<br/>nextVictimBuffer, freelist head/tail"]
  end
  DESC -- "buf_id indexes" --> BLK
  HASH -- "resolves tag to" --> DESC
  CTL -- "clock hand sweeps" --> DESC

Figure 1 — The buffer pool’s shared arrays. BufferDescriptors[] (metadata) runs parallel to BufferBlocks[] (the 8 KB payloads); a buf_id indexes both. The buf_table hash resolves a BufferTag to a buf_id, and BufferStrategyControl holds the clock hand and free-list pointers. All of it is fixed-size, created once by BufferManagerShmemInit.

The buffer tag — what identifies a page

A frame’s identity is a BufferTag: the five fields that locate a block without consulting any catalog (important, because the backend flushing a page may not even believe the relation is visible yet):

// struct buftag — storage/buf_internals.h
typedef struct buftag
{
    Oid           spcOid;       /* tablespace oid */
    Oid           dbOid;        /* database oid */
    RelFileNumber relNumber;    /* relation file number */
    ForkNumber    forkNum;      /* fork number (main / fsm / vm / init) */
    BlockNumber   blockNum;     /* block number within the fork */
} BufferTag;

The tag is the hash key. BufTableHashCode hashes it; the low bits of that hash pick one of NUM_BUFFER_PARTITIONS (128) partition locks, so lookups on unrelated pages take different BufMappingPartitionLocks and do not serialize:

// BufMappingPartitionLock / BufTableHashPartition — storage/buf_internals.h
static inline uint32
BufTableHashPartition(uint32 hashcode)
{
    return hashcode % NUM_BUFFER_PARTITIONS;
}
static inline LWLock *
BufMappingPartitionLock(uint32 hashcode)
{
    return &MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET +
                            BufTableHashPartition(hashcode)].lock;
}

When more than one partition must be held at once, they are always taken in partition-number order to avoid deadlock (the in-tree README is explicit about this). The mapping hash itself is sized for NBuffers + NUM_BUFFER_PARTITIONS entries — slightly more than the pool, because BufferAlloc inserts the new tag before deleting the old one and that can happen concurrently in each partition.

The packed state word

The single most important micro-design decision is that the descriptor’s mutable header is one atomic word:

// BufferDesc + state layout — storage/buf_internals.h (condensed)
// state = 18 bits refcount | 4 bits usagecount | 10 bits flags
#define BUF_REFCOUNT_BITS   18
#define BUF_USAGECOUNT_BITS 4
#define BUF_FLAG_BITS       10

typedef struct BufferDesc
{
    BufferTag        tag;           /* ID of page contained in buffer */
    int              buf_id;        /* buffer's index number (from 0) */
    pg_atomic_uint32 state;         /* flags + refcount + usagecount */
    int              wait_backend_pgprocno; /* backend waiting for sole pin */
    int              freeNext;      /* link in freelist chain */
    PgAioWaitRef     io_wref;       /* set iff async I/O is in progress */
    LWLock           content_lock;  /* lock for the buffer *contents* */
} BufferDesc;

The ten flag bits encode the page’s lifecycle:

// buffer flags — storage/buf_internals.h
#define BM_LOCKED            (1U << 22)  /* buffer header is spinlocked */
#define BM_DIRTY             (1U << 23)  /* data needs writing */
#define BM_VALID             (1U << 24)  /* data is valid */
#define BM_TAG_VALID         (1U << 25)  /* tag is assigned (has a hash entry) */
#define BM_IO_IN_PROGRESS    (1U << 26)  /* read or write in progress */
#define BM_IO_ERROR          (1U << 27)  /* previous I/O failed */
#define BM_JUST_DIRTIED      (1U << 28)  /* dirtied since write started */
#define BM_PIN_COUNT_WAITER  (1U << 29)  /* have waiter for sole pin */
#define BM_CHECKPOINT_NEEDED (1U << 30)  /* must write for checkpoint */
#define BM_PERMANENT         (1U << 31)  /* permanent (WAL-logged) buffer */

Because refcount, usagecount, and flags share one word, a backend can pin a buffer (increment refcount, bump usagecount, test BM_VALID) in a single compare-and-swap loop — no spinlock taken in the common case. The buffer header spinlock is itself just the BM_LOCKED flag bit: LockBufHdr spins setting it, and UnlockBufHdr clears it with a plain write. The header spinlock guards complex multi-field updates (changing the tag); the CAS path handles the simple ones. Critically, the header spinlock does not guard the page bytes — that is the content_lock’s job.

Pins vs content locks

These are the two independent guards from §“Common DBMS Design”, and the in-tree README states the rules precisely. A backend must pin a buffer before touching it at all; a pin keeps the frame from being recycled. Pin state is tracked per-backend in a small array plus an overflow hash, so a backend that pins the same buffer twice does not touch shared memory the second time:

// PrivateRefCountEntry / array — storage/buffer/bufmgr.c
typedef struct PrivateRefCountEntry
{
    Buffer  buffer;
    int32   refcount;
} PrivateRefCountEntry;
#define REFCOUNT_ARRAY_ENTRIES 8   /* ~one cache line; overflow spills to a hash */

PinBuffer increments the shared refcount (in state) only on the first private pin, then tracks further pins privately and registers the pin with the current ResourceOwner so it is released at end of transaction even on error:

// PinBuffer — storage/buffer/bufmgr.c (condensed)
ref = GetPrivateRefCountEntry(b, true);
if (ref == NULL)                               /* first pin by this backend */
{
    ref = NewPrivateRefCountEntry(b);
    old_buf_state = pg_atomic_read_u32(&buf->state);
    for (;;)
    {
        if (old_buf_state & BM_LOCKED)
            old_buf_state = WaitBufHdrUnlocked(buf);
        buf_state = old_buf_state + BUF_REFCOUNT_ONE;   /* take a shared pin */
        if (strategy == NULL)                            /* bump usagecount, capped */
        {
            if (BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT)
                buf_state += BUF_USAGECOUNT_ONE;
        }
        else if (BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
            buf_state += BUF_USAGECOUNT_ONE;             /* ring buffers cap at 1 */
        if (pg_atomic_compare_exchange_u32(&buf->state, &old_buf_state, buf_state))
        {
            result = (buf_state & BM_VALID) != 0;
            break;
        }
    }
}
else
    result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;
ref->refcount++;
ResourceOwnerRememberBuffer(CurrentResourceOwner, b);

Note the strategy branch: a normal pin bumps the usage counter (up to BM_MAX_USAGE_COUNT); a pin made through a ring buffer caps usage at 1, so ring pages never accumulate the protection that would keep them resident past the scan. UnpinBuffer is the mirror image — it decrements the private count, and only when that hits zero does it CAS the shared refcount down and, if a cleanup waiter is parked (BM_PIN_COUNT_WAITER), wake it.

The content lock is a separate per-descriptor LWLock, taken in shared or exclusive mode through LockBuffer:

// LockBuffer — storage/buffer/bufmgr.c (condensed)
if (mode == BUFFER_LOCK_UNLOCK)
    LWLockRelease(BufferDescriptorGetContentLock(buf));
else if (mode == BUFFER_LOCK_SHARE)
    LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_SHARED);
else if (mode == BUFFER_LOCK_EXCLUSIVE)
    LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);

The README’s access rules tie the two together: to scan a page you hold a pin and a (shared or exclusive) content lock; once you have decided a tuple is visible you may drop the content lock and keep reading its bytes under the pin alone (rule #2) — which is exactly why pins and locks are separate primitives.

Reading a page: BufferAlloc

ReadBuffer → ReadBuffer_common → BufferAlloc is the lookup-or-fault path. BufferAlloc first probes the mapping hash under a shared partition lock; on a hit it pins and returns. On a miss it must fault the page in: acquire a victim frame, re-check the hash under an exclusive partition lock (another backend may have raced in), and on a clean miss install the new tag:

// BufferAlloc — storage/buffer/bufmgr.c (condensed)
LWLockAcquire(newPartitionLock, LW_SHARED);
existing_buf_id = BufTableLookup(&newTag, newHash);
if (existing_buf_id >= 0)                       /* HIT */
{
    buf = GetBufferDescriptor(existing_buf_id);
    valid = PinBuffer(buf, strategy);
    LWLockRelease(newPartitionLock);
    *foundPtr = true;
    if (!valid) *foundPtr = false;              /* read still in flight */
    return buf;
}
LWLockRelease(newPartitionLock);

/* MISS: get a recyclable frame (this may flush a dirty victim, see below) */
victim_buffer  = GetVictimBuffer(strategy, io_context);
victim_buf_hdr = GetBufferDescriptor(victim_buffer - 1);

LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
existing_buf_id = BufTableInsert(&newTag, newHash, victim_buf_hdr->buf_id);
if (existing_buf_id >= 0)                       /* lost the race; use theirs */
{
    UnpinBuffer(victim_buf_hdr);
    StrategyFreeBuffer(victim_buf_hdr);
    /* ... pin existing_buf_hdr, release lock, return ... */
}
/* won the race: stamp the victim with the new tag */
victim_buf_state = LockBufHdr(victim_buf_hdr);
victim_buf_hdr->tag = newTag;
victim_buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
    victim_buf_state |= BM_PERMANENT;
UnlockBufHdr(victim_buf_hdr, victim_buf_state);
LWLockRelease(newPartitionLock);
*foundPtr = false;                               /* caller must do the read I/O */
return victim_buf_hdr;

BufferAlloc returns a pinned frame stamped with the tag but, on a miss, with invalid contents (*foundPtr = false); the caller then performs the actual read and marks the page BM_VALID via the StartBufferIO / TerminateBufferIO protocol. BM_PERMANENT is set here, once, based on relation persistence — this is the bit FlushBuffer later consults to decide whether the WAL rule applies.

flowchart TD
  A["ReadBuffer(rel, blk)"] --> B["compute BufferTag + hash<br/>pick partition lock"]
  B --> C{"BufTableLookup<br/>(shared partition lock)"}
  C -- "hit" --> D["PinBuffer; release lock"]
  D --> E{"BM_VALID?"}
  E -- "yes" --> Z["return pinned buffer"]
  E -- "no" --> Y["wait for in-flight read"]
  C -- "miss" --> F["GetVictimBuffer<br/>(clock sweep / ring)"]
  F --> G["BufTableInsert<br/>(exclusive partition lock)"]
  G -- "race lost" --> H["UnpinBuffer + StrategyFreeBuffer<br/>use the winner's buffer"]
  G -- "race won" --> I["stamp tag, set BM_TAG_VALID<br/>BM_PERMANENT if logged"]
  I --> J["read block via smgr<br/>set BM_VALID"]
  J --> Z
  H --> Z

Figure 2 — ReadBuffer/BufferAlloc lookup-or-fault flow. A hit pins under a shared partition lock and returns. A miss acquires a victim frame (clock sweep, possibly flushing it), then re-checks under an exclusive partition lock to handle the race where another backend faulted the same page concurrently; the loser frees its victim and uses the winner’s buffer.

Clock-sweep eviction: GetVictimBuffer and StrategyGetBuffer

When a miss needs a frame, GetVictimBuffer (in bufmgr.c) drives the policy but delegates selection to StrategyGetBuffer (in freelist.c). StrategyGetBuffer first tries the strategy ring (if any), then the free list, then runs the clock sweep:

// StrategyGetBuffer — storage/buffer/freelist.c (condensed, clock-sweep arm)
trycounter = NBuffers;
for (;;)
{
    buf = GetBufferDescriptor(ClockSweepTick());   /* advance hand, return frame */
    local_buf_state = LockBufHdr(buf);
    if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0)
    {
        if (BUF_STATE_GET_USAGECOUNT(local_buf_state) != 0)
        {
            local_buf_state -= BUF_USAGECOUNT_ONE;  /* second chance: decay */
            trycounter = NBuffers;
        }
        else
        {
            /* Found a usable buffer (unpinned, usage 0) */
            if (strategy != NULL)
                AddBufferToRing(strategy, buf);
            *buf_state = local_buf_state;
            return buf;                              /* returned with hdr spinlock held! */
        }
    }
    else if (--trycounter == 0)
    {
        UnlockBufHdr(buf, local_buf_state);
        elog(ERROR, "no unpinned buffers available");
    }
    UnlockBufHdr(buf, local_buf_state);
}

The hand itself is a single atomic counter advanced by ClockSweepTick:

// ClockSweepTick — storage/buffer/freelist.c (condensed)
victim = pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);
if (victim >= NBuffers)
{
    victim = victim % NBuffers;          /* wrap the index */
    if (victim == 0)                     /* we caused a wraparound */
    {
        /* take buffer_strategy_lock just long enough to bump completePasses */
        // ... CAS nextVictimBuffer back into range, completePasses++ ...
    }
}
return victim;

The free-list fast path is checked first, lock-free, and only on a non-empty list is the buffer_strategy_lock spinlock taken to pop a head entry:

// StrategyGetBuffer — freelist.c (free-list arm, condensed)
if (StrategyControl->firstFreeBuffer >= 0)
{
    while (true)
    {
        SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
        if (StrategyControl->firstFreeBuffer < 0) { SpinLockRelease(...); break; }
        buf = GetBufferDescriptor(StrategyControl->firstFreeBuffer);
        StrategyControl->firstFreeBuffer = buf->freeNext;     /* pop head */
        buf->freeNext = FREENEXT_NOT_IN_LIST;
        SpinLockRelease(&StrategyControl->buffer_strategy_lock);
        local_buf_state = LockBufHdr(buf);
        if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
            && BUF_STATE_GET_USAGECOUNT(local_buf_state) == 0)
            return buf;                  /* clean + unused: take it */
        UnlockBufHdr(buf, local_buf_state);   /* else discard, retry */
    }
}

Two scalability properties fall out. The clock hand is one atomic fetch-add — multiple backends sweeping concurrently never hold a global lock (they may return buffers slightly out of order, which is harmless). And the per-frame test is done under only that frame’s header spinlock. The buffer_strategy_lock spinlock is touched only to pop the free list or to record a clock wraparound, never during the sweep itself. BM_MAX_USAGE_COUNT = 5 caps how many sweeps a hot page can survive: a page pinned 5 times needs 5 hand passes to decay to 0, so the worst-case sweep to find a victim is bounded.

flowchart LR
  H["nextVictimBuffer<br/>(clock hand)"] --> F0
  subgraph RING["BufferDescriptors viewed as a circle"]
    direction LR
    F0["frame i<br/>pinned? -> skip<br/>usage>0 -> usage--<br/>usage=0 -> EVICT"]
    F1["frame i+1"]
    F2["frame i+2"]
    F3["..."]
  end
  F0 --> F1 --> F2 --> F3 -. "wraps to 0,<br/>completePasses++" .-> F0

Figure 3 — Clock-sweep replacement. The hand (nextVictimBuffer) advances one frame per ClockSweepTick. A pinned frame is skipped; an unpinned frame with usage > 0 has its usage decremented (a “second chance”); an unpinned frame at usage 0 is the victim. The hand never stops at a global lock — it is a single atomic counter.

GetVictimBuffer wraps selection with the eviction work: it pins the chosen frame with the spinlock held (PinBuffer_Locked), and if the frame is dirty, writes it out before recycling — this is where the write path and the WAL rule come in.

Ring buffers for bulk operations (BAS_*)

A BufferAccessStrategy is a backend-private ring of frame numbers (palloc’d, not in shared memory) that a bulk operation reuses among itself:

// BufferAccessStrategyData — storage/buffer/freelist.c
typedef struct BufferAccessStrategyData
{
    BufferAccessStrategyType btype;
    int    nbuffers;
    int    current;                       /* most recently returned ring slot */
    Buffer buffers[FLEXIBLE_ARRAY_MEMBER];/* the ring; 0 = slot not yet filled */
} BufferAccessStrategyData;

GetAccessStrategy picks the ring size by type (the README explains each):

Strategy	Type	Ring size	Used by
Bulk read	`BAS_BULKREAD`	256 KB (+IO concurrency)	large seq scans
Bulk write	`BAS_BULKWRITE`	16 MB (capped 1/8 `shared_buffers`)	`COPY IN`, `CREATE TABLE AS`
Vacuum	`BAS_VACUUM`	`vacuum_buffer_usage_limit` (default 2 MB)	`VACUUM`
Normal	`BAS_NORMAL`	— (returns NULL)	everything else

When a strategy is active, StrategyGetBuffer first calls GetBufferFromRing, which advances current and re-offers the frame already in that slot — but only if it is unpinned and its usage count is ≤ 1 (i.e., nobody else has since touched it). If the slot is empty or the buffer was stolen, it returns NULL and the caller falls through to the normal clock sweep, then calls AddBufferToRing to record the freshly-acquired frame in the slot:

// GetBufferFromRing — storage/buffer/freelist.c (condensed)
if (++strategy->current >= strategy->nbuffers)
    strategy->current = 0;
bufnum = strategy->buffers[strategy->current];
if (bufnum == InvalidBuffer)
    return NULL;                          /* slot empty: get a normal buffer */
buf = GetBufferDescriptor(bufnum - 1);
local_buf_state = LockBufHdr(buf);
if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0
    && BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1)
    return buf;                           /* reuse this ring buffer */
UnlockBufHdr(buf, local_buf_state);
return NULL;                              /* stolen: get a normal buffer */

The interaction with the WAL rule is the subtle part. If a ring buffer is dirty and reusing it would force a WAL flush, BAS_BULKREAD would rather pick a different victim than stall on WAL — StrategyRejectBuffer drops the dirty frame from the ring so the caller takes a fresh one. VACUUM and bulk write instead keep the page in the ring and pay the WAL flush, because they are write-heavy by nature. This is decided in GetVictimBuffer:

// GetVictimBuffer — storage/buffer/bufmgr.c (the strategy/WAL interaction, condensed)
if (buf_state & BM_DIRTY)
{
    content_lock = BufferDescriptorGetContentLock(buf_hdr);
    if (!LWLockConditionalAcquire(content_lock, LW_SHARED))  /* avoid deadlock */
    {
        UnpinBuffer(buf_hdr);
        goto again;
    }
    if (strategy != NULL)
    {
        buf_state = LockBufHdr(buf_hdr);
        lsn = BufferGetLSN(buf_hdr);
        UnlockBufHdr(buf_hdr, buf_state);
        if (XLogNeedsFlush(lsn) && StrategyRejectBuffer(strategy, buf_hdr, from_ring))
        {
            LWLockRelease(content_lock);
            UnpinBuffer(buf_hdr);
            goto again;                  /* pick a different victim */
        }
    }
    FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
    LWLockRelease(content_lock);
    ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context, &buf_hdr->tag);
}

The WAL-before-flush rule: FlushBuffer

FlushBuffer is the chokepoint where a dirty data page becomes durable, and it is where the WAL rule is physically enforced. The sequence is: claim the I/O (set BM_IO_IN_PROGRESS so no one else writes the same page), read the page LSN under the header spinlock, flush WAL up to that LSN, then write the page and clear the dirty bit:

// FlushBuffer — storage/buffer/bufmgr.c (condensed)
if (!StartBufferIO(buf, false, false))      /* false = for output; someone else won */
    return;
// ... condensed: error-context setup, smgropen if reln == NULL ...
buf_state = LockBufHdr(buf);
recptr = BufferGetLSN(buf);                 /* page's LSN, read under hdr lock */
buf_state &= ~BM_JUST_DIRTIED;              /* detect concurrent re-dirtying */
UnlockBufHdr(buf, buf_state);

/*
 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
 * rule that log updates must hit disk before any of the data-file changes
 * they describe do.  ... skip the flush if the buffer isn't permanent.
 */
if (buf_state & BM_PERMANENT)
    XLogFlush(recptr);

bufBlock   = BufHdrGetBlock(buf);
bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
smgrwrite(reln, BufTagGetForkNum(&buf->tag), buf->tag.blockNum, bufToWrite, false);
// ... condensed: I/O stats ...
TerminateBufferIO(buf, true, 0, true, false); /* clear BM_DIRTY (unless re-dirtied) */

Three details carry the design:

if (buf_state & BM_PERMANENT) XLogFlush(recptr) is the literal WAL rule. For a permanent (WAL-logged) relation, the log is forced to disk up to the page’s LSN before smgrwrite. For unlogged relations the flush is skipped — they are lost on crash anyway, and (per the in-tree comment) flushing a fake unlogged-GiST LSN could even try to flush past the WAL insertion point. Note smgrwrite writes to the OS page cache, not necessarily to the platter — durability of the data page is later forced by the checkpointer’s smgrimmedsync/fsync; the WAL rule only requires that the log precede the data write, which it does.
StartBufferIO returning false means another backend already flushed this page — FlushBuffer simply returns. BM_IO_IN_PROGRESS is the per-page I/O latch; waiters sleep on the frame’s condition variable in WaitIO.
BM_JUST_DIRTIED is cleared before the write and re-checked in TerminateBufferIO: if the page was re-dirtied during the write (legal, since only a shared content lock is held — hint-bit updates happen under shared lock), the dirty bit is not cleared, so the page will be written again later.

sequenceDiagram
    participant V as GetVictimBuffer / checkpointer
    participant FB as FlushBuffer
    participant HDR as buffer header
    participant WAL as WAL (XLogFlush)
    participant SMGR as smgr (data file)

    V->>FB: flush dirty victim
    FB->>HDR: StartBufferIO -> set BM_IO_IN_PROGRESS
    alt someone else already flushing
        HDR-->>FB: false -> return (no work)
    end
    FB->>HDR: LockBufHdr; recptr = BufferGetLSN; clear BM_JUST_DIRTIED
    opt buffer is BM_PERMANENT
        FB->>WAL: XLogFlush(recptr)
        Note right of WAL: log on disk up to page LSN<br/>BEFORE the page write
    end
    FB->>SMGR: smgrwrite(page) (+ checksum copy)
    FB->>HDR: TerminateBufferIO -> clear BM_DIRTY unless BM_JUST_DIRTIED
    Note right of HDR: if re-dirtied mid-write,<br/>page stays dirty -> rewritten later

Figure 4 — FlushBuffer and the WAL-before-flush rule. The flusher claims the page via BM_IO_IN_PROGRESS, reads its LSN, and for a permanent relation calls XLogFlush(recptr) to force the log to disk up to that LSN before smgrwrite. BM_JUST_DIRTIED catches concurrent hint-bit dirtying so a re-dirtied page is not falsely marked clean.

Marking a page dirty

The write path begins when a modifier calls MarkBufferDirty (under an exclusive content lock and a pin). It just sets two flag bits via CAS:

// MarkBufferDirty — storage/buffer/bufmgr.c (condensed)
Assert(BufferIsPinned(buffer));
Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE));
for (;;)
{
    if (old_buf_state & BM_LOCKED)
        old_buf_state = WaitBufHdrUnlocked(bufHdr);
    buf_state = old_buf_state | BM_DIRTY | BM_JUST_DIRTIED;
    if (pg_atomic_compare_exchange_u32(&bufHdr->state, &old_buf_state, buf_state))
        break;
}

The page’s LSN (which FlushBuffer later reads) is not set here — it is set by the WAL machinery (XLogInsert returns the record’s end LSN, which the modifying access method stamps into the page header; see postgres-xlog-wal.md). MarkBufferDirty only records that the bytes diverge from disk.

Cleanup locks: removing tuples needs sole-pin

Some operations (VACUUM compaction, page defragmentation) must guarantee no other backend holds a pointer into the page. The README’s rule #5 requires an exclusive content lock and observing refcount == 1. LockBufferForCleanup loops: take the exclusive lock, check the pin count under the header spinlock; if it is 1, done; otherwise register as the sole BM_PIN_COUNT_WAITER, drop the lock, and sleep until another backend’s UnpinBuffer (or TerminateBufferIO) wakes it via WakePinCountWaiter. Only one waiter per buffer is supported, which is sufficient because PostgreSQL does not run two VACUUMs on one relation concurrently.

Local buffers for temporary tables

Temporary-table pages never go in the shared pool — they are session-private and not WAL-logged, so they live in a per-backend array managed by localbuf.c (LocalBufferDescriptors, LocalRefCount). LocalBufferAlloc and GetLocalVictimBuffer mirror the shared logic but with no locking (no spinlock, no LWLock — there is only one accessor) and a simple clock sweep over the local array. The same BufferDesc struct is reused, but its locks and most flag bits are inert. This keeps the temp-table fast path off the shared pool’s contention entirely.

Source Walkthrough

Anchor on symbol names, not line numbers. Use git grep -n '<symbol>' src/backend/storage/buffer/ to relocate a symbol; the line numbers in the position-hint table are hints scoped to the updated: commit.

Shared-pool data structures (`storage/buf_internals.h`)

struct buftag (BufferTag) — the (spc, db, rel, fork, block) cache key; InitBufferTag, BufferTagsEqual, ClearBufferTag operate on it.
struct BufferDesc — per-frame metadata: tag, buf_id, atomic state, freeNext, content_lock, io_wref.
BufferDescPadded — cache-line-padded union wrapping BufferDesc.
BUF_REFCOUNT_BITS / BUF_USAGECOUNT_BITS / BUF_FLAG_BITS and the BUF_STATE_GET_* accessors — the packed-word layout.
BM_* flag macros — BM_DIRTY, BM_VALID, BM_TAG_VALID, BM_IO_IN_PROGRESS, BM_JUST_DIRTIED, BM_PIN_COUNT_WAITER, BM_PERMANENT, BM_LOCKED.
BM_MAX_USAGE_COUNT — clock-sweep usage cap (= 5).
BufMappingPartitionLock / BufTableHashPartition — partition-lock selection.
LockBufHdr / UnlockBufHdr — header spinlock via BM_LOCKED.

Pool initialization (`buf_init.c`)

BufferManagerShmemInit — allocate + link the three shared arrays.
BufferManagerShmemSize — size computation for the shmem segment.

Mapping hash (`buf_table.c`)

BufTableHashCode, BufTableLookup, BufTableInsert, BufTableDelete — the tag→buf_id shared hash table API.

Replacement strategy (`freelist.c`)

StrategyGetBuffer — ring → free list → clock sweep victim selection.
ClockSweepTick — advance the atomic clock hand, handle wraparound.
StrategyFreeBuffer — return a clean frame to the free list.
BufferStrategyControl — shared nextVictimBuffer, free-list head/tail, buffer_strategy_lock, bgwriter notification.
GetAccessStrategy / GetAccessStrategyWithSize — build a ring.
GetBufferFromRing / AddBufferToRing / StrategyRejectBuffer — ring reuse and dirty-buffer rejection.
StrategySyncStart / StrategyNotifyBgWriter — bgwriter coordination.

Read / pin / lock / write path (`bufmgr.c`)

ReadBuffer / ReadBufferExtended / ReadBuffer_common — public entry.
BufferAlloc — lookup-or-fault; tag install under partition lock.
GetVictimBuffer — wrap selection with eviction + the strategy/WAL check.
PinBuffer / PinBuffer_Locked / UnpinBuffer / UnpinBufferNoOwner — reference counting; PrivateRefCountArray / PrivateRefCountEntry.
LockBuffer / ConditionalLockBuffer — content lock.
LockBufferForCleanup / ConditionalLockBufferForCleanup / WakePinCountWaiter — sole-pin cleanup protocol.
MarkBufferDirty — set BM_DIRTY | BM_JUST_DIRTIED.
FlushBuffer — the WAL rule: XLogFlush(BufferGetLSN(buf)) then smgrwrite.
StartBufferIO / TerminateBufferIO / WaitIO / AbortBufferIO — the BM_IO_IN_PROGRESS I/O latch protocol.
InvalidateBuffer / InvalidateVictimBuffer — drop a buffer’s mapping.
BufferSync — checkpoint-time flush of all dirty buffers.

Local (temp-table) buffers (`localbuf.c`)

LocalBufferAlloc, GetLocalVictimBuffer, MarkLocalBufferDirty, FlushLocalBuffer — the session-private, lockless mirror.

Position hints (as of 2026-06-05, REL_18 273fe94)

Symbol	File	Line
`struct buftag`	`storage/buf_internals.h`	106
`struct BufferDesc`	`storage/buf_internals.h`	258
`BM_MAX_USAGE_COUNT`	`storage/buf_internals.h`	87
`BufMappingPartitionLock`	`storage/buf_internals.h`	198
`BufferManagerShmemInit`	`storage/buffer/buf_init.c`	67
`BufferManagerShmemSize`	`storage/buffer/buf_init.c`	161
`BufTableHashCode`	`storage/buffer/buf_table.c`	78
`BufTableLookup`	`storage/buffer/buf_table.c`	90
`BufTableInsert`	`storage/buffer/buf_table.c`	118
`BufTableDelete`	`storage/buffer/buf_table.c`	148
`BufferStrategyControl`	`storage/buffer/freelist.c`	30
`ClockSweepTick`	`storage/buffer/freelist.c`	107
`StrategyGetBuffer`	`storage/buffer/freelist.c`	195
`StrategyFreeBuffer`	`storage/buffer/freelist.c`	362
`GetAccessStrategy`	`storage/buffer/freelist.c`	540
`GetBufferFromRing`	`storage/buffer/freelist.c`	736
`StrategyRejectBuffer`	`storage/buffer/freelist.c`	839
`ReadBuffer`	`storage/buffer/bufmgr.c`	758
`BufferAlloc`	`storage/buffer/bufmgr.c`	2009
`InvalidateBuffer`	`storage/buffer/bufmgr.c`	2187
`GetVictimBuffer`	`storage/buffer/bufmgr.c`	2354
`MarkBufferDirty`	`storage/buffer/bufmgr.c`	2956
`PinBuffer`	`storage/buffer/bufmgr.c`	3076
`PinBuffer_Locked`	`storage/buffer/bufmgr.c`	3186
`UnpinBuffer`	`storage/buffer/bufmgr.c`	3268
`FlushBuffer`	`storage/buffer/bufmgr.c`	4293
`LockBuffer`	`storage/buffer/bufmgr.c`	5609
`LockBufferForCleanup`	`storage/buffer/bufmgr.c`	5689
`WaitIO`	`storage/buffer/bufmgr.c`	5968
`StartBufferIO`	`storage/buffer/bufmgr.c`	6047
`TerminateBufferIO`	`storage/buffer/bufmgr.c`	6104
`LocalBufferAlloc`	`storage/buffer/localbuf.c`	118
`GetLocalVictimBuffer`	`storage/buffer/localbuf.c`	223

Source verification (as of 2026-06-05)

Each entry is a fact about the current source at commit 273fe94, readable without external materials. Open questions follow as the curator’s recorded gaps.

Verified facts

The buffer state word is 18-bit refcount + 4-bit usagecount + 10-bit flags, with a compile-time assert that they sum to 32. Verified in buf_internals.h (BUF_REFCOUNT_BITS/BUF_USAGECOUNT_BITS/BUF_FLAG_BITS and StaticAssertDecl(... == 32)). The flag bits start at bit 22 (BM_LOCKED = 1U << 22).
BM_MAX_USAGE_COUNT is 5, hard-coded, not a GUC. In buf_internals.h, with a comment explaining the accuracy/speed trade-off and a static assert that it fits in BUF_USAGECOUNT_BITS. Tuning requires a recompile.
NUM_BUFFER_PARTITIONS is 128. Defined in storage/lwlock.h (not in the buffer files themselves), consumed via BufTableHashPartition. Must be a power of two.
The clock hand is a single atomic nextVictimBuffer advanced by pg_atomic_fetch_add_u32; the buffer_strategy_lock spinlock is only taken on a clock wraparound or to pop the free list. Verified in ClockSweepTick and StrategyGetBuffer (freelist.c). Concurrent sweeps can return buffers slightly out of order — the in-code comment acknowledges this as harmless.
The WAL rule is enforced in FlushBuffer as if (buf_state & BM_PERMANENT) XLogFlush(recptr), where recptr = BufferGetLSN(buf) is read under the buffer header spinlock, before smgrwrite. Verified in FlushBuffer (bufmgr.c). Unlogged relations skip the flush by design; the in-code comment cites the fake-LSN GiST hazard as the reason the BM_PERMANENT guard is mandatory, not merely an optimization.
Ring sizes: BAS_BULKREAD = 256 KB base (grown by IO concurrency), BAS_BULKWRITE = 16 MB, BAS_VACUUM = 2048 KB; all capped at 1/8 of shared_buffers. Verified in GetAccessStrategy and GetAccessStrategyWithSize (freelist.c). BAS_NORMAL returns NULL (no ring). BAS_VACUUM’s effective size is driven by the vacuum_buffer_usage_limit GUC at the call site, not hard-coded in GetAccessStrategy.
Only BAS_BULKREAD rejects a dirty ring buffer rather than flush it. Verified in StrategyRejectBuffer (freelist.c): it returns false for any type other than BAS_BULKREAD. So vacuum and bulk-write rings do pay WAL flushes when reusing dirty pages, matching the README.
A backend tracks its first pin of a buffer in shared memory and further pins privately, in an 8-entry array (REFCOUNT_ARRAY_ENTRIES) plus an overflow hash. Verified in bufmgr.c (PrivateRefCountEntry, PinBuffer). ResourceOwnerRememberBuffer ties each pin to a resource owner for cleanup on abort.
LockBufferForCleanup supports exactly one pin-count waiter per buffer. Verified in bufmgr.c and stated in the README; sufficient because concurrent VACUUMs on one relation are disallowed.

Open questions

AIO interaction with eviction. REL_18 adds io_wref to BufferDesc and a separate async-I/O subsystem (storage/aio). The read path can be asynchronous, and TerminateBufferIO(..., release_aio) releases an AIO-held pin. How GetVictimBuffer and LockBufferForCleanup interact with a buffer whose read is still in flight is only partially traced here. Investigation path: read storage/aio/README.md and the WaitReadBuffers/StartReadBuffers batch path; cross-ref a future postgres-aio.md.
The free list’s long-term role. The README notes the free list is only ever populated by genuinely empty frames (relation drops) — “the current algorithm never does that” for cold pages. Under steady state the list is empty and every allocation runs the clock sweep. Is the free list effectively dead weight in a warm cache, or does it matter for drop-heavy workloads? Investigation path: instrument StrategyFreeBuffer call sites.
Bgwriter / checkpointer division of labor. BufferSync (checkpoint) and BgBufferSync (background writer) both write dirty buffers but on different schedules and triggers. This doc covers the per-buffer FlushBuffer mechanism, not the policy that decides when each runs. Investigation path: a dedicated postgres-checkpointer-bgwriter.md.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Pointers, not analysis. Each bullet is a starting handle for a follow-up doc; depth here is intentionally shallow.

CUBRID’s page buffer. CUBRID caches pages in pgbuf with a similar pin/latch separation and its own LRU/clock hybrid, but its old-version storage is out-of-place (undo log), so the buffer manager’s interaction with MVCC differs from PostgreSQL’s in-place heap (see cubrid-mvcc.md). A side-by-side of the two replacement policies and WAL-flush chokepoints would isolate what is intrinsic to steal/no-force from what is PostgreSQL-specific.
CLOCK variants beyond second-chance. PostgreSQL’s fixed BM_MAX_USAGE_COUNT=5 CLOCK approximates LRU crudely. CLOCK-Pro (Jiang & Zhang, USENIX 2005), LRU-K (O’Neil et al., SIGMOD 1993), 2Q, and ARC (Megiddo & Modha, FAST 2003) all improve scan resistance and frequency awareness. PG’s ring-buffer strategies are a hand-built scan-resistance patch on top of plain CLOCK; how much a smarter base policy would subsume the rings is an open empirical question.
The double-buffering problem. PostgreSQL relies on the OS page cache under shared_buffers, so a hot page can be cached twice. Direct I/O and the REL_18 async-I/O subsystem (storage/aio) move toward bypassing the OS cache; Database Internals ch. 4 discusses the trade-off. A measured study of double-buffering cost on PG would anchor the shared_buffers sizing folklore.
Buffer management at high core counts. The packed-state CAS path and partitioned mapping lock are PG’s answers to the pre-8.1 BufMgrLock bottleneck. Whether they hold at hundreds of cores connects to the scalable-lock-manager line of work (dbms-papers/scalable-lock-manager.md) and to the architecture-overview’s shared-memory thesis.
WAL-before-flush vs. shadow paging. ARIES steal/no-force (dbms-papers/aries.md) is one point in the recovery design space; shadow-paging engines (and copy-on-write stores like LMDB) avoid the WAL-before-flush coupling entirely at the cost of write amplification and fragmentation. The comparison clarifies why PG’s buffer manager carries the XLogFlush-before-smgrwrite chokepoint at all.

Sources

In-tree README (first-class design doc)

src/backend/storage/buffer/README — buffer access rules (pins vs content locks), internal locking (BufMappingLock partitions, buffer_strategy_lock, per-header spinlock, BM_IO_IN_PROGRESS), the clock-sweep algorithm, the ring strategies, and the background writer.

PostgreSQL source (under `/data/hgryoo/references/postgres`, REL_18 273fe94)

src/backend/storage/buffer/bufmgr.c
src/backend/storage/buffer/freelist.c
src/backend/storage/buffer/buf_init.c
src/backend/storage/buffer/buf_table.c
src/backend/storage/buffer/localbuf.c
src/include/storage/buf_internals.h
src/include/storage/bufmgr.h
src/include/storage/lwlock.h (for NUM_BUFFER_PARTITIONS)

Textbook chapters (under `knowledge/research/dbms-general/`)

Database Internals (Petrov), Ch. 4 §“Buffer Management” (≈ line 3419), §“Buffer manager” (≈ line 791) — page cache role, eviction, force/steal, the double-buffering trade-off.

Papers (under `knowledge/research/dbms-papers/`)

ARIES (Mohan et al., 1992) — aries.md. The steal/no-force recovery method and the WAL rule that FlushBuffer enforces.

Cross-references (sibling module docs)

postgres-smgr-md.md — the storage manager FlushBuffer writes through.
postgres-xlog-wal.md — WAL generation, the page LSN, and XLogFlush.
postgres-page-layout.md — what is inside the 8 KB page a buffer holds.
postgres-architecture-overview.md — the fixed-size shared-memory segment the pool is carved from (Axis 2) and the WAL spine (Axis 3).