PostgreSQL Buffer Manager — Shared Pool, Clock-Sweep Eviction, and the WAL-Before-Flush Rule
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A database is too large to live in memory, so the engine keeps the
authoritative copy on disk (in PostgreSQL, in the per-fork segment files
managed by the storage manager — see postgres-smgr-md.md) and caches a
working subset of fixed-size pages in RAM. The component that owns that
cache is the buffer manager (a/k/a buffer pool or page cache).
Database Internals (Petrov, ch. 4 §“Buffer Management”) frames its job
precisely: it “is responsible for caching pages read from disk in memory”
and serves “as an intermediary between persistent storage (disk) and the
rest of the [engine],” so that higher layers address data by page ID and
never touch the file directly. Every read of a tuple, every index probe,
every WAL-logged modification first goes through a buffer.
The buffer manager sits at the intersection of two hard constraints, and the design space is defined by how it answers each:
-
The cache is finite, so it must choose what to evict. When a backend needs a page that is not resident and no empty frame is free, some resident page must be ejected. The replacement policy (which page to evict) is the first axis. True LRU requires per-access bookkeeping that does not scale across many CPUs sharing one pool, so real engines use approximations — CLOCK, CLOCK-Pro, LRU-K, 2Q, ARC. The textbook calls out that “evicting pages that are still in use, called page faults,” and the surrounding cost model is what the policy is trying to minimize.
-
Cached pages are modified in place, so eviction interacts with recovery. A page made dirty in the cache and not yet written back embodies committed (or uncommitted) state that must survive a crash. The second axis is the force/steal policy from the recovery literature ([HABERMANN]/[MOHAN] — see
dbms-papers/aries.md):- Force vs no-force: must a transaction’s dirty pages be flushed to disk before it is allowed to commit (force), or may they stay in the cache and be written lazily (no-force)?
- Steal vs no-steal: may the buffer manager evict (and write to disk) a page dirtied by a transaction that has not yet committed (steal), or must it pin such pages until commit (no-steal)?
ARIES — the recovery method PostgreSQL follows — chooses the most performant and most demanding combination: steal + no-force. Steal means an uncommitted transaction’s changes can reach disk, so recovery must be able to undo them; no-force means a committed transaction’s changes may not have reached disk, so recovery must be able to redo them. Both directions are made sound by one invariant — the Write-Ahead Logging (WAL) rule: the log record describing a change must be on stable storage before the changed data page is. The buffer manager is the component that physically enforces the no-force/steal half of this contract, by refusing to write any data page to disk until the WAL is flushed up to that page’s LSN. WAL generation and flushing themselves live in
postgres-xlog-wal.md; this document covers the enforcement point.
Two further textbook points shape the implementation. First, the buffer
manager is a shared structure in a multi-process engine, so access to its
metadata must be cheap and concurrent — a single global lock (PostgreSQL’s
pre-8.1 BufMgrLock) is a notorious scalability bottleneck. Second, the
unit of caching is the page (PostgreSQL BLCKSZ, default 8 KB; see
postgres-page-layout.md for what is inside a page), identified by a
page ID that the cache uses as the hash key. Everything below is
PostgreSQL turning these two axes — replacement and force/steal — and the
two constraints — concurrency and page-addressing — into concrete data
structures.
Common DBMS Design
Section titled “Common DBMS Design”The textbook gives the model; this section names the engineering conventions that nearly every disk-resident buffer manager — PostgreSQL, Oracle, InnoDB, SQL Server, CUBRID — adopts in some form. PostgreSQL’s specific choices in §“PostgreSQL’s Approach” read as one set of dials within this shared space.
A descriptor array beside a page array
Section titled “A descriptor array beside a page array”The pool is two parallel arrays: a contiguous block of page-sized frames holding the actual bytes, and a parallel array of buffer descriptors — small metadata records (one per frame) holding the page’s identity, its dirty/valid flags, a pin count, and the replacement-policy bookkeeping. A buffer is named by its index into these arrays. Keeping the hot metadata (the descriptor) physically separate from the cold 8 KB payload lets the engine scan, lock, and CAS descriptors without dragging page data through the CPU cache.
A hash table from page-ID to frame
Section titled “A hash table from page-ID to frame”Given a page ID, the engine must answer “is this page resident, and if so in which frame?” in roughly O(1). Every engine keeps a mapping hash table keyed on the page identity (tablespace/database/relation/fork/block, or some packing thereof) and valued by the frame index. To stop this table from becoming the new global bottleneck, the lock that protects it is striped into partitions keyed on the hash of the tag, so two backends touching unrelated pages rarely contend.
Pin counts vs content locks — two independent guards
Section titled “Pin counts vs content locks — two independent guards”There are two orthogonal things a backend can want from a buffer, and mature engines separate them:
- A pin (reference count) says “do not evict this frame out from under me.” It guards the frame’s identity. Pins can be held for a long time (a sequential scan holds one for the whole page) and are cheap.
- A content lock (shared/exclusive) says “do not let anyone read/write the bytes while I am writing/reading them.” It guards the page contents. Content locks are short-term.
Conflating the two forces every reader to take a heavyweight lock; separating them lets a reader pin-and-share-lock, determine visibility, drop the content lock, and keep reading tuple data under the pin alone.
An approximate-LRU replacement hand
Section titled “An approximate-LRU replacement hand”Exact LRU needs a global list reshuffled on every access — unacceptable under concurrency. The dominant approximation is CLOCK (or second chance): frames are arranged in a circle, each carrying a small usage counter; a “hand” sweeps the circle, and at each frame it either evicts (usage 0, unpinned) or decrements the usage counter and moves on. Access sets the counter, sweeping clears it, so recently/frequently used pages survive a sweep. The counter’s maximum is a tunable trade-off between LRU fidelity and sweep cost.
A small free list for genuinely empty frames
Section titled “A small free list for genuinely empty frames”At startup every frame is empty; as relations are dropped, frames become empty again. A free list of known-empty frames lets allocation skip the sweep entirely in those cases. The free list is a fast path layered over the clock; it is not the replacement policy.
Bulk-operation rings to avoid cache pollution
Section titled “Bulk-operation rings to avoid cache pollution”A one-shot large scan (a seqscan of a table bigger than the pool, a
VACUUM, a COPY) would, under naive CLOCK, evict the entire working set to
cache pages it will never look at again. The convention is a ring buffer
(a/k/a buffer ring / strategy): a small fixed set of frames the bulk
operation reuses among itself, leaving the rest of the pool untouched.
The WAL-before-flush enforcement point
Section titled “The WAL-before-flush enforcement point”Every steal/no-force engine puts one chokepoint on the path from “dirty page in cache” to “page bytes on disk”: before issuing the write, force the log to disk up to the page’s LSN. This single rule is what makes both undo (steal) and redo (no-force) recovery sound. The buffer manager is where it is enforced because the buffer manager is the only component that writes data pages.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Theory / convention | PostgreSQL name |
|---|---|
| Page-sized cache frame | BufferBlocks array, BLCKSZ bytes each |
| Per-frame metadata record | BufferDesc (in BufferDescriptors[]) |
| Page identity / cache key | BufferTag = (spcOid, dbOid, relNumber, forkNum, blockNum) |
| Dirty / valid / pinned flags + counters | BufferDesc.state — one atomic uint32 (18b refcount, 4b usagecount, 10b flags) |
| Page-ID → frame hash table | buf_table.c shared hash, BufTableLookup/Insert/Delete |
| Striped lock on the hash | BufMappingPartitionLock — NUM_BUFFER_PARTITIONS = 128 |
| Pin / reference count | PinBuffer / UnpinBuffer, backed by PrivateRefCountArray |
| Content (read/write) lock | LWLock content_lock per descriptor, via LockBuffer |
| CLOCK usage counter | BUF_USAGECOUNT, max BM_MAX_USAGE_COUNT = 5 |
| CLOCK hand | StrategyControl->nextVictimBuffer, advanced by ClockSweepTick |
| Free list of empty frames | StrategyControl->firstFreeBuffer + BufferDesc.freeNext |
| Victim selection | StrategyGetBuffer → wrapped by GetVictimBuffer |
| Bulk-operation ring | BufferAccessStrategy (BAS_BULKREAD/BAS_BULKWRITE/BAS_VACUUM) |
| WAL-before-flush rule | FlushBuffer → XLogFlush(BufferGetLSN(buf)) |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL’s buffer pool is a fixed-size region carved out of the one shared
memory segment at postmaster startup (see postgres-architecture-overview.md
§“Axis 2” — the pool does not grow at runtime). NBuffers (the
shared_buffers GUC, in BLCKSZ units) frames are allocated once, and three
parallel shared arrays are built beside them. The defining choices are:
(1) all of a frame’s mutable header state — pin count, usage count, and ten
flag bits — is packed into one 32-bit atomic word so the common
pin/unpin operations need no spinlock; (2) replacement is a clock-sweep
over that usage counter with a free-list fast path; (3) bulk operations run
inside small ring buffers so they cannot blow out the cache; and (4) the
write path enforces the WAL rule unconditionally for permanent relations.
The three shared arrays and their initialization
Section titled “The three shared arrays and their initialization”BufferManagerShmemInit (in buf_init.c) creates the descriptors, the page
blocks, the per-buffer I/O condition variables, and a checkpoint-sort scratch
array, then links every descriptor into one initial free list:
// BufferManagerShmemInit — storage/buffer/buf_init.c (condensed)BufferDescriptors = (BufferDescPadded *) ShmemInitStruct("Buffer Descriptors", NBuffers * sizeof(BufferDescPadded), &foundDescs);BufferBlocks = (char *) TYPEALIGN(PG_IO_ALIGN_SIZE, ShmemInitStruct("Buffer Blocks", NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE, &foundBufs));// ... condensed: BufferIOCVArray, CkptBufferIds ...for (i = 0; i < NBuffers; i++){ BufferDesc *buf = GetBufferDescriptor(i); ClearBufferTag(&buf->tag); pg_atomic_init_u32(&buf->state, 0); buf->buf_id = i; buf->freeNext = i + 1; /* link all buffers as unused */ LWLockInitialize(BufferDescriptorGetContentLock(buf), LWTRANCHE_BUFFER_CONTENT); ConditionVariableInit(BufferDescriptorGetIOCV(buf));}GetBufferDescriptor(NBuffers - 1)->freeNext = FREENEXT_END_OF_LIST;StrategyInitialize(!foundDescs); /* builds the mapping hash + control */The descriptor array is BufferDescPadded — each BufferDesc is padded to a
64-byte cache line (BUFFERDESC_PAD_TO_SIZE) so that two CPUs spinning on two
adjacent descriptors do not false-share a cache line. The payload array
(BufferBlocks) is aligned to PG_IO_ALIGN_SIZE so direct/async I/O can DMA
into it without straddling.
flowchart LR
subgraph SHM["one shared-memory segment (sized once at boot)"]
direction TB
DESC["BufferDescriptors[]<br/>NBuffers x BufferDescPadded<br/>(64-byte cache-line padded)"]
BLK["BufferBlocks[]<br/>NBuffers x BLCKSZ (8KB)"]
CV["BufferIOCVArray[]<br/>per-frame I/O condition vars"]
HASH["buf_table hash<br/>BufferTag -> buf_id"]
CTL["BufferStrategyControl<br/>nextVictimBuffer, freelist head/tail"]
end
DESC -- "buf_id indexes" --> BLK
HASH -- "resolves tag to" --> DESC
CTL -- "clock hand sweeps" --> DESC
Figure 1 — The buffer pool’s shared arrays. BufferDescriptors[] (metadata) runs parallel to BufferBlocks[] (the 8 KB payloads); a buf_id indexes both. The buf_table hash resolves a BufferTag to a buf_id, and BufferStrategyControl holds the clock hand and free-list pointers. All of it is fixed-size, created once by BufferManagerShmemInit.
The buffer tag — what identifies a page
Section titled “The buffer tag — what identifies a page”A frame’s identity is a BufferTag: the five fields that locate a block
without consulting any catalog (important, because the backend flushing a
page may not even believe the relation is visible yet):
// struct buftag — storage/buf_internals.htypedef struct buftag{ Oid spcOid; /* tablespace oid */ Oid dbOid; /* database oid */ RelFileNumber relNumber; /* relation file number */ ForkNumber forkNum; /* fork number (main / fsm / vm / init) */ BlockNumber blockNum; /* block number within the fork */} BufferTag;The tag is the hash key. BufTableHashCode hashes it; the low bits of that
hash pick one of NUM_BUFFER_PARTITIONS (128) partition locks, so lookups on
unrelated pages take different BufMappingPartitionLocks and do not serialize:
// BufMappingPartitionLock / BufTableHashPartition — storage/buf_internals.hstatic inline uint32BufTableHashPartition(uint32 hashcode){ return hashcode % NUM_BUFFER_PARTITIONS;}static inline LWLock *BufMappingPartitionLock(uint32 hashcode){ return &MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET + BufTableHashPartition(hashcode)].lock;}When more than one partition must be held at once, they are always taken in
partition-number order to avoid deadlock (the in-tree README is explicit about
this). The mapping hash itself is sized for NBuffers + NUM_BUFFER_PARTITIONS
entries — slightly more than the pool, because BufferAlloc inserts the new
tag before deleting the old one and that can happen concurrently in each
partition.
The packed state word
Section titled “The packed state word”The single most important micro-design decision is that the descriptor’s mutable header is one atomic word:
// BufferDesc + state layout — storage/buf_internals.h (condensed)// state = 18 bits refcount | 4 bits usagecount | 10 bits flags#define BUF_REFCOUNT_BITS 18#define BUF_USAGECOUNT_BITS 4#define BUF_FLAG_BITS 10
typedef struct BufferDesc{ BufferTag tag; /* ID of page contained in buffer */ int buf_id; /* buffer's index number (from 0) */ pg_atomic_uint32 state; /* flags + refcount + usagecount */ int wait_backend_pgprocno; /* backend waiting for sole pin */ int freeNext; /* link in freelist chain */ PgAioWaitRef io_wref; /* set iff async I/O is in progress */ LWLock content_lock; /* lock for the buffer *contents* */} BufferDesc;The ten flag bits encode the page’s lifecycle:
// buffer flags — storage/buf_internals.h#define BM_LOCKED (1U << 22) /* buffer header is spinlocked */#define BM_DIRTY (1U << 23) /* data needs writing */#define BM_VALID (1U << 24) /* data is valid */#define BM_TAG_VALID (1U << 25) /* tag is assigned (has a hash entry) */#define BM_IO_IN_PROGRESS (1U << 26) /* read or write in progress */#define BM_IO_ERROR (1U << 27) /* previous I/O failed */#define BM_JUST_DIRTIED (1U << 28) /* dirtied since write started */#define BM_PIN_COUNT_WAITER (1U << 29) /* have waiter for sole pin */#define BM_CHECKPOINT_NEEDED (1U << 30) /* must write for checkpoint */#define BM_PERMANENT (1U << 31) /* permanent (WAL-logged) buffer */Because refcount, usagecount, and flags share one word, a backend can pin a
buffer (increment refcount, bump usagecount, test BM_VALID) in a single
compare-and-swap loop — no spinlock taken in the common case. The buffer
header spinlock is itself just the BM_LOCKED flag bit: LockBufHdr
spins setting it, and UnlockBufHdr clears it with a plain write. The
header spinlock guards complex multi-field updates (changing the tag); the
CAS path handles the simple ones. Critically, the header spinlock does not
guard the page bytes — that is the content_lock’s job.
Pins vs content locks
Section titled “Pins vs content locks”These are the two independent guards from §“Common DBMS Design”, and the in-tree README states the rules precisely. A backend must pin a buffer before touching it at all; a pin keeps the frame from being recycled. Pin state is tracked per-backend in a small array plus an overflow hash, so a backend that pins the same buffer twice does not touch shared memory the second time:
// PrivateRefCountEntry / array — storage/buffer/bufmgr.ctypedef struct PrivateRefCountEntry{ Buffer buffer; int32 refcount;} PrivateRefCountEntry;#define REFCOUNT_ARRAY_ENTRIES 8 /* ~one cache line; overflow spills to a hash */PinBuffer increments the shared refcount (in state) only on the
first private pin, then tracks further pins privately and registers the pin
with the current ResourceOwner so it is released at end of transaction even
on error:
// PinBuffer — storage/buffer/bufmgr.c (condensed)ref = GetPrivateRefCountEntry(b, true);if (ref == NULL) /* first pin by this backend */{ ref = NewPrivateRefCountEntry(b); old_buf_state = pg_atomic_read_u32(&buf->state); for (;;) { if (old_buf_state & BM_LOCKED) old_buf_state = WaitBufHdrUnlocked(buf); buf_state = old_buf_state + BUF_REFCOUNT_ONE; /* take a shared pin */ if (strategy == NULL) /* bump usagecount, capped */ { if (BUF_STATE_GET_USAGECOUNT(buf_state) < BM_MAX_USAGE_COUNT) buf_state += BUF_USAGECOUNT_ONE; } else if (BUF_STATE_GET_USAGECOUNT(buf_state) == 0) buf_state += BUF_USAGECOUNT_ONE; /* ring buffers cap at 1 */ if (pg_atomic_compare_exchange_u32(&buf->state, &old_buf_state, buf_state)) { result = (buf_state & BM_VALID) != 0; break; } }}else result = (pg_atomic_read_u32(&buf->state) & BM_VALID) != 0;ref->refcount++;ResourceOwnerRememberBuffer(CurrentResourceOwner, b);Note the strategy branch: a normal pin bumps the usage counter (up to
BM_MAX_USAGE_COUNT); a pin made through a ring buffer caps usage at 1, so
ring pages never accumulate the protection that would keep them resident past
the scan. UnpinBuffer is the mirror image — it decrements the private
count, and only when that hits zero does it CAS the shared refcount down and,
if a cleanup waiter is parked (BM_PIN_COUNT_WAITER), wake it.
The content lock is a separate per-descriptor LWLock, taken in shared
or exclusive mode through LockBuffer:
// LockBuffer — storage/buffer/bufmgr.c (condensed)if (mode == BUFFER_LOCK_UNLOCK) LWLockRelease(BufferDescriptorGetContentLock(buf));else if (mode == BUFFER_LOCK_SHARE) LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_SHARED);else if (mode == BUFFER_LOCK_EXCLUSIVE) LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);The README’s access rules tie the two together: to scan a page you hold a pin and a (shared or exclusive) content lock; once you have decided a tuple is visible you may drop the content lock and keep reading its bytes under the pin alone (rule #2) — which is exactly why pins and locks are separate primitives.
Reading a page: BufferAlloc
Section titled “Reading a page: BufferAlloc”ReadBuffer → ReadBuffer_common → BufferAlloc is the lookup-or-fault
path. BufferAlloc first probes the mapping hash under a shared partition
lock; on a hit it pins and returns. On a miss it must fault the page in:
acquire a victim frame, re-check the hash under an exclusive partition lock
(another backend may have raced in), and on a clean miss install the new tag:
// BufferAlloc — storage/buffer/bufmgr.c (condensed)LWLockAcquire(newPartitionLock, LW_SHARED);existing_buf_id = BufTableLookup(&newTag, newHash);if (existing_buf_id >= 0) /* HIT */{ buf = GetBufferDescriptor(existing_buf_id); valid = PinBuffer(buf, strategy); LWLockRelease(newPartitionLock); *foundPtr = true; if (!valid) *foundPtr = false; /* read still in flight */ return buf;}LWLockRelease(newPartitionLock);
/* MISS: get a recyclable frame (this may flush a dirty victim, see below) */victim_buffer = GetVictimBuffer(strategy, io_context);victim_buf_hdr = GetBufferDescriptor(victim_buffer - 1);
LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);existing_buf_id = BufTableInsert(&newTag, newHash, victim_buf_hdr->buf_id);if (existing_buf_id >= 0) /* lost the race; use theirs */{ UnpinBuffer(victim_buf_hdr); StrategyFreeBuffer(victim_buf_hdr); /* ... pin existing_buf_hdr, release lock, return ... */}/* won the race: stamp the victim with the new tag */victim_buf_state = LockBufHdr(victim_buf_hdr);victim_buf_hdr->tag = newTag;victim_buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM) victim_buf_state |= BM_PERMANENT;UnlockBufHdr(victim_buf_hdr, victim_buf_state);LWLockRelease(newPartitionLock);*foundPtr = false; /* caller must do the read I/O */return victim_buf_hdr;BufferAlloc returns a pinned frame stamped with the tag but, on a miss, with
invalid contents (*foundPtr = false); the caller then performs the actual
read and marks the page BM_VALID via the StartBufferIO / TerminateBufferIO
protocol. BM_PERMANENT is set here, once, based on relation persistence —
this is the bit FlushBuffer later consults to decide whether the WAL rule
applies.
flowchart TD
A["ReadBuffer(rel, blk)"] --> B["compute BufferTag + hash<br/>pick partition lock"]
B --> C{"BufTableLookup<br/>(shared partition lock)"}
C -- "hit" --> D["PinBuffer; release lock"]
D --> E{"BM_VALID?"}
E -- "yes" --> Z["return pinned buffer"]
E -- "no" --> Y["wait for in-flight read"]
C -- "miss" --> F["GetVictimBuffer<br/>(clock sweep / ring)"]
F --> G["BufTableInsert<br/>(exclusive partition lock)"]
G -- "race lost" --> H["UnpinBuffer + StrategyFreeBuffer<br/>use the winner's buffer"]
G -- "race won" --> I["stamp tag, set BM_TAG_VALID<br/>BM_PERMANENT if logged"]
I --> J["read block via smgr<br/>set BM_VALID"]
J --> Z
H --> Z
Figure 2 — ReadBuffer/BufferAlloc lookup-or-fault flow. A hit pins under a shared partition lock and returns. A miss acquires a victim frame (clock sweep, possibly flushing it), then re-checks under an exclusive partition lock to handle the race where another backend faulted the same page concurrently; the loser frees its victim and uses the winner’s buffer.
Clock-sweep eviction: GetVictimBuffer and StrategyGetBuffer
Section titled “Clock-sweep eviction: GetVictimBuffer and StrategyGetBuffer”When a miss needs a frame, GetVictimBuffer (in bufmgr.c) drives the policy
but delegates selection to StrategyGetBuffer (in freelist.c).
StrategyGetBuffer first tries the strategy ring (if any), then the free
list, then runs the clock sweep:
// StrategyGetBuffer — storage/buffer/freelist.c (condensed, clock-sweep arm)trycounter = NBuffers;for (;;){ buf = GetBufferDescriptor(ClockSweepTick()); /* advance hand, return frame */ local_buf_state = LockBufHdr(buf); if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0) { if (BUF_STATE_GET_USAGECOUNT(local_buf_state) != 0) { local_buf_state -= BUF_USAGECOUNT_ONE; /* second chance: decay */ trycounter = NBuffers; } else { /* Found a usable buffer (unpinned, usage 0) */ if (strategy != NULL) AddBufferToRing(strategy, buf); *buf_state = local_buf_state; return buf; /* returned with hdr spinlock held! */ } } else if (--trycounter == 0) { UnlockBufHdr(buf, local_buf_state); elog(ERROR, "no unpinned buffers available"); } UnlockBufHdr(buf, local_buf_state);}The hand itself is a single atomic counter advanced by ClockSweepTick:
// ClockSweepTick — storage/buffer/freelist.c (condensed)victim = pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);if (victim >= NBuffers){ victim = victim % NBuffers; /* wrap the index */ if (victim == 0) /* we caused a wraparound */ { /* take buffer_strategy_lock just long enough to bump completePasses */ // ... CAS nextVictimBuffer back into range, completePasses++ ... }}return victim;The free-list fast path is checked first, lock-free, and only on a non-empty
list is the buffer_strategy_lock spinlock taken to pop a head entry:
// StrategyGetBuffer — freelist.c (free-list arm, condensed)if (StrategyControl->firstFreeBuffer >= 0){ while (true) { SpinLockAcquire(&StrategyControl->buffer_strategy_lock); if (StrategyControl->firstFreeBuffer < 0) { SpinLockRelease(...); break; } buf = GetBufferDescriptor(StrategyControl->firstFreeBuffer); StrategyControl->firstFreeBuffer = buf->freeNext; /* pop head */ buf->freeNext = FREENEXT_NOT_IN_LIST; SpinLockRelease(&StrategyControl->buffer_strategy_lock); local_buf_state = LockBufHdr(buf); if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0 && BUF_STATE_GET_USAGECOUNT(local_buf_state) == 0) return buf; /* clean + unused: take it */ UnlockBufHdr(buf, local_buf_state); /* else discard, retry */ }}Two scalability properties fall out. The clock hand is one atomic
fetch-add — multiple backends sweeping concurrently never hold a global lock
(they may return buffers slightly out of order, which is harmless). And the
per-frame test is done under only that frame’s header spinlock. The
buffer_strategy_lock spinlock is touched only to pop the free list or to
record a clock wraparound, never during the sweep itself. BM_MAX_USAGE_COUNT
= 5 caps how many sweeps a hot page can survive: a page pinned 5 times needs 5
hand passes to decay to 0, so the worst-case sweep to find a victim is
bounded.
flowchart LR
H["nextVictimBuffer<br/>(clock hand)"] --> F0
subgraph RING["BufferDescriptors viewed as a circle"]
direction LR
F0["frame i<br/>pinned? -> skip<br/>usage>0 -> usage--<br/>usage=0 -> EVICT"]
F1["frame i+1"]
F2["frame i+2"]
F3["..."]
end
F0 --> F1 --> F2 --> F3 -. "wraps to 0,<br/>completePasses++" .-> F0
Figure 3 — Clock-sweep replacement. The hand (nextVictimBuffer) advances one frame per ClockSweepTick. A pinned frame is skipped; an unpinned frame with usage > 0 has its usage decremented (a “second chance”); an unpinned frame at usage 0 is the victim. The hand never stops at a global lock — it is a single atomic counter.
GetVictimBuffer wraps selection with the eviction work: it pins the chosen
frame with the spinlock held (PinBuffer_Locked), and if the frame is dirty,
writes it out before recycling — this is where the write path and the WAL rule
come in.
Ring buffers for bulk operations (BAS_*)
Section titled “Ring buffers for bulk operations (BAS_*)”A BufferAccessStrategy is a backend-private ring of frame numbers
(palloc’d, not in shared memory) that a bulk operation reuses among itself:
// BufferAccessStrategyData — storage/buffer/freelist.ctypedef struct BufferAccessStrategyData{ BufferAccessStrategyType btype; int nbuffers; int current; /* most recently returned ring slot */ Buffer buffers[FLEXIBLE_ARRAY_MEMBER];/* the ring; 0 = slot not yet filled */} BufferAccessStrategyData;GetAccessStrategy picks the ring size by type (the README explains each):
| Strategy | Type | Ring size | Used by |
|---|---|---|---|
| Bulk read | BAS_BULKREAD | 256 KB (+IO concurrency) | large seq scans |
| Bulk write | BAS_BULKWRITE | 16 MB (capped 1/8 shared_buffers) | COPY IN, CREATE TABLE AS |
| Vacuum | BAS_VACUUM | vacuum_buffer_usage_limit (default 2 MB) | VACUUM |
| Normal | BAS_NORMAL | — (returns NULL) | everything else |
When a strategy is active, StrategyGetBuffer first calls GetBufferFromRing,
which advances current and re-offers the frame already in that slot — but
only if it is unpinned and its usage count is ≤ 1 (i.e., nobody else has
since touched it). If the slot is empty or the buffer was stolen, it returns
NULL and the caller falls through to the normal clock sweep, then calls
AddBufferToRing to record the freshly-acquired frame in the slot:
// GetBufferFromRing — storage/buffer/freelist.c (condensed)if (++strategy->current >= strategy->nbuffers) strategy->current = 0;bufnum = strategy->buffers[strategy->current];if (bufnum == InvalidBuffer) return NULL; /* slot empty: get a normal buffer */buf = GetBufferDescriptor(bufnum - 1);local_buf_state = LockBufHdr(buf);if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0 && BUF_STATE_GET_USAGECOUNT(local_buf_state) <= 1) return buf; /* reuse this ring buffer */UnlockBufHdr(buf, local_buf_state);return NULL; /* stolen: get a normal buffer */The interaction with the WAL rule is the subtle part. If a ring buffer is
dirty and reusing it would force a WAL flush, BAS_BULKREAD would rather pick
a different victim than stall on WAL — StrategyRejectBuffer drops the dirty
frame from the ring so the caller takes a fresh one. VACUUM and bulk write
instead keep the page in the ring and pay the WAL flush, because they are
write-heavy by nature. This is decided in GetVictimBuffer:
// GetVictimBuffer — storage/buffer/bufmgr.c (the strategy/WAL interaction, condensed)if (buf_state & BM_DIRTY){ content_lock = BufferDescriptorGetContentLock(buf_hdr); if (!LWLockConditionalAcquire(content_lock, LW_SHARED)) /* avoid deadlock */ { UnpinBuffer(buf_hdr); goto again; } if (strategy != NULL) { buf_state = LockBufHdr(buf_hdr); lsn = BufferGetLSN(buf_hdr); UnlockBufHdr(buf_hdr, buf_state); if (XLogNeedsFlush(lsn) && StrategyRejectBuffer(strategy, buf_hdr, from_ring)) { LWLockRelease(content_lock); UnpinBuffer(buf_hdr); goto again; /* pick a different victim */ } } FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context); LWLockRelease(content_lock); ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context, &buf_hdr->tag);}The WAL-before-flush rule: FlushBuffer
Section titled “The WAL-before-flush rule: FlushBuffer”FlushBuffer is the chokepoint where a dirty data page becomes durable, and
it is where the WAL rule is physically enforced. The sequence is: claim the
I/O (set BM_IO_IN_PROGRESS so no one else writes the same page), read the
page LSN under the header spinlock, flush WAL up to that LSN, then write
the page and clear the dirty bit:
// FlushBuffer — storage/buffer/bufmgr.c (condensed)if (!StartBufferIO(buf, false, false)) /* false = for output; someone else won */ return;// ... condensed: error-context setup, smgropen if reln == NULL ...buf_state = LockBufHdr(buf);recptr = BufferGetLSN(buf); /* page's LSN, read under hdr lock */buf_state &= ~BM_JUST_DIRTIED; /* detect concurrent re-dirtying */UnlockBufHdr(buf, buf_state);
/* * Force XLOG flush up to buffer's LSN. This implements the basic WAL * rule that log updates must hit disk before any of the data-file changes * they describe do. ... skip the flush if the buffer isn't permanent. */if (buf_state & BM_PERMANENT) XLogFlush(recptr);
bufBlock = BufHdrGetBlock(buf);bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);smgrwrite(reln, BufTagGetForkNum(&buf->tag), buf->tag.blockNum, bufToWrite, false);// ... condensed: I/O stats ...TerminateBufferIO(buf, true, 0, true, false); /* clear BM_DIRTY (unless re-dirtied) */Three details carry the design:
if (buf_state & BM_PERMANENT) XLogFlush(recptr)is the literal WAL rule. For a permanent (WAL-logged) relation, the log is forced to disk up to the page’s LSN beforesmgrwrite. For unlogged relations the flush is skipped — they are lost on crash anyway, and (per the in-tree comment) flushing a fake unlogged-GiST LSN could even try to flush past the WAL insertion point. Notesmgrwritewrites to the OS page cache, not necessarily to the platter — durability of the data page is later forced by the checkpointer’ssmgrimmedsync/fsync; the WAL rule only requires that the log precede the data write, which it does.StartBufferIOreturning false means another backend already flushed this page —FlushBuffersimply returns.BM_IO_IN_PROGRESSis the per-page I/O latch; waiters sleep on the frame’s condition variable inWaitIO.BM_JUST_DIRTIEDis cleared before the write and re-checked inTerminateBufferIO: if the page was re-dirtied during the write (legal, since only a shared content lock is held — hint-bit updates happen under shared lock), the dirty bit is not cleared, so the page will be written again later.
sequenceDiagram
participant V as GetVictimBuffer / checkpointer
participant FB as FlushBuffer
participant HDR as buffer header
participant WAL as WAL (XLogFlush)
participant SMGR as smgr (data file)
V->>FB: flush dirty victim
FB->>HDR: StartBufferIO -> set BM_IO_IN_PROGRESS
alt someone else already flushing
HDR-->>FB: false -> return (no work)
end
FB->>HDR: LockBufHdr; recptr = BufferGetLSN; clear BM_JUST_DIRTIED
opt buffer is BM_PERMANENT
FB->>WAL: XLogFlush(recptr)
Note right of WAL: log on disk up to page LSN<br/>BEFORE the page write
end
FB->>SMGR: smgrwrite(page) (+ checksum copy)
FB->>HDR: TerminateBufferIO -> clear BM_DIRTY unless BM_JUST_DIRTIED
Note right of HDR: if re-dirtied mid-write,<br/>page stays dirty -> rewritten later
Figure 4 — FlushBuffer and the WAL-before-flush rule. The flusher claims the page via BM_IO_IN_PROGRESS, reads its LSN, and for a permanent relation calls XLogFlush(recptr) to force the log to disk up to that LSN before smgrwrite. BM_JUST_DIRTIED catches concurrent hint-bit dirtying so a re-dirtied page is not falsely marked clean.
Marking a page dirty
Section titled “Marking a page dirty”The write path begins when a modifier calls MarkBufferDirty (under an
exclusive content lock and a pin). It just sets two flag bits via CAS:
// MarkBufferDirty — storage/buffer/bufmgr.c (condensed)Assert(BufferIsPinned(buffer));Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr), LW_EXCLUSIVE));for (;;){ if (old_buf_state & BM_LOCKED) old_buf_state = WaitBufHdrUnlocked(bufHdr); buf_state = old_buf_state | BM_DIRTY | BM_JUST_DIRTIED; if (pg_atomic_compare_exchange_u32(&bufHdr->state, &old_buf_state, buf_state)) break;}The page’s LSN (which FlushBuffer later reads) is not set here — it is set
by the WAL machinery (XLogInsert returns the record’s end LSN, which the
modifying access method stamps into the page header; see
postgres-xlog-wal.md). MarkBufferDirty only records that the bytes diverge
from disk.
Cleanup locks: removing tuples needs sole-pin
Section titled “Cleanup locks: removing tuples needs sole-pin”Some operations (VACUUM compaction, page defragmentation) must guarantee no
other backend holds a pointer into the page. The README’s rule #5 requires an
exclusive content lock and observing refcount == 1.
LockBufferForCleanup loops: take the exclusive lock, check the pin count
under the header spinlock; if it is 1, done; otherwise register as the sole
BM_PIN_COUNT_WAITER, drop the lock, and sleep until another backend’s
UnpinBuffer (or TerminateBufferIO) wakes it via WakePinCountWaiter. Only
one waiter per buffer is supported, which is sufficient because PostgreSQL does
not run two VACUUMs on one relation concurrently.
Local buffers for temporary tables
Section titled “Local buffers for temporary tables”Temporary-table pages never go in the shared pool — they are session-private
and not WAL-logged, so they live in a per-backend array managed by
localbuf.c (LocalBufferDescriptors, LocalRefCount). LocalBufferAlloc
and GetLocalVictimBuffer mirror the shared logic but with no locking (no
spinlock, no LWLock — there is only one accessor) and a simple clock sweep
over the local array. The same BufferDesc struct is reused, but its locks
and most flag bits are inert. This keeps the temp-table fast path off the
shared pool’s contention entirely.
Source Walkthrough
Section titled “Source Walkthrough”Anchor on symbol names, not line numbers. Use
git grep -n '<symbol>' src/backend/storage/buffer/to relocate a symbol; the line numbers in the position-hint table are hints scoped to theupdated:commit.
Shared-pool data structures (storage/buf_internals.h)
Section titled “Shared-pool data structures (storage/buf_internals.h)”struct buftag(BufferTag) — the (spc, db, rel, fork, block) cache key;InitBufferTag,BufferTagsEqual,ClearBufferTagoperate on it.struct BufferDesc— per-frame metadata: tag,buf_id, atomicstate,freeNext,content_lock,io_wref.BufferDescPadded— cache-line-padded union wrappingBufferDesc.BUF_REFCOUNT_BITS/BUF_USAGECOUNT_BITS/BUF_FLAG_BITSand theBUF_STATE_GET_*accessors — the packed-word layout.BM_*flag macros —BM_DIRTY,BM_VALID,BM_TAG_VALID,BM_IO_IN_PROGRESS,BM_JUST_DIRTIED,BM_PIN_COUNT_WAITER,BM_PERMANENT,BM_LOCKED.BM_MAX_USAGE_COUNT— clock-sweep usage cap (= 5).BufMappingPartitionLock/BufTableHashPartition— partition-lock selection.LockBufHdr/UnlockBufHdr— header spinlock viaBM_LOCKED.
Pool initialization (buf_init.c)
Section titled “Pool initialization (buf_init.c)”BufferManagerShmemInit— allocate + link the three shared arrays.BufferManagerShmemSize— size computation for the shmem segment.
Mapping hash (buf_table.c)
Section titled “Mapping hash (buf_table.c)”BufTableHashCode,BufTableLookup,BufTableInsert,BufTableDelete— the tag→buf_id shared hash table API.
Replacement strategy (freelist.c)
Section titled “Replacement strategy (freelist.c)”StrategyGetBuffer— ring → free list → clock sweep victim selection.ClockSweepTick— advance the atomic clock hand, handle wraparound.StrategyFreeBuffer— return a clean frame to the free list.BufferStrategyControl— sharednextVictimBuffer, free-list head/tail,buffer_strategy_lock, bgwriter notification.GetAccessStrategy/GetAccessStrategyWithSize— build a ring.GetBufferFromRing/AddBufferToRing/StrategyRejectBuffer— ring reuse and dirty-buffer rejection.StrategySyncStart/StrategyNotifyBgWriter— bgwriter coordination.
Read / pin / lock / write path (bufmgr.c)
Section titled “Read / pin / lock / write path (bufmgr.c)”ReadBuffer/ReadBufferExtended/ReadBuffer_common— public entry.BufferAlloc— lookup-or-fault; tag install under partition lock.GetVictimBuffer— wrap selection with eviction + the strategy/WAL check.PinBuffer/PinBuffer_Locked/UnpinBuffer/UnpinBufferNoOwner— reference counting;PrivateRefCountArray/PrivateRefCountEntry.LockBuffer/ConditionalLockBuffer— content lock.LockBufferForCleanup/ConditionalLockBufferForCleanup/WakePinCountWaiter— sole-pin cleanup protocol.MarkBufferDirty— setBM_DIRTY|BM_JUST_DIRTIED.FlushBuffer— the WAL rule:XLogFlush(BufferGetLSN(buf))thensmgrwrite.StartBufferIO/TerminateBufferIO/WaitIO/AbortBufferIO— theBM_IO_IN_PROGRESSI/O latch protocol.InvalidateBuffer/InvalidateVictimBuffer— drop a buffer’s mapping.BufferSync— checkpoint-time flush of all dirty buffers.
Local (temp-table) buffers (localbuf.c)
Section titled “Local (temp-table) buffers (localbuf.c)”LocalBufferAlloc,GetLocalVictimBuffer,MarkLocalBufferDirty,FlushLocalBuffer— the session-private, lockless mirror.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
struct buftag | storage/buf_internals.h | 106 |
struct BufferDesc | storage/buf_internals.h | 258 |
BM_MAX_USAGE_COUNT | storage/buf_internals.h | 87 |
BufMappingPartitionLock | storage/buf_internals.h | 198 |
BufferManagerShmemInit | storage/buffer/buf_init.c | 67 |
BufferManagerShmemSize | storage/buffer/buf_init.c | 161 |
BufTableHashCode | storage/buffer/buf_table.c | 78 |
BufTableLookup | storage/buffer/buf_table.c | 90 |
BufTableInsert | storage/buffer/buf_table.c | 118 |
BufTableDelete | storage/buffer/buf_table.c | 148 |
BufferStrategyControl | storage/buffer/freelist.c | 30 |
ClockSweepTick | storage/buffer/freelist.c | 107 |
StrategyGetBuffer | storage/buffer/freelist.c | 195 |
StrategyFreeBuffer | storage/buffer/freelist.c | 362 |
GetAccessStrategy | storage/buffer/freelist.c | 540 |
GetBufferFromRing | storage/buffer/freelist.c | 736 |
StrategyRejectBuffer | storage/buffer/freelist.c | 839 |
ReadBuffer | storage/buffer/bufmgr.c | 758 |
BufferAlloc | storage/buffer/bufmgr.c | 2009 |
InvalidateBuffer | storage/buffer/bufmgr.c | 2187 |
GetVictimBuffer | storage/buffer/bufmgr.c | 2354 |
MarkBufferDirty | storage/buffer/bufmgr.c | 2956 |
PinBuffer | storage/buffer/bufmgr.c | 3076 |
PinBuffer_Locked | storage/buffer/bufmgr.c | 3186 |
UnpinBuffer | storage/buffer/bufmgr.c | 3268 |
FlushBuffer | storage/buffer/bufmgr.c | 4293 |
LockBuffer | storage/buffer/bufmgr.c | 5609 |
LockBufferForCleanup | storage/buffer/bufmgr.c | 5689 |
WaitIO | storage/buffer/bufmgr.c | 5968 |
StartBufferIO | storage/buffer/bufmgr.c | 6047 |
TerminateBufferIO | storage/buffer/bufmgr.c | 6104 |
LocalBufferAlloc | storage/buffer/localbuf.c | 118 |
GetLocalVictimBuffer | storage/buffer/localbuf.c | 223 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Each entry is a fact about the current source at commit
273fe94, readable without external materials. Open questions follow as the curator’s recorded gaps.
Verified facts
Section titled “Verified facts”-
The buffer state word is 18-bit refcount + 4-bit usagecount + 10-bit flags, with a compile-time assert that they sum to 32. Verified in
buf_internals.h(BUF_REFCOUNT_BITS/BUF_USAGECOUNT_BITS/BUF_FLAG_BITSandStaticAssertDecl(... == 32)). The flag bits start at bit 22 (BM_LOCKED=1U << 22). -
BM_MAX_USAGE_COUNTis 5, hard-coded, not a GUC. Inbuf_internals.h, with a comment explaining the accuracy/speed trade-off and a static assert that it fits inBUF_USAGECOUNT_BITS. Tuning requires a recompile. -
NUM_BUFFER_PARTITIONSis 128. Defined instorage/lwlock.h(not in the buffer files themselves), consumed viaBufTableHashPartition. Must be a power of two. -
The clock hand is a single atomic
nextVictimBufferadvanced bypg_atomic_fetch_add_u32; thebuffer_strategy_lockspinlock is only taken on a clock wraparound or to pop the free list. Verified inClockSweepTickandStrategyGetBuffer(freelist.c). Concurrent sweeps can return buffers slightly out of order — the in-code comment acknowledges this as harmless. -
The WAL rule is enforced in
FlushBufferasif (buf_state & BM_PERMANENT) XLogFlush(recptr), whererecptr = BufferGetLSN(buf)is read under the buffer header spinlock, beforesmgrwrite. Verified inFlushBuffer(bufmgr.c). Unlogged relations skip the flush by design; the in-code comment cites the fake-LSN GiST hazard as the reason theBM_PERMANENTguard is mandatory, not merely an optimization. -
Ring sizes:
BAS_BULKREAD= 256 KB base (grown by IO concurrency),BAS_BULKWRITE= 16 MB,BAS_VACUUM= 2048 KB; all capped at 1/8 ofshared_buffers. Verified inGetAccessStrategyandGetAccessStrategyWithSize(freelist.c).BAS_NORMALreturns NULL (no ring).BAS_VACUUM’s effective size is driven by thevacuum_buffer_usage_limitGUC at the call site, not hard-coded inGetAccessStrategy. -
Only
BAS_BULKREADrejects a dirty ring buffer rather than flush it. Verified inStrategyRejectBuffer(freelist.c): it returns false for any type other thanBAS_BULKREAD. So vacuum and bulk-write rings do pay WAL flushes when reusing dirty pages, matching the README. -
A backend tracks its first pin of a buffer in shared memory and further pins privately, in an 8-entry array (
REFCOUNT_ARRAY_ENTRIES) plus an overflow hash. Verified inbufmgr.c(PrivateRefCountEntry,PinBuffer).ResourceOwnerRememberBufferties each pin to a resource owner for cleanup on abort. -
LockBufferForCleanupsupports exactly one pin-count waiter per buffer. Verified inbufmgr.cand stated in the README; sufficient because concurrent VACUUMs on one relation are disallowed.
Open questions
Section titled “Open questions”-
AIO interaction with eviction. REL_18 adds
io_wreftoBufferDescand a separate async-I/O subsystem (storage/aio). The read path can be asynchronous, andTerminateBufferIO(..., release_aio)releases an AIO-held pin. HowGetVictimBufferandLockBufferForCleanupinteract with a buffer whose read is still in flight is only partially traced here. Investigation path: readstorage/aio/README.mdand theWaitReadBuffers/StartReadBuffersbatch path; cross-ref a futurepostgres-aio.md. -
The free list’s long-term role. The README notes the free list is only ever populated by genuinely empty frames (relation drops) — “the current algorithm never does that” for cold pages. Under steady state the list is empty and every allocation runs the clock sweep. Is the free list effectively dead weight in a warm cache, or does it matter for drop-heavy workloads? Investigation path: instrument
StrategyFreeBuffercall sites. -
Bgwriter / checkpointer division of labor.
BufferSync(checkpoint) andBgBufferSync(background writer) both write dirty buffers but on different schedules and triggers. This doc covers the per-bufferFlushBuffermechanism, not the policy that decides when each runs. Investigation path: a dedicatedpostgres-checkpointer-bgwriter.md.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”Pointers, not analysis. Each bullet is a starting handle for a follow-up doc; depth here is intentionally shallow.
-
CUBRID’s page buffer. CUBRID caches pages in
pgbufwith a similar pin/latch separation and its own LRU/clock hybrid, but its old-version storage is out-of-place (undo log), so the buffer manager’s interaction with MVCC differs from PostgreSQL’s in-place heap (seecubrid-mvcc.md). A side-by-side of the two replacement policies and WAL-flush chokepoints would isolate what is intrinsic to steal/no-force from what is PostgreSQL-specific. -
CLOCK variants beyond second-chance. PostgreSQL’s fixed
BM_MAX_USAGE_COUNT=5 CLOCK approximates LRU crudely. CLOCK-Pro (Jiang & Zhang, USENIX 2005), LRU-K (O’Neil et al., SIGMOD 1993), 2Q, and ARC (Megiddo & Modha, FAST 2003) all improve scan resistance and frequency awareness. PG’s ring-buffer strategies are a hand-built scan-resistance patch on top of plain CLOCK; how much a smarter base policy would subsume the rings is an open empirical question. -
The double-buffering problem. PostgreSQL relies on the OS page cache under
shared_buffers, so a hot page can be cached twice. Direct I/O and the REL_18 async-I/O subsystem (storage/aio) move toward bypassing the OS cache; Database Internals ch. 4 discusses the trade-off. A measured study of double-buffering cost on PG would anchor theshared_bufferssizing folklore. -
Buffer management at high core counts. The packed-state CAS path and partitioned mapping lock are PG’s answers to the pre-8.1
BufMgrLockbottleneck. Whether they hold at hundreds of cores connects to the scalable-lock-manager line of work (dbms-papers/scalable-lock-manager.md) and to the architecture-overview’s shared-memory thesis. -
WAL-before-flush vs. shadow paging. ARIES steal/no-force (
dbms-papers/aries.md) is one point in the recovery design space; shadow-paging engines (and copy-on-write stores like LMDB) avoid the WAL-before-flush coupling entirely at the cost of write amplification and fragmentation. The comparison clarifies why PG’s buffer manager carries theXLogFlush-before-smgrwritechokepoint at all.
Sources
Section titled “Sources”In-tree README (first-class design doc)
Section titled “In-tree README (first-class design doc)”src/backend/storage/buffer/README— buffer access rules (pins vs content locks), internal locking (BufMappingLock partitions, buffer_strategy_lock, per-header spinlock, BM_IO_IN_PROGRESS), the clock-sweep algorithm, the ring strategies, and the background writer.
PostgreSQL source (under /data/hgryoo/references/postgres, REL_18 273fe94)
Section titled “PostgreSQL source (under /data/hgryoo/references/postgres, REL_18 273fe94)”src/backend/storage/buffer/bufmgr.csrc/backend/storage/buffer/freelist.csrc/backend/storage/buffer/buf_init.csrc/backend/storage/buffer/buf_table.csrc/backend/storage/buffer/localbuf.csrc/include/storage/buf_internals.hsrc/include/storage/bufmgr.hsrc/include/storage/lwlock.h(forNUM_BUFFER_PARTITIONS)
Textbook chapters (under knowledge/research/dbms-general/)
Section titled “Textbook chapters (under knowledge/research/dbms-general/)”- Database Internals (Petrov), Ch. 4 §“Buffer Management” (≈ line 3419), §“Buffer manager” (≈ line 791) — page cache role, eviction, force/steal, the double-buffering trade-off.
Papers (under knowledge/research/dbms-papers/)
Section titled “Papers (under knowledge/research/dbms-papers/)”- ARIES (Mohan et al., 1992) —
aries.md. The steal/no-force recovery method and the WAL rule thatFlushBufferenforces.
Cross-references (sibling module docs)
Section titled “Cross-references (sibling module docs)”postgres-smgr-md.md— the storage managerFlushBufferwrites through.postgres-xlog-wal.md— WAL generation, the page LSN, andXLogFlush.postgres-page-layout.md— what is inside the 8 KB page a buffer holds.postgres-architecture-overview.md— the fixed-size shared-memory segment the pool is carved from (Axis 2) and the WAL spine (Axis 3).