PostgreSQL SLRU — Simple LRU Buffering for Wrap-Around-Able Metadata
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”PostgreSQL manages transaction status — whether a given XID committed, aborted, or is still in progress — through a set of small, append-mostly files indexed by TransactionId. The fundamental question each of these files answers is “what happened to XID N?” Access is random by XID, but the access pattern is strongly temporal: the latest page is read and written most heavily, and progressively older pages are read with decreasing frequency as vacuuming ages them out. This shape is classic for a simple paged buffer cache with LRU replacement.
The textbook treatment of buffer management (Database System Concepts,
Silberschatz et al., 7e, ch. 13 “Data Storage Structures”) describes the
general model: a fixed pool of in-memory frames, a page-table mapping
logical pages to frames, a replacement policy (LRU, clock, MRU) that
evicts the least-useful frame when the pool is full, and dirty-page
tracking to ensure writes reach stable storage before eviction. ARIES
(ARIES: A Transaction Recovery Method, Mohan et al., ACM TODS 1992;
dbms-papers/aries.md) introduces the WAL-before-data invariant: a
dirty page must not be written to stable storage until the WAL records
that describe its changes have been flushed first. For ordinary heap and
index pages, the main buffer manager in storage/buffer/bufmgr.c
enforces this by checking the page’s LSN against the WAL flush point at
write time. Transaction-status pages follow the same invariant, but they
are small, numerous, and have a monotonically advancing logical address
space: none of the richness of the heap buffer manager is needed, and the
cost of its infrastructure — the buffer descriptors, the shared buffer
hash, the per-buffer content locks, the complex clock-sweep algorithm —
would be pure overhead.
SLRU (Simple LRU) is PostgreSQL’s purpose-built answer: a stripped-down paged buffer cache whose design exploits the temporal locality and fixed logical address space of transaction metadata. It is not academically novel — it is careful engineering that eliminates every feature of the general buffer manager that does not pay for itself in this access pattern.
The wrap-around aspect is a second distinct concern. TransactionIds are
32-bit unsigned integers. After approximately 2 billion transactions they
wrap. The status files must therefore treat page numbering and segment
numbering as modular arithmetic. SimpleLruTruncate deletes old segments
using a PagePrecedes callback that each client supplies; the callback
encapsulates the modular comparison so the SLRU substrate itself never
needs to understand XID semantics.
Common DBMS Design
Section titled “Common DBMS Design”Transaction-status storage is a near-universal subsystem in relational engines, though its shape varies with the engine’s concurrency protocol.
The two archetypal designs
Section titled “The two archetypal designs”Undo-log engines (Oracle, SQL Server, MySQL/InnoDB) record the before-image of every modified row in a separate undo segment. “What was the committed state of this row at snapshot time T?” is answered by chasing the undo chain from the current version backward to the appropriate revision. There is no separate “did XID N commit?” log because the commit record is embedded in the row version chain and in the transaction table within the rollback segment. The equivalent of CLOG is a small in-memory transaction table, evicted to undo tablespace, not a dedicated append-mostly file.
MVCC-with-commit-log engines (PostgreSQL, CUBRID in some modes)
record the row’s final committed version in-place (or as a heap
version) and keep a side file that maps XID → {committed, aborted,
in-progress}. Reading visibility requires consulting this side file.
PostgreSQL’s pg_xact/ (formerly pg_clog/) is the canonical example.
The design trades undo-chain traversal cost for a single random read of a
small, hot, cacheable file.
Temporal locality and the “recent-biased” access pattern
Section titled “Temporal locality and the “recent-biased” access pattern”In both designs, transaction-status data exhibits strong temporal locality. Transactions recently committed are the ones most likely to be checked by concurrent readers. Transactions committed thousands of checkpoints ago appear only during vacuum or when a very long-lived snapshot is active. Any cache for transaction-status data should be tuned toward recency. The SLRU’s LRU replacement and its explicit protection of the “latest page” from eviction are the direct expression of this principle.
Fixed-size segment files
Section titled “Fixed-size segment files”Rather than a single monolithic file growing without bound, every
production engine divides transaction-status storage into fixed-size
segments. Segment rotation is a checkpoint-friendly operation (the
checkpointer can sync one segment at a time), and truncation removes
whole segments, which is a directory-entry delete — cheaper than
punching holes in a large file. PostgreSQL’s SLRU_PAGES_PER_SEGMENT = 32 segments are 256 KB each (32 pages × 8 KB/page), holding status
data for 1 M transactions (for CLOG’s 2 bits per XID packing).
LRU with “last page pinned”
Section titled “LRU with “last page pinned””Pure LRU evicts the frame not accessed for the longest time. For
append-mostly workloads the latest page will be written by every
committing transaction; a pure LRU eviction policy could paradoxically
evict it if no other reads have happened in the bank since it was
loaded. PostgreSQL’s SLRU introduces a single special case: the page
identified by latest_page_number (an atomic uint64) is never
selected as a victim, regardless of its LRU count. Every other page
follows pure LRU within its bank.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”SLRU is a generic substrate that any in-core subsystem may instantiate
by calling SimpleLruInit. As of REL_18_STABLE the following clients
exist:
| Client | SlruCtl | Directory | Notes |
|---|---|---|---|
| CLOG | XactCtl | pg_xact/ | 2 bits/XID, WAL-before-data enabled |
| Subtransaction | SubTransCtl | pg_subtrans/ | Parent XID per sub-XID |
| Commit timestamp | CommitTsCtl | pg_commit_ts/ | commit_timestamp optional |
| MultiXact offset | MultiXactOffsetCtl | pg_multixact/offsets/ | Offset into members file |
| MultiXact member | MultiXactMemberCtl | pg_multixact/members/ | XID list + flags |
| NOTIFY | NotifyCtl | pg_notify/ | Async notifications; no fsync |
| Serializable | SerialSlruCtl | pg_serial/ | SSI SIREAD lock summary |
Extensions may also call SimpleLruInit to define their own SLRUs.
Data structures
Section titled “Data structures”The SLRU state is split between a shared part (allocated in the
fixed shared-memory segment at postmaster startup) and an unshared
control struct (SlruCtlData) that each backend keeps in local memory
and uses to reach the shared state.
// SlruSharedData — src/include/access/slru.htypedef struct SlruSharedData{ int num_slots; /* total buffer slots */ char **page_buffer; /* BLCKSZ byte arrays, one per slot */ SlruPageStatus *page_status; /* EMPTY / READ_IN_PROGRESS / VALID / WRITE_IN_PROGRESS */ bool *page_dirty; /* re-dirtied while write in progress? */ int64 *page_number; /* logical page each slot holds */ int *page_lru_count; /* per-slot LRU counter */ LWLockPadded *buffer_locks; /* per-buffer I/O LWLocks */ LWLockPadded *bank_locks; /* per-bank control LWLocks */ int *bank_cur_lru_count; /* per-bank LRU clock */ XLogRecPtr *group_lsn; /* optional WAL flush LSNs (CLOG only) */ int lsn_groups_per_page; pg_atomic_uint64 latest_page_number; /* newest page; never evict this one */ int slru_stats_idx; /* index into pgstat SLRU counters */} SlruSharedData;// SlruCtlData — src/include/access/slru.htypedef struct SlruCtlData{ SlruShared shared; uint16 nbanks; /* total banks = num_slots / SLRU_BANK_SIZE */ bool long_segment_names; /* 15-hex-char names vs 4-6-hex-char names */ SyncRequestHandler sync_handler;/* SYNC_HANDLER_NONE disables fsync (notify) */ bool (*PagePrecedes)(int64, int64); /* modular "older than" for truncation */ char Dir[64]; /* PGDATA-relative subdirectory */} SlruCtlData;Bank architecture
Section titled “Bank architecture”Buffers are organized into banks of SLRU_BANK_SIZE = 16 slots
each. Every page is assigned to a bank by pageno % nbanks. Each bank
has its own bank_locks[bankno] LWLock and its own bank_cur_lru_count
counter. The key properties this delivers are:
- Scope-limited LRU search. Victim selection (
SlruSelectLRUPage) only walks the 16 slots of the target bank. No global scan; no hashtable lookup. - Partitioned locking. Concurrent accesses to pages in different banks contend on different LWLocks, reducing hot-lock contention at high commit rates.
- Per-bank LRU counts avoid global invalidation. Updating
bank_cur_lru_count[bankno]on every access touches only the cache line for that bank, not a global counter.
The total buffer count is a multiple of SLRU_BANK_SIZE. GUC parameters
(e.g., transaction_buffers, multixact_offsets_buffers) control the
total count; check_slru_buffers enforces the multiple-of-16 constraint.
SimpleLruAutotuneBuffers computes a default count as
shared_buffers / divisor, rounded down to a multiple of 16.
The shared-memory layout is a single ShmemInitStruct allocation whose
size is precomputed by SimpleLruShmemSize. After the SlruSharedData
header come the parallel per-slot arrays (one entry per buffer slot,
indexed by slotno), then the per-bank arrays (one entry per bank,
indexed by bankno = pageno % nbanks), then the optional group_lsn
array, and finally the BLCKSZ * nslots block of page buffers
themselves. The slot index and bank index are tied together by
SlotGetBankNumber(slotno) = slotno >> SLRU_BANK_BITSHIFT — i.e., the
first 16 slots belong to bank 0, the next 16 to bank 1, and so on.
flowchart TB
subgraph SHM["ShmemInitStruct(name, SimpleLruShmemSize(nslots, nlsns))"]
HDR["SlruSharedData header<br/>num_slots, latest_page_number (atomic),<br/>lsn_groups_per_page, slru_stats_idx"]
subgraph PERSLOT["Per-slot arrays — indexed by slotno (length nslots)"]
PS["page_status[]<br/>EMPTY / READ_IP / VALID / WRITE_IP"]
PD["page_dirty[]"]
PN["page_number[] (int64)"]
PL["page_lru_count[] (int)"]
BL["buffer_locks[] (LWLockPadded)<br/>one I/O lock per slot"]
end
subgraph PERBANK["Per-bank arrays — indexed by bankno (length nbanks)"]
BKL["bank_locks[] (LWLockPadded)"]
BC["bank_cur_lru_count[] (int)"]
end
GL["group_lsn[] (XLogRecPtr)<br/>nslots * lsn_groups_per_page<br/>NULL unless nlsns > 0 — CLOG only"]
subgraph BUFS["page_buffer[] -> BLCKSZ * nslots bytes"]
PB0["slot 0: 8192-byte page"]
PB1["slot 1: 8192-byte page"]
PBN["... slot nslots-1"]
end
end
SEG["pg_xact / pg_subtrans / ... segment files<br/>SLRU_PAGES_PER_SEGMENT = 32 pages each"]
PN -->|"pageno % nbanks"| BKL
BL -.->|"SlotGetBankNumber = slotno >> 4"| BKL
PB0 <-->|"SlruPhysicalReadPage / SlruPhysicalWritePage<br/>pread() / pwrite() at (pageno % 32) * BLCKSZ"| SEG
The contiguous-array sizing is explicit in SimpleLruShmemSize; each
array is MAXALIGN-padded so the next array starts on a safe boundary,
and the page-buffer block is appended after a BUFFERALIGN:
// SimpleLruShmemSize — src/backend/access/transam/slru.cint nbanks = nslots / SLRU_BANK_SIZE;sz = MAXALIGN(sizeof(SlruSharedData));sz += MAXALIGN(nslots * sizeof(char *)); /* page_buffer[] */sz += MAXALIGN(nslots * sizeof(SlruPageStatus));/* page_status[] */sz += MAXALIGN(nslots * sizeof(bool)); /* page_dirty[] */sz += MAXALIGN(nslots * sizeof(int64)); /* page_number[] */sz += MAXALIGN(nslots * sizeof(int)); /* page_lru_count[] */sz += MAXALIGN(nslots * sizeof(LWLockPadded)); /* buffer_locks[] */sz += MAXALIGN(nbanks * sizeof(LWLockPadded)); /* bank_locks[] */sz += MAXALIGN(nbanks * sizeof(int)); /* bank_cur_lru_count[] */if (nlsns > 0) sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr)); /* group_lsn[] */return BUFFERALIGN(sz) + BLCKSZ * nslots; /* page contents last */SimpleLruInit then carves this one allocation into the array pointers
in exactly the same order (the carving code lives in the
if (!IsUnderPostmaster) branch, where it advances a running ptr
through the region). The fact that the size calculation and the pointer
assignment walk the arrays in lockstep is the layout’s only correctness
contract — there is no per-array allocation, so a mismatch would silently
overlap two arrays.
The bank geometry itself is a pure bit-shift, chosen so victim search touches a bounded, cache-friendly window:
// bank geometry — src/backend/access/transam/slru.c#define SLRU_BANK_BITSHIFT 4#define SLRU_BANK_SIZE (1 << SLRU_BANK_BITSHIFT) /* 16 */#define SlotGetBankNumber(slotno) ((slotno) >> SLRU_BANK_BITSHIFT)Page lifecycle and the four states
Section titled “Page lifecycle and the four states” SimpleLruZeroPage / SimpleLruReadPage | EMPTY ───► READ_IN_PROGRESS ───► VALID │ ▲ (dirty) │ │ re-dirty during write ▼ │ WRITE_IN_PROGRESS ───► VALID (clean)SLRU_PAGE_EMPTY — the slot is unused.
SLRU_PAGE_READ_IN_PROGRESS — a backend is reading the page from disk; it
holds buffer_locks[slotno] exclusively and has released the bank lock.
SLRU_PAGE_VALID — the page is in memory and usable; may be dirty.
SLRU_PAGE_WRITE_IN_PROGRESS — a backend is writing the page to disk; it
holds buffer_locks[slotno] exclusively. If another backend dirtifies
the page during the write, page_dirty is set to true again so a
second write will follow.
The page_dirty flag is separate from page_status. A page can be
WRITE_IN_PROGRESS and dirty simultaneously, signalling that the
current write will not be sufficient and a re-write is needed.
Locking protocol
Section titled “Locking protocol”The design uses two tiers of LWLocks per SLRU instance:
-
Bank control locks (
bank_locks[bankno]) — protect all metadata fields of the bank’s slots:page_status,page_dirty,page_number,page_lru_count,bank_cur_lru_count. Must be held (exclusive in most paths, shared inSimpleLruReadPage_ReadOnly) to examine or modify any of these fields. -
Per-buffer I/O locks (
buffer_locks[slotno]) — synchronize in-flight I/O. Before starting a read or write the backend acquires the buffer lock exclusively, then releases the bank lock. Other backends waiting for the same I/O acquire the buffer lock shared, block, and re-acquire the bank lock after the I/O finishes.
The key invariant: no backend holds both a bank lock and a buffer lock
at the same time in the I/O hot path. This prevents deadlock because
the only way to acquire a buffer lock is after releasing the bank lock,
and re-acquiring the bank lock always happens before releasing the buffer
lock. SimpleLruWaitIO is the canonical expression of this ordering — it
is how a backend blocks on someone else’s in-flight I/O:
// SimpleLruWaitIO — src/backend/access/transam/slru.cint bankno = SlotGetBankNumber(slotno);/* See notes at top of file */LWLockRelease(&shared->bank_locks[bankno].lock);LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_SHARED);LWLockRelease(&shared->buffer_locks[slotno].lock);LWLockAcquire(&shared->bank_locks[bankno].lock, LW_EXCLUSIVE);/* If the slot is still io-in-progress, the I/O must have failed; * recover the page_status so the slot can be reused. */if (shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS || shared->page_status[slotno] == SLRU_PAGE_WRITE_IN_PROGRESS){ if (LWLockConditionalAcquire(&shared->buffer_locks[slotno].lock, LW_SHARED)) { if (shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS) shared->page_status[slotno] = SLRU_PAGE_EMPTY; else /* write_in_progress */ { shared->page_status[slotno] = SLRU_PAGE_VALID; shared->page_dirty[slotno] = true; } LWLockRelease(&shared->buffer_locks[slotno].lock); }}Acquiring the buffer lock LW_SHARED here is a pure rendezvous: the
backend doing the I/O holds it LW_EXCLUSIVE, so the waiter blocks until
the I/O finishes and releases it, then immediately drops it. The
conditional-acquire afterward is a self-healing check — if the buffer
lock is grantable but the slot is still flagged in-progress, the original
I/O must have errored out (a transaction abort released its lock without
resetting the status), so this backend repairs the slot state.
latest_page_number is the single exception: it is read and written with
atomic operations (pg_atomic_read_u64 / pg_atomic_write_u64) rather
than under a lock, because checking whether a slot holds the latest page
during LRU victim selection is an advisory hint, not a correctness-critical
test.
// SimpleLruReadPage — src/backend/access/transam/slru.c// Bank lock held exclusive on entry and exit.intSimpleLruReadPage(SlruCtl ctl, int64 pageno, bool write_ok, TransactionId xid){ for (;;) { slotno = SlruSelectLRUPage(ctl, pageno); /* find or pick victim */
if (page already in memory && not I/O busy) { SlruRecentlyUsed(shared, slotno); pgstat_count_slru_page_hit(...); return slotno; }
/* Mark slot read-busy, acquire buffer lock, release bank lock */ shared->page_status[slotno] = SLRU_PAGE_READ_IN_PROGRESS; LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_EXCLUSIVE); LWLockRelease(banklock);
ok = SlruPhysicalReadPage(ctl, pageno, slotno); /* pread() */
/* Re-acquire bank lock, update status, release buffer lock */ LWLockAcquire(banklock, LW_EXCLUSIVE); shared->page_status[slotno] = ok ? SLRU_PAGE_VALID : SLRU_PAGE_EMPTY; LWLockRelease(&shared->buffer_locks[slotno].lock);
if (!ok) SlruReportIOError(ctl, pageno, xid); return slotno; }}SimpleLruReadPage_ReadOnly first attempts a shared bank lock scan —
it walks the bank looking for the page without acquiring the exclusive
lock. On a hit, SlruRecentlyUsed updates the LRU count without a
barrier (safe because int reads/writes are atomic on all supported
platforms and a slightly stale LRU count only degrades, not breaks,
victim selection). On a miss, it escalates to exclusive and falls through
to SimpleLruReadPage.
The full SimpleLruReadPage miss path — the case where the page is not
resident and must be faulted in — is the most lock-intricate flow in the
file. The outer for (;;) loop exists precisely so that a backend which
discovers the slot is mid-I/O (its own retry, or another backend’s read)
can wait and re-derive the slot from scratch, because after dropping the
bank lock the slot contents may have changed entirely:
flowchart TB
START["SimpleLruReadPage(ctl, pageno, write_ok, xid)<br/>bank lock held EXCLUSIVE on entry"] --> SEL["slotno = SlruSelectLRUPage(ctl, pageno)<br/>scan 16 slots of bank pageno % nbanks"]
SEL --> RES{"slot holds pageno<br/>and != EMPTY?"}
RES -->|"yes — hit"| BUSY{"READ_IN_PROGRESS,<br/>or WRITE_IN_PROGRESS<br/>and not write_ok?"}
BUSY -->|"yes"| WAIT["SimpleLruWaitIO(ctl, slotno)<br/>drop bank lock, take buffer lock SHARED,<br/>release it, re-take bank lock"]
WAIT --> SEL
BUSY -->|"no"| HIT["SlruRecentlyUsed<br/>pgstat_count_slru_page_hit<br/>return slotno"]
RES -->|"no — miss, victim chosen"| MARK["page_number[slotno] = pageno<br/>page_status[slotno] = READ_IN_PROGRESS<br/>page_dirty[slotno] = false"]
MARK --> ACQ["LWLockAcquire buffer_locks[slotno] EXCLUSIVE<br/>then LWLockRelease(banklock) — never both held"]
ACQ --> IO["SlruPhysicalReadPage(ctl, pageno, slotno)<br/>pread(); missing file -> zeroes in recovery"]
IO --> ZLSN["SimpleLruZeroLSNs(ctl, slotno)"]
ZLSN --> REACQ["LWLockAcquire(banklock) EXCLUSIVE<br/>page_status = ok ? VALID : EMPTY<br/>LWLockRelease(buffer_locks[slotno])"]
REACQ --> ERR{"read ok?"}
ERR -->|"no"| RPT["SlruReportIOError(ctl, pageno, xid)<br/>ereport(ERROR)"]
ERR -->|"yes"| DONE["SlruRecentlyUsed<br/>pgstat_count_slru_page_read<br/>return slotno"]
Note that SlruSelectLRUPage is called even on the retry after a wait:
the page someone else read in may now be resident, turning the next
iteration into a hit. If the chosen victim was dirty, SlruSelectLRUPage
itself flushes it via SlruInternalWritePage before returning, so by the
time SimpleLruReadPage overwrites the slot it is guaranteed clean
(asserted at the “we found no match” point).
LRU victim selection
Section titled “LRU victim selection”// SlruSelectLRUPage — src/backend/access/transam/slru.c// Bank lock held on entry and exit. May write a dirty victim to disk.static intSlruSelectLRUPage(SlruCtl ctl, int64 pageno){ /* 1. Return any slot already holding the target page. */ /* 2. Return an EMPTY slot if one exists. */ /* 3. Among VALID slots: find the one with the largest (bank_cur_lru_count - page_lru_count) — i.e., least recently used. Skip the latest_page_number slot. */ /* 4. If no VALID slot, wait for the LRU I/O-busy slot and retry. */ /* 5. If the chosen victim is dirty, call SlruInternalWritePage to flush it, then loop back. */}The “largest delta” computation uses integer subtraction with implicit
wraparound, which handles the counter wrapping correctly as long as no
page’s age exceeds INT_MAX counts. In case concurrent
SlruRecentlyUsed calls have caused clock skew, SlruSelectLRUPage
pre-increments bank_cur_lru_count before the scan, ensuring the chosen
victim will be marked freshly used on the next access.
The “mark recently used” side is SlruRecentlyUsed, deliberately small
enough to be safe under a shared bank lock (the read-only fast path
runs it concurrently). The if guard is the load-bearing detail: it
suppresses the increment when the page is already the most-recent in its
bank, which both saves a write and — more importantly — slows the
bank counter’s advance so old pages’ counts are less likely to “wrap
around” and masquerade as recently used:
// SlruRecentlyUsed — src/backend/access/transam/slru.cint bankno = SlotGetBankNumber(slotno);int new_lru_count = shared->bank_cur_lru_count[bankno];/* Suppress useless increments; allows concurrent callers under * shared bank lock, since int reads/writes are atomic. */if (new_lru_count != shared->page_lru_count[slotno]){ shared->bank_cur_lru_count[bankno] = ++new_lru_count; shared->page_lru_count[slotno] = new_lru_count;}The concurrency tolerance here is what makes SimpleLruReadPage_ReadOnly
viable: multiple readers may race on these two int stores, and the
worst case is a counter “reset” to a lower value, which SlruSelectLRUPage
defensively corrects (the this_delta < 0 branch resets the page count
to cur_count). No correctness depends on the LRU counts being exact —
they only steer eviction toward a reasonable victim, and the
latest_page_number carve-out protects the one page that must never be
evicted regardless of counter noise.
WAL-before-data for CLOG
Section titled “WAL-before-data for CLOG”CLOG is the only SLRU client that requires WAL flushing before writes.
Each committing transaction calls TransactionIdSetPageStatus (in
clog.c) which updates a 2-bit field and records the commit’s WAL LSN
into group_lsn[slotno * lsn_groups_per_page + group_offset]. When
SlruPhysicalWritePage is called for a CLOG slot, it scans the
group_lsn array for the slot, finds the maximum LSN, and calls
XLogFlush(max_lsn) before issuing the pwrite().
// SlruPhysicalWritePage (WAL path) — src/backend/access/transam/slru.cif (shared->group_lsn != NULL){ XLogRecPtr max_lsn = 0; lsnindex = slotno * shared->lsn_groups_per_page; for (int lsnoff = 0; lsnoff < shared->lsn_groups_per_page; lsnoff++) { XLogRecPtr this_lsn = shared->group_lsn[lsnindex++]; if (max_lsn < this_lsn) max_lsn = this_lsn; } if (!XLogRecPtrIsInvalid(max_lsn)) { START_CRIT_SECTION(); XLogFlush(max_lsn); END_CRIT_SECTION(); }}Other SLRU clients pass nlsns = 0 to SimpleLruInit, which leaves
group_lsn NULL, and the WAL flush is skipped entirely.
Segment naming and the long/short split
Section titled “Segment naming and the long/short split”On-disk, each SLRU maps logical page numbers to segment files. A segment
holds SLRU_PAGES_PER_SEGMENT = 32 pages. The segment number is
pageno / 32. The file name is the segment number in hex:
- Short names (default): 4–6 hex digits (
0000toFFFFFF), used by CLOG, subtrans, commit_ts. Supports up to 2²⁴ segments. - Long names (
long_segment_names = true): 15 hex digits, supporting up to 2⁶⁰ segments, used by multixact members and offsets to accommodate the larger address space of 64-bit MultiXactIds.
SlruFileName encodes this choice:
// SlruFileName — src/backend/access/transam/slru.cstatic inline intSlruFileName(SlruCtl ctl, char *path, int64 segno){ if (ctl->long_segment_names) return snprintf(path, MAXPGPATH, "%s/%015" PRIX64, ctl->Dir, segno); else return snprintf(path, MAXPGPATH, "%s/%04X", ctl->Dir, (unsigned int) segno);}Truncation and the PagePrecedes callback
Section titled “Truncation and the PagePrecedes callback”SimpleLruTruncate(ctl, cutoffPage) removes all on-disk segments that
contain only pages strictly older than cutoffPage. “Older” is defined
by ctl->PagePrecedes(a, b), a client-supplied callback that must use
modular arithmetic to handle XID wraparound.
The safety invariant: latest_page_number must not be older than
cutoffPage. If it is, the function logs a warning and returns without
deleting anything — this is a backstop against bugs in the caller.
The in-memory side first evicts all buffered pages below the cutoff (or
waits for I/O to finish), then calls SlruScanDirectory with the
SlruScanDirCbDeleteCutoff callback, which deletes segments where both
the first and last page satisfy PagePrecedes(page, cutoffPage) —
SlruMayDeleteSegment’s four-case analysis handles the wraparound edge
case where a segment straddles the cutoff or the wrap point.
Checkpoint integration
Section titled “Checkpoint integration”SimpleLruWriteAll flushes every dirty buffer to disk. It walks all
slots in bank order, acquiring and releasing bank locks as the bank
boundary changes. It batches file descriptors in SlruWriteAllData (up
to MAX_WRITEALL_BUFFERS = 16 open FDs) to amortize open()/close()
cost across pages of the same segment. After the slot scan it syncs the
directory (fsync_fname(ctl->Dir, true)) to ensure new segment-file
directory entries are durable.
Checkpoint accounting: each SLRU page written during checkpoint increments
CheckpointStats.ckpt_slru_written and PendingCheckpointerStats.slru_written.
Statistics
Section titled “Statistics”Each SLRU instance registers with pgstat_get_slru_index(name) at init
time. The per-SLRU counters tracked are:
pgstat_count_slru_page_zeroed— new page created.pgstat_count_slru_page_hit— found in buffer.pgstat_count_slru_page_read— loaded from disk.pgstat_count_slru_page_written— written to disk.pgstat_count_slru_page_exists—SimpleLruDoesPhysicalPageExistcalled.pgstat_count_slru_flush—SimpleLruWriteAllcalled.pgstat_count_slru_truncate—SimpleLruTruncatecalled.
These counters are exposed in pg_stat_slru.
Source Walkthrough
Section titled “Source Walkthrough”Initialization subsystem
Section titled “Initialization subsystem”| Symbol | Role |
|---|---|
SimpleLruShmemSize | Computes shared-memory bytes needed for nslots buffers + nbanks locks + optional group_lsn array |
SimpleLruAutotuneBuffers | GUC default: shared_buffers / divisor, rounded to SLRU_BANK_SIZE multiple |
SimpleLruInit | Allocates or attaches to named shmem region; initialises slot/bank arrays and LWLocks; sets ctl->Dir, nbanks, PagePrecedes |
check_slru_buffers | GUC check hook; rejects values not divisible by SLRU_BANK_SIZE |
Read/write API
Section titled “Read/write API”| Symbol | Role |
|---|---|
SimpleLruZeroPage | Allocates a new zeroed page into a slot; sets it as latest_page_number |
SimpleLruReadPage | Finds or loads a page; returns slot number; caller retains bank lock |
SimpleLruReadPage_ReadOnly | Optimistic shared-lock scan before falling back to SimpleLruReadPage |
SimpleLruWritePage | External wrapper for SlruInternalWritePage with fdata = NULL |
SimpleLruWriteAll | Checkpoint flush: iterates all slots, writes dirty ones, batches FDs |
SimpleLruDoesPhysicalPageExist | Checks on-disk existence without loading into a buffer |
Internal helpers
Section titled “Internal helpers”| Symbol | Role |
|---|---|
SlruInternalWritePage | Core write: marks slot WRITE_IN_PROGRESS, drops bank lock, calls SlruPhysicalWritePage, re-acquires bank lock |
SlruPhysicalReadPage | Issues pread(); treats missing file as zeroes during recovery |
SlruPhysicalWritePage | Enforces WAL-before-data via group_lsn; issues pwrite(); queues sync request |
SlruRecentlyUsed | Stamps page_lru_count[slotno] = ++bank_cur_lru_count[bankno]; safe under shared bank lock |
SlruSelectLRUPage | Victim search within the bank; evicts dirty victims inline |
SimpleLruWaitIO | Releases bank lock, acquires buffer lock shared, releases buffer lock, re-acquires bank lock |
SlruReportIOError | Translates slru_errcause / slru_errno into an ereport(ERROR) after state is cleaned up |
SimpleLruZeroLSNs | Zeroes the group_lsn array entries for a slot |
Segment management
Section titled “Segment management”| Symbol | Role |
|---|---|
SlruFileName | Formats segment path; short (4–6 hex) or long (15 hex) based on long_segment_names |
SimpleLruTruncate | Evicts old in-memory pages, then calls SlruScanDirectory(SlruScanDirCbDeleteCutoff) |
SlruMayDeleteSegment | Four-case modular check: both first and last page of segment must precede cutoff |
SlruInternalDeleteSegment | Forgets sync requests, calls unlink() |
SlruDeleteSegment | Public entry: evicts the segment from buffers first, then calls SlruInternalDeleteSegment |
SlruScanDirectory | Reads directory entries, filters by name length + hex pattern, invokes callback per segment |
SlruScanDirCbReportPresence | Scan callback: returns true if any segment precedes cutoff |
SlruScanDirCbDeleteCutoff | Scan callback: deletes segments that satisfy SlruMayDeleteSegment |
SlruScanDirCbDeleteAll | Scan callback: deletes all segments unconditionally |
SlruSyncFileTag | Sync handler: opens the segment file and calls pg_fsync() |
Data structures (header)
Section titled “Data structures (header)”| Symbol | Role |
|---|---|
SlruPageStatus | Enum: SLRU_PAGE_EMPTY, SLRU_PAGE_READ_IN_PROGRESS, SLRU_PAGE_VALID, SLRU_PAGE_WRITE_IN_PROGRESS |
SlruSharedData / SlruShared | Shared-memory state: arrays for page_buffer, page_status, page_dirty, page_number, page_lru_count, buffer_locks, bank_locks, bank_cur_lru_count, group_lsn |
SlruCtlData / SlruCtl | Per-backend unshared control: pointer to shared, nbanks, long_segment_names, sync_handler, PagePrecedes, Dir |
SimpleLruGetBankLock | Inline: computes pageno % nbanks and returns &bank_locks[bankno].lock |
Position hints (commit 273fe94, REL_18_STABLE, 2026-06-05)
Section titled “Position hints (commit 273fe94, REL_18_STABLE, 2026-06-05)”| Symbol | File | Line |
|---|---|---|
SlruSharedData | src/include/access/slru.h | 61 |
SlruCtlData | src/include/access/slru.h | 127 |
SlruPageStatus | src/include/access/slru.h | 47 |
SimpleLruGetBankLock | src/include/access/slru.h | 175 |
SLRU_BANK_BITSHIFT | src/backend/access/transam/slru.c | 142 |
SLRU_BANK_SIZE | src/backend/access/transam/slru.c | 143 |
SlotGetBankNumber | src/backend/access/transam/slru.c | 148 |
MAX_WRITEALL_BUFFERS | src/backend/access/transam/slru.c | 123 |
SLRU_PAGES_PER_SEGMENT | src/include/access/slru.h | 39 |
SimpleLruShmemSize | src/backend/access/transam/slru.c | 198 |
SimpleLruAutotuneBuffers | src/backend/access/transam/slru.c | 231 |
SimpleLruInit | src/backend/access/transam/slru.c | 252 |
SimpleLruZeroPage | src/backend/access/transam/slru.c | 375 |
SimpleLruWaitIO | src/backend/access/transam/slru.c | 445 |
SimpleLruReadPage | src/backend/access/transam/slru.c | 502 |
SimpleLruReadPage_ReadOnly | src/backend/access/transam/slru.c | 605 |
SlruInternalWritePage | src/backend/access/transam/slru.c | 652 |
SimpleLruWritePage | src/backend/access/transam/slru.c | 732 |
SlruPhysicalReadPage | src/backend/access/transam/slru.c | 804 |
SlruPhysicalWritePage | src/backend/access/transam/slru.c | 876 |
SlruReportIOError | src/backend/access/transam/slru.c | 1048 |
SlruRecentlyUsed | src/backend/access/transam/slru.c | 1123 |
SlruSelectLRUPage | src/backend/access/transam/slru.c | 1169 |
SimpleLruWriteAll | src/backend/access/transam/slru.c | 1322 |
SimpleLruTruncate | src/backend/access/transam/slru.c | 1408 |
SlruInternalDeleteSegment | src/backend/access/transam/slru.c | 1503 |
SlruDeleteSegment | src/backend/access/transam/slru.c | 1526 |
SlruMayDeleteSegment | src/backend/access/transam/slru.c | 1603 |
SlruScanDirectory | src/backend/access/transam/slru.c | 1791 |
SlruSyncFileTag | src/backend/access/transam/slru.c | 1831 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified against commit 273fe94 on branch REL_18_STABLE.
Confirmed:
SlruSharedDatalayout matches theSimpleLruShmemSizecomputation exactly: the field offsets calculated inSimpleLruInit(lines 283–309) advance throughpage_buffer,page_status,page_dirty,page_number,page_lru_count,buffer_locks,bank_locks,bank_cur_lru_count, and optionallygroup_lsnin that order.SLRU_BANK_SIZE = 16,SLRU_BANK_BITSHIFT = 4, confirmed at lines 142–143.SlotGetBankNumber(slotno)is(slotno) >> 4.SLRU_PAGES_PER_SEGMENT = 32(slru.h:39). One segment is 32 × 8192 = 256 KB. CLOG: 32 pages × (8192 × 4 XIDs/byte) = 1,048,576 XIDs/segment.latest_page_numberispg_atomic_uint64(slru.h:115), written atomically inSimpleLruZeroPageand read without barrier inSlruSelectLRUPage— the comment at slru.c:1251 explains why this is correct.group_lsnis non-NULL only for CLOG (nlsns = CLOG_LSNS_PER_PAGEat clog.c:811). All other clients passnlsns = 0.NOTIFYclient passessync_handler = SYNC_HANDLER_NONE, disabling fsync (confirmed at commands/async.c:537–541); theSlruPhysicalWritePagepath skipsRegisterSyncRequestwhensync_handler == SYNC_HANDLER_NONE.MAX_WRITEALL_BUFFERS = 16at slru.c:123. The batching fallback (setfdata = NULL) at slru.c:985 handles the unlikely case of > 16 open files.
Unresolved / out of scope for this doc:
- The
SlruPagePrecedesUnitTestsfunction (slru.c:1697) and how each client’sPagePrecedescallback satisfies its contracts — covered inpostgres-clog-commit-ts.mdandpostgres-multixact.md. - The interaction between
SimpleLruTruncateand vacuum’s XID horizon computation — covered inpostgres-vacuum.md. - The SSI client (
SerialSlruCtl, predicate.c:814) stores SIREAD lock summaries per-page rather than per-XID; the address space semantics differ from CLOG — covered inpostgres-ssi-predicate-locking.md.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”Oracle and SQL Server: no dedicated status log
Section titled “Oracle and SQL Server: no dedicated status log”Oracle stores transaction status in the System Change Number (SCN)
embedded in each row’s ITL (Interested Transaction List) slot and in the
rollback segment transaction table. “Is XID N committed?” is answered by
probing the rollback segment, not a separate log file. When rollback
segments age out, Oracle uses “delayed block cleanout” to lazily update
row headers. There is no analog of pg_xact/.
SQL Server uses a similar in-row versioning approach (row version store
in tempdb), with transaction status embedded in version chains rather
than a side file.
The PostgreSQL design’s advantage is simplicity and predictability: the
pg_xact/ SLRU is a compact, uniform log that vacuum can truncate
independently of heap file lifetimes. The disadvantage is that every
visibility check requires an additional SLRU lookup unless the commit
status is cached in the hint bits of the heap tuple — which is what
HeapTupleHeaderSetHintBits in heapam_visibility.c does to amortize
the cost.
CUBRID: commit log with a different eviction granularity
Section titled “CUBRID: commit log with a different eviction granularity”CUBRID’s transaction-status subsystem also uses a log-file approach with a shared in-memory cache, but its buffer pool is integrated with the general buffer manager rather than a separate structure. The advantage is a unified eviction policy; the disadvantage is contention on the central buffer pool lock during high-commit-rate workloads — exactly the problem SLRU’s banked architecture avoids.
Research: MVCC with status bitmaps in the heap header
Section titled “Research: MVCC with status bitmaps in the heap header”Neumann et al. (“Fast Serializable Multi-Version Concurrency Control for
Main-Memory Database Systems”, SIGMOD 2015) and the HyPer/Umbra lineage
embed a transaction-status bitmap directly in each version’s header,
eliminating the side-file lookup entirely. This is viable for
main-memory databases where the bitmap fits in a cache line; for
disk-based databases the side-file approach reduces per-page metadata
overhead. PostgreSQL’s hint bits (t_infomask in HeapTupleHeaderData)
are a partial application of the same idea: once a transaction’s status
is confirmed, it is written into the tuple header so future reads bypass
the SLRU entirely.
The extensibility angle
Section titled “The extensibility angle”PostgreSQL’s SLRU is itself an extensibility surface. Any extension can
call SimpleLruInit with a custom name, directory, buffer count,
PagePrecedes callback, and sync handler. This mirrors the broader
PostgreSQL design philosophy of exposing internal machinery (table AM,
index AM, WAL rmgr, background workers) as first-class extension points.
A columnar or time-series extension that needs a compact, wrap-around-able
status log can reuse the entire SLRU stack — LRU eviction, WAL flushing,
checkpointing, truncation — without reimplementing it.
Sources
Section titled “Sources”In-tree source files (REL_18_STABLE, commit 273fe94):
src/backend/access/transam/slru.c— main implementation (1852 lines)src/include/access/slru.h— public API and shared-memory layoutsrc/backend/access/transam/clog.c— CLOG client; definesXactCtlsrc/backend/access/transam/subtrans.c— subtransaction clientsrc/backend/access/transam/commit_ts.c— commit_ts clientsrc/backend/access/transam/multixact.c— MultiXact offset/member clientssrc/backend/commands/async.c— NOTIFY client (no fsync)src/backend/storage/lmgr/predicate.c— SSI SIREAD lock client
Cross-references in this KB:
postgres-xact.md— transaction lifecycle that drives CLOG writespostgres-clog-commit-ts.md— CLOG and commit_ts as SLRU clientspostgres-multixact.md— MultiXact as SLRU clientpostgres-mvcc-snapshots.md— visibility checks that read CLOG via SLRUpostgres-xlog-wal.md— WAL infrastructure that SLRU defers to before writespostgres-vacuum.md— XID horizon computation that drives SLRU truncationpostgres-ssi-predicate-locking.md— SSI SerialSlruCtl clientpostgres-buffer-manager.md— contrast: the full heap/index buffer manager
Textbook foundations:
- Database System Concepts, Silberschatz et al., 7e, ch. 13 (buffer management)
- Database Internals, Petrov 2019 (buffer pool eviction policies)
dbms-papers/aries.md— WAL-before-data invariant (Mohan et al. 1992)