Skip to content

PostgreSQL SLRU — Simple LRU Buffering for Wrap-Around-Able Metadata

Contents:

PostgreSQL manages transaction status — whether a given XID committed, aborted, or is still in progress — through a set of small, append-mostly files indexed by TransactionId. The fundamental question each of these files answers is “what happened to XID N?” Access is random by XID, but the access pattern is strongly temporal: the latest page is read and written most heavily, and progressively older pages are read with decreasing frequency as vacuuming ages them out. This shape is classic for a simple paged buffer cache with LRU replacement.

The textbook treatment of buffer management (Database System Concepts, Silberschatz et al., 7e, ch. 13 “Data Storage Structures”) describes the general model: a fixed pool of in-memory frames, a page-table mapping logical pages to frames, a replacement policy (LRU, clock, MRU) that evicts the least-useful frame when the pool is full, and dirty-page tracking to ensure writes reach stable storage before eviction. ARIES (ARIES: A Transaction Recovery Method, Mohan et al., ACM TODS 1992; dbms-papers/aries.md) introduces the WAL-before-data invariant: a dirty page must not be written to stable storage until the WAL records that describe its changes have been flushed first. For ordinary heap and index pages, the main buffer manager in storage/buffer/bufmgr.c enforces this by checking the page’s LSN against the WAL flush point at write time. Transaction-status pages follow the same invariant, but they are small, numerous, and have a monotonically advancing logical address space: none of the richness of the heap buffer manager is needed, and the cost of its infrastructure — the buffer descriptors, the shared buffer hash, the per-buffer content locks, the complex clock-sweep algorithm — would be pure overhead.

SLRU (Simple LRU) is PostgreSQL’s purpose-built answer: a stripped-down paged buffer cache whose design exploits the temporal locality and fixed logical address space of transaction metadata. It is not academically novel — it is careful engineering that eliminates every feature of the general buffer manager that does not pay for itself in this access pattern.

The wrap-around aspect is a second distinct concern. TransactionIds are 32-bit unsigned integers. After approximately 2 billion transactions they wrap. The status files must therefore treat page numbering and segment numbering as modular arithmetic. SimpleLruTruncate deletes old segments using a PagePrecedes callback that each client supplies; the callback encapsulates the modular comparison so the SLRU substrate itself never needs to understand XID semantics.

Transaction-status storage is a near-universal subsystem in relational engines, though its shape varies with the engine’s concurrency protocol.

Undo-log engines (Oracle, SQL Server, MySQL/InnoDB) record the before-image of every modified row in a separate undo segment. “What was the committed state of this row at snapshot time T?” is answered by chasing the undo chain from the current version backward to the appropriate revision. There is no separate “did XID N commit?” log because the commit record is embedded in the row version chain and in the transaction table within the rollback segment. The equivalent of CLOG is a small in-memory transaction table, evicted to undo tablespace, not a dedicated append-mostly file.

MVCC-with-commit-log engines (PostgreSQL, CUBRID in some modes) record the row’s final committed version in-place (or as a heap version) and keep a side file that maps XID → {committed, aborted, in-progress}. Reading visibility requires consulting this side file. PostgreSQL’s pg_xact/ (formerly pg_clog/) is the canonical example. The design trades undo-chain traversal cost for a single random read of a small, hot, cacheable file.

Temporal locality and the “recent-biased” access pattern

Section titled “Temporal locality and the “recent-biased” access pattern”

In both designs, transaction-status data exhibits strong temporal locality. Transactions recently committed are the ones most likely to be checked by concurrent readers. Transactions committed thousands of checkpoints ago appear only during vacuum or when a very long-lived snapshot is active. Any cache for transaction-status data should be tuned toward recency. The SLRU’s LRU replacement and its explicit protection of the “latest page” from eviction are the direct expression of this principle.

Rather than a single monolithic file growing without bound, every production engine divides transaction-status storage into fixed-size segments. Segment rotation is a checkpoint-friendly operation (the checkpointer can sync one segment at a time), and truncation removes whole segments, which is a directory-entry delete — cheaper than punching holes in a large file. PostgreSQL’s SLRU_PAGES_PER_SEGMENT = 32 segments are 256 KB each (32 pages × 8 KB/page), holding status data for 1 M transactions (for CLOG’s 2 bits per XID packing).

Pure LRU evicts the frame not accessed for the longest time. For append-mostly workloads the latest page will be written by every committing transaction; a pure LRU eviction policy could paradoxically evict it if no other reads have happened in the bank since it was loaded. PostgreSQL’s SLRU introduces a single special case: the page identified by latest_page_number (an atomic uint64) is never selected as a victim, regardless of its LRU count. Every other page follows pure LRU within its bank.

SLRU is a generic substrate that any in-core subsystem may instantiate by calling SimpleLruInit. As of REL_18_STABLE the following clients exist:

ClientSlruCtlDirectoryNotes
CLOGXactCtlpg_xact/2 bits/XID, WAL-before-data enabled
SubtransactionSubTransCtlpg_subtrans/Parent XID per sub-XID
Commit timestampCommitTsCtlpg_commit_ts/commit_timestamp optional
MultiXact offsetMultiXactOffsetCtlpg_multixact/offsets/Offset into members file
MultiXact memberMultiXactMemberCtlpg_multixact/members/XID list + flags
NOTIFYNotifyCtlpg_notify/Async notifications; no fsync
SerializableSerialSlruCtlpg_serial/SSI SIREAD lock summary

Extensions may also call SimpleLruInit to define their own SLRUs.

The SLRU state is split between a shared part (allocated in the fixed shared-memory segment at postmaster startup) and an unshared control struct (SlruCtlData) that each backend keeps in local memory and uses to reach the shared state.

// SlruSharedData — src/include/access/slru.h
typedef struct SlruSharedData
{
int num_slots; /* total buffer slots */
char **page_buffer; /* BLCKSZ byte arrays, one per slot */
SlruPageStatus *page_status; /* EMPTY / READ_IN_PROGRESS / VALID /
WRITE_IN_PROGRESS */
bool *page_dirty; /* re-dirtied while write in progress? */
int64 *page_number; /* logical page each slot holds */
int *page_lru_count; /* per-slot LRU counter */
LWLockPadded *buffer_locks; /* per-buffer I/O LWLocks */
LWLockPadded *bank_locks; /* per-bank control LWLocks */
int *bank_cur_lru_count; /* per-bank LRU clock */
XLogRecPtr *group_lsn; /* optional WAL flush LSNs (CLOG only) */
int lsn_groups_per_page;
pg_atomic_uint64 latest_page_number; /* newest page; never evict this one */
int slru_stats_idx; /* index into pgstat SLRU counters */
} SlruSharedData;
// SlruCtlData — src/include/access/slru.h
typedef struct SlruCtlData
{
SlruShared shared;
uint16 nbanks; /* total banks = num_slots / SLRU_BANK_SIZE */
bool long_segment_names; /* 15-hex-char names vs 4-6-hex-char names */
SyncRequestHandler sync_handler;/* SYNC_HANDLER_NONE disables fsync (notify) */
bool (*PagePrecedes)(int64, int64); /* modular "older than" for truncation */
char Dir[64]; /* PGDATA-relative subdirectory */
} SlruCtlData;

Buffers are organized into banks of SLRU_BANK_SIZE = 16 slots each. Every page is assigned to a bank by pageno % nbanks. Each bank has its own bank_locks[bankno] LWLock and its own bank_cur_lru_count counter. The key properties this delivers are:

  1. Scope-limited LRU search. Victim selection (SlruSelectLRUPage) only walks the 16 slots of the target bank. No global scan; no hashtable lookup.
  2. Partitioned locking. Concurrent accesses to pages in different banks contend on different LWLocks, reducing hot-lock contention at high commit rates.
  3. Per-bank LRU counts avoid global invalidation. Updating bank_cur_lru_count[bankno] on every access touches only the cache line for that bank, not a global counter.

The total buffer count is a multiple of SLRU_BANK_SIZE. GUC parameters (e.g., transaction_buffers, multixact_offsets_buffers) control the total count; check_slru_buffers enforces the multiple-of-16 constraint. SimpleLruAutotuneBuffers computes a default count as shared_buffers / divisor, rounded down to a multiple of 16.

The shared-memory layout is a single ShmemInitStruct allocation whose size is precomputed by SimpleLruShmemSize. After the SlruSharedData header come the parallel per-slot arrays (one entry per buffer slot, indexed by slotno), then the per-bank arrays (one entry per bank, indexed by bankno = pageno % nbanks), then the optional group_lsn array, and finally the BLCKSZ * nslots block of page buffers themselves. The slot index and bank index are tied together by SlotGetBankNumber(slotno) = slotno >> SLRU_BANK_BITSHIFT — i.e., the first 16 slots belong to bank 0, the next 16 to bank 1, and so on.

flowchart TB
    subgraph SHM["ShmemInitStruct(name, SimpleLruShmemSize(nslots, nlsns))"]
        HDR["SlruSharedData header<br/>num_slots, latest_page_number (atomic),<br/>lsn_groups_per_page, slru_stats_idx"]
        subgraph PERSLOT["Per-slot arrays — indexed by slotno (length nslots)"]
            PS["page_status[]<br/>EMPTY / READ_IP / VALID / WRITE_IP"]
            PD["page_dirty[]"]
            PN["page_number[] (int64)"]
            PL["page_lru_count[] (int)"]
            BL["buffer_locks[] (LWLockPadded)<br/>one I/O lock per slot"]
        end
        subgraph PERBANK["Per-bank arrays — indexed by bankno (length nbanks)"]
            BKL["bank_locks[] (LWLockPadded)"]
            BC["bank_cur_lru_count[] (int)"]
        end
        GL["group_lsn[] (XLogRecPtr)<br/>nslots * lsn_groups_per_page<br/>NULL unless nlsns > 0 — CLOG only"]
        subgraph BUFS["page_buffer[] -> BLCKSZ * nslots bytes"]
            PB0["slot 0: 8192-byte page"]
            PB1["slot 1: 8192-byte page"]
            PBN["... slot nslots-1"]
        end
    end

    SEG["pg_xact / pg_subtrans / ... segment files<br/>SLRU_PAGES_PER_SEGMENT = 32 pages each"]

    PN -->|"pageno % nbanks"| BKL
    BL -.->|"SlotGetBankNumber = slotno >> 4"| BKL
    PB0 <-->|"SlruPhysicalReadPage / SlruPhysicalWritePage<br/>pread() / pwrite() at (pageno % 32) * BLCKSZ"| SEG

The contiguous-array sizing is explicit in SimpleLruShmemSize; each array is MAXALIGN-padded so the next array starts on a safe boundary, and the page-buffer block is appended after a BUFFERALIGN:

// SimpleLruShmemSize — src/backend/access/transam/slru.c
int nbanks = nslots / SLRU_BANK_SIZE;
sz = MAXALIGN(sizeof(SlruSharedData));
sz += MAXALIGN(nslots * sizeof(char *)); /* page_buffer[] */
sz += MAXALIGN(nslots * sizeof(SlruPageStatus));/* page_status[] */
sz += MAXALIGN(nslots * sizeof(bool)); /* page_dirty[] */
sz += MAXALIGN(nslots * sizeof(int64)); /* page_number[] */
sz += MAXALIGN(nslots * sizeof(int)); /* page_lru_count[] */
sz += MAXALIGN(nslots * sizeof(LWLockPadded)); /* buffer_locks[] */
sz += MAXALIGN(nbanks * sizeof(LWLockPadded)); /* bank_locks[] */
sz += MAXALIGN(nbanks * sizeof(int)); /* bank_cur_lru_count[] */
if (nlsns > 0)
sz += MAXALIGN(nslots * nlsns * sizeof(XLogRecPtr)); /* group_lsn[] */
return BUFFERALIGN(sz) + BLCKSZ * nslots; /* page contents last */

SimpleLruInit then carves this one allocation into the array pointers in exactly the same order (the carving code lives in the if (!IsUnderPostmaster) branch, where it advances a running ptr through the region). The fact that the size calculation and the pointer assignment walk the arrays in lockstep is the layout’s only correctness contract — there is no per-array allocation, so a mismatch would silently overlap two arrays.

The bank geometry itself is a pure bit-shift, chosen so victim search touches a bounded, cache-friendly window:

// bank geometry — src/backend/access/transam/slru.c
#define SLRU_BANK_BITSHIFT 4
#define SLRU_BANK_SIZE (1 << SLRU_BANK_BITSHIFT) /* 16 */
#define SlotGetBankNumber(slotno) ((slotno) >> SLRU_BANK_BITSHIFT)
SimpleLruZeroPage / SimpleLruReadPage
|
EMPTY ───► READ_IN_PROGRESS ───► VALID
│ ▲
(dirty) │ │ re-dirty during write
▼ │
WRITE_IN_PROGRESS ───► VALID (clean)

SLRU_PAGE_EMPTY — the slot is unused. SLRU_PAGE_READ_IN_PROGRESS — a backend is reading the page from disk; it holds buffer_locks[slotno] exclusively and has released the bank lock. SLRU_PAGE_VALID — the page is in memory and usable; may be dirty. SLRU_PAGE_WRITE_IN_PROGRESS — a backend is writing the page to disk; it holds buffer_locks[slotno] exclusively. If another backend dirtifies the page during the write, page_dirty is set to true again so a second write will follow.

The page_dirty flag is separate from page_status. A page can be WRITE_IN_PROGRESS and dirty simultaneously, signalling that the current write will not be sufficient and a re-write is needed.

The design uses two tiers of LWLocks per SLRU instance:

  1. Bank control locks (bank_locks[bankno]) — protect all metadata fields of the bank’s slots: page_status, page_dirty, page_number, page_lru_count, bank_cur_lru_count. Must be held (exclusive in most paths, shared in SimpleLruReadPage_ReadOnly) to examine or modify any of these fields.

  2. Per-buffer I/O locks (buffer_locks[slotno]) — synchronize in-flight I/O. Before starting a read or write the backend acquires the buffer lock exclusively, then releases the bank lock. Other backends waiting for the same I/O acquire the buffer lock shared, block, and re-acquire the bank lock after the I/O finishes.

The key invariant: no backend holds both a bank lock and a buffer lock at the same time in the I/O hot path. This prevents deadlock because the only way to acquire a buffer lock is after releasing the bank lock, and re-acquiring the bank lock always happens before releasing the buffer lock. SimpleLruWaitIO is the canonical expression of this ordering — it is how a backend blocks on someone else’s in-flight I/O:

// SimpleLruWaitIO — src/backend/access/transam/slru.c
int bankno = SlotGetBankNumber(slotno);
/* See notes at top of file */
LWLockRelease(&shared->bank_locks[bankno].lock);
LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_SHARED);
LWLockRelease(&shared->buffer_locks[slotno].lock);
LWLockAcquire(&shared->bank_locks[bankno].lock, LW_EXCLUSIVE);
/* If the slot is still io-in-progress, the I/O must have failed;
* recover the page_status so the slot can be reused. */
if (shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS ||
shared->page_status[slotno] == SLRU_PAGE_WRITE_IN_PROGRESS)
{
if (LWLockConditionalAcquire(&shared->buffer_locks[slotno].lock, LW_SHARED))
{
if (shared->page_status[slotno] == SLRU_PAGE_READ_IN_PROGRESS)
shared->page_status[slotno] = SLRU_PAGE_EMPTY;
else /* write_in_progress */
{
shared->page_status[slotno] = SLRU_PAGE_VALID;
shared->page_dirty[slotno] = true;
}
LWLockRelease(&shared->buffer_locks[slotno].lock);
}
}

Acquiring the buffer lock LW_SHARED here is a pure rendezvous: the backend doing the I/O holds it LW_EXCLUSIVE, so the waiter blocks until the I/O finishes and releases it, then immediately drops it. The conditional-acquire afterward is a self-healing check — if the buffer lock is grantable but the slot is still flagged in-progress, the original I/O must have errored out (a transaction abort released its lock without resetting the status), so this backend repairs the slot state.

latest_page_number is the single exception: it is read and written with atomic operations (pg_atomic_read_u64 / pg_atomic_write_u64) rather than under a lock, because checking whether a slot holds the latest page during LRU victim selection is an advisory hint, not a correctness-critical test.

// SimpleLruReadPage — src/backend/access/transam/slru.c
// Bank lock held exclusive on entry and exit.
int
SimpleLruReadPage(SlruCtl ctl, int64 pageno, bool write_ok,
TransactionId xid)
{
for (;;)
{
slotno = SlruSelectLRUPage(ctl, pageno); /* find or pick victim */
if (page already in memory && not I/O busy)
{
SlruRecentlyUsed(shared, slotno);
pgstat_count_slru_page_hit(...);
return slotno;
}
/* Mark slot read-busy, acquire buffer lock, release bank lock */
shared->page_status[slotno] = SLRU_PAGE_READ_IN_PROGRESS;
LWLockAcquire(&shared->buffer_locks[slotno].lock, LW_EXCLUSIVE);
LWLockRelease(banklock);
ok = SlruPhysicalReadPage(ctl, pageno, slotno); /* pread() */
/* Re-acquire bank lock, update status, release buffer lock */
LWLockAcquire(banklock, LW_EXCLUSIVE);
shared->page_status[slotno] = ok ? SLRU_PAGE_VALID : SLRU_PAGE_EMPTY;
LWLockRelease(&shared->buffer_locks[slotno].lock);
if (!ok) SlruReportIOError(ctl, pageno, xid);
return slotno;
}
}

SimpleLruReadPage_ReadOnly first attempts a shared bank lock scan — it walks the bank looking for the page without acquiring the exclusive lock. On a hit, SlruRecentlyUsed updates the LRU count without a barrier (safe because int reads/writes are atomic on all supported platforms and a slightly stale LRU count only degrades, not breaks, victim selection). On a miss, it escalates to exclusive and falls through to SimpleLruReadPage.

The full SimpleLruReadPage miss path — the case where the page is not resident and must be faulted in — is the most lock-intricate flow in the file. The outer for (;;) loop exists precisely so that a backend which discovers the slot is mid-I/O (its own retry, or another backend’s read) can wait and re-derive the slot from scratch, because after dropping the bank lock the slot contents may have changed entirely:

flowchart TB
    START["SimpleLruReadPage(ctl, pageno, write_ok, xid)<br/>bank lock held EXCLUSIVE on entry"] --> SEL["slotno = SlruSelectLRUPage(ctl, pageno)<br/>scan 16 slots of bank pageno % nbanks"]
    SEL --> RES{"slot holds pageno<br/>and != EMPTY?"}
    RES -->|"yes — hit"| BUSY{"READ_IN_PROGRESS,<br/>or WRITE_IN_PROGRESS<br/>and not write_ok?"}
    BUSY -->|"yes"| WAIT["SimpleLruWaitIO(ctl, slotno)<br/>drop bank lock, take buffer lock SHARED,<br/>release it, re-take bank lock"]
    WAIT --> SEL
    BUSY -->|"no"| HIT["SlruRecentlyUsed<br/>pgstat_count_slru_page_hit<br/>return slotno"]

    RES -->|"no — miss, victim chosen"| MARK["page_number[slotno] = pageno<br/>page_status[slotno] = READ_IN_PROGRESS<br/>page_dirty[slotno] = false"]
    MARK --> ACQ["LWLockAcquire buffer_locks[slotno] EXCLUSIVE<br/>then LWLockRelease(banklock) — never both held"]
    ACQ --> IO["SlruPhysicalReadPage(ctl, pageno, slotno)<br/>pread(); missing file -> zeroes in recovery"]
    IO --> ZLSN["SimpleLruZeroLSNs(ctl, slotno)"]
    ZLSN --> REACQ["LWLockAcquire(banklock) EXCLUSIVE<br/>page_status = ok ? VALID : EMPTY<br/>LWLockRelease(buffer_locks[slotno])"]
    REACQ --> ERR{"read ok?"}
    ERR -->|"no"| RPT["SlruReportIOError(ctl, pageno, xid)<br/>ereport(ERROR)"]
    ERR -->|"yes"| DONE["SlruRecentlyUsed<br/>pgstat_count_slru_page_read<br/>return slotno"]

Note that SlruSelectLRUPage is called even on the retry after a wait: the page someone else read in may now be resident, turning the next iteration into a hit. If the chosen victim was dirty, SlruSelectLRUPage itself flushes it via SlruInternalWritePage before returning, so by the time SimpleLruReadPage overwrites the slot it is guaranteed clean (asserted at the “we found no match” point).

// SlruSelectLRUPage — src/backend/access/transam/slru.c
// Bank lock held on entry and exit. May write a dirty victim to disk.
static int
SlruSelectLRUPage(SlruCtl ctl, int64 pageno)
{
/* 1. Return any slot already holding the target page. */
/* 2. Return an EMPTY slot if one exists. */
/* 3. Among VALID slots: find the one with the largest
(bank_cur_lru_count - page_lru_count) — i.e., least recently
used. Skip the latest_page_number slot. */
/* 4. If no VALID slot, wait for the LRU I/O-busy slot and retry. */
/* 5. If the chosen victim is dirty, call SlruInternalWritePage
to flush it, then loop back. */
}

The “largest delta” computation uses integer subtraction with implicit wraparound, which handles the counter wrapping correctly as long as no page’s age exceeds INT_MAX counts. In case concurrent SlruRecentlyUsed calls have caused clock skew, SlruSelectLRUPage pre-increments bank_cur_lru_count before the scan, ensuring the chosen victim will be marked freshly used on the next access.

The “mark recently used” side is SlruRecentlyUsed, deliberately small enough to be safe under a shared bank lock (the read-only fast path runs it concurrently). The if guard is the load-bearing detail: it suppresses the increment when the page is already the most-recent in its bank, which both saves a write and — more importantly — slows the bank counter’s advance so old pages’ counts are less likely to “wrap around” and masquerade as recently used:

// SlruRecentlyUsed — src/backend/access/transam/slru.c
int bankno = SlotGetBankNumber(slotno);
int new_lru_count = shared->bank_cur_lru_count[bankno];
/* Suppress useless increments; allows concurrent callers under
* shared bank lock, since int reads/writes are atomic. */
if (new_lru_count != shared->page_lru_count[slotno])
{
shared->bank_cur_lru_count[bankno] = ++new_lru_count;
shared->page_lru_count[slotno] = new_lru_count;
}

The concurrency tolerance here is what makes SimpleLruReadPage_ReadOnly viable: multiple readers may race on these two int stores, and the worst case is a counter “reset” to a lower value, which SlruSelectLRUPage defensively corrects (the this_delta < 0 branch resets the page count to cur_count). No correctness depends on the LRU counts being exact — they only steer eviction toward a reasonable victim, and the latest_page_number carve-out protects the one page that must never be evicted regardless of counter noise.

CLOG is the only SLRU client that requires WAL flushing before writes. Each committing transaction calls TransactionIdSetPageStatus (in clog.c) which updates a 2-bit field and records the commit’s WAL LSN into group_lsn[slotno * lsn_groups_per_page + group_offset]. When SlruPhysicalWritePage is called for a CLOG slot, it scans the group_lsn array for the slot, finds the maximum LSN, and calls XLogFlush(max_lsn) before issuing the pwrite().

// SlruPhysicalWritePage (WAL path) — src/backend/access/transam/slru.c
if (shared->group_lsn != NULL)
{
XLogRecPtr max_lsn = 0;
lsnindex = slotno * shared->lsn_groups_per_page;
for (int lsnoff = 0; lsnoff < shared->lsn_groups_per_page; lsnoff++)
{
XLogRecPtr this_lsn = shared->group_lsn[lsnindex++];
if (max_lsn < this_lsn) max_lsn = this_lsn;
}
if (!XLogRecPtrIsInvalid(max_lsn))
{
START_CRIT_SECTION();
XLogFlush(max_lsn);
END_CRIT_SECTION();
}
}

Other SLRU clients pass nlsns = 0 to SimpleLruInit, which leaves group_lsn NULL, and the WAL flush is skipped entirely.

On-disk, each SLRU maps logical page numbers to segment files. A segment holds SLRU_PAGES_PER_SEGMENT = 32 pages. The segment number is pageno / 32. The file name is the segment number in hex:

  • Short names (default): 4–6 hex digits (0000 to FFFFFF), used by CLOG, subtrans, commit_ts. Supports up to 2²⁴ segments.
  • Long names (long_segment_names = true): 15 hex digits, supporting up to 2⁶⁰ segments, used by multixact members and offsets to accommodate the larger address space of 64-bit MultiXactIds.

SlruFileName encodes this choice:

// SlruFileName — src/backend/access/transam/slru.c
static inline int
SlruFileName(SlruCtl ctl, char *path, int64 segno)
{
if (ctl->long_segment_names)
return snprintf(path, MAXPGPATH, "%s/%015" PRIX64, ctl->Dir, segno);
else
return snprintf(path, MAXPGPATH, "%s/%04X", ctl->Dir,
(unsigned int) segno);
}

SimpleLruTruncate(ctl, cutoffPage) removes all on-disk segments that contain only pages strictly older than cutoffPage. “Older” is defined by ctl->PagePrecedes(a, b), a client-supplied callback that must use modular arithmetic to handle XID wraparound.

The safety invariant: latest_page_number must not be older than cutoffPage. If it is, the function logs a warning and returns without deleting anything — this is a backstop against bugs in the caller.

The in-memory side first evicts all buffered pages below the cutoff (or waits for I/O to finish), then calls SlruScanDirectory with the SlruScanDirCbDeleteCutoff callback, which deletes segments where both the first and last page satisfy PagePrecedes(page, cutoffPage)SlruMayDeleteSegment’s four-case analysis handles the wraparound edge case where a segment straddles the cutoff or the wrap point.

SimpleLruWriteAll flushes every dirty buffer to disk. It walks all slots in bank order, acquiring and releasing bank locks as the bank boundary changes. It batches file descriptors in SlruWriteAllData (up to MAX_WRITEALL_BUFFERS = 16 open FDs) to amortize open()/close() cost across pages of the same segment. After the slot scan it syncs the directory (fsync_fname(ctl->Dir, true)) to ensure new segment-file directory entries are durable.

Checkpoint accounting: each SLRU page written during checkpoint increments CheckpointStats.ckpt_slru_written and PendingCheckpointerStats.slru_written.

Each SLRU instance registers with pgstat_get_slru_index(name) at init time. The per-SLRU counters tracked are:

  • pgstat_count_slru_page_zeroed — new page created.
  • pgstat_count_slru_page_hit — found in buffer.
  • pgstat_count_slru_page_read — loaded from disk.
  • pgstat_count_slru_page_written — written to disk.
  • pgstat_count_slru_page_existsSimpleLruDoesPhysicalPageExist called.
  • pgstat_count_slru_flushSimpleLruWriteAll called.
  • pgstat_count_slru_truncateSimpleLruTruncate called.

These counters are exposed in pg_stat_slru.

SymbolRole
SimpleLruShmemSizeComputes shared-memory bytes needed for nslots buffers + nbanks locks + optional group_lsn array
SimpleLruAutotuneBuffersGUC default: shared_buffers / divisor, rounded to SLRU_BANK_SIZE multiple
SimpleLruInitAllocates or attaches to named shmem region; initialises slot/bank arrays and LWLocks; sets ctl->Dir, nbanks, PagePrecedes
check_slru_buffersGUC check hook; rejects values not divisible by SLRU_BANK_SIZE
SymbolRole
SimpleLruZeroPageAllocates a new zeroed page into a slot; sets it as latest_page_number
SimpleLruReadPageFinds or loads a page; returns slot number; caller retains bank lock
SimpleLruReadPage_ReadOnlyOptimistic shared-lock scan before falling back to SimpleLruReadPage
SimpleLruWritePageExternal wrapper for SlruInternalWritePage with fdata = NULL
SimpleLruWriteAllCheckpoint flush: iterates all slots, writes dirty ones, batches FDs
SimpleLruDoesPhysicalPageExistChecks on-disk existence without loading into a buffer
SymbolRole
SlruInternalWritePageCore write: marks slot WRITE_IN_PROGRESS, drops bank lock, calls SlruPhysicalWritePage, re-acquires bank lock
SlruPhysicalReadPageIssues pread(); treats missing file as zeroes during recovery
SlruPhysicalWritePageEnforces WAL-before-data via group_lsn; issues pwrite(); queues sync request
SlruRecentlyUsedStamps page_lru_count[slotno] = ++bank_cur_lru_count[bankno]; safe under shared bank lock
SlruSelectLRUPageVictim search within the bank; evicts dirty victims inline
SimpleLruWaitIOReleases bank lock, acquires buffer lock shared, releases buffer lock, re-acquires bank lock
SlruReportIOErrorTranslates slru_errcause / slru_errno into an ereport(ERROR) after state is cleaned up
SimpleLruZeroLSNsZeroes the group_lsn array entries for a slot
SymbolRole
SlruFileNameFormats segment path; short (4–6 hex) or long (15 hex) based on long_segment_names
SimpleLruTruncateEvicts old in-memory pages, then calls SlruScanDirectory(SlruScanDirCbDeleteCutoff)
SlruMayDeleteSegmentFour-case modular check: both first and last page of segment must precede cutoff
SlruInternalDeleteSegmentForgets sync requests, calls unlink()
SlruDeleteSegmentPublic entry: evicts the segment from buffers first, then calls SlruInternalDeleteSegment
SlruScanDirectoryReads directory entries, filters by name length + hex pattern, invokes callback per segment
SlruScanDirCbReportPresenceScan callback: returns true if any segment precedes cutoff
SlruScanDirCbDeleteCutoffScan callback: deletes segments that satisfy SlruMayDeleteSegment
SlruScanDirCbDeleteAllScan callback: deletes all segments unconditionally
SlruSyncFileTagSync handler: opens the segment file and calls pg_fsync()
SymbolRole
SlruPageStatusEnum: SLRU_PAGE_EMPTY, SLRU_PAGE_READ_IN_PROGRESS, SLRU_PAGE_VALID, SLRU_PAGE_WRITE_IN_PROGRESS
SlruSharedData / SlruSharedShared-memory state: arrays for page_buffer, page_status, page_dirty, page_number, page_lru_count, buffer_locks, bank_locks, bank_cur_lru_count, group_lsn
SlruCtlData / SlruCtlPer-backend unshared control: pointer to shared, nbanks, long_segment_names, sync_handler, PagePrecedes, Dir
SimpleLruGetBankLockInline: computes pageno % nbanks and returns &bank_locks[bankno].lock

Position hints (commit 273fe94, REL_18_STABLE, 2026-06-05)

Section titled “Position hints (commit 273fe94, REL_18_STABLE, 2026-06-05)”
SymbolFileLine
SlruSharedDatasrc/include/access/slru.h61
SlruCtlDatasrc/include/access/slru.h127
SlruPageStatussrc/include/access/slru.h47
SimpleLruGetBankLocksrc/include/access/slru.h175
SLRU_BANK_BITSHIFTsrc/backend/access/transam/slru.c142
SLRU_BANK_SIZEsrc/backend/access/transam/slru.c143
SlotGetBankNumbersrc/backend/access/transam/slru.c148
MAX_WRITEALL_BUFFERSsrc/backend/access/transam/slru.c123
SLRU_PAGES_PER_SEGMENTsrc/include/access/slru.h39
SimpleLruShmemSizesrc/backend/access/transam/slru.c198
SimpleLruAutotuneBufferssrc/backend/access/transam/slru.c231
SimpleLruInitsrc/backend/access/transam/slru.c252
SimpleLruZeroPagesrc/backend/access/transam/slru.c375
SimpleLruWaitIOsrc/backend/access/transam/slru.c445
SimpleLruReadPagesrc/backend/access/transam/slru.c502
SimpleLruReadPage_ReadOnlysrc/backend/access/transam/slru.c605
SlruInternalWritePagesrc/backend/access/transam/slru.c652
SimpleLruWritePagesrc/backend/access/transam/slru.c732
SlruPhysicalReadPagesrc/backend/access/transam/slru.c804
SlruPhysicalWritePagesrc/backend/access/transam/slru.c876
SlruReportIOErrorsrc/backend/access/transam/slru.c1048
SlruRecentlyUsedsrc/backend/access/transam/slru.c1123
SlruSelectLRUPagesrc/backend/access/transam/slru.c1169
SimpleLruWriteAllsrc/backend/access/transam/slru.c1322
SimpleLruTruncatesrc/backend/access/transam/slru.c1408
SlruInternalDeleteSegmentsrc/backend/access/transam/slru.c1503
SlruDeleteSegmentsrc/backend/access/transam/slru.c1526
SlruMayDeleteSegmentsrc/backend/access/transam/slru.c1603
SlruScanDirectorysrc/backend/access/transam/slru.c1791
SlruSyncFileTagsrc/backend/access/transam/slru.c1831

Verified against commit 273fe94 on branch REL_18_STABLE.

Confirmed:

  • SlruSharedData layout matches the SimpleLruShmemSize computation exactly: the field offsets calculated in SimpleLruInit (lines 283–309) advance through page_buffer, page_status, page_dirty, page_number, page_lru_count, buffer_locks, bank_locks, bank_cur_lru_count, and optionally group_lsn in that order.
  • SLRU_BANK_SIZE = 16, SLRU_BANK_BITSHIFT = 4, confirmed at lines 142–143. SlotGetBankNumber(slotno) is (slotno) >> 4.
  • SLRU_PAGES_PER_SEGMENT = 32 (slru.h:39). One segment is 32 × 8192 = 256 KB. CLOG: 32 pages × (8192 × 4 XIDs/byte) = 1,048,576 XIDs/segment.
  • latest_page_number is pg_atomic_uint64 (slru.h:115), written atomically in SimpleLruZeroPage and read without barrier in SlruSelectLRUPage — the comment at slru.c:1251 explains why this is correct.
  • group_lsn is non-NULL only for CLOG (nlsns = CLOG_LSNS_PER_PAGE at clog.c:811). All other clients pass nlsns = 0.
  • NOTIFY client passes sync_handler = SYNC_HANDLER_NONE, disabling fsync (confirmed at commands/async.c:537–541); the SlruPhysicalWritePage path skips RegisterSyncRequest when sync_handler == SYNC_HANDLER_NONE.
  • MAX_WRITEALL_BUFFERS = 16 at slru.c:123. The batching fallback (set fdata = NULL) at slru.c:985 handles the unlikely case of > 16 open files.

Unresolved / out of scope for this doc:

  • The SlruPagePrecedesUnitTests function (slru.c:1697) and how each client’s PagePrecedes callback satisfies its contracts — covered in postgres-clog-commit-ts.md and postgres-multixact.md.
  • The interaction between SimpleLruTruncate and vacuum’s XID horizon computation — covered in postgres-vacuum.md.
  • The SSI client (SerialSlruCtl, predicate.c:814) stores SIREAD lock summaries per-page rather than per-XID; the address space semantics differ from CLOG — covered in postgres-ssi-predicate-locking.md.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

Oracle and SQL Server: no dedicated status log

Section titled “Oracle and SQL Server: no dedicated status log”

Oracle stores transaction status in the System Change Number (SCN) embedded in each row’s ITL (Interested Transaction List) slot and in the rollback segment transaction table. “Is XID N committed?” is answered by probing the rollback segment, not a separate log file. When rollback segments age out, Oracle uses “delayed block cleanout” to lazily update row headers. There is no analog of pg_xact/.

SQL Server uses a similar in-row versioning approach (row version store in tempdb), with transaction status embedded in version chains rather than a side file.

The PostgreSQL design’s advantage is simplicity and predictability: the pg_xact/ SLRU is a compact, uniform log that vacuum can truncate independently of heap file lifetimes. The disadvantage is that every visibility check requires an additional SLRU lookup unless the commit status is cached in the hint bits of the heap tuple — which is what HeapTupleHeaderSetHintBits in heapam_visibility.c does to amortize the cost.

CUBRID: commit log with a different eviction granularity

Section titled “CUBRID: commit log with a different eviction granularity”

CUBRID’s transaction-status subsystem also uses a log-file approach with a shared in-memory cache, but its buffer pool is integrated with the general buffer manager rather than a separate structure. The advantage is a unified eviction policy; the disadvantage is contention on the central buffer pool lock during high-commit-rate workloads — exactly the problem SLRU’s banked architecture avoids.

Research: MVCC with status bitmaps in the heap header

Section titled “Research: MVCC with status bitmaps in the heap header”

Neumann et al. (“Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems”, SIGMOD 2015) and the HyPer/Umbra lineage embed a transaction-status bitmap directly in each version’s header, eliminating the side-file lookup entirely. This is viable for main-memory databases where the bitmap fits in a cache line; for disk-based databases the side-file approach reduces per-page metadata overhead. PostgreSQL’s hint bits (t_infomask in HeapTupleHeaderData) are a partial application of the same idea: once a transaction’s status is confirmed, it is written into the tuple header so future reads bypass the SLRU entirely.

PostgreSQL’s SLRU is itself an extensibility surface. Any extension can call SimpleLruInit with a custom name, directory, buffer count, PagePrecedes callback, and sync handler. This mirrors the broader PostgreSQL design philosophy of exposing internal machinery (table AM, index AM, WAL rmgr, background workers) as first-class extension points. A columnar or time-series extension that needs a compact, wrap-around-able status log can reuse the entire SLRU stack — LRU eviction, WAL flushing, checkpointing, truncation — without reimplementing it.

In-tree source files (REL_18_STABLE, commit 273fe94):

  • src/backend/access/transam/slru.c — main implementation (1852 lines)
  • src/include/access/slru.h — public API and shared-memory layout
  • src/backend/access/transam/clog.c — CLOG client; defines XactCtl
  • src/backend/access/transam/subtrans.c — subtransaction client
  • src/backend/access/transam/commit_ts.c — commit_ts client
  • src/backend/access/transam/multixact.c — MultiXact offset/member clients
  • src/backend/commands/async.c — NOTIFY client (no fsync)
  • src/backend/storage/lmgr/predicate.c — SSI SIREAD lock client

Cross-references in this KB:

  • postgres-xact.md — transaction lifecycle that drives CLOG writes
  • postgres-clog-commit-ts.md — CLOG and commit_ts as SLRU clients
  • postgres-multixact.md — MultiXact as SLRU client
  • postgres-mvcc-snapshots.md — visibility checks that read CLOG via SLRU
  • postgres-xlog-wal.md — WAL infrastructure that SLRU defers to before writes
  • postgres-vacuum.md — XID horizon computation that drives SLRU truncation
  • postgres-ssi-predicate-locking.md — SSI SerialSlruCtl client
  • postgres-buffer-manager.md — contrast: the full heap/index buffer manager

Textbook foundations:

  • Database System Concepts, Silberschatz et al., 7e, ch. 13 (buffer management)
  • Database Internals, Petrov 2019 (buffer pool eviction policies)
  • dbms-papers/aries.md — WAL-before-data invariant (Mohan et al. 1992)