PostgreSQL WAL Records & Resource Managers — The rmgr Table and Record Anatomy
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A write-ahead log is the spine of crash recovery in every disk-based
relational engine, and its correctness rests on a single rule made famous by
the ARIES paper (Mohan et al. 1992, “ARIES: A Transaction Recovery
Method Supporting Fine-Granularity Locking and Partial Rollbacks Using
Write-Ahead Logging”, captured in knowledge/research/dbms-papers/aries.md):
the log record describing a page change must reach stable storage before
the changed data page does (the “write-ahead” invariant). Given that rule,
a database that crashed with dirty buffers still un-flushed can be brought
back to a transactionally consistent state by replaying the log forward
from the last checkpoint — the redo pass — and then undoing the effects
of transactions that never committed — the undo pass. ARIES ties the two
passes together with three ideas the reader should keep in mind throughout
this document: every log record has a monotonically increasing LSN (log
sequence number); every data page stamps the LSN of the last record that
modified it (pageLSN), so replay can tell at a glance whether a record’s
effect is already present on the page; and recovery is repeatable —
replaying the same record twice produces the same page, because the redo
logic is idempotent with respect to pageLSN.
ARIES says nothing about how the log is physically formatted, and that is exactly the gap this document fills. A real engine logs changes from dozens of independent subsystems — the heap, every index access method, the free-space and visibility maps, the commit log, multixact, sequences, tablespace and database creation, replication bookkeeping — and every one of them needs to emit records and, at recovery time, redo them. Two structural questions follow. First, how is a single physical log shared by many subsystems? The log is one append-only byte stream, but a heap INSERT and a btree page split are utterly different operations with different payloads. Something must tag each record with “who wrote this” and route it to the right replay code. Second, what is the on-disk shape of one record such that recovery — running in a fresh process with no in-memory state — can parse it, find the pages it touched, and reconstruct enough to redo the change exactly?
The classical answer to the first question is the resource manager abstraction, which ARIES itself names: the recovery driver is generic and knows only about LSNs, log scanning, and the dirty-page/transaction tables; it delegates the semantics of each record to a per-subsystem module — a “resource manager” — that knows how to redo and undo its own record types. The recovery manager is thus a thin dispatcher over a table of resource managers, indexed by a small integer stored in every record. PostgreSQL adopts this design almost verbatim, with one large simplification discussed below (it has no per-record undo pass). The answer to the second question is a self-describing record format: a fixed header common to all record types, followed by a sequence of length-prefixed, tagged chunks — block references and an opaque main-data blob — that a generic parser can walk without knowing the record’s meaning, while the owning resource manager interprets the bytes.
The other concept the WAL format must encode is the full-page image
(FPI), sometimes called a full-page write. Incremental redo — “set 4 bytes
at offset 120 of block 17” — only works if the underlying page was not left
in a torn (partially written) state by the crash, but commodity storage
gives no atomicity guarantee larger than a sector. ARIES-family systems
defend against torn pages by logging, the first time a page is touched
after a checkpoint, a complete copy of the page; replay restores that copy
wholesale instead of applying the increment. The WAL record format must
therefore be able to carry, per modified block, an optional full image — and,
because pages have a large run of zero bytes in the middle, it pays to strip
that “hole” before storing the image. All three ideas — record tagging,
self-describing block references, and optional FPIs — are visible directly in
PostgreSQL’s XLogRecord and its block-reference sub-headers.
Common DBMS Design
Section titled “Common DBMS Design”Across ARIES-family engines a recognizable set of conventions has settled around WAL record formatting and dispatch, and PostgreSQL sits squarely inside them while making a few characteristic choices.
A small integer names the owning subsystem. Oracle calls its log records
redo records composed of change vectors, each tagged with an opcode
(layer.subcode) identifying the layer that produced it; SQL Server tags each
log record with an operation and a context; InnoDB tags each mini-
transaction log record with an mlog type byte. In every case a single
small field routes the record to the code that understands it. PostgreSQL’s
version is the 8-bit xl_rmid field plus a 4-bit private opcode carved out of
xl_info; the rmid indexes a static dispatch table.
A common header, then opaque payload. Universally the record begins with a fixed-layout header — total length, the transaction id, a back-pointer or length to the previous record, a checksum, and the resource-manager/operation tag — after which the bytes are the private business of the owning module. The header is what the generic log scanner needs; the payload is what the resource manager needs.
Per-page logging with physiological redo. ARIES coined physiological logging: physical to a page, logical within it. A record names a specific page (physical) but describes the change in terms of slots/offsets rather than raw byte positions (logical within the page), so the change survives the page being reorganized by an unrelated operation. PostgreSQL’s block references carry the physical page identity — relation, fork, block — while the rmgr’s main-data blob describes the change logically (e.g., “insert this tuple at this offset”).
FPIs to defeat torn pages. The first-touch-after-checkpoint full-page
image is near-universal: Oracle, SQL Server, MySQL/InnoDB (“doublewrite” is a
different but complementary mechanism), and PostgreSQL all log full pages to
make redo robust against partial writes. The engineering refinements differ —
PostgreSQL strips the inter-pd_lower/pd_upper hole and optionally
compresses the image (pglz/lz4/zstd).
A descriptor/dump facility. Production engines ship a tool to render the
log human-readably (Oracle LogMiner, SQL Server fn_dblog, InnoDB’s
innodb_log parsing); the per-record-type formatting logic naturally lives
beside the redo logic. PostgreSQL factors this into desc and identify
callbacks that pg_waldump calls.
Where PostgreSQL diverges most sharply is the absence of logical undo in
the log. Classic ARIES logs both redo and undo information and writes
compensation log records (CLRs) during rollback. PostgreSQL’s MVCC storage
makes an aborted transaction’s tuples simply invisible (the xact is marked
aborted in pg_xact), so there is no need to physically undo row changes;
the WAL is redo-only. Consequently RmgrData has a rm_redo callback but
no rm_undo, and recovery is a single forward redo pass with no analysis-of-
losers/undo phase in the ARIES sense. (Rollback of an in-flight transaction
during normal operation is handled by MVCC + subtransaction abort, not by the
log.) This is the single biggest structural reason PostgreSQL’s recovery code
is dramatically simpler than a textbook ARIES implementation.
The flip side of redo-only is that the resource-manager callback set is lean. The next diagram shows the conceptual layering every ARIES-family engine shares, with PostgreSQL’s concrete names.
flowchart TD
subgraph emit["Emit side (normal running backend)"]
A["subsystem code<br/>(heapam, nbtinsert, ...)"] --> B["XLogBeginInsert /<br/>XLogRegisterBuffer /<br/>XLogRegisterData"]
B --> C["XLogInsert(rmid, info)<br/>assembles XLogRecord"]
C --> D["WAL byte stream<br/>(one shared log)"]
end
subgraph replay["Replay side (startup process)"]
D --> E["XLogReader<br/>parses header + chunks"]
E --> F["RmgrTable[xl_rmid].rm_redo(record)"]
F --> G["XLogReadBufferForRedo<br/>re-pin each block ref"]
G --> H["apply change iff<br/>pageLSN < record LSN"]
end
subgraph tools["Tooling"]
E --> I["rm_identify(info) -> name"]
E --> J["rm_desc(buf, record) -> detail"]
I --> K["pg_waldump output"]
J --> K
end
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL builds the resource-manager dispatch table at compile time out of a
single list and an X-macro. The list lives in rmgrlist.h, a header
deliberately written without an include guard so it can be #include-d
multiple times, each time with a different definition of the PG_RMGR(...)
macro. The list itself is just a sequence of PG_RMGR invocations, one per
resource manager, in a fixed order — and that order is the wire format,
because each rmgr’s numeric id is its position in the list:
// rmgrlist.h — src/include/access/rmgrlist.h/* symbol name, textual name, redo, desc, identify, startup, cleanup, mask, decode */PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL, xlog_decode)PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL, xact_decode)PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL, NULL)/* ... 18 more ... */PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask, NULL)PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL, logicalmsg_decode)The header’s own comment states the contract bluntly: “order of entries
defines the numerical values of each rmgr’s ID, which is stored in WAL
records. New entries should be added at the end.” Reordering the list would
silently change what an existing WAL stream means, so additions are append-
only and any change “possibly need[s] an XLOG_PAGE_MAGIC bump.” On REL_18
the built-in set has 22 named entries occupying ids 0–21 (RM_XLOG_ID
through RM_LOGICALMSG_ID). Note there is no XLOG2 rmgr — that is a
later-version addition and does not exist here.
The same header is consumed three ways. In rmgr.h the macro expands to a
bare enumerator, producing the RmgrIds enum whose values are exactly the ids
referenced from the list:
// rmgr.h — src/include/access/rmgr.h#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \ symname,
typedef enum RmgrIds{#include "access/rmgrlist.h" RM_NEXT_ID} RmgrIds;In rmgr.c the macro expands to a struct initializer, materializing the
actual dispatch array RmgrTable[]. This is the single most important data
structure in the document:
// RmgrTable — src/backend/access/transam/rmgr.c#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \ { name, redo, desc, identify, startup, cleanup, mask, decode },
RmgrData RmgrTable[RM_MAX_ID + 1] = {#include "access/rmgrlist.h"};RM_MAX_ID is UINT8_MAX (255), so the table has room for 256 entries even
though only 22 are built in; the gap from 22 to 127 is reserved-but-unused,
and 128–255 are for custom (extension) resource managers (more below). An
entry whose rm_name is NULL is “not registered,” which is how the
RmgrIdExists predicate distinguishes live slots from holes.
Each row is a RmgrData, a struct of eight fields — one name and seven
function pointers — documented in xlog_internal.h:
// RmgrData — src/include/access/xlog_internal.htypedef struct RmgrData{ const char *rm_name; void (*rm_redo) (XLogReaderState *record); void (*rm_desc) (StringInfo buf, XLogReaderState *record); const char *(*rm_identify) (uint8 info); void (*rm_startup) (void); void (*rm_cleanup) (void); void (*rm_mask) (char *pagedata, BlockNumber blkno); void (*rm_decode) (struct LogicalDecodingContext *ctx, struct XLogRecordBuffer *buf);} RmgrData;The callbacks divide cleanly by purpose. rm_redo is the heart of
recovery: given a parsed record, reapply its change. rm_desc and
rm_identify are the descriptor pair pg_waldump uses — rm_identify
turns the opcode bits into a short name like "INSERT", and rm_desc
appends record-specific detail to a string buffer. rm_startup and
rm_cleanup let an rmgr allocate and free recovery-time scratch state
(only btree, gin, gist, spgist use them — for their “incomplete action”
tracking). rm_mask supports wal_consistency_checking: it blanks out
non-deterministic page bits (hint bits, free space) before recovery compares a
redone page against the FPI in the record. rm_decode is the hook into
logical decoding (only the rmgrs whose changes are logically meaningful —
XLOG, Transaction, Standby, Heap, Heap2, LogicalMessage — provide one). A
NULL in any slot simply means “this rmgr has nothing to do here.”
Looking down the rmgrlist.h columns reveals the pattern at a glance: almost
every rmgr supplies redo/desc/identify; only the four index AMs with multi-
record split protocols supply startup/cleanup; only the storage AMs (Heap,
Heap2, the indexes, Sequence, Generic) supply mask; and only the six
logical-decoding-relevant rmgrs supply decode. The next diagram captures the
demultiplexing flow from a raw record to a callback.
flowchart TD
R["XLogRecord on disk<br/>xl_rmid = N, xl_info = 0xI0 | flags"] --> P["XLogReader: parse header,<br/>block refs, main data"]
P --> IDX["rmid = XLogRecGetRmid(record)<br/>info = XLogRecGetInfo & ~XLR_INFO_MASK"]
IDX --> T{"RmgrTable[rmid]"}
T --> RD["rm_redo(record)"]
RD --> SW["switch (info & OPMASK)"]
SW --> OP1["heap_xlog_insert"]
SW --> OP2["heap_xlog_update"]
SW --> OP3["heap_xlog_delete"]
OP1 --> BUF["XLogReadBufferForRedo(record, block_id)"]
OP2 --> BUF
OP3 --> BUF
BUF --> ACT{"action"}
ACT --> NR["BLK_NEEDS_REDO:<br/>apply & PageSetLSN"]
ACT --> RST["BLK_RESTORED:<br/>FPI already restored"]
ACT --> DN["BLK_DONE:<br/>pageLSN >= record, skip"]
ACT --> NF["BLK_NOTFOUND:<br/>page gone, skip"]
A redo routine’s first act is always to recover its private opcode from
xl_info. The 4 low bits of xl_info are reserved (XLR_INFO_MASK = 0x0F)
for the WAL machinery’s own flags; the 4 high bits (XLR_RMGR_INFO_MASK = 0xF0) belong to the rmgr. The canonical idiom, from heap_redo, masks off
the reserved bits and switches on the remainder:
// heap_redo — src/backend/access/heap/heapam_xlog.cvoidheap_redo(XLogReaderState *record){ uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
switch (info & XLOG_HEAP_OPMASK) { case XLOG_HEAP_INSERT: heap_xlog_insert(record); break; case XLOG_HEAP_DELETE: heap_xlog_delete(record); break; case XLOG_HEAP_UPDATE: heap_xlog_update(record, false); break; /* ... HOT_UPDATE, CONFIRM, LOCK, INPLACE, TRUNCATE ... */ default: elog(PANIC, "heap_redo: unknown op code %u", info); }}This two-level dispatch — rmid picks the rmgr, opcode picks the operation —
is why a single byte (xl_rmid) plus four bits (xl_info high nibble) is
enough to route every record in a 24-rmgr system to exactly one redo
function. The default: PANIC is deliberate: an unrecognized opcode means the
WAL is from an incompatible build or is corrupt, and continuing would risk
silent data loss.
Source Walkthrough
Section titled “Source Walkthrough”This section walks the four code areas in the document’s scope: the
XLogRecord on-disk anatomy, the rmgr table machinery in rmgr.c, the
redo-time buffer access in xlogutils.c, and the generic_xlog.c delta
engine. WAL insertion (XLogInsert, the insert-lock machinery, segment
management) is the subject of postgres-xlog-wal.md; the recovery driver
loop (PerformWalRecovery, the redo main loop, checkpoint/restartpoint
sequencing) is the subject of postgres-recovery-redo.md. Here we focus on
the record format and the dispatch layer that sits between them.
Record anatomy: the fixed header
Section titled “Record anatomy: the fixed header”Every WAL record begins with a XLogRecord header of exactly
SizeOfXLogRecord bytes (24 on a 64-bit build), defined in xlogrecord.h:
// XLogRecord — src/include/access/xlogrecord.htypedef struct XLogRecord{ uint32 xl_tot_len; /* total len of entire record */ TransactionId xl_xid; /* xact id */ XLogRecPtr xl_prev; /* ptr to previous record in log */ uint8 xl_info; /* flag bits, see below */ RmgrId xl_rmid; /* resource manager for this record */ /* 2 bytes of padding here, initialize to zero */ pg_crc32c xl_crc; /* CRC for this record */ /* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */} XLogRecord;
#define SizeOfXLogRecord (offsetof(XLogRecord, xl_crc) + sizeof(pg_crc32c))xl_tot_len bounds the parse; xl_xid ties the record to a transaction (used
by recovery’s known-assigned-xids tracking and by logical decoding);
xl_prev is the LSN of the previous record, forming a back-chain that lets
the reader validate it is reading a real record boundary and not garbage;
xl_info carries the rmgr opcode (high nibble) and WAL flags (low nibble);
xl_rmid is the dispatch key; and xl_crc is a CRC-32C over the whole record
(header-with-CRC-zeroed plus all following bytes) — the primary defense
against replaying a torn or corrupt record. The two reserved xl_info flags
the caller of XLogInsert may set are XLR_SPECIAL_REL_UPDATE (0x01, a
hint to external block-tracking tools) and XLR_CHECK_CONSISTENCY (0x02,
force an FPI for after-the-fact consistency checking).
Record anatomy: block references and full-page images
Section titled “Record anatomy: block references and full-page images”After the header comes a sequence of self-describing chunks. Block references
are introduced by a XLogRecordBlockHeader, which the comment notes is not
aligned and must be copied to aligned storage before use:
// XLogRecordBlockHeader — src/include/access/xlogrecord.htypedef struct XLogRecordBlockHeader{ uint8 id; /* block reference ID */ uint8 fork_flags; /* fork within the relation, and flags */ uint16 data_length; /* number of payload bytes (not including * page image) */ /* If BKPBLOCK_HAS_IMAGE, an XLogRecordBlockImageHeader struct follows */ /* If BKPBLOCK_SAME_REL is not set, a RelFileLocator follows */ /* BlockNumber follows */} XLogRecordBlockHeader;The id is the small integer the emitting code chose in XLogRegisterBuffer
(0..4 by default) — the redo routine asks for “block 0,” “block 1,” etc. by
the same id. fork_flags packs the ForkNumber (low 4 bits) with flag bits
(high 4): BKPBLOCK_HAS_IMAGE (an FPI follows), BKPBLOCK_HAS_DATA (per-
buffer data follows), BKPBLOCK_WILL_INIT (redo re-initializes the page from
scratch, so no FPI is needed and any read of the old contents is a bug), and
BKPBLOCK_SAME_REL (the RelFileLocator is omitted because it equals the
previous block ref’s — a WAL-volume optimization for multi-page records on the
same relation). The page’s physical identity — relation, fork, block — is thus
fully recoverable from the record alone, which is what lets a fresh startup
process re-find the page with no relcache.
When a full-page image is present, an XLogRecordBlockImageHeader follows,
and this is where the “hole” optimization lives:
// XLogRecordBlockImageHeader — src/include/access/xlogrecord.htypedef struct XLogRecordBlockImageHeader{ uint16 length; /* number of page image bytes */ uint16 hole_offset; /* number of bytes before "hole" */ uint8 bimg_info; /* flag bits, see below */ /* If BKPIMAGE_HAS_HOLE and BKPIMAGE_COMPRESSED(), an * XLogRecordBlockCompressHeader struct follows. */} XLogRecordBlockImageHeader;
/* Information stored in bimg_info */#define BKPIMAGE_HAS_HOLE 0x01 /* page image has "hole" */#define BKPIMAGE_APPLY 0x02 /* page image should be restored * during replay */#define BKPIMAGE_COMPRESS_PGLZ 0x04#define BKPIMAGE_COMPRESS_LZ4 0x08#define BKPIMAGE_COMPRESS_ZSTD 0x10A standard PostgreSQL page has a contiguous run of zero bytes between
pd_lower (end of the line-pointer array) and pd_upper (start of the tuple
data growing down from the end). Since those bytes are known-zero, the FPI
omits them: hole_offset records where the hole starts and the compress
header (if present) records its length, so replay can reconstruct the full
BLCKSZ page by zero-filling the gap. BKPIMAGE_APPLY distinguishes an image
that should be restored during replay (the normal first-touch-after-
checkpoint case) from one logged only for wal_consistency_checking
comparison. The three COMPRESS_* bits select the algorithm when
wal_compression is on. This is the on-disk realization of the torn-page
defense from the theory section.
The rmgr table machinery
Section titled “The rmgr table machinery”rmgr.c is small and almost entirely table-driven. Beyond defining
RmgrTable[] (shown above), it provides the lifecycle hooks the recovery
driver calls. RmgrStartup and RmgrCleanup iterate the table once,
invoking the optional rm_startup/rm_cleanup callbacks so index AMs can set
up and tear down their incomplete-action bookkeeping:
// RmgrStartup — src/backend/access/transam/rmgr.cvoidRmgrStartup(void){ for (int rmid = 0; rmid <= RM_MAX_ID; rmid++) { if (!RmgrIdExists(rmid)) continue; if (RmgrTable[rmid].rm_startup != NULL) RmgrTable[rmid].rm_startup(); }}RmgrNotFound is the error path: if a record arrives bearing an rmid with no
registered entry — typically a custom rmgr whose extension was not loaded into
shared_preload_libraries — recovery ereport(ERROR)s with a hint to load
the module. The GetRmgr inline accessor (in xlog_internal.h) funnels every
lookup through RmgrNotFound so a missing rmgr can never be silently skipped.
Custom resource managers occupy ids 128–255 and are installed at process
startup by RegisterCustomRmgr, which enforces several invariants: the id
must be in the custom range (RmgrIdIsCustom), registration must happen
while shared_preload_libraries are initializing
(process_shared_preload_libraries_in_progress), the id must not already be
taken, and the name must be non-empty and not collide with an existing rmgr’s
name:
// RegisterCustomRmgr — src/backend/access/transam/rmgr.cvoidRegisterCustomRmgr(RmgrId rmid, const RmgrData *rmgr){ if (rmgr->rm_name == NULL || strlen(rmgr->rm_name) == 0) ereport(ERROR, (errmsg("custom resource manager name is invalid"), ...)); if (!RmgrIdIsCustom(rmid)) ereport(ERROR, (errmsg("custom resource manager ID %d is out of range", rmid), ...)); if (!process_shared_preload_libraries_in_progress) ereport(ERROR, ( ... "must be registered while initializing modules in " "\"shared_preload_libraries\"." )); if (RmgrTable[rmid].rm_name != NULL) ereport(ERROR, ( ... "already registered with the same ID." )); /* ... name-collision scan ... */ RmgrTable[rmid] = *rmgr; /* register it */ ereport(LOG, (errmsg("registered custom resource manager \"%s\" with ID %d", rmgr->rm_name, rmid)));}The RM_EXPERIMENTAL_ID (128) is the conventional id to use during
development before reserving a unique number on the PostgreSQL wiki. Finally,
pg_get_wal_resource_managers is the SQL set-returning function backing the
pg_get_wal_resource_managers() view; it walks the table emitting (rmid, name, builtin) rows, using RmgrIdIsBuiltin to flag whether each id is a
built-in or a custom registration.
Redo-time buffer access in xlogutils.c
Section titled “Redo-time buffer access in xlogutils.c”A redo routine never reads a page through the normal relcache/buffer paths —
during recovery there is no relcache and the catalogs may not even be
consistent yet. Instead it goes through xlogutils.c, whose central entry
point is XLogReadBufferForRedo. This is the single function that ties the
record’s block reference to a real shared buffer and decides whether the
change still needs applying:
// XLogReadBufferForRedoExtended — src/backend/access/transam/xlogutils.cXLogRedoActionXLogReadBufferForRedoExtended(XLogReaderState *record, uint8 block_id, ReadBufferMode mode, bool get_cleanup_lock, Buffer *buf){ XLogRecPtr lsn = record->EndRecPtr; RelFileLocator rlocator; ForkNumber forknum; BlockNumber blkno; /* ... */ if (!XLogRecGetBlockTagExtended(record, block_id, &rlocator, &forknum, &blkno, &prefetch_buffer)) elog(PANIC, "failed to locate backup block with ID %d in WAL record", block_id);
/* If it has a full-page image and it should be restored, do it. */ if (XLogRecBlockImageApply(record, block_id)) { *buf = XLogReadBufferExtended(rlocator, forknum, blkno, ...); page = BufferGetPage(*buf); if (!RestoreBlockImage(record, block_id, page)) ereport(ERROR, ...); if (!PageIsNew(page)) PageSetLSN(page, lsn); MarkBufferDirty(*buf); return BLK_RESTORED; } else { *buf = XLogReadBufferExtended(rlocator, forknum, blkno, mode, prefetch_buffer); if (BufferIsValid(*buf)) { /* ... acquire lock ... */ if (lsn <= PageGetLSN(BufferGetPage(*buf))) return BLK_DONE; else return BLK_NEEDS_REDO; } else return BLK_NOTFOUND; }}The four-way return value is the idempotent-redo logic from ARIES made
concrete. BLK_RESTORED — the record carried an FPI that applies, so the
whole page was overwritten and the rmgr’s incremental logic is skipped.
BLK_NEEDS_REDO — the page exists and its pageLSN is older than this
record, so the incremental change has not yet been applied; the rmgr applies
it and stamps the new LSN. BLK_DONE — the page’s LSN is already >= the
record LSN, meaning the change is already present (the page was flushed before
the crash, or we are replaying past it), so redo is skipped. BLK_NOTFOUND
— the page no longer exists (its relation was dropped or truncated later in
the WAL), so the change is harmlessly ignored. The lsn <= PageGetLSN(...)
comparison is the exact mechanism that makes replay safe to run twice. The
thin wrapper XLogReadBufferForRedo calls this with RBM_NORMAL;
XLogInitBufferForRedo calls it with RBM_ZERO_AND_LOCK for the
BKPBLOCK_WILL_INIT case where the redo routine rebuilds the page from
scratch.
XLogReadBufferExtended is the lower-level page fetch. During recovery it
opens the relation at the storage-manager (smgr) level — bypassing the
relcache entirely — and smgrcreates the file if missing, so replay can
write to a relation that a later record will drop. If the requested block is
beyond end-of-file it either extends with zero pages (the RBM_ZERO_* modes)
or, in RBM_NORMAL, logs an invalid page reference and returns
InvalidBuffer:
// XLogReadBufferExtended — src/backend/access/transam/xlogutils.c (condensed)smgr = smgropen(rlocator, INVALID_PROC_NUMBER);smgrcreate(smgr, forknum, true);lastblock = smgrnblocks(smgr, forknum);if (blkno < lastblock) buffer = ReadBufferWithoutRelcache(rlocator, forknum, blkno, mode, NULL, true);else{ if (mode == RBM_NORMAL) /* hm, page doesn't exist in file */ { log_invalid_page(rlocator, forknum, blkno, false); return InvalidBuffer; } /* ... RBM_ZERO_* modes extend the file with zero pages ... */}The invalid-page mechanism deserves a note because it is a subtle correctness
device. With full_page_writes=off it is possible to see an incremental
record for a page that no longer exists (because its relation was later
dropped/truncated, but the drop record comes after this one in the stream).
log_invalid_page records the (relation, fork, block) in a hash table; if a
later record drops or truncates that relation, the entry is forgotten. At the
end of recovery, XLogCheckInvalidPages sweeps the table — any entry still
present means the WAL referenced a page that was never accounted for, which is
a PANIC-worthy corruption signal (downgradable to WARNING via the
ignore_invalid_pages GUC):
// XLogCheckInvalidPages — src/backend/access/transam/xlogutils.c (condensed)while ((hentry = (xl_invalid_page *) hash_seq_search(&status)) != NULL){ report_invalid_page(WARNING, hentry->key.locator, hentry->key.forkno, hentry->key.blkno, hentry->present); foundone = true;}if (foundone) elog(ignore_invalid_pages ? WARNING : PANIC, "WAL contains references to invalid pages");Descriptor routines and pg_waldump
Section titled “Descriptor routines and pg_waldump”The desc/identify callbacks exist purely to render WAL human-readably; the
backend never calls them during normal operation. The per-rmgr descriptor
files live under src/backend/access/rmgrdesc/. identify maps the opcode
bits (the same xl_info & ~XLR_INFO_MASK the redo routine switches on) to a
short string, including the +INIT suffix when XLOG_HEAP_INIT_PAGE is set:
// heap_identify — src/backend/access/rmgrdesc/heapdesc.c (condensed)const char *heap_identify(uint8 info){ const char *id = NULL; switch (info & ~XLR_INFO_MASK) { case XLOG_HEAP_INSERT: id = "INSERT"; break; case XLOG_HEAP_INSERT | XLOG_HEAP_INIT_PAGE: id = "INSERT+INIT"; break; case XLOG_HEAP_DELETE: id = "DELETE"; break; case XLOG_HEAP_UPDATE: id = "UPDATE"; break; /* ... HOT_UPDATE, TRUNCATE, CONFIRM, LOCK, INPLACE ... */ } return id;}desc decodes the record’s main-data blob into field-by-field detail by
casting XLogRecGetData(record) to the rmgr’s struct for that opcode:
// heap_desc — src/backend/access/rmgrdesc/heapdesc.c (condensed)voidheap_desc(StringInfo buf, XLogReaderState *record){ char *rec = XLogRecGetData(record); uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK; info &= XLOG_HEAP_OPMASK; if (info == XLOG_HEAP_INSERT) { xl_heap_insert *xlrec = (xl_heap_insert *) rec; appendStringInfo(buf, "off: %u, flags: 0x%02X", xlrec->offnum, xlrec->flags); } else if (info == XLOG_HEAP_DELETE) { /* ... */ } /* ... */}pg_waldump is a frontend program and cannot link the backend’s redo code,
so it builds its own descriptor-only table from the same rmgrlist.h, this
time expanding PG_RMGR to keep only the name/desc/identify columns:
// RmgrDescTable — src/bin/pg_waldump/rmgrdesc.c#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \ { name, desc, identify},
static const RmgrDescData RmgrDescTable[RM_N_BUILTIN_IDS] = {#include "access/rmgrlist.h"};For each record pg_waldump fetches GetRmgrDesc(XLogRecGetRmid(record)),
calls rm_identify(info) for the operation name and rm_desc(&s, record) for
the detail line — the same two callbacks the table threads through. This is
why a single header (rmgrlist.h) keeps the backend’s redo dispatch and the
standalone dump tool perfectly in sync: both are generated from one list.
The generic_xlog delta engine
Section titled “The generic_xlog delta engine”An extension that stores data in standard PostgreSQL pages but does not want
to write a full custom resource manager can use generic_xlog.c, the Generic
rmgr (RM_GENERIC_ID). It WAL-logs an arbitrary byte-level delta of a page
without the extension having to define record types or a redo routine — the
shared generic_redo replays any such record. The construction API is a
three-call lifecycle. GenericXLogStart allocates an I/O-aligned state holding
up to MAX_GENERIC_XLOG_PAGES page slots:
// GenericXLogStart — src/backend/access/transam/generic_xlog.c (condensed)GenericXLogState *GenericXLogStart(Relation relation){ GenericXLogState *state; state = (GenericXLogState *) palloc_aligned(sizeof(GenericXLogState), PG_IO_ALIGN_SIZE, 0); state->isLogged = RelationNeedsWAL(relation); for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++) { state->pages[i].image = state->images[i].data; state->pages[i].buffer = InvalidBuffer; } return state;}GenericXLogRegisterBuffer takes a copy of the page into the state’s image
buffer and hands that copy back to the caller to modify in place; the caller
mutates the image, never the live buffer:
// GenericXLogRegisterBuffer — src/backend/access/transam/generic_xlog.c (condensed)PageGenericXLogRegisterBuffer(GenericXLogState *state, Buffer buffer, int flags){ for (block_id = 0; block_id < MAX_GENERIC_XLOG_PAGES; block_id++) { GenericXLogPageData *page = &state->pages[block_id]; if (BufferIsInvalid(page->buffer)) { page->buffer = buffer; page->flags = flags; memcpy(page->image, BufferGetPage(buffer), BLCKSZ); return (Page) page->image; } else if (page->buffer == buffer) return (Page) page->image; /* already registered */ } elog(ERROR, "maximum number %d of generic xlog buffers is exceeded", MAX_GENERIC_XLOG_PAGES);}GenericXLogFinish is where the magic happens. It diffs the original page
against the modified image to produce a compact delta, copies the image onto
the real buffer (zeroing the hole to match what replay will produce), marks
the buffer dirty, and registers the delta as per-buffer data on a Generic
record:
// GenericXLogFinish — src/backend/access/transam/generic_xlog.c (condensed)if (!(pageData->flags & GENERIC_XLOG_FULL_IMAGE)) computeDelta(pageData, page, (Page) pageData->image);
/* apply the image, zeroing the hole between pd_lower and pd_upper */memcpy(page, pageData->image, pageHeader->pd_lower);memset(page + pageHeader->pd_lower, 0, pageHeader->pd_upper - pageHeader->pd_lower);memcpy(page + pageHeader->pd_upper, pageData->image + pageHeader->pd_upper, BLCKSZ - pageHeader->pd_upper);MarkBufferDirty(pageData->buffer);
if (pageData->flags & GENERIC_XLOG_FULL_IMAGE) XLogRegisterBuffer(i, pageData->buffer, REGBUF_FORCE_IMAGE | REGBUF_STANDARD);else{ XLogRegisterBuffer(i, pageData->buffer, REGBUF_STANDARD); XLogRegisterBufData(i, pageData->delta, pageData->deltaLen);}/* ... */lsn = XLogInsert(RM_GENERIC_ID, 0);The delta itself is a list of fragments — (offset, length, bytes) triples
covering only the regions that changed. computeDelta runs the diff
separately over the page’s lower part (0..pd_lower) and upper part
(pd_upper..BLCKSZ), skipping the hole entirely; computeRegionDelta does
the tight byte-matching loop, merging fragments separated by fewer than
MATCH_THRESHOLD unchanged bytes (because a fragment header costs
2*sizeof(OffsetNumber), so tiny gaps are not worth splitting). The worst-case
delta is bounded by MAX_DELTA_SIZE = BLCKSZ + 2*FRAGMENT_HEADER_SIZE. On
replay, generic_redo re-pins each block with XLogReadBufferForRedo and
applies the fragments with applyPageRedo, the mirror of writeFragment:
// applyPageRedo — src/backend/access/transam/generic_xlog.cstatic voidapplyPageRedo(Page page, const char *delta, Size deltaSize){ const char *ptr = delta; const char *end = delta + deltaSize; while (ptr < end) { OffsetNumber offset, length; memcpy(&offset, ptr, sizeof(offset)); ptr += sizeof(offset); memcpy(&length, ptr, sizeof(length)); ptr += sizeof(length); memcpy(page + offset, ptr, length); ptr += length; }}After applying, generic_redo zeroes the hole on the replayed page (the delta
carries no hole bytes) and stamps the LSN — exactly mirroring what
GenericXLogFinish did on the primary, so a consistency check passes.
generic_mask (the rmgr’s rm_mask) blanks the page LSN, checksum, and
unused space before that check. The whole point of generic_xlog is that an
extension gets crash-safe, replication-safe page edits with a delta-encoded
WAL footprint and zero custom redo code — at the cost of the diff being
byte-level rather than semantic.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
PG_RMGR list (rmgr definitions) | src/include/access/rmgrlist.h | 28–49 |
RmgrIds enum (PG_RMGR → enumerator) | src/include/access/rmgr.h | 22–30 |
RM_MAX_ID / RM_MIN_CUSTOM_ID / RM_EXPERIMENTAL_ID | src/include/access/rmgr.h | 33–60 |
RmgrIdIsBuiltin / RmgrIdIsCustom | src/include/access/rmgr.h | 41–53 |
RmgrData struct | src/include/access/xlog_internal.h | 349–360 |
GetRmgr / RmgrIdExists | src/include/access/xlog_internal.h | 369–382 |
RmgrTable[] definition | src/backend/access/transam/rmgr.c | 47–52 |
RmgrStartup | src/backend/access/transam/rmgr.c | 58 |
RmgrCleanup | src/backend/access/transam/rmgr.c | 74 |
RmgrNotFound | src/backend/access/transam/rmgr.c | 91 |
RegisterCustomRmgr | src/backend/access/transam/rmgr.c | 107 |
pg_get_wal_resource_managers | src/backend/access/transam/rmgr.c | 150 |
XLogRecord struct | src/include/access/xlogrecord.h | 41–53 |
XLR_INFO_MASK / XLR_RMGR_INFO_MASK | src/include/access/xlogrecord.h | 62–63 |
XLR_SPECIAL_REL_UPDATE / XLR_CHECK_CONSISTENCY | src/include/access/xlogrecord.h | 82–91 |
XLogRecordBlockHeader | src/include/access/xlogrecord.h | 103–113 |
XLogRecordBlockImageHeader + BKPIMAGE_* | src/include/access/xlogrecord.h | 141–167 |
BKPBLOCK_* flags | src/include/access/xlogrecord.h | 196–202 |
XLogReadBufferForRedo | src/backend/access/transam/xlogutils.c | 303 |
XLogInitBufferForRedo | src/backend/access/transam/xlogutils.c | 315 |
XLogReadBufferForRedoExtended | src/backend/access/transam/xlogutils.c | 340 |
XLogReadBufferExtended | src/backend/access/transam/xlogutils.c | 460 |
log_invalid_page | src/backend/access/transam/xlogutils.c | 101 |
XLogCheckInvalidPages | src/backend/access/transam/xlogutils.c | 234 |
GenericXLogState / GenericXLogPageData | src/backend/access/transam/generic_xlog.c | 49–71 |
GenericXLogStart | src/backend/access/transam/generic_xlog.c | 269 |
GenericXLogRegisterBuffer | src/backend/access/transam/generic_xlog.c | 299 |
GenericXLogFinish | src/backend/access/transam/generic_xlog.c | 337 |
computeDelta / computeRegionDelta / writeFragment | src/backend/access/transam/generic_xlog.c | 90 / 121 / 228 |
generic_redo / applyPageRedo / generic_mask | src/backend/access/transam/generic_xlog.c | 478 / 453 / 539 |
MAX_GENERIC_XLOG_PAGES / GENERIC_XLOG_FULL_IMAGE | src/include/access/generic_xlog.h | 23 / 26 |
heap_redo (opcode dispatch) | src/backend/access/heap/heapam_xlog.c | 1181 |
heap_desc / heap_identify | src/backend/access/rmgrdesc/heapdesc.c | 184 / 389 |
RmgrDescTable (frontend) | src/bin/pg_waldump/rmgrdesc.c | 35–40 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified against the REL_18_STABLE checkout at commit 273fe94 under
/data/hgryoo/references/postgres.
- rmgr set and ids.
rmgrlist.hdefines exactly 22PG_RMGRentries (ids 0–21): XLOG, Transaction, Storage, CLOG, Database, Tablespace, MultiXact, RelMap, Standby, Heap2, Heap, Btree, Hash, Gin, Gist, Sequence, SPGist, BRIN, CommitTs, ReplicationOrigin, Generic, LogicalMessage. Confirmed there is noXLOG2rmgr in this revision (it is a later- version addition) — the list ends atRM_LOGICALMSG_ID.RM_MAX_BUILTIN_IDis thereforeRM_NEXT_ID - 1 = 21. - X-macro triple use. Verified
rmgrlist.hhas no include guard and is#include-d under three differentPG_RMGRdefinitions: enumerator (rmgr.h), full struct initializer (rmgr.c), and{name,desc,identify}triple (pg_waldump/rmgrdesc.c). Thermgr.cmacro comment “must be kept in sync with RmgrData definition in xlog_internal.h” matches the 8-fieldRmgrDatastruct order verbatim. RmgrDatacallback order. The struct inxlog_internal.hlistsrm_name, rm_redo, rm_desc, rm_identify, rm_startup, rm_cleanup, rm_mask, rm_decode, identical to thePG_RMGRargument order — confirmed by reading both. Thermgr.cinitializer omitsrm_name’s callback position correctly (name is first field, callbacks follow).XLogRecordheader. Confirmed the six fields and the 2-byte padding comment inxlogrecord.h;SizeOfXLogRecord = offsetof(XLogRecord, xl_crc) + sizeof(pg_crc32c).xl_rmidisRmgrId=uint8.- opcode masking. Verified
XLR_INFO_MASK = 0x0FandXLR_RMGR_INFO_MASK = 0xF0, and thatheap_redomasks with& ~XLR_INFO_MASKthen& XLOG_HEAP_OPMASKbefore switching — matching the prose’s “low nibble reserved, high nibble rmgr” claim. - block flags / FPI. Confirmed
BKPBLOCK_HAS_IMAGE/HAS_DATA/WILL_INIT/ SAME_REL(0x10/0x20/0x40/0x80) andBKPIMAGE_HAS_HOLE/APPLY/COMPRESS_*bit values inxlogrecord.h. - redo buffer return values.
XLogReadBufferForRedoExtendedreturnsBLK_RESTORED(FPI applied),BLK_NEEDS_REDO(lsn > PageGetLSN),BLK_DONE(lsn <= PageGetLSN),BLK_NOTFOUND(no buffer) — read directly from the function body. - generic_xlog limits.
MAX_GENERIC_XLOG_PAGESisXLR_NORMAL_MAX_BLOCK_IDandGENERIC_XLOG_FULL_IMAGE = 0x0001ingeneric_xlog.h;MAX_DELTA_SIZE = BLCKSZ + 2 * FRAGMENT_HEADER_SIZEandFRAGMENT_HEADER_SIZE = 2 * sizeof(OffsetNumber)ingeneric_xlog.c. - custom rmgr range.
RM_MIN_CUSTOM_ID = 128,RM_MAX_CUSTOM_ID = UINT8_MAX,RM_EXPERIMENTAL_ID = 128;RegisterCustomRmgrenforces all four invariants quoted (range, preload-in-progress, no duplicate id, no duplicate name) — read directly. - Line numbers in the position-hint table were captured from this checkout on 2026-06-05; they are hints and will drift with reformatting. Symbol names are the durable anchors.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”PostgreSQL’s rmgr table is a particularly clean instance of a pattern that recurs across every ARIES-family engine, but the choices it makes — redo-only logging, a compile-time X-macro dispatch table, physiological per-page records with optional FPIs — sit at one corner of a larger design space. Placing it against its peers and the research literature sharpens what is essential versus incidental.
ARIES and the resource-manager abstraction. The original ARIES paper
(Mohan et al. 1992, captured in knowledge/research/dbms-papers/aries.md)
already factors recovery into a generic driver plus per-resource-manager redo
and undo logic; IBM’s DB2 and the System R lineage
(knowledge/research/dbms-papers/systemr.md) embody it directly. The deepest
divergence is that classic ARIES is a redo-undo log: every update logs both
how to reapply and how to roll back, and rollback writes compensation log
records (CLRs) whose UndoNxtLSN lets recovery skip already-undone work and
guarantees termination even if the system crashes mid-rollback. PostgreSQL
discards the entire undo half. Because MVCC leaves an aborted transaction’s
tuples physically in place and merely marks the xid aborted in pg_xact, there
is nothing to physically undo, so RmgrData carries rm_redo but no
rm_undo, and recovery is a single forward pass with no analysis/undo phases
in the ARIES sense. The cost is paid elsewhere — bloat from dead tuples,
deferred to VACUUM (postgres-vacuum.md) — but the recovery code shrinks
dramatically. This is the single most consequential architectural difference
between PostgreSQL’s WAL subsystem and a textbook ARIES implementation.
Undo-based engines and the road not taken. Oracle and MySQL/InnoDB take
the opposite stance: they maintain an explicit undo/rollback segment, so
old row versions are reconstructed on demand from undo and the main heap holds
only the latest version. Their redo logs (Oracle redo records of change
vectors tagged by layer.opcode; InnoDB mini-transaction records tagged by
mlog type) still multiplex many subsystems through a small type tag, exactly
as xl_rmid does, but recovery must run both a redo pass and an undo pass that
applies the rollback segment. PostgreSQL’s own community has repeatedly
proposed an undo-based storage engine (the zheap project) precisely to escape
MVCC bloat; notably, zheap would have reintroduced an undo log and therefore
something like the CLR machinery ARIES describes — a reminder that PostgreSQL’s
redo-only simplicity is a direct consequence of its heap’s append-only version
storage, not a free lunch.
Physiological logging and FPIs as a shared optimum. The decision to log
physical-to-a-page, logical-within-it records plus a first-touch full-page
image is close to universal: Oracle, SQL Server, and InnoDB all do some form of
it (InnoDB’s doublewrite buffer is a complementary torn-page defense rather
than a replacement). PostgreSQL’s refinements — stripping the
pd_lower/pd_upper hole and optionally compressing the image with
pglz/lz4/zstd — are engineering rather than architecture, but they materially
shrink WAL volume, which dominates replication bandwidth and archive cost.
Database Internals (Petrov 2019,
knowledge/research/dbms-papers/ / raw/dbms/textbooks/Database Internals.pdf)
surveys this design space and frames FPIs as the standard answer to the absence
of atomic page writes on commodity storage.
Compile-time dispatch versus pluggable registries. The X-macro
construction of RmgrTable[] is a deliberately static design: the built-in
set is frozen at compile time and the wire format is literally the order of
lines in rmgrlist.h. The 2020-era addition of custom rmgrs (ids 128–255 via
RegisterCustomRmgr) bolts a small dynamic registry onto the static core,
which is why custom registration is hemmed in by so many invariants — it must
happen during shared_preload_libraries init, before any WAL is read, so the
table is effectively immutable again by the time recovery runs. This is a
narrower extensibility contract than, say, SQLite’s VFS layer or a fully
plugin-based log-record registry, and the narrowness is the point: a resource
manager participates in crash recovery, where a mismatched or missing entry is
silent data loss, so the system trades flexibility for the guarantee that a
WAL stream means exactly one thing. The generic_xlog machinery is the
pressure valve: an extension that only needs crash-safe edits of standard
pages gets them with zero custom redo code, at the price of byte-level rather
than semantic deltas.
Research frontiers. Three threads are worth flagging. First, log
shipping and scale-out: because the rmgr layer makes the WAL a clean,
self-describing stream, it doubles as the substrate for both physical
replication and logical decoding (the rm_decode callbacks), and systems like
Amazon Aurora push this further by treating the redo log itself as the
durability boundary — “the log is the database” — shipping WAL to a
storage tier that materializes pages lazily. Second, non-volatile memory and
shorter logs: a line of research asks whether persistent memory lets engines
shrink or eliminate FPIs and even the redo log, since byte-addressable
durability changes the torn-page calculus that motivated FPIs in the first
place. Third, WAL as a contention bottleneck: the single ordered log is a
serialization point, and just as the lock manager became a multicore
bottleneck addressed by partitioning
(knowledge/research/dbms-papers/scalable-lock-manager.md), the WAL insert
path is partitioned in PostgreSQL via multiple insertion locks
(postgres-xlog-wal.md) — an ongoing tension between the strict total order
ARIES assumes and the parallelism modern hardware demands. The rmgr table
itself is not the bottleneck — it is a pure dispatch on a record already
parsed — but it sits directly downstream of these scalability questions.
Sources
Section titled “Sources”- Code (REL_18_STABLE, commit 273fe94, under
/data/hgryoo/references/postgres):src/include/access/rmgrlist.h— thePG_RMGRX-macro list (the 22 built-in resource managers and their callback columns).src/include/access/rmgr.h—RmgrIdsenum,RM_MAX_ID/RM_MIN_CUSTOM_ID/RM_EXPERIMENTAL_ID,RmgrIdIsBuiltin/RmgrIdIsCustom.src/backend/access/transam/rmgr.c—RmgrTable[],RmgrStartup/RmgrCleanup,RmgrNotFound,RegisterCustomRmgr,pg_get_wal_resource_managers.src/include/access/xlog_internal.h—RmgrDatastruct,GetRmgr/RmgrIdExistsaccessors.src/include/access/xlogrecord.h—XLogRecordheader,XLogRecordBlockHeader,XLogRecordBlockImageHeader, theBKPBLOCK_*/BKPIMAGE_*/XLR_*flag families.src/backend/access/transam/xlogutils.c—XLogReadBufferForRedo/…Extended,XLogReadBufferExtended,log_invalid_page/XLogCheckInvalidPages.src/backend/access/transam/generic_xlog.c+src/include/access/generic_xlog.h— theGenericrmgr delta engine (GenericXLogStart/RegisterBuffer/Finish,computeDelta,applyPageRedo,generic_redo,generic_mask).src/backend/access/heap/heapam_xlog.c—heap_redoopcode dispatch (illustrative of the rmgrrm_redopattern).src/backend/access/rmgrdesc/heapdesc.c—heap_desc/heap_identifydescriptor pair.src/bin/pg_waldump/rmgrdesc.c— the frontendRmgrDescTablebuilt from the samermgrlist.h.src/backend/access/transam/README— the canonical narrative of WAL record construction, buffer registration, and full-page-image rules.
- Theory / textbooks:
knowledge/research/dbms-papers/aries.md— Mohan et al. 1992, ARIES; WAL invariant, LSN/pageLSN, redo/undo, CLRs, resource managers.knowledge/research/dbms-papers/systemr.md— the System R recovery lineage that ARIES generalizes.knowledge/research/dbms-papers/scalable-lock-manager.md— multicore partitioning of a single serialization point (analogy for WAL insert locks).raw/dbms/textbooks/Database Internals.pdf(Petrov 2019) — WAL, full-page writes, buffer management, recovery in the broader engine landscape.
- Cross-references within this knowledge base:
postgres-xlog-wal.md— WAL insertion, insert locks, segment/flush machinery (the emit side this doc defers to).postgres-recovery-redo.md— the recovery driver loop, checkpoints, restartpoints (the replay driver this doc defers to).postgres-xact.md— transaction commit/abort records (theTransactionrmgr).postgres-heap-am.md,postgres-nbtree.md— the heap and btree rmgrs whose redo routines this doc uses as worked examples.postgres-buffer-manager.md— the shared-buffer layerXLogReadBufferExtendeddrives during replay.postgres-logical-decoding.md— therm_decodecallback consumers.