Skip to content

PostgreSQL WAL Records & Resource Managers — The rmgr Table and Record Anatomy

Contents:

A write-ahead log is the spine of crash recovery in every disk-based relational engine, and its correctness rests on a single rule made famous by the ARIES paper (Mohan et al. 1992, “ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging”, captured in knowledge/research/dbms-papers/aries.md): the log record describing a page change must reach stable storage before the changed data page does (the “write-ahead” invariant). Given that rule, a database that crashed with dirty buffers still un-flushed can be brought back to a transactionally consistent state by replaying the log forward from the last checkpoint — the redo pass — and then undoing the effects of transactions that never committed — the undo pass. ARIES ties the two passes together with three ideas the reader should keep in mind throughout this document: every log record has a monotonically increasing LSN (log sequence number); every data page stamps the LSN of the last record that modified it (pageLSN), so replay can tell at a glance whether a record’s effect is already present on the page; and recovery is repeatable — replaying the same record twice produces the same page, because the redo logic is idempotent with respect to pageLSN.

ARIES says nothing about how the log is physically formatted, and that is exactly the gap this document fills. A real engine logs changes from dozens of independent subsystems — the heap, every index access method, the free-space and visibility maps, the commit log, multixact, sequences, tablespace and database creation, replication bookkeeping — and every one of them needs to emit records and, at recovery time, redo them. Two structural questions follow. First, how is a single physical log shared by many subsystems? The log is one append-only byte stream, but a heap INSERT and a btree page split are utterly different operations with different payloads. Something must tag each record with “who wrote this” and route it to the right replay code. Second, what is the on-disk shape of one record such that recovery — running in a fresh process with no in-memory state — can parse it, find the pages it touched, and reconstruct enough to redo the change exactly?

The classical answer to the first question is the resource manager abstraction, which ARIES itself names: the recovery driver is generic and knows only about LSNs, log scanning, and the dirty-page/transaction tables; it delegates the semantics of each record to a per-subsystem module — a “resource manager” — that knows how to redo and undo its own record types. The recovery manager is thus a thin dispatcher over a table of resource managers, indexed by a small integer stored in every record. PostgreSQL adopts this design almost verbatim, with one large simplification discussed below (it has no per-record undo pass). The answer to the second question is a self-describing record format: a fixed header common to all record types, followed by a sequence of length-prefixed, tagged chunks — block references and an opaque main-data blob — that a generic parser can walk without knowing the record’s meaning, while the owning resource manager interprets the bytes.

The other concept the WAL format must encode is the full-page image (FPI), sometimes called a full-page write. Incremental redo — “set 4 bytes at offset 120 of block 17” — only works if the underlying page was not left in a torn (partially written) state by the crash, but commodity storage gives no atomicity guarantee larger than a sector. ARIES-family systems defend against torn pages by logging, the first time a page is touched after a checkpoint, a complete copy of the page; replay restores that copy wholesale instead of applying the increment. The WAL record format must therefore be able to carry, per modified block, an optional full image — and, because pages have a large run of zero bytes in the middle, it pays to strip that “hole” before storing the image. All three ideas — record tagging, self-describing block references, and optional FPIs — are visible directly in PostgreSQL’s XLogRecord and its block-reference sub-headers.

Across ARIES-family engines a recognizable set of conventions has settled around WAL record formatting and dispatch, and PostgreSQL sits squarely inside them while making a few characteristic choices.

A small integer names the owning subsystem. Oracle calls its log records redo records composed of change vectors, each tagged with an opcode (layer.subcode) identifying the layer that produced it; SQL Server tags each log record with an operation and a context; InnoDB tags each mini- transaction log record with an mlog type byte. In every case a single small field routes the record to the code that understands it. PostgreSQL’s version is the 8-bit xl_rmid field plus a 4-bit private opcode carved out of xl_info; the rmid indexes a static dispatch table.

A common header, then opaque payload. Universally the record begins with a fixed-layout header — total length, the transaction id, a back-pointer or length to the previous record, a checksum, and the resource-manager/operation tag — after which the bytes are the private business of the owning module. The header is what the generic log scanner needs; the payload is what the resource manager needs.

Per-page logging with physiological redo. ARIES coined physiological logging: physical to a page, logical within it. A record names a specific page (physical) but describes the change in terms of slots/offsets rather than raw byte positions (logical within the page), so the change survives the page being reorganized by an unrelated operation. PostgreSQL’s block references carry the physical page identity — relation, fork, block — while the rmgr’s main-data blob describes the change logically (e.g., “insert this tuple at this offset”).

FPIs to defeat torn pages. The first-touch-after-checkpoint full-page image is near-universal: Oracle, SQL Server, MySQL/InnoDB (“doublewrite” is a different but complementary mechanism), and PostgreSQL all log full pages to make redo robust against partial writes. The engineering refinements differ — PostgreSQL strips the inter-pd_lower/pd_upper hole and optionally compresses the image (pglz/lz4/zstd).

A descriptor/dump facility. Production engines ship a tool to render the log human-readably (Oracle LogMiner, SQL Server fn_dblog, InnoDB’s innodb_log parsing); the per-record-type formatting logic naturally lives beside the redo logic. PostgreSQL factors this into desc and identify callbacks that pg_waldump calls.

Where PostgreSQL diverges most sharply is the absence of logical undo in the log. Classic ARIES logs both redo and undo information and writes compensation log records (CLRs) during rollback. PostgreSQL’s MVCC storage makes an aborted transaction’s tuples simply invisible (the xact is marked aborted in pg_xact), so there is no need to physically undo row changes; the WAL is redo-only. Consequently RmgrData has a rm_redo callback but no rm_undo, and recovery is a single forward redo pass with no analysis-of- losers/undo phase in the ARIES sense. (Rollback of an in-flight transaction during normal operation is handled by MVCC + subtransaction abort, not by the log.) This is the single biggest structural reason PostgreSQL’s recovery code is dramatically simpler than a textbook ARIES implementation.

The flip side of redo-only is that the resource-manager callback set is lean. The next diagram shows the conceptual layering every ARIES-family engine shares, with PostgreSQL’s concrete names.

flowchart TD
    subgraph emit["Emit side (normal running backend)"]
        A["subsystem code<br/>(heapam, nbtinsert, ...)"] --> B["XLogBeginInsert /<br/>XLogRegisterBuffer /<br/>XLogRegisterData"]
        B --> C["XLogInsert(rmid, info)<br/>assembles XLogRecord"]
        C --> D["WAL byte stream<br/>(one shared log)"]
    end
    subgraph replay["Replay side (startup process)"]
        D --> E["XLogReader<br/>parses header + chunks"]
        E --> F["RmgrTable[xl_rmid].rm_redo(record)"]
        F --> G["XLogReadBufferForRedo<br/>re-pin each block ref"]
        G --> H["apply change iff<br/>pageLSN &lt; record LSN"]
    end
    subgraph tools["Tooling"]
        E --> I["rm_identify(info) -> name"]
        E --> J["rm_desc(buf, record) -> detail"]
        I --> K["pg_waldump output"]
        J --> K
    end

PostgreSQL builds the resource-manager dispatch table at compile time out of a single list and an X-macro. The list lives in rmgrlist.h, a header deliberately written without an include guard so it can be #include-d multiple times, each time with a different definition of the PG_RMGR(...) macro. The list itself is just a sequence of PG_RMGR invocations, one per resource manager, in a fixed order — and that order is the wire format, because each rmgr’s numeric id is its position in the list:

// rmgrlist.h — src/include/access/rmgrlist.h
/* symbol name, textual name, redo, desc, identify, startup, cleanup, mask, decode */
PG_RMGR(RM_XLOG_ID, "XLOG", xlog_redo, xlog_desc, xlog_identify, NULL, NULL, NULL, xlog_decode)
PG_RMGR(RM_XACT_ID, "Transaction", xact_redo, xact_desc, xact_identify, NULL, NULL, NULL, xact_decode)
PG_RMGR(RM_SMGR_ID, "Storage", smgr_redo, smgr_desc, smgr_identify, NULL, NULL, NULL, NULL)
/* ... 18 more ... */
PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask, NULL)
PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL, logicalmsg_decode)

The header’s own comment states the contract bluntly: “order of entries defines the numerical values of each rmgr’s ID, which is stored in WAL records. New entries should be added at the end.” Reordering the list would silently change what an existing WAL stream means, so additions are append- only and any change “possibly need[s] an XLOG_PAGE_MAGIC bump.” On REL_18 the built-in set has 22 named entries occupying ids 0–21 (RM_XLOG_ID through RM_LOGICALMSG_ID). Note there is no XLOG2 rmgr — that is a later-version addition and does not exist here.

The same header is consumed three ways. In rmgr.h the macro expands to a bare enumerator, producing the RmgrIds enum whose values are exactly the ids referenced from the list:

// rmgr.h — src/include/access/rmgr.h
#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
symname,
typedef enum RmgrIds
{
#include "access/rmgrlist.h"
RM_NEXT_ID
} RmgrIds;

In rmgr.c the macro expands to a struct initializer, materializing the actual dispatch array RmgrTable[]. This is the single most important data structure in the document:

// RmgrTable — src/backend/access/transam/rmgr.c
#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
{ name, redo, desc, identify, startup, cleanup, mask, decode },
RmgrData RmgrTable[RM_MAX_ID + 1] = {
#include "access/rmgrlist.h"
};

RM_MAX_ID is UINT8_MAX (255), so the table has room for 256 entries even though only 22 are built in; the gap from 22 to 127 is reserved-but-unused, and 128–255 are for custom (extension) resource managers (more below). An entry whose rm_name is NULL is “not registered,” which is how the RmgrIdExists predicate distinguishes live slots from holes.

Each row is a RmgrData, a struct of eight fields — one name and seven function pointers — documented in xlog_internal.h:

// RmgrData — src/include/access/xlog_internal.h
typedef struct RmgrData
{
const char *rm_name;
void (*rm_redo) (XLogReaderState *record);
void (*rm_desc) (StringInfo buf, XLogReaderState *record);
const char *(*rm_identify) (uint8 info);
void (*rm_startup) (void);
void (*rm_cleanup) (void);
void (*rm_mask) (char *pagedata, BlockNumber blkno);
void (*rm_decode) (struct LogicalDecodingContext *ctx,
struct XLogRecordBuffer *buf);
} RmgrData;

The callbacks divide cleanly by purpose. rm_redo is the heart of recovery: given a parsed record, reapply its change. rm_desc and rm_identify are the descriptor pair pg_waldump uses — rm_identify turns the opcode bits into a short name like "INSERT", and rm_desc appends record-specific detail to a string buffer. rm_startup and rm_cleanup let an rmgr allocate and free recovery-time scratch state (only btree, gin, gist, spgist use them — for their “incomplete action” tracking). rm_mask supports wal_consistency_checking: it blanks out non-deterministic page bits (hint bits, free space) before recovery compares a redone page against the FPI in the record. rm_decode is the hook into logical decoding (only the rmgrs whose changes are logically meaningful — XLOG, Transaction, Standby, Heap, Heap2, LogicalMessage — provide one). A NULL in any slot simply means “this rmgr has nothing to do here.”

Looking down the rmgrlist.h columns reveals the pattern at a glance: almost every rmgr supplies redo/desc/identify; only the four index AMs with multi- record split protocols supply startup/cleanup; only the storage AMs (Heap, Heap2, the indexes, Sequence, Generic) supply mask; and only the six logical-decoding-relevant rmgrs supply decode. The next diagram captures the demultiplexing flow from a raw record to a callback.

flowchart TD
    R["XLogRecord on disk<br/>xl_rmid = N, xl_info = 0xI0 | flags"] --> P["XLogReader: parse header,<br/>block refs, main data"]
    P --> IDX["rmid = XLogRecGetRmid(record)<br/>info = XLogRecGetInfo &amp; ~XLR_INFO_MASK"]
    IDX --> T{"RmgrTable[rmid]"}
    T --> RD["rm_redo(record)"]
    RD --> SW["switch (info &amp; OPMASK)"]
    SW --> OP1["heap_xlog_insert"]
    SW --> OP2["heap_xlog_update"]
    SW --> OP3["heap_xlog_delete"]
    OP1 --> BUF["XLogReadBufferForRedo(record, block_id)"]
    OP2 --> BUF
    OP3 --> BUF
    BUF --> ACT{"action"}
    ACT --> NR["BLK_NEEDS_REDO:<br/>apply &amp; PageSetLSN"]
    ACT --> RST["BLK_RESTORED:<br/>FPI already restored"]
    ACT --> DN["BLK_DONE:<br/>pageLSN &gt;= record, skip"]
    ACT --> NF["BLK_NOTFOUND:<br/>page gone, skip"]

A redo routine’s first act is always to recover its private opcode from xl_info. The 4 low bits of xl_info are reserved (XLR_INFO_MASK = 0x0F) for the WAL machinery’s own flags; the 4 high bits (XLR_RMGR_INFO_MASK = 0xF0) belong to the rmgr. The canonical idiom, from heap_redo, masks off the reserved bits and switches on the remainder:

// heap_redo — src/backend/access/heap/heapam_xlog.c
void
heap_redo(XLogReaderState *record)
{
uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
switch (info & XLOG_HEAP_OPMASK)
{
case XLOG_HEAP_INSERT:
heap_xlog_insert(record);
break;
case XLOG_HEAP_DELETE:
heap_xlog_delete(record);
break;
case XLOG_HEAP_UPDATE:
heap_xlog_update(record, false);
break;
/* ... HOT_UPDATE, CONFIRM, LOCK, INPLACE, TRUNCATE ... */
default:
elog(PANIC, "heap_redo: unknown op code %u", info);
}
}

This two-level dispatch — rmid picks the rmgr, opcode picks the operation — is why a single byte (xl_rmid) plus four bits (xl_info high nibble) is enough to route every record in a 24-rmgr system to exactly one redo function. The default: PANIC is deliberate: an unrecognized opcode means the WAL is from an incompatible build or is corrupt, and continuing would risk silent data loss.

This section walks the four code areas in the document’s scope: the XLogRecord on-disk anatomy, the rmgr table machinery in rmgr.c, the redo-time buffer access in xlogutils.c, and the generic_xlog.c delta engine. WAL insertion (XLogInsert, the insert-lock machinery, segment management) is the subject of postgres-xlog-wal.md; the recovery driver loop (PerformWalRecovery, the redo main loop, checkpoint/restartpoint sequencing) is the subject of postgres-recovery-redo.md. Here we focus on the record format and the dispatch layer that sits between them.

Every WAL record begins with a XLogRecord header of exactly SizeOfXLogRecord bytes (24 on a 64-bit build), defined in xlogrecord.h:

// XLogRecord — src/include/access/xlogrecord.h
typedef struct XLogRecord
{
uint32 xl_tot_len; /* total len of entire record */
TransactionId xl_xid; /* xact id */
XLogRecPtr xl_prev; /* ptr to previous record in log */
uint8 xl_info; /* flag bits, see below */
RmgrId xl_rmid; /* resource manager for this record */
/* 2 bytes of padding here, initialize to zero */
pg_crc32c xl_crc; /* CRC for this record */
/* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */
} XLogRecord;
#define SizeOfXLogRecord (offsetof(XLogRecord, xl_crc) + sizeof(pg_crc32c))

xl_tot_len bounds the parse; xl_xid ties the record to a transaction (used by recovery’s known-assigned-xids tracking and by logical decoding); xl_prev is the LSN of the previous record, forming a back-chain that lets the reader validate it is reading a real record boundary and not garbage; xl_info carries the rmgr opcode (high nibble) and WAL flags (low nibble); xl_rmid is the dispatch key; and xl_crc is a CRC-32C over the whole record (header-with-CRC-zeroed plus all following bytes) — the primary defense against replaying a torn or corrupt record. The two reserved xl_info flags the caller of XLogInsert may set are XLR_SPECIAL_REL_UPDATE (0x01, a hint to external block-tracking tools) and XLR_CHECK_CONSISTENCY (0x02, force an FPI for after-the-fact consistency checking).

Record anatomy: block references and full-page images

Section titled “Record anatomy: block references and full-page images”

After the header comes a sequence of self-describing chunks. Block references are introduced by a XLogRecordBlockHeader, which the comment notes is not aligned and must be copied to aligned storage before use:

// XLogRecordBlockHeader — src/include/access/xlogrecord.h
typedef struct XLogRecordBlockHeader
{
uint8 id; /* block reference ID */
uint8 fork_flags; /* fork within the relation, and flags */
uint16 data_length; /* number of payload bytes (not including
* page image) */
/* If BKPBLOCK_HAS_IMAGE, an XLogRecordBlockImageHeader struct follows */
/* If BKPBLOCK_SAME_REL is not set, a RelFileLocator follows */
/* BlockNumber follows */
} XLogRecordBlockHeader;

The id is the small integer the emitting code chose in XLogRegisterBuffer (0..4 by default) — the redo routine asks for “block 0,” “block 1,” etc. by the same id. fork_flags packs the ForkNumber (low 4 bits) with flag bits (high 4): BKPBLOCK_HAS_IMAGE (an FPI follows), BKPBLOCK_HAS_DATA (per- buffer data follows), BKPBLOCK_WILL_INIT (redo re-initializes the page from scratch, so no FPI is needed and any read of the old contents is a bug), and BKPBLOCK_SAME_REL (the RelFileLocator is omitted because it equals the previous block ref’s — a WAL-volume optimization for multi-page records on the same relation). The page’s physical identity — relation, fork, block — is thus fully recoverable from the record alone, which is what lets a fresh startup process re-find the page with no relcache.

When a full-page image is present, an XLogRecordBlockImageHeader follows, and this is where the “hole” optimization lives:

// XLogRecordBlockImageHeader — src/include/access/xlogrecord.h
typedef struct XLogRecordBlockImageHeader
{
uint16 length; /* number of page image bytes */
uint16 hole_offset; /* number of bytes before "hole" */
uint8 bimg_info; /* flag bits, see below */
/* If BKPIMAGE_HAS_HOLE and BKPIMAGE_COMPRESSED(), an
* XLogRecordBlockCompressHeader struct follows. */
} XLogRecordBlockImageHeader;
/* Information stored in bimg_info */
#define BKPIMAGE_HAS_HOLE 0x01 /* page image has "hole" */
#define BKPIMAGE_APPLY 0x02 /* page image should be restored
* during replay */
#define BKPIMAGE_COMPRESS_PGLZ 0x04
#define BKPIMAGE_COMPRESS_LZ4 0x08
#define BKPIMAGE_COMPRESS_ZSTD 0x10

A standard PostgreSQL page has a contiguous run of zero bytes between pd_lower (end of the line-pointer array) and pd_upper (start of the tuple data growing down from the end). Since those bytes are known-zero, the FPI omits them: hole_offset records where the hole starts and the compress header (if present) records its length, so replay can reconstruct the full BLCKSZ page by zero-filling the gap. BKPIMAGE_APPLY distinguishes an image that should be restored during replay (the normal first-touch-after- checkpoint case) from one logged only for wal_consistency_checking comparison. The three COMPRESS_* bits select the algorithm when wal_compression is on. This is the on-disk realization of the torn-page defense from the theory section.

rmgr.c is small and almost entirely table-driven. Beyond defining RmgrTable[] (shown above), it provides the lifecycle hooks the recovery driver calls. RmgrStartup and RmgrCleanup iterate the table once, invoking the optional rm_startup/rm_cleanup callbacks so index AMs can set up and tear down their incomplete-action bookkeeping:

// RmgrStartup — src/backend/access/transam/rmgr.c
void
RmgrStartup(void)
{
for (int rmid = 0; rmid <= RM_MAX_ID; rmid++)
{
if (!RmgrIdExists(rmid))
continue;
if (RmgrTable[rmid].rm_startup != NULL)
RmgrTable[rmid].rm_startup();
}
}

RmgrNotFound is the error path: if a record arrives bearing an rmid with no registered entry — typically a custom rmgr whose extension was not loaded into shared_preload_libraries — recovery ereport(ERROR)s with a hint to load the module. The GetRmgr inline accessor (in xlog_internal.h) funnels every lookup through RmgrNotFound so a missing rmgr can never be silently skipped.

Custom resource managers occupy ids 128–255 and are installed at process startup by RegisterCustomRmgr, which enforces several invariants: the id must be in the custom range (RmgrIdIsCustom), registration must happen while shared_preload_libraries are initializing (process_shared_preload_libraries_in_progress), the id must not already be taken, and the name must be non-empty and not collide with an existing rmgr’s name:

// RegisterCustomRmgr — src/backend/access/transam/rmgr.c
void
RegisterCustomRmgr(RmgrId rmid, const RmgrData *rmgr)
{
if (rmgr->rm_name == NULL || strlen(rmgr->rm_name) == 0)
ereport(ERROR, (errmsg("custom resource manager name is invalid"), ...));
if (!RmgrIdIsCustom(rmid))
ereport(ERROR, (errmsg("custom resource manager ID %d is out of range", rmid), ...));
if (!process_shared_preload_libraries_in_progress)
ereport(ERROR, ( ... "must be registered while initializing modules in "
"\"shared_preload_libraries\"." ));
if (RmgrTable[rmid].rm_name != NULL)
ereport(ERROR, ( ... "already registered with the same ID." ));
/* ... name-collision scan ... */
RmgrTable[rmid] = *rmgr; /* register it */
ereport(LOG, (errmsg("registered custom resource manager \"%s\" with ID %d",
rmgr->rm_name, rmid)));
}

The RM_EXPERIMENTAL_ID (128) is the conventional id to use during development before reserving a unique number on the PostgreSQL wiki. Finally, pg_get_wal_resource_managers is the SQL set-returning function backing the pg_get_wal_resource_managers() view; it walks the table emitting (rmid, name, builtin) rows, using RmgrIdIsBuiltin to flag whether each id is a built-in or a custom registration.

A redo routine never reads a page through the normal relcache/buffer paths — during recovery there is no relcache and the catalogs may not even be consistent yet. Instead it goes through xlogutils.c, whose central entry point is XLogReadBufferForRedo. This is the single function that ties the record’s block reference to a real shared buffer and decides whether the change still needs applying:

// XLogReadBufferForRedoExtended — src/backend/access/transam/xlogutils.c
XLogRedoAction
XLogReadBufferForRedoExtended(XLogReaderState *record, uint8 block_id,
ReadBufferMode mode, bool get_cleanup_lock,
Buffer *buf)
{
XLogRecPtr lsn = record->EndRecPtr;
RelFileLocator rlocator;
ForkNumber forknum;
BlockNumber blkno;
/* ... */
if (!XLogRecGetBlockTagExtended(record, block_id, &rlocator, &forknum,
&blkno, &prefetch_buffer))
elog(PANIC, "failed to locate backup block with ID %d in WAL record", block_id);
/* If it has a full-page image and it should be restored, do it. */
if (XLogRecBlockImageApply(record, block_id))
{
*buf = XLogReadBufferExtended(rlocator, forknum, blkno, ...);
page = BufferGetPage(*buf);
if (!RestoreBlockImage(record, block_id, page))
ereport(ERROR, ...);
if (!PageIsNew(page))
PageSetLSN(page, lsn);
MarkBufferDirty(*buf);
return BLK_RESTORED;
}
else
{
*buf = XLogReadBufferExtended(rlocator, forknum, blkno, mode, prefetch_buffer);
if (BufferIsValid(*buf))
{
/* ... acquire lock ... */
if (lsn <= PageGetLSN(BufferGetPage(*buf)))
return BLK_DONE;
else
return BLK_NEEDS_REDO;
}
else
return BLK_NOTFOUND;
}
}

The four-way return value is the idempotent-redo logic from ARIES made concrete. BLK_RESTORED — the record carried an FPI that applies, so the whole page was overwritten and the rmgr’s incremental logic is skipped. BLK_NEEDS_REDO — the page exists and its pageLSN is older than this record, so the incremental change has not yet been applied; the rmgr applies it and stamps the new LSN. BLK_DONE — the page’s LSN is already >= the record LSN, meaning the change is already present (the page was flushed before the crash, or we are replaying past it), so redo is skipped. BLK_NOTFOUND — the page no longer exists (its relation was dropped or truncated later in the WAL), so the change is harmlessly ignored. The lsn <= PageGetLSN(...) comparison is the exact mechanism that makes replay safe to run twice. The thin wrapper XLogReadBufferForRedo calls this with RBM_NORMAL; XLogInitBufferForRedo calls it with RBM_ZERO_AND_LOCK for the BKPBLOCK_WILL_INIT case where the redo routine rebuilds the page from scratch.

XLogReadBufferExtended is the lower-level page fetch. During recovery it opens the relation at the storage-manager (smgr) level — bypassing the relcache entirely — and smgrcreates the file if missing, so replay can write to a relation that a later record will drop. If the requested block is beyond end-of-file it either extends with zero pages (the RBM_ZERO_* modes) or, in RBM_NORMAL, logs an invalid page reference and returns InvalidBuffer:

// XLogReadBufferExtended — src/backend/access/transam/xlogutils.c (condensed)
smgr = smgropen(rlocator, INVALID_PROC_NUMBER);
smgrcreate(smgr, forknum, true);
lastblock = smgrnblocks(smgr, forknum);
if (blkno < lastblock)
buffer = ReadBufferWithoutRelcache(rlocator, forknum, blkno, mode, NULL, true);
else
{
if (mode == RBM_NORMAL) /* hm, page doesn't exist in file */
{
log_invalid_page(rlocator, forknum, blkno, false);
return InvalidBuffer;
}
/* ... RBM_ZERO_* modes extend the file with zero pages ... */
}

The invalid-page mechanism deserves a note because it is a subtle correctness device. With full_page_writes=off it is possible to see an incremental record for a page that no longer exists (because its relation was later dropped/truncated, but the drop record comes after this one in the stream). log_invalid_page records the (relation, fork, block) in a hash table; if a later record drops or truncates that relation, the entry is forgotten. At the end of recovery, XLogCheckInvalidPages sweeps the table — any entry still present means the WAL referenced a page that was never accounted for, which is a PANIC-worthy corruption signal (downgradable to WARNING via the ignore_invalid_pages GUC):

// XLogCheckInvalidPages — src/backend/access/transam/xlogutils.c (condensed)
while ((hentry = (xl_invalid_page *) hash_seq_search(&status)) != NULL)
{
report_invalid_page(WARNING, hentry->key.locator, hentry->key.forkno,
hentry->key.blkno, hentry->present);
foundone = true;
}
if (foundone)
elog(ignore_invalid_pages ? WARNING : PANIC,
"WAL contains references to invalid pages");

The desc/identify callbacks exist purely to render WAL human-readably; the backend never calls them during normal operation. The per-rmgr descriptor files live under src/backend/access/rmgrdesc/. identify maps the opcode bits (the same xl_info & ~XLR_INFO_MASK the redo routine switches on) to a short string, including the +INIT suffix when XLOG_HEAP_INIT_PAGE is set:

// heap_identify — src/backend/access/rmgrdesc/heapdesc.c (condensed)
const char *
heap_identify(uint8 info)
{
const char *id = NULL;
switch (info & ~XLR_INFO_MASK)
{
case XLOG_HEAP_INSERT: id = "INSERT"; break;
case XLOG_HEAP_INSERT | XLOG_HEAP_INIT_PAGE: id = "INSERT+INIT"; break;
case XLOG_HEAP_DELETE: id = "DELETE"; break;
case XLOG_HEAP_UPDATE: id = "UPDATE"; break;
/* ... HOT_UPDATE, TRUNCATE, CONFIRM, LOCK, INPLACE ... */
}
return id;
}

desc decodes the record’s main-data blob into field-by-field detail by casting XLogRecGetData(record) to the rmgr’s struct for that opcode:

// heap_desc — src/backend/access/rmgrdesc/heapdesc.c (condensed)
void
heap_desc(StringInfo buf, XLogReaderState *record)
{
char *rec = XLogRecGetData(record);
uint8 info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
info &= XLOG_HEAP_OPMASK;
if (info == XLOG_HEAP_INSERT)
{
xl_heap_insert *xlrec = (xl_heap_insert *) rec;
appendStringInfo(buf, "off: %u, flags: 0x%02X", xlrec->offnum, xlrec->flags);
}
else if (info == XLOG_HEAP_DELETE) { /* ... */ }
/* ... */
}

pg_waldump is a frontend program and cannot link the backend’s redo code, so it builds its own descriptor-only table from the same rmgrlist.h, this time expanding PG_RMGR to keep only the name/desc/identify columns:

// RmgrDescTable — src/bin/pg_waldump/rmgrdesc.c
#define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
{ name, desc, identify},
static const RmgrDescData RmgrDescTable[RM_N_BUILTIN_IDS] = {
#include "access/rmgrlist.h"
};

For each record pg_waldump fetches GetRmgrDesc(XLogRecGetRmid(record)), calls rm_identify(info) for the operation name and rm_desc(&s, record) for the detail line — the same two callbacks the table threads through. This is why a single header (rmgrlist.h) keeps the backend’s redo dispatch and the standalone dump tool perfectly in sync: both are generated from one list.

An extension that stores data in standard PostgreSQL pages but does not want to write a full custom resource manager can use generic_xlog.c, the Generic rmgr (RM_GENERIC_ID). It WAL-logs an arbitrary byte-level delta of a page without the extension having to define record types or a redo routine — the shared generic_redo replays any such record. The construction API is a three-call lifecycle. GenericXLogStart allocates an I/O-aligned state holding up to MAX_GENERIC_XLOG_PAGES page slots:

// GenericXLogStart — src/backend/access/transam/generic_xlog.c (condensed)
GenericXLogState *
GenericXLogStart(Relation relation)
{
GenericXLogState *state;
state = (GenericXLogState *) palloc_aligned(sizeof(GenericXLogState),
PG_IO_ALIGN_SIZE, 0);
state->isLogged = RelationNeedsWAL(relation);
for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++)
{
state->pages[i].image = state->images[i].data;
state->pages[i].buffer = InvalidBuffer;
}
return state;
}

GenericXLogRegisterBuffer takes a copy of the page into the state’s image buffer and hands that copy back to the caller to modify in place; the caller mutates the image, never the live buffer:

// GenericXLogRegisterBuffer — src/backend/access/transam/generic_xlog.c (condensed)
Page
GenericXLogRegisterBuffer(GenericXLogState *state, Buffer buffer, int flags)
{
for (block_id = 0; block_id < MAX_GENERIC_XLOG_PAGES; block_id++)
{
GenericXLogPageData *page = &state->pages[block_id];
if (BufferIsInvalid(page->buffer))
{
page->buffer = buffer;
page->flags = flags;
memcpy(page->image, BufferGetPage(buffer), BLCKSZ);
return (Page) page->image;
}
else if (page->buffer == buffer)
return (Page) page->image; /* already registered */
}
elog(ERROR, "maximum number %d of generic xlog buffers is exceeded",
MAX_GENERIC_XLOG_PAGES);
}

GenericXLogFinish is where the magic happens. It diffs the original page against the modified image to produce a compact delta, copies the image onto the real buffer (zeroing the hole to match what replay will produce), marks the buffer dirty, and registers the delta as per-buffer data on a Generic record:

// GenericXLogFinish — src/backend/access/transam/generic_xlog.c (condensed)
if (!(pageData->flags & GENERIC_XLOG_FULL_IMAGE))
computeDelta(pageData, page, (Page) pageData->image);
/* apply the image, zeroing the hole between pd_lower and pd_upper */
memcpy(page, pageData->image, pageHeader->pd_lower);
memset(page + pageHeader->pd_lower, 0, pageHeader->pd_upper - pageHeader->pd_lower);
memcpy(page + pageHeader->pd_upper, pageData->image + pageHeader->pd_upper,
BLCKSZ - pageHeader->pd_upper);
MarkBufferDirty(pageData->buffer);
if (pageData->flags & GENERIC_XLOG_FULL_IMAGE)
XLogRegisterBuffer(i, pageData->buffer, REGBUF_FORCE_IMAGE | REGBUF_STANDARD);
else
{
XLogRegisterBuffer(i, pageData->buffer, REGBUF_STANDARD);
XLogRegisterBufData(i, pageData->delta, pageData->deltaLen);
}
/* ... */
lsn = XLogInsert(RM_GENERIC_ID, 0);

The delta itself is a list of fragments(offset, length, bytes) triples covering only the regions that changed. computeDelta runs the diff separately over the page’s lower part (0..pd_lower) and upper part (pd_upper..BLCKSZ), skipping the hole entirely; computeRegionDelta does the tight byte-matching loop, merging fragments separated by fewer than MATCH_THRESHOLD unchanged bytes (because a fragment header costs 2*sizeof(OffsetNumber), so tiny gaps are not worth splitting). The worst-case delta is bounded by MAX_DELTA_SIZE = BLCKSZ + 2*FRAGMENT_HEADER_SIZE. On replay, generic_redo re-pins each block with XLogReadBufferForRedo and applies the fragments with applyPageRedo, the mirror of writeFragment:

// applyPageRedo — src/backend/access/transam/generic_xlog.c
static void
applyPageRedo(Page page, const char *delta, Size deltaSize)
{
const char *ptr = delta;
const char *end = delta + deltaSize;
while (ptr < end)
{
OffsetNumber offset, length;
memcpy(&offset, ptr, sizeof(offset)); ptr += sizeof(offset);
memcpy(&length, ptr, sizeof(length)); ptr += sizeof(length);
memcpy(page + offset, ptr, length);
ptr += length;
}
}

After applying, generic_redo zeroes the hole on the replayed page (the delta carries no hole bytes) and stamps the LSN — exactly mirroring what GenericXLogFinish did on the primary, so a consistency check passes. generic_mask (the rmgr’s rm_mask) blanks the page LSN, checksum, and unused space before that check. The whole point of generic_xlog is that an extension gets crash-safe, replication-safe page edits with a delta-encoded WAL footprint and zero custom redo code — at the cost of the diff being byte-level rather than semantic.

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
PG_RMGR list (rmgr definitions)src/include/access/rmgrlist.h28–49
RmgrIds enum (PG_RMGR → enumerator)src/include/access/rmgr.h22–30
RM_MAX_ID / RM_MIN_CUSTOM_ID / RM_EXPERIMENTAL_IDsrc/include/access/rmgr.h33–60
RmgrIdIsBuiltin / RmgrIdIsCustomsrc/include/access/rmgr.h41–53
RmgrData structsrc/include/access/xlog_internal.h349–360
GetRmgr / RmgrIdExistssrc/include/access/xlog_internal.h369–382
RmgrTable[] definitionsrc/backend/access/transam/rmgr.c47–52
RmgrStartupsrc/backend/access/transam/rmgr.c58
RmgrCleanupsrc/backend/access/transam/rmgr.c74
RmgrNotFoundsrc/backend/access/transam/rmgr.c91
RegisterCustomRmgrsrc/backend/access/transam/rmgr.c107
pg_get_wal_resource_managerssrc/backend/access/transam/rmgr.c150
XLogRecord structsrc/include/access/xlogrecord.h41–53
XLR_INFO_MASK / XLR_RMGR_INFO_MASKsrc/include/access/xlogrecord.h62–63
XLR_SPECIAL_REL_UPDATE / XLR_CHECK_CONSISTENCYsrc/include/access/xlogrecord.h82–91
XLogRecordBlockHeadersrc/include/access/xlogrecord.h103–113
XLogRecordBlockImageHeader + BKPIMAGE_*src/include/access/xlogrecord.h141–167
BKPBLOCK_* flagssrc/include/access/xlogrecord.h196–202
XLogReadBufferForRedosrc/backend/access/transam/xlogutils.c303
XLogInitBufferForRedosrc/backend/access/transam/xlogutils.c315
XLogReadBufferForRedoExtendedsrc/backend/access/transam/xlogutils.c340
XLogReadBufferExtendedsrc/backend/access/transam/xlogutils.c460
log_invalid_pagesrc/backend/access/transam/xlogutils.c101
XLogCheckInvalidPagessrc/backend/access/transam/xlogutils.c234
GenericXLogState / GenericXLogPageDatasrc/backend/access/transam/generic_xlog.c49–71
GenericXLogStartsrc/backend/access/transam/generic_xlog.c269
GenericXLogRegisterBuffersrc/backend/access/transam/generic_xlog.c299
GenericXLogFinishsrc/backend/access/transam/generic_xlog.c337
computeDelta / computeRegionDelta / writeFragmentsrc/backend/access/transam/generic_xlog.c90 / 121 / 228
generic_redo / applyPageRedo / generic_masksrc/backend/access/transam/generic_xlog.c478 / 453 / 539
MAX_GENERIC_XLOG_PAGES / GENERIC_XLOG_FULL_IMAGEsrc/include/access/generic_xlog.h23 / 26
heap_redo (opcode dispatch)src/backend/access/heap/heapam_xlog.c1181
heap_desc / heap_identifysrc/backend/access/rmgrdesc/heapdesc.c184 / 389
RmgrDescTable (frontend)src/bin/pg_waldump/rmgrdesc.c35–40

Verified against the REL_18_STABLE checkout at commit 273fe94 under /data/hgryoo/references/postgres.

  • rmgr set and ids. rmgrlist.h defines exactly 22 PG_RMGR entries (ids 0–21): XLOG, Transaction, Storage, CLOG, Database, Tablespace, MultiXact, RelMap, Standby, Heap2, Heap, Btree, Hash, Gin, Gist, Sequence, SPGist, BRIN, CommitTs, ReplicationOrigin, Generic, LogicalMessage. Confirmed there is no XLOG2 rmgr in this revision (it is a later- version addition) — the list ends at RM_LOGICALMSG_ID. RM_MAX_BUILTIN_ID is therefore RM_NEXT_ID - 1 = 21.
  • X-macro triple use. Verified rmgrlist.h has no include guard and is #include-d under three different PG_RMGR definitions: enumerator (rmgr.h), full struct initializer (rmgr.c), and {name,desc,identify} triple (pg_waldump/rmgrdesc.c). The rmgr.c macro comment “must be kept in sync with RmgrData definition in xlog_internal.h” matches the 8-field RmgrData struct order verbatim.
  • RmgrData callback order. The struct in xlog_internal.h lists rm_name, rm_redo, rm_desc, rm_identify, rm_startup, rm_cleanup, rm_mask, rm_decode, identical to the PG_RMGR argument order — confirmed by reading both. The rmgr.c initializer omits rm_name’s callback position correctly (name is first field, callbacks follow).
  • XLogRecord header. Confirmed the six fields and the 2-byte padding comment in xlogrecord.h; SizeOfXLogRecord = offsetof(XLogRecord, xl_crc) + sizeof(pg_crc32c). xl_rmid is RmgrId = uint8.
  • opcode masking. Verified XLR_INFO_MASK = 0x0F and XLR_RMGR_INFO_MASK = 0xF0, and that heap_redo masks with & ~XLR_INFO_MASK then & XLOG_HEAP_OPMASK before switching — matching the prose’s “low nibble reserved, high nibble rmgr” claim.
  • block flags / FPI. Confirmed BKPBLOCK_HAS_IMAGE/HAS_DATA/WILL_INIT/ SAME_REL (0x10/0x20/0x40/0x80) and BKPIMAGE_HAS_HOLE/APPLY/COMPRESS_* bit values in xlogrecord.h.
  • redo buffer return values. XLogReadBufferForRedoExtended returns BLK_RESTORED (FPI applied), BLK_NEEDS_REDO (lsn > PageGetLSN), BLK_DONE (lsn <= PageGetLSN), BLK_NOTFOUND (no buffer) — read directly from the function body.
  • generic_xlog limits. MAX_GENERIC_XLOG_PAGES is XLR_NORMAL_MAX_BLOCK_ID and GENERIC_XLOG_FULL_IMAGE = 0x0001 in generic_xlog.h; MAX_DELTA_SIZE = BLCKSZ + 2 * FRAGMENT_HEADER_SIZE and FRAGMENT_HEADER_SIZE = 2 * sizeof(OffsetNumber) in generic_xlog.c.
  • custom rmgr range. RM_MIN_CUSTOM_ID = 128, RM_MAX_CUSTOM_ID = UINT8_MAX, RM_EXPERIMENTAL_ID = 128; RegisterCustomRmgr enforces all four invariants quoted (range, preload-in-progress, no duplicate id, no duplicate name) — read directly.
  • Line numbers in the position-hint table were captured from this checkout on 2026-06-05; they are hints and will drift with reformatting. Symbol names are the durable anchors.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

PostgreSQL’s rmgr table is a particularly clean instance of a pattern that recurs across every ARIES-family engine, but the choices it makes — redo-only logging, a compile-time X-macro dispatch table, physiological per-page records with optional FPIs — sit at one corner of a larger design space. Placing it against its peers and the research literature sharpens what is essential versus incidental.

ARIES and the resource-manager abstraction. The original ARIES paper (Mohan et al. 1992, captured in knowledge/research/dbms-papers/aries.md) already factors recovery into a generic driver plus per-resource-manager redo and undo logic; IBM’s DB2 and the System R lineage (knowledge/research/dbms-papers/systemr.md) embody it directly. The deepest divergence is that classic ARIES is a redo-undo log: every update logs both how to reapply and how to roll back, and rollback writes compensation log records (CLRs) whose UndoNxtLSN lets recovery skip already-undone work and guarantees termination even if the system crashes mid-rollback. PostgreSQL discards the entire undo half. Because MVCC leaves an aborted transaction’s tuples physically in place and merely marks the xid aborted in pg_xact, there is nothing to physically undo, so RmgrData carries rm_redo but no rm_undo, and recovery is a single forward pass with no analysis/undo phases in the ARIES sense. The cost is paid elsewhere — bloat from dead tuples, deferred to VACUUM (postgres-vacuum.md) — but the recovery code shrinks dramatically. This is the single most consequential architectural difference between PostgreSQL’s WAL subsystem and a textbook ARIES implementation.

Undo-based engines and the road not taken. Oracle and MySQL/InnoDB take the opposite stance: they maintain an explicit undo/rollback segment, so old row versions are reconstructed on demand from undo and the main heap holds only the latest version. Their redo logs (Oracle redo records of change vectors tagged by layer.opcode; InnoDB mini-transaction records tagged by mlog type) still multiplex many subsystems through a small type tag, exactly as xl_rmid does, but recovery must run both a redo pass and an undo pass that applies the rollback segment. PostgreSQL’s own community has repeatedly proposed an undo-based storage engine (the zheap project) precisely to escape MVCC bloat; notably, zheap would have reintroduced an undo log and therefore something like the CLR machinery ARIES describes — a reminder that PostgreSQL’s redo-only simplicity is a direct consequence of its heap’s append-only version storage, not a free lunch.

Physiological logging and FPIs as a shared optimum. The decision to log physical-to-a-page, logical-within-it records plus a first-touch full-page image is close to universal: Oracle, SQL Server, and InnoDB all do some form of it (InnoDB’s doublewrite buffer is a complementary torn-page defense rather than a replacement). PostgreSQL’s refinements — stripping the pd_lower/pd_upper hole and optionally compressing the image with pglz/lz4/zstd — are engineering rather than architecture, but they materially shrink WAL volume, which dominates replication bandwidth and archive cost. Database Internals (Petrov 2019, knowledge/research/dbms-papers/ / raw/dbms/textbooks/Database Internals.pdf) surveys this design space and frames FPIs as the standard answer to the absence of atomic page writes on commodity storage.

Compile-time dispatch versus pluggable registries. The X-macro construction of RmgrTable[] is a deliberately static design: the built-in set is frozen at compile time and the wire format is literally the order of lines in rmgrlist.h. The 2020-era addition of custom rmgrs (ids 128–255 via RegisterCustomRmgr) bolts a small dynamic registry onto the static core, which is why custom registration is hemmed in by so many invariants — it must happen during shared_preload_libraries init, before any WAL is read, so the table is effectively immutable again by the time recovery runs. This is a narrower extensibility contract than, say, SQLite’s VFS layer or a fully plugin-based log-record registry, and the narrowness is the point: a resource manager participates in crash recovery, where a mismatched or missing entry is silent data loss, so the system trades flexibility for the guarantee that a WAL stream means exactly one thing. The generic_xlog machinery is the pressure valve: an extension that only needs crash-safe edits of standard pages gets them with zero custom redo code, at the price of byte-level rather than semantic deltas.

Research frontiers. Three threads are worth flagging. First, log shipping and scale-out: because the rmgr layer makes the WAL a clean, self-describing stream, it doubles as the substrate for both physical replication and logical decoding (the rm_decode callbacks), and systems like Amazon Aurora push this further by treating the redo log itself as the durability boundary — “the log is the database” — shipping WAL to a storage tier that materializes pages lazily. Second, non-volatile memory and shorter logs: a line of research asks whether persistent memory lets engines shrink or eliminate FPIs and even the redo log, since byte-addressable durability changes the torn-page calculus that motivated FPIs in the first place. Third, WAL as a contention bottleneck: the single ordered log is a serialization point, and just as the lock manager became a multicore bottleneck addressed by partitioning (knowledge/research/dbms-papers/scalable-lock-manager.md), the WAL insert path is partitioned in PostgreSQL via multiple insertion locks (postgres-xlog-wal.md) — an ongoing tension between the strict total order ARIES assumes and the parallelism modern hardware demands. The rmgr table itself is not the bottleneck — it is a pure dispatch on a record already parsed — but it sits directly downstream of these scalability questions.

  • Code (REL_18_STABLE, commit 273fe94, under /data/hgryoo/references/postgres):
    • src/include/access/rmgrlist.h — the PG_RMGR X-macro list (the 22 built-in resource managers and their callback columns).
    • src/include/access/rmgr.hRmgrIds enum, RM_MAX_ID / RM_MIN_CUSTOM_ID / RM_EXPERIMENTAL_ID, RmgrIdIsBuiltin / RmgrIdIsCustom.
    • src/backend/access/transam/rmgr.cRmgrTable[], RmgrStartup / RmgrCleanup, RmgrNotFound, RegisterCustomRmgr, pg_get_wal_resource_managers.
    • src/include/access/xlog_internal.hRmgrData struct, GetRmgr / RmgrIdExists accessors.
    • src/include/access/xlogrecord.hXLogRecord header, XLogRecordBlockHeader, XLogRecordBlockImageHeader, the BKPBLOCK_* / BKPIMAGE_* / XLR_* flag families.
    • src/backend/access/transam/xlogutils.cXLogReadBufferForRedo / …Extended, XLogReadBufferExtended, log_invalid_page / XLogCheckInvalidPages.
    • src/backend/access/transam/generic_xlog.c + src/include/access/generic_xlog.h — the Generic rmgr delta engine (GenericXLogStart / RegisterBuffer / Finish, computeDelta, applyPageRedo, generic_redo, generic_mask).
    • src/backend/access/heap/heapam_xlog.cheap_redo opcode dispatch (illustrative of the rmgr rm_redo pattern).
    • src/backend/access/rmgrdesc/heapdesc.cheap_desc / heap_identify descriptor pair.
    • src/bin/pg_waldump/rmgrdesc.c — the frontend RmgrDescTable built from the same rmgrlist.h.
    • src/backend/access/transam/README — the canonical narrative of WAL record construction, buffer registration, and full-page-image rules.
  • Theory / textbooks:
    • knowledge/research/dbms-papers/aries.md — Mohan et al. 1992, ARIES; WAL invariant, LSN/pageLSN, redo/undo, CLRs, resource managers.
    • knowledge/research/dbms-papers/systemr.md — the System R recovery lineage that ARIES generalizes.
    • knowledge/research/dbms-papers/scalable-lock-manager.md — multicore partitioning of a single serialization point (analogy for WAL insert locks).
    • raw/dbms/textbooks/Database Internals.pdf (Petrov 2019) — WAL, full-page writes, buffer management, recovery in the broader engine landscape.
  • Cross-references within this knowledge base:
    • postgres-xlog-wal.md — WAL insertion, insert locks, segment/flush machinery (the emit side this doc defers to).
    • postgres-recovery-redo.md — the recovery driver loop, checkpoints, restartpoints (the replay driver this doc defers to).
    • postgres-xact.md — transaction commit/abort records (the Transaction rmgr).
    • postgres-heap-am.md, postgres-nbtree.md — the heap and btree rmgrs whose redo routines this doc uses as worked examples.
    • postgres-buffer-manager.md — the shared-buffer layer XLogReadBufferExtended drives during replay.
    • postgres-logical-decoding.md — the rm_decode callback consumers.