PostgreSQL pg_waldump — WAL Decoding and Inspection Utility
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-06)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A write-ahead log is opaque by design: it is a binary byte stream optimized
for sequential append, not for human inspection. The correctness guarantee it
provides — “if these bytes survived, these changes survived” — is valuable
precisely because the engine does not need to interpret the bytes during normal
operation. But that opacity is a liability the moment something goes wrong. A
DBA who sees a standby falling behind, an engineer debugging an unexpected
VACUUM stall, or a developer writing a new extension that emits WAL records
all need the same thing: a way to read the log and understand what is in it.
The general problem is WAL introspection: given a physical log file on disk, produce a human-readable rendering of each record’s identity, position, size, and payload summary. Three properties must hold for this to be safe and useful.
Offline read-only access. The tool must be able to read a WAL segment without the database being online — the primary use case is post-mortem analysis of a crashed server or inspection of an archived WAL stream that is not attached to a live instance. This means the tool cannot call into the buffer manager, shared memory, lock manager, or any other infrastructure that assumes a running server. It must be a pure reader operating on files.
Faithful record decoding. A WAL record is not self-describing prose; its
payload is a binary blob whose layout is defined by the resource manager (rmgr)
that wrote it. A Heap INSERT record, a B-tree page-split record, and a
checkpoint record all have the same wire framing but different payloads. The
introspection tool must be able to invoke the same desc / identify logic
that the recovery driver would use, but without executing the record’s redo
function. This requires the rmgr table to be present in the tool, with its
description callbacks but without its recovery logic.
Position and size accounting. Operators debugging replication lag or checkpoint pressure need more than record text. They need aggregate counts and sizes: how many bytes does the Heap rmgr consume per WAL segment? What fraction is full-page images? Which record type generates the most volume? This calls for a statistics accumulation mode layered on top of the basic record scan.
These three properties map directly onto pg_waldump’s three operating modes:
plain record display, statistics (--stats), and follow mode (--follow).
The ARIES paper (Mohan et al. 1992) defines the log as a sequence of
self-describing records linked by prevLSN back-pointers. The back-pointer is
what makes sequential backward traversal possible. pg_waldump exploits the
forward-link structure (each record carries its own length, so the next record
starts at the current LSN plus the padded record length) and uses the
xl_prev back-pointer (XLogRecGetPrev) to report the predecessor LSN in
each display line. See postgres-xlog-wal.md for the record framing and
postgres-wal-records-rmgr.md for the rmgr dispatch table.
Common DBMS Design
Section titled “Common DBMS Design”Every production-quality DBMS ships a WAL inspection tool. They converge on a small set of architectural choices that are worth naming before looking at PostgreSQL’s specific implementation.
Frontend-mode binary with shared library support
Section titled “Frontend-mode binary with shared library support”The standard pattern is a standalone binary compiled from the same source tree
as the server, sharing the WAL-reader and rmgr-description code but excluding
the recovery executor and all server-side infrastructure. MySQL/InnoDB ships
mysqlbinlog (operates on binlog files, which are MySQL’s logical WAL analog).
Oracle ships LogMiner, which runs as a server-side package rather than an
external binary. SQL Server ships fn_dblog and fn_dump_dblog as T-SQL
table-valued functions callable from a live connection. PostgreSQL’s design
choice — an external binary that links against the same C library used by the
server but runs entirely as a frontend process — sits between Oracle’s fully
server-integrated approach and a hypothetical third-party parser.
Pluggable page-read callback
Section titled “Pluggable page-read callback”A WAL reader that works both inside the server (reading from WAL buffers or
streaming-replication receive buffers) and outside it (reading from segment
files on disk) cannot hard-code the I/O path. The universal solution is a
page-read callback supplied by the caller. Inside the server, the callback
reads from shared WAL buffers. Outside the server, the callback opens segment
files directly. The WAL reader’s core loop — record framing, CRC checking,
block-reference decoding — is the same in both cases. pg_waldump supplies its
own WALDumpReadPage callback for the file-based path.
Filter-then-display pipeline
Section titled “Filter-then-display pipeline”Raw WAL output is voluminous. A WAL segment defaults to 16 MB and can contain
thousands of records. Every inspection tool offers filters to narrow the output:
by time range (Oracle LogMiner START_TIME/END_TIME), by object
(OBJECT_NAME), by transaction. pg_waldump exposes filters by rmgr, XID,
relation (tablespace OID / database OID / relation filenode), block number,
fork, and full-page-write presence. The filter is applied in a tight main loop
after each successful XLogReadRecord; matching records are displayed, skipped
records advance the loop with no output.
Statistics mode as an independent output path
Section titled “Statistics mode as an independent output path”An aggregate view of WAL content — “how many bytes does each record type
contribute over this segment?” — is qualitatively different from per-record
display. It requires accumulating counts and sizes across the scan, then
printing a summary table. Separating the accumulation path from the display
path keeps both clean: the display path is stateless (one record in, one line
out), while the statistics path is stateful (N records in, one table out).
pg_waldump implements this split with XLogDumpDisplayRecord and
XLogDumpDisplayStats as two mutually exclusive output paths inside the same
main loop.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”pg_waldump is a single-file frontend binary (src/bin/pg_waldump/pg_waldump.c,
1323 lines). Its companion rmgrdesc.c provides the RmgrDescTable — a
stripped-down version of the server-side rmgr table that carries only the
rm_name, rm_desc, and rm_identify callbacks, not the redo or
startup/cleanup callbacks. The binary links against libpgcommon and
libpgport for utility functions but against none of the server backend
libraries.
flowchart TB
CLI["main()<br/>option parsing<br/>XLogDumpConfig + XLogDumpPrivate"]
DIR["identify_target_directory()<br/>locate WAL segment dir<br/>probe ., pg_wal, $PGDATA/pg_wal"]
ALLOC["XLogReaderAllocate()<br/>WalSegSz, waldir<br/>XL_ROUTINE callbacks"]
FIND["XLogFindNextRecord()<br/>scan forward to first valid record"]
LOOP["main loop<br/>XLogReadRecord()"]
FILTER{"filters pass?<br/>rmgr / xid / relation<br/>block / fork / fpw"}
DISPLAY["XLogDumpDisplayRecord()<br/>rm_name + rm_identify + rm_desc<br/>+ XLogRecGetBlockRefInfo"]
STATS["XLogRecStoreStats()<br/>accumulate XLogStats"]
FPW["XLogRecordSaveFPWs()<br/>RestoreBlockImage → file"]
SUMMARY["XLogDumpDisplayStats()<br/>per-rmgr/record table"]
END["XLogReaderFree()"]
CLI --> DIR --> ALLOC --> FIND --> LOOP
LOOP --> FILTER
FILTER -->|"no"| LOOP
FILTER -->|"yes, --stats"| STATS --> FPW --> LOOP
FILTER -->|"yes, display"| DISPLAY --> FPW --> LOOP
LOOP -->|"end or --limit"| SUMMARY --> END
Figure 1 — pg_waldump main control flow. The two output paths (display and
stats) are mutually exclusive inside the loop; --save-fullpage runs on every
matching record regardless of mode.
Startup: locating WAL and reading the segment header
Section titled “Startup: locating WAL and reading the segment header”main calls identify_target_directory to find the directory that contains
WAL segment files. The search order is: (1) the explicit --path argument;
(2) that path plus /pg_wal; (3) if no --path, the current directory, then
pg_wal, then $PGDATA/pg_wal. For each candidate, search_directory opens
a file matching the WAL filename pattern and reads the first 8 KB page to
extract WalSegSz from the XLogLongPageHeader:
// search_directory — pg_waldump.c (condensed)r = read(fd, buf.data, XLOG_BLCKSZ);if (r == XLOG_BLCKSZ){ XLogLongPageHeader longhdr = (XLogLongPageHeader) buf.data; WalSegSz = longhdr->xlp_seg_size; if (!IsValidWalSegSize(WalSegSz)) /* error: not a power of two between 1 MB and 1 GB */ exit(1);}IsValidWalSegSize checks that the size is a power of two in [1 MB, 1 GB].
The segment size is a compile-time default (16 MB) but is overridable at
initdb time, so pg_waldump must discover it from the file rather than
assuming the compiled default. This matters for clusters initialized with a
non-default --wal-segsize.
XLogReaderState initialization and the three callbacks
Section titled “XLogReaderState initialization and the three callbacks”After locating the WAL directory and segment size, main calls
XLogReaderAllocate to create an XLogReaderState. The three file-I/O
callbacks are supplied via the XL_ROUTINE macro, which initializes a
compound literal of type XLogReaderRoutine:
// main — pg_waldump.c (condensed)xlogreader_state = XLogReaderAllocate(WalSegSz, waldir, XL_ROUTINE(.page_read = WALDumpReadPage, .segment_open = WALDumpOpenSegment, .segment_close = WALDumpCloseSegment), &private);WALDumpOpenSegment builds the segment filename from (tli, segno, segsize)
via XLogFileName and opens it read-only. In --follow mode it retries up to
10 times with 500 ms sleeps — a short busy-wait to handle the gap between the
server finishing the previous segment and creating the next one.
WALDumpReadPage is the core I/O callback — pg_waldump’s file-based equivalent
of the server’s read_local_xlog_page. It checks whether the request falls
within the configured end-pointer (private.endptr), computes how many bytes
to read, and calls WALRead. If WALRead fails it reports the segment filename
and offset from the WALReadError struct. A read that exceeds endptr sets
private.endptr_reached = true and returns -1 to stop the loop:
// WALDumpReadPage — pg_waldump.c (condensed)static intWALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen, XLogRecPtr targetPtr, char *readBuff){ XLogDumpPrivate *private = state->private_data; int count = XLOG_BLCKSZ; WALReadError errinfo;
if (private->endptr != InvalidXLogRecPtr) { if (targetPagePtr + XLOG_BLCKSZ <= private->endptr) count = XLOG_BLCKSZ; else if (targetPagePtr + reqLen <= private->endptr) count = private->endptr - targetPagePtr; else { private->endptr_reached = true; return -1; /* stops the main loop */ } }
if (!WALRead(state, readBuff, targetPagePtr, count, private->timeline, &errinfo)) { WALOpenSegment *seg = &errinfo.wre_seg; char fname[MAXPGPATH];
XLogFileName(fname, seg->ws_tli, seg->ws_segno, state->segcxt.ws_segsize); /* pg_fatal with fname + errinfo.wre_off ... */ }
return count;}The three branches on endptr matter: a full page is read when the whole page
lies before the boundary; a partial page (count = endptr - targetPagePtr) is
read when only reqLen bytes are needed before the boundary; and the boundary
itself trips endptr_reached. WALRead (shared with the server) handles the
segment_open/segment_close calls into WALDumpOpenSegment /
WALDumpCloseSegment under the hood, so the callback never opens files itself.
WALDumpCloseSegment simply closes the file descriptor and resets it to -1.
The main read loop
Section titled “The main read loop”After XLogFindNextRecord positions the reader at the first valid record at or
after the requested start LSN, the main loop calls XLogReadRecord repeatedly:
// main — pg_waldump.c (condensed)for (;;){ if (time_to_stop) break; /* SIGINT handler set this */
record = XLogReadRecord(xlogreader_state, &errormsg); if (!record) { if (!config.follow || private.endptr_reached) break; pg_usleep(1000000L); /* follow mode: sleep 1 s and retry */ continue; }
/* apply filters */ if (config.filter_by_rmgr_enabled && !config.filter_by_rmgr[record->xl_rmid]) continue; if (config.filter_by_xid_enabled && config.filter_by_xid != record->xl_xid) continue; if (config.filter_by_extended && !XLogRecordMatchesRelationBlock(...)) continue; if (config.filter_by_fpw && !XLogRecordHasFPW(xlogreader_state)) continue;
/* output */ if (!config.quiet) { if (config.stats) { XLogRecStoreStats(&stats, xlogreader_state); stats.endptr = ...; } else XLogDumpDisplayRecord(&config, xlogreader_state); } if (config.save_fullpage_path) XLogRecordSaveFPWs(xlogreader_state, config.save_fullpage_path);
config.already_displayed_records++; if (config.stop_after_records > 0 && config.already_displayed_records >= config.stop_after_records) break;}XLogReadRecord handles page boundaries, record spanning, CRC verification,
and back-link validation internally. pg_waldump’s loop sees only a decoded
XLogRecord * (or NULL on end-of-WAL or error) and the unpacked block
references in the XLogReaderState.
The real per-record dispatch — the heart of the loop, after filters pass — selects between the stats and display output paths and then optionally extracts full-page images. This is the verbatim shape of the body:
// main (loop body) — pg_waldump.c (verbatim)/* perform any per-record work */if (!config.quiet){ if (config.stats == true) { XLogRecStoreStats(&stats, xlogreader_state); stats.endptr = xlogreader_state->EndRecPtr; } else XLogDumpDisplayRecord(&config, xlogreader_state);}
/* save full pages if requested */if (config.save_fullpage_path != NULL) XLogRecordSaveFPWs(xlogreader_state, config.save_fullpage_path);
/* check whether we printed enough */config.already_displayed_records++;if (config.stop_after_records > 0 && config.already_displayed_records >= config.stop_after_records) break;Two details are load-bearing. First, stats.endptr is refreshed on every stats
record (not once at the end) because the loop may stop early via --limit, and
the summary table’s byte-range header must reflect the last record actually
counted. Second, XLogRecordSaveFPWs is outside the if (!config.quiet) block,
so --save-fullpage --quiet still extracts images while suppressing text — a
deliberate decoupling of the FPI side-channel from the display path.
flowchart TB
TOP["loop top"]
SIG{"time_to_stop?<br/>SIGINT handler"}
READ["record = XLogReadRecord()"]
NULLQ{"record == NULL?"}
FOLLOW{"follow and not<br/>endptr_reached?"}
SLEEP["pg_usleep 1 s<br/>continue"]
EXIT["break loop"]
FRMGR{"rmgr filter pass?"}
FXID{"xid filter pass?"}
FEXT{"relation/block/fork<br/>filter pass?"}
FFPW{"fpw filter pass?"}
OUT{"stats mode?"}
STORE["XLogRecStoreStats()<br/>stats.endptr = EndRecPtr"]
DISP["XLogDumpDisplayRecord()"]
FPW["save_fullpage_path?<br/>XLogRecordSaveFPWs()"]
LIMIT{"stop_after_records<br/>reached?"}
TOP --> SIG
SIG -->|"yes"| EXIT
SIG -->|"no"| READ --> NULLQ
NULLQ -->|"yes"| FOLLOW
FOLLOW -->|"yes"| SLEEP --> TOP
FOLLOW -->|"no"| EXIT
NULLQ -->|"no"| FRMGR
FRMGR -->|"no"| TOP
FRMGR -->|"yes"| FXID
FXID -->|"no"| TOP
FXID -->|"yes"| FEXT
FEXT -->|"no"| TOP
FEXT -->|"yes"| FFPW
FFPW -->|"no"| TOP
FFPW -->|"yes"| OUT
OUT -->|"yes"| STORE --> FPW
OUT -->|"no, display"| DISP --> FPW
FPW --> LIMIT
LIMIT -->|"yes"| EXIT
LIMIT -->|"no"| TOP
Figure 2 — the decode loop. Each iteration reads one record, runs the four
filter gates in fixed order (rmgr, xid, extended relation/block/fork, fpw) —
a continue on any failed gate returns to the loop top with no output — then
dispatches to exactly one of the stats or display paths. --save-fullpage and
the --limit check run after either output path. Filters short-circuit
left-to-right, so the cheapest checks (single array index, single integer
compare) precede the per-block-iteration checks.
Filters
Section titled “Filters”XLogDumpConfig holds all filter state. Five independent filter dimensions can
be combined:
- rmgr (
--rmgr): a boolean arrayfilter_by_rmgr[RM_MAX_ID + 1]. The--rmgr=listoption callsprint_rmgr_listwhich iteratesGetRmgrDesc(i)->rm_nameforiin0..RM_MAX_BUILTIN_ID, printing the builtin rmgr names. - xid (
--xid): a singleTransactionId. - relation + block + fork (
--relation,--block,--fork): checked byXLogRecordMatchesRelationBlock, which iterates over all block IDs in the record viaXLogRecMaxBlockId/XLogRecGetBlockTagExtended. - full-page image presence (
--fullpage): checked byXLogRecordHasFPW, which iterates block IDs and testsXLogRecHasBlockImage.
The filter_by_extended flag is the gate for the relation/block/fork check;
it is set whenever any of --relation, --block, or --fork is supplied.
--block requires --relation; the code enforces this after option parsing.
The --rmgr option (case 'r' in the getopt_long switch) resolves a name to
an rmgr ID at parse time and sets one slot of the filter_by_rmgr[] boolean
array. It accepts three forms — the literal "list" (print names and exit), a
"custom###" numeric ID for an unloaded custom rmgr, and a builtin name matched
case-insensitively against the rmgr table:
// main (case 'r') — pg_waldump.c (condensed)if (pg_strcasecmp(optarg, "list") == 0){ print_rmgr_list(); exit(EXIT_SUCCESS);}
/* "custom###": the module isn't loaded, so match by numeric ID */if (sscanf(optarg, "custom%03d", &rmid) == 1){ if (!RmgrIdIsCustom(rmid)) /* error: custom resource manager does not exist */ ; config.filter_by_rmgr[rmid] = true; config.filter_by_rmgr_enabled = true;}else{ /* then look for builtin rmgrs by name */ for (rmid = 0; rmid <= RM_MAX_BUILTIN_ID; rmid++) if (pg_strcasecmp(optarg, GetRmgrDesc(rmid)->rm_name) == 0) { config.filter_by_rmgr[rmid] = true; config.filter_by_rmgr_enabled = true; break; } if (rmid > RM_MAX_BUILTIN_ID) /* error: resource manager does not exist */ ;}Resolving the name to an integer ID at parse time is what lets the main loop’s
filter check be a single O(1) array index (config.filter_by_rmgr[record->xl_rmid])
rather than a string comparison per record. Repeating --rmgr sets multiple
slots, so the filter is a union (records matching any listed rmgr pass).
Display: rm_desc and rm_identify
Section titled “Display: rm_desc and rm_identify”XLogDumpDisplayRecord produces one line per record. It calls
GetRmgrDesc(XLogRecGetRmid(record)) to obtain the RmgrDescData for this
record’s rmgr, then prints the fixed fields, then calls the two text callbacks:
// XLogDumpDisplayRecord — pg_waldump.c (condensed)const RmgrDescData *desc = GetRmgrDesc(XLogRecGetRmid(record));XLogRecGetLen(record, &rec_len, &fpi_len);
printf("rmgr: %-11s len (rec/tot): %6u/%6u, tx: %10u, lsn: %X/%08X, prev %X/%08X, ", desc->rm_name, rec_len, XLogRecGetTotalLen(record), XLogRecGetXid(record), LSN_FORMAT_ARGS(record->ReadRecPtr), LSN_FORMAT_ARGS(xl_prev));
id = desc->rm_identify(info);if (id == NULL) printf("desc: UNKNOWN (%x) ", info & ~XLR_INFO_MASK);else printf("desc: %s ", id);
initStringInfo(&s);desc->rm_desc(&s, record);printf("%s", s.data);
XLogRecGetBlockRefInfo(record, true, config->bkp_details, &s, NULL);printf("%s", s.data);rm_identify maps the 4-bit opcode (upper nibble of xl_info) to a string
such as "INSERT", "UPDATE", or "HEAP_LOCK". rm_desc appends the
payload summary — for a Heap INSERT, this is the target block number and
offset; for a checkpoint, it is the redo pointer and timeline. Both callbacks
are defined in the rmgr’s *desc.c file (e.g. heapdesc.c, nbtdesc.c) and
are compiled into both the server and pg_waldump’s rmgrdesc.c via the same
PG_RMGR X-macro expansion.
XLogRecGetBlockRefInfo appends the block reference list — a comma-separated
blk N: rel T/D/R fork main blk B summary for each registered block, plus
FPW if the block carries a full-page image.
Statistics mode
Section titled “Statistics mode”When --stats is supplied, XLogDumpDisplayRecord is replaced by
XLogRecStoreStats (from src/include/access/xlogstats.h) which increments
XLogStats.rmgr_stats[rmid] and optionally record_stats[rmid][xl_info >> 4].
At the end of the scan, XLogDumpDisplayStats prints the summary:
// XLogDumpDisplayStats — pg_waldump.c (condensed)for (ri = 0; ri <= RM_MAX_ID; ri++){ if (!RmgrIdIsValid(ri)) continue; desc = GetRmgrDesc(ri); if (!config->stats_per_record) { count = stats->rmgr_stats[ri].count; /* ... */ XLogDumpStatsRow(desc->rm_name, count, total_count, rec_len, total_rec_len, fpi_len, total_fpi_len, tot_len, total_len); } else { for (rj = 0; rj < MAX_XLINFO_TYPES; rj++) { /* skip zero-count entries */ id = desc->rm_identify(rj << 4); XLogDumpStatsRow(psprintf("%s/%s", desc->rm_name, id), ...); } }}MAX_XLINFO_TYPES is 16 (the 4-bit opcode space per rmgr). The stats table
has four columns — record count (%), record-body bytes (%), FPI bytes (%),
combined bytes (%) — each shown with a percentage of the column total.
--stats=record produces per-rmgr/opcode rows; --stats alone produces
per-rmgr rows. The percentages in per-row cells are against the column total;
the final Total row shows FPI share as [xx.xx%] against the row total.
Full-page image extraction
Section titled “Full-page image extraction”--save-fullpage=DIR enables XLogRecordSaveFPWs, which runs after every
matching record (regardless of display/stats mode). It iterates over block IDs,
skips any that lack XLogRecHasBlockImage, calls RestoreBlockImage to
decompress the stored image into a PGAlignedBlock buffer, then writes it to a
file named:
<DIR>/<tli>-<LSN_hi>/<LSN_lo>.<spcOid>.<dbOid>.<relNumber>.<blk>_<fork>This is the same image that the recovery driver would apply to a data page
during redo. The file is a raw 8 KB (or BLCKSZ-byte) page — it can be opened
with any tool that understands the PostgreSQL page layout (see
postgres-page-layout.md).
Custom rmgr fallback
Section titled “Custom rmgr fallback”pg_waldump runs without loading extension modules. When it encounters a record
with a custom rmgr ID (above RM_MAX_BUILTIN_ID), GetRmgrDesc calls
initialize_custom_rmgrs on the first such lookup, which populates
CustomRmgrDesc[] with numeric names custom000..customNNN and stub
default_desc / default_identify callbacks. The stub rm_identify returns
NULL (triggering "UNKNOWN" in the display); default_desc appends just the
raw rmgr ID. The --rmgr option accepts "custom###" names in this same
convention so a user can filter on a numeric custom rmgr ID even without the
module loaded.
Follow mode
Section titled “Follow mode”--follow keeps the loop alive after end-of-WAL rather than exiting. When
XLogReadRecord returns NULL (end of current segment) and endptr_reached
is false, the loop sleeps 1 second and retries. WALDumpOpenSegment also
retries up to 10 times with 500 ms delays when the next segment file does not
yet exist — covering the window after the server has written the final record
of the previous segment but before it has created the new one. SIGINT sets
the volatile sig_atomic_t time_to_stop flag, which the main loop checks at
the top of each iteration.
Source Walkthrough
Section titled “Source Walkthrough”Anchor on symbol names, not line numbers. Line numbers in the table below are hints scoped to commit
273fe94(REL_18_STABLE, 2026-06-06).
Main driver (pg_waldump.c)
Section titled “Main driver (pg_waldump.c)”main— option parsing (getopt_longover 18 long options), directory discovery,XLogReaderAllocate,XLogFindNextRecord, main read loop, finalXLogDumpDisplayStatsif--stats,XLogReaderFree.sigint_handler— setstime_to_stop = true; checked at loop top.identify_target_directory— three-candidate search (explicit path, path +/pg_wal, then.,pg_wal,$PGDATA/pg_wal).search_directory— opens a WAL file (by name or by scanning for anyIsXLogFileNamematch), reads first page to discoverWalSegSz.verify_directory/open_file_in_directory/split_path— path helpers.create_fullpage_directory— create output dir for--save-fullpageviapg_check_dir/pg_mkdir_p.WALDumpOpenSegment—XLogReaderRoutine.segment_opencallback; retries 10 × 500 ms in follow mode.WALDumpCloseSegment—XLogReaderRoutine.segment_closecallback.WALDumpReadPage—XLogReaderRoutine.page_readcallback; callsWALReadwithendptrenforcement.XLogRecordMatchesRelationBlock— iterates block IDs viaXLogRecGetBlockTagExtended; testsRelFileLocatorEquals, block number, fork number.XLogRecordHasFPW— iterates block IDs; testsXLogRecHasBlockImage.XLogRecordSaveFPWs— iterates block IDs,RestoreBlockImage, writes raw page to--save-fullpagedir.XLogDumpDisplayRecord— formats the per-record display line:rm_name, lengths, xid, LSN, prev LSN,rm_identify(info),rm_desc(s, record),XLogRecGetBlockRefInfo.XLogDumpStatsRow— formats one row of the stats table with four count/pct columns.XLogDumpDisplayStats— two-pass stats printer: first pass totals; second pass per-rmgr (or per-rmgr/opcode) rows; finalTotalrow with FPI share.print_rmgr_list— prints builtin rmgr names for--rmgr=list.usage— help text listing all 18 options.
rmgr description layer (rmgrdesc.c, rmgrdesc.h)
Section titled “rmgr description layer (rmgrdesc.c, rmgrdesc.h)”RmgrDescTable[RM_N_BUILTIN_IDS]— static array built by thePG_RMGRX-macro overaccess/rmgrlist.h; each entry holdsrm_name,rm_desc,rm_identify. Noredo,startup, orcleanupentries.CustomRmgrDesc[RM_N_CUSTOM_IDS]— lazy-initialized byinitialize_custom_rmgrson first custom-rmgr lookup; names are"custom000".."custom127".GetRmgrDesc(rmid)— returns&RmgrDescTable[rmid]for builtins; initializes and returns&CustomRmgrDesc[rmid - RM_MIN_CUSTOM_ID]for customs.
Key data structures
Section titled “Key data structures”XLogDumpPrivate—{timeline, startptr, endptr, endptr_reached}. Passed asprivate_datatoXLogReaderState.XLogDumpConfig— all option state: display flags (quiet,bkp_details,follow,stats,stats_per_record), filter fields (filter_by_rmgr[],filter_by_xid,filter_by_relation,filter_by_relation_block,filter_by_relation_forknum,filter_by_fpw),stop_after_records,save_fullpage_path.XLogStats(fromxlogstats.h) —{count, startptr, endptr, rmgr_stats[RM_MAX_ID+1], record_stats[RM_MAX_ID+1][MAX_XLINFO_TYPES]}. EachXLogRecStatsslot holds{count, rec_len, fpi_len}.RmgrDescData(fromrmgrdesc.hin pg_waldump, mirroringrmgr.h) —{rm_name, rm_desc, rm_identify}.XLogReaderRoutine(fromxlogreader.h) —{page_read, segment_open, segment_close}function pointers;XL_ROUTINE(...)initializes a compound literal.
Position hints (as of 2026-06-06, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-06, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
XLogDumpPrivate (typedef) | pg_waldump.c | 47 |
XLogDumpConfig (typedef) | pg_waldump.c | 55 |
sigint_handler | pg_waldump.c | 91 |
print_rmgr_list | pg_waldump.c | 97 |
verify_directory | pg_waldump.c | 113 |
create_fullpage_directory | pg_waldump.c | 127 |
split_path | pg_waldump.c | 160 |
open_file_in_directory | pg_waldump.c | 188 |
search_directory | pg_waldump.c | 209 |
identify_target_directory | pg_waldump.c | 291 |
WALDumpOpenSegment | pg_waldump.c | 337 |
WALDumpCloseSegment | pg_waldump.c | 379 |
WALDumpReadPage | pg_waldump.c | 388 |
XLogRecordMatchesRelationBlock | pg_waldump.c | 437 |
XLogRecordHasFPW | pg_waldump.c | 469 |
XLogRecordSaveFPWs | pg_waldump.c | 489 |
XLogDumpDisplayRecord | pg_waldump.c | 545 |
XLogDumpStatsRow | pg_waldump.c | 584 |
XLogDumpDisplayStats | pg_waldump.c | 625 |
usage | pg_waldump.c | 755 |
main | pg_waldump.c | 792 |
RmgrDescTable[] | rmgrdesc.c | 37 |
initialize_custom_rmgrs | rmgrdesc.c | 68 |
GetRmgrDesc | rmgrdesc.c | 86 |
XLogStats (struct) | xlogstats.h | 28 |
XLogRecStoreStats | xlogstats.h | 41 |
XLogReaderRoutine (struct) | xlogreader.h | 72 |
XL_ROUTINE (macro) | xlogreader.h | 117 |
XLogReaderAllocate | xlogreader.h | 331 |
Source verification (as of 2026-06-06)
Section titled “Source verification (as of 2026-06-06)”Verified facts
Section titled “Verified facts”-
WalSegSzis always discovered from the first page of a segment file, never assumed from the compiled default. Verified insearch_directory: it readsXLOG_BLCKSZbytes, casts toXLogLongPageHeader, and readsxlp_seg_size.IsValidWalSegSizechecks the power-of-two constraint beforeWalSegSzis used anywhere else. -
The three I/O callbacks are registered via
XL_ROUTINE, which expands to a compound-literal ofXLogReaderRoutine. Verified inxlogreader.h:#define XL_ROUTINE(...) &(XLogReaderRoutine){__VA_ARGS__}. TheXLogReaderAllocatesignature takesXLogReaderRoutine *routine. -
--followsleeps at two points: 500 ms inWALDumpOpenSegment(up to 10 retries for the next segment file) and 1 second in the main loop (whenXLogReadRecordreturns NULL). Verified inWALDumpOpenSegment(tries < 10,pg_usleep(500 * 1000)) and in the main loop (pg_usleep(1000000L)). -
The
filter_by_rmgr[]check usesrecord->xl_rmiddirectly (thexl_rmidfield of theXLogRecordheader), not a secondary decoded field. Verified:!config.filter_by_rmgr[record->xl_rmid]in the main loop.XLogReadRecordreturns the rawXLogRecord *, soxl_rmidis the actual header byte. -
--blockrequires--relation. Verified inmainaftergetopt_long:if (config.filter_by_relation_block_enabled && !config.filter_by_relation_enabled)triggers apg_log_errorandgoto bad_argument. -
Custom rmgr names in
--rmgrare accepted as"custom###"three-digit IDs, consistent withrmgrdesc.c’sinitialize_custom_rmgrs. Verified: thecase 'r'branchsscanf(optarg, "custom%03d", &rmid)before the builtin-name loop, andinitialize_custom_rmgrsuses the same"custom%03d"format. -
Statistics accumulation uses
XLogRecStoreStatsfromxlogstats.h, not local code. Verified:XLogRecStoreStats(&stats, xlogreader_state)in the main loop; the function is declared extern inxlogstats.hand is shared withpg_walinspect(the contrib SQL-accessible analog, mentioned in the source comment at line 37 ofpg_waldump.c). -
XLogRecordSaveFPWscallsRestoreBlockImageto decompress FPIs before writing. Verified:if (!RestoreBlockImage(record, block_id, page)) pg_fatal(...)beforefwrite. The written file is the decompressed 8 KB page, not the compressed WAL representation. -
The stats table skips custom rmgr rows with zero count but prints all builtin rmgr rows (including zero-count ones). Verified:
if (RmgrIdIsCustom(ri) && count == 0) continue;inside the stats loop — the guard applies only to custom rmgrs, so builtins always get a row.
Open questions
Section titled “Open questions”-
pg_walinspectparity. The source comment at the top ofpg_waldump.csays “it is highly recommended to give a thought about doing the same inpg_walinspectcontrib module as well” for any code change or fix. The degree of code sharing (vs. duplication) between the two is not fully traced here; apg_walinspect-focused follow-up would clarify the boundary. -
Timeline handling across history files.
WALDumpOpenSegmentadvances*tli_pif the segment is not found on the current timeline. The logic for reading timeline history files (.historyfiles inpg_wal) and switching timelines during a scan is inherited from theXLogReaderinfrastructure and is not walked in detail here. A follow-up anchored onXLogReadRecord’s internal timeline switching would complete the picture. -
--save-fullpagefile naming for non-main forks. The filename includesforknamederived fromforkNames[fork]with an underscore prefix. The behavior when a fork number outside[0, MAX_FORKNUM]appears is an immediatepg_fatalrather than a skip — whether this can be triggered by a legitimate FPW of an init fork is left as an open question.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”MySQL mysqlbinlog
Section titled “MySQL mysqlbinlog”MySQL’s binary log is a logical log: it records row images (in ROW format) or
SQL statements (in STATEMENT format), not physical page edits. mysqlbinlog is
structurally similar to pg_waldump — a standalone binary, file-based I/O, a
filter pipeline — but because MySQL’s log is row-oriented rather than
page-oriented, mysqlbinlog can reconstruct readable DML (INSERT INTO … VALUES …) rather than rmgr-specific payload summaries. The trade-off is that
MySQL’s binlog is not suitable for crash recovery of the storage engine (InnoDB
uses its own redo log for that); MySQL therefore has two separate logs for the
two separate concerns. PostgreSQL’s WAL serves both crash recovery and logical
decoding from a single stream (the latter via postgres-logical-decoding.md).
Oracle LogMiner
Section titled “Oracle LogMiner”Oracle LogMiner (introduced in Oracle 8i) is the opposite architectural choice
from pg_waldump: it is a server-side PL/SQL package (DBMS_LOGMNR) that reads
redo log files and presents the decoded output through SQL views
(V$LOGMNR_CONTENTS). This means it has access to the full data dictionary,
can reconstruct SQL-level DML, and can filter by schema and object name — but
it requires a live database connection and cannot inspect redo logs from an
offline crashed instance without an auxiliary setup. pg_waldump requires only
files, making it usable in scenarios where the server cannot start.
pg_walinspect (PG14+)
Section titled “pg_walinspect (PG14+)”pg_walinspect is a contrib extension that exposes pg_waldump-equivalent
functionality through SQL functions: pg_get_wal_records_info,
pg_get_wal_stats, and pg_get_wal_block_info. Because it runs inside the
server, it can join WAL record data against the live catalog and is accessible
to any client with the right privilege — useful for monitoring dashboards and
automated tooling. The source comment in pg_waldump.c explicitly calls out the
expectation that bug fixes in pg_waldump be ported to pg_walinspect. The two
share XLogStats and the XLogRecStoreStats accounting function, but they are
otherwise separate codebases.
Research: WAL analytics for performance tuning
Section titled “Research: WAL analytics for performance tuning”The aggregate WAL statistics that --stats produces are a practical instance
of a broader research area: using log content as a workload signal. Papers in
this space include work on adaptive checkpoint scheduling (minimizing I/O
amplification from FPWs by tuning checkpoint frequency per workload mix) and
on WAL compression schemes that exploit the redundancy in FPIs across adjacent
checkpoints. Greenplum and CockroachDB have published engineering reports on
structured WAL analytics — characterizing the volume and distribution of record
types across workload phases — as a lever for capacity planning. pg_waldump’s
--stats=record mode is the manual version of this analysis; automated
collection is the direction pg_walinspect’s SQL interface enables.
Sources
Section titled “Sources”src/bin/pg_waldump/pg_waldump.c— main binary (REL_18_STABLE, commit 273fe94)src/bin/pg_waldump/rmgrdesc.c/rmgrdesc.h— rmgr description table for pg_waldumpsrc/include/access/xlogreader.h—XLogReaderState,XLogReaderRoutine,XL_ROUTINE,XLogReaderAllocate,XLogFindNextRecord,XLogReadRecord,XLogReaderFreesrc/include/access/xlogstats.h—XLogStats,XLogRecStats,XLogRecStoreStatssrc/include/access/xlog_internal.h—XLogLongPageHeader,XLogLongPageHeaderData,IsXLogFileName,XLogFileName,XLogFromFileName,XLogSegNoOffsetToRecPtr,XLByteInSegsrc/include/access/xlogrecord.h—XLogRecordheader fieldspostgres-xlog-wal.md— WAL insertion, LSN mechanics, flush watermarkspostgres-wal-records-rmgr.md— rmgr dispatch table, record anatomy, FPIspostgres-page-layout.md— page layout (for interpreting--save-fullpageoutput)postgres-logical-decoding.md— logical decoding, a separate WAL consumer