PostgreSQL WAL Archiving & Summarization — PITR and the Incremental-Backup Substrate
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A write-ahead log is, by construction, a complete and ordered record of every durable change a database has ever made. Once you have that record, two capabilities fall out almost for free — if you keep the log around long enough and if you can find the blocks it touched. Continuous archiving harvests the first; WAL summarization harvests the second.
Point-in-time recovery (PITR). Crash recovery replays WAL from the
last checkpoint’s redo pointer to the end of the log, recovering the
database to the moment it died. PITR generalizes this: take a base
backup (a copy of the data directory plus a record of the LSN at which
it was consistent), then replay an unbounded WAL history on top of it,
stopping at any chosen target — a wall-clock time, a named restore point,
an LSN, or a transaction id. The base backup is the “checkpoint”; the
archived WAL is the “log to replay.” The only new requirement over crash
recovery is that the WAL between the backup and the target must still
exist somewhere — and pg_wal is recycled aggressively, so it must be
copied off to durable, separate storage before recycling. That copying
is continuous archiving.
Incremental backup. A full base backup copies every block in the
cluster, even blocks that have not changed since the last backup. If you
know which blocks changed since a reference backup, you can copy only
those, plus enough metadata to reconstruct the rest by reference. The WAL
already names every modified block — each WAL record carries the
RelFileLocator, fork number, and block number it touched. WAL
summarization distills that information: it scans WAL once and writes
compact files saying “in LSN range X..Y, these blocks of these relation
forks were modified.” An incremental backup then unions the summaries
between the reference backup and now to learn the change set, without
re-reading the WAL itself.
Database Internals (Petrov, 2019) frames the WAL as the system’s “source of truth”: the page cache and the data files are derived, reconstructable state, but the log is canonical. Both archiving and summarization are consequences of taking that framing seriously. Archiving extends the log’s lifetime beyond the recycling horizon so that an arbitrarily old base backup can still be rolled forward. Summarization builds a secondary index over the log — a block-reference table — so that the question “what changed?” can be answered in time proportional to the answer rather than to the whole log. ARIES (Mohan et al., 1992) supplies the redo discipline that makes roll-forward correct in the first place: every change is logged before the page is written (write-ahead), and redo is idempotent (replaying a record whose effect is already on the page is a no-op via the page LSN), so replaying archived WAL onto a base backup converges to a consistent state regardless of how far back the backup was taken.
The design space for archiving and summarization has a few axes:
-
Push vs. pull. Does the database push completed log segments to the archive (an archiver daemon), or does an external tool pull them? Most systems push: the database knows precisely when a segment is complete and safe to copy.
-
Whole-segment vs. record granularity. Archiving ships whole WAL segments (16 MB by default) because that is the unit
pg_walrecycles. Summarization, by contrast, reads at record granularity but emits summaries at checkpoint granularity, because checkpoints are the only LSNs where redo can legally begin. -
Block-delta vs. file-delta vs. byte-delta incremental backup. PostgreSQL chose block-delta: the unit of change tracking is the 8 KB page. This matches the page-LSN redo model and the WAL’s own block-reference granularity.
-
In-band vs. out-of-band change tracking. Some systems maintain a live “changed-block bitmap” updated synchronously on every write (in-band, low-latency but adds write-path cost). PostgreSQL chose out-of-band: a separate process derives the change set from the WAL after the fact, keeping the write path untouched at the cost of a summarization lag that incremental backup must wait out.
Common DBMS Design
Section titled “Common DBMS Design”Continuous archiving and incremental backup are old, well-trodden ideas; most mature systems converge on a similar skeleton. Naming the shared conventions first makes PostgreSQL’s specific symbols read as one set of choices within a common playbook.
A notification-file handshake between the log writer and the archiver
Section titled “A notification-file handshake between the log writer and the archiver”The process that fills WAL segments and the process that archives them are
decoupled. The universal pattern is a small status-file protocol: when a
segment is complete, the writer drops a marker (“this segment is ready”);
the archiver picks it up, copies the segment, and replaces the marker with
a different one (“this segment is archived”); a third party (the
checkpointer) reads the second marker to decide when the segment may be
recycled. PostgreSQL implements exactly this with archive_status/NNN.ready
and NNN.done files. The handshake decouples the three actors so none
blocks the others, and it survives crashes: a .ready that reappears after
a crash simply gets re-archived (so archive commands must be idempotent).
A dedicated archiver process with retry and back-pressure
Section titled “A dedicated archiver process with retry and back-pressure”Archiving talks to slow, possibly remote, possibly flaky storage. It must
not be on the commit path, and a transient archive failure must not crash
the server or lose WAL. The convention is a dedicated process that retries
with a bounded count, leaves the .ready file in place on failure so the
segment is not recycled, and applies natural back-pressure: if archiving
falls behind, .ready files accumulate, pg_wal is not recycled, and the
disk fills — a loud, recoverable failure rather than silent data loss.
Pluggable archive transport
Section titled “Pluggable archive transport”Early systems hard-coded “run this shell command per file.” That is simple
but fork-per-file is slow and shell quoting is a security and correctness
hazard. The modern convention adds a module interface: a loadable library
exposes a callback invoked once per file in-process, avoiding the fork and
the shell. PostgreSQL keeps both: archive_command (shell) and
archive_library (module), with the shell path itself implemented as a
built-in module so the archiver only ever calls one callback.
A change-tracking index derived from the log
Section titled “A change-tracking index derived from the log”Incremental backup needs a “what changed since LSN X” oracle. Systems that
build it from the log (rather than a synchronous bitmap) run a background
reader that consumes WAL and materializes a compact change index. The index
is keyed by (relation, fork, block) and bounded by an LSN range so that a
backup can select exactly the ranges between its reference point and now.
PostgreSQL’s WAL summary files, each named TLI-startLSN-endLSN.summary
and containing a serialized BlockRefTable, are this index.
Boundary alignment at recoverable points
Section titled “Boundary alignment at recoverable points”A change-tracking index is only useful if its LSN boundaries coincide with
points where recovery can actually start and stop. The convention is to cut
index segments at checkpoint redo points. PostgreSQL cuts summary files at
XLOG_CHECKPOINT_REDO and XLOG_CHECKPOINT_SHUTDOWN records, because those
are the only LSNs where redo may begin — guaranteeing that any incremental
backup’s reference LSN lands on a summary-file boundary.
flowchart LR
subgraph primary["Primary cluster"]
WAL["pg_wal/<br/>WAL segments"] -->|segment full| RDY["archive_status/<br/>NNN.ready"]
RDY -->|notify latch| ARCH["archiver process<br/>(PgArchiverMain)"]
ARCH -->|archive_file_cb| STORE["archive storage<br/>(command or library)"]
ARCH -->|rename| DONE["archive_status/<br/>NNN.done"]
WAL -->|xlogreader scan| SUMM["walsummarizer<br/>(WalSummarizerMain)"]
SUMM -->|BlockRefTable| SUMFILES["pg_wal/summaries/<br/>TLI-LSN-LSN.summary"]
end
STORE -->|restore_command| RESTORE["recovery / PITR<br/>(RestoreArchivedFile)"]
SUMFILES -->|change set| INCR["incremental backup<br/>(pg_basebackup)"]
Figure 1 — The two harvesting pipelines that share the WAL stream. The
archiver pushes completed segments to durable storage for PITR; the
summarizer indexes modified blocks for incremental backup. Both read from
pg_wal, neither is on the commit path.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL splits the work across two independent auxiliary processes
forked by the postmaster — the archiver and the WAL summarizer —
plus a body of recovery-side restore logic in xlogarchive.c. All
three are gated by GUCs (archive_mode + archive_command/archive_library;
summarize_wal) and none of them sits on a backend’s commit path.
The archiver: latch, scan, prioritize, ship
Section titled “The archiver: latch, scan, prioritize, ship”The archiver is an auxiliary process whose entire job is to drain
pg_wal/archive_status of .ready files. PgArchiverMain sets up signal
handlers (notably SIGUSR2 for “do one final cycle and stop”), advertises
its proc number so backends can wake it, allocates a max-heap workspace,
loads the archive module, and enters pgarch_MainLoop. The main loop is a
latch wait that wakes on a notification, a 60-second autowake, or shutdown:
// pgarch_MainLoop — src/backend/postmaster/pgarch.cdo{ ResetLatch(MyLatch); time_to_stop = ready_to_stop; /* set by SIGUSR2 handler */ ProcessPgArchInterrupts(); /* ... SIGTERM-but-no-SIGUSR2 grace handling ... */ pgarch_ArchiverCopyLoop(); /* archive everything outstanding */ if (!time_to_stop) rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, PGARCH_AUTOWAKE_INTERVAL * 1000L, WAIT_EVENT_ARCHIVER_MAIN);} while (!time_to_stop);The wakeup comes from PgArchWakeup, which the WAL machinery calls via
XLogArchiveNotify whenever a new .ready file appears. The 60-second
autowake (PGARCH_AUTOWAKE_INTERVAL) is a safety net in case a wakeup is
missed — “the archiver exists to protect our data,” so it polls proactively.
pgarch_ArchiverCopyLoop is the per-cycle workhorse: it repeatedly asks
pgarch_readyXlog for the next file and archives it, with bounded retries.
A clever wrinkle is the orphan cleanup: if the .ready file names a
segment that no longer exists in pg_wal (a crash left a stale marker for
an already-recycled segment), the archiver simply unlinks the .ready and
moves on rather than failing forever:
// pgarch_ArchiverCopyLoop — src/backend/postmaster/pgarch.cwhile (pgarch_readyXlog(xlog)){ int failures = 0, failures_orphan = 0; for (;;) { if (ShutdownRequestPending || !PostmasterIsAlive()) return; ProcessPgArchInterrupts(); /* pick up config changes fast */ snprintf(pathname, MAXPGPATH, XLOGDIR "/%s", xlog); if (stat(pathname, &stat_buf) != 0 && errno == ENOENT) { /* orphan .ready for an already-recycled segment: unlink it */ StatusFilePath(xlogready, xlog, ".ready"); if (unlink(xlogready) == 0) break; /* ... bounded NUM_ORPHAN_CLEANUP_RETRIES ... */ } if (pgarch_archiveXlog(xlog)) /* the actual copy */ { pgarch_archiveDone(xlog); /* .ready -> .done */ pgstat_report_archiver(xlog, false); break; } else if (++failures >= NUM_ARCHIVE_RETRIES) return; /* give up this cycle, retry later */ }}Prioritization. When many segments are waiting, the archiver does not
naively pick the alphabetically first directory entry. pgarch_readyXlog
performs a directory scan that fills a bounded max-heap
(NUM_FILES_PER_DIRECTORY_SCAN = 64) so that one scan can serve up to 64
files in priority order, amortizing the (expensive) directory scan. The
priority order is encoded in ready_file_comparator: timeline-history
files always win (so a promoted standby’s new timeline is archived ASAP),
then oldest segment first (to keep the WAL chain contiguous and to free the
soonest-recyclable segments):
// ready_file_comparator — src/backend/postmaster/pgarch.cbool a_history = IsTLHistoryFileName(a_str);bool b_history = IsTLHistoryFileName(b_str);/* Timeline history files always have the highest priority. */if (a_history != b_history) return a_history ? -1 : 1;/* Priority is given to older files. */return strcmp(a_str, b_str);PgArchForceDirScan lets the WAL machinery force the next scan to ignore
the cached heap — XLogArchiveNotify calls it for timeline-history files
so they jump the queue immediately rather than waiting behind 64 buffered
segment names.
Pluggable transport: shell command vs. archive library
Section titled “Pluggable transport: shell command vs. archive library”The archiver never runs system() directly. LoadArchiveLibrary resolves
a single ArchiveModuleCallbacks vtable: if archive_library is empty it
uses the built-in shell_archive_init (which wraps archive_command),
otherwise it dlopens the named library and calls its
_PG_archive_module_init. Exactly one of archive_command /
archive_library may be set:
// LoadArchiveLibrary — src/backend/postmaster/pgarch.cif (XLogArchiveLibrary[0] != '\0' && XLogArchiveCommand[0] != '\0') ereport(ERROR, ... "both \"archive_command\" and \"archive_library\" set" ...);if (XLogArchiveLibrary[0] == '\0') archive_init = shell_archive_init;else archive_init = (ArchiveModuleInit) load_external_function(XLogArchiveLibrary, "_PG_archive_module_init", false, NULL);ArchiveCallbacks = (*archive_init) ();if (ArchiveCallbacks->archive_file_cb == NULL) ereport(ERROR, ... "archive modules must register an archive callback" ...);pgarch_archiveXlog invokes the callback inside its own sigsetjmp
exception barrier so that an ERROR raised by the archive module turns
into “return false, retry this file” rather than a FATAL that restarts the
whole archiver. This is the one place in the backend where a custom
PG_exception_stack is installed precisely to downgrade errors.
Recovery side: restore_command and the .ready/.done protocol
Section titled “Recovery side: restore_command and the .ready/.done protocol”The flip side of archiving is restoring. During archive recovery (PITR or
standby bootstrap), the startup process needs WAL segments that are no
longer in pg_wal. RestoreArchivedFile runs restore_command to copy a
named segment out of the archive into a temporary file in pg_wal,
crosschecks its size, and reports success/failure. Crucially, “restore
fails” is the normal signal for “end of WAL reached” — recovery rolls
forward until the restore fails:
// RestoreArchivedFile — src/backend/access/transam/xlogarchive.cxlogRestoreCmd = BuildRestoreCommand(recoveryRestoreCommand, xlogpath, xlogfname, lastRestartPointFname);PreRestoreCommand(); /* arm SIGTERM fast-exit */rc = system(xlogRestoreCmd);PostRestoreCommand();if (rc == 0 && stat(xlogpath, &stat_buf) == 0) { /* success: use xlogpath */ }/* on signal death, punt; otherwise treat as "no such file, end of WAL" */if (wait_result_is_signal(rc, SIGTERM)) proc_exit(1);The .ready/.done status files are the contract between the WAL writer,
the archiver, and the checkpointer. XLogArchiveNotify creates NNN.ready
when a segment fills; pgarch_archiveDone renames it to NNN.done after a
successful copy; XLogArchiveCheckDone is what the checkpointer calls
before recycling a segment — it returns true (deletable) only if .done
exists, and re-creates a missing .ready if neither exists so the segment
is not lost:
// XLogArchiveCheckDone — src/backend/access/transam/xlogarchive.cif (!XLogArchivingActive()) return true; /* archive_mode=off: always deletable */StatusFilePath(archiveStatusPath, xlog, ".done");if (stat(archiveStatusPath, &stat_buf) == 0) return true; /* archived: recycle it */StatusFilePath(archiveStatusPath, xlog, ".ready");if (stat(archiveStatusPath, &stat_buf) == 0) return false; /* still pending: keep it *//* neither exists -> (re)create .ready and keep */XLogArchiveNotify(xlog);return false;stateDiagram-v2 [*] --> SegmentFull SegmentFull --> Ready: XLogArchiveNotify creates NNN.ready Ready --> Archiving: pgarch_readyXlog selects file Archiving --> Ready: archive_file_cb fails \n retry up to NUM_ARCHIVE_RETRIES Archiving --> Done: pgarch_archiveDone renames to NNN.done Done --> Recycled: checkpoint sees .done \n XLogArchiveCheckDone returns true Recycled --> [*]
Figure 2 — The lifecycle of one WAL segment through the archive status
protocol. A failed copy returns the segment to the Ready state; only a
.done marker authorizes the checkpointer to recycle the segment.
The WAL summarizer: indexing modified blocks
Section titled “The WAL summarizer: indexing modified blocks”When summarize_wal is on, the postmaster runs a second auxiliary process,
WalSummarizerMain. It reads WAL with an XLogReaderState, accumulates a
BlockRefTable of modified (relation, fork, block) tuples, and periodically
writes a .summary file. The shared-memory WalSummarizerData advertises
summarized_lsn (durable on disk), pending_lsn (read into memory but not
yet written), and the timeline, all under WALSummarizerLock. Incremental
backup reads summarized_lsn to know how far the change index extends and
waits (via WaitForWalSummarization) if it needs the summarizer to catch up
to the backup’s start LSN.
The summarizer cuts a new summary file at every checkpoint redo point, skips
WAL written under wal_level=minimal (which is unsafe to base an incremental
backup on), follows timeline switches, and removes old summaries once the
underlying WAL is gone and the file is older than wal_summary_keep_time.
The next section walks the call flow in detail.
Source Walkthrough
Section titled “Source Walkthrough”This section follows the three subsystems by call flow: the archiver process
(pgarch.c), the recovery-side restore and status protocol
(xlogarchive.c), and the WAL summarizer (walsummarizer.c). Symbols are
the stable anchors; the position-hint table at the end maps each to a
(file, line) pair as observed on REL_18 at commit 273fe94.
Archiver process startup and main loop
Section titled “Archiver process startup and main loop”PgArchiverMain is the entry point the postmaster calls when
XLogArchivingActive(). It sets MyBackendType = B_ARCHIVER, installs
signal handlers, publishes its proc number into shared memory so backends
can wake it, builds the per-scan workspace, and loads the archive module
before looping:
// PgArchiverMain — src/backend/postmaster/pgarch.cMyBackendType = B_ARCHIVER;AuxiliaryProcessMainCommon();pqsignal(SIGUSR2, pgarch_waken_stop); /* "final cycle then exit" */Assert(XLogArchivingActive());on_shmem_exit(pgarch_die, 0);PgArch->pgprocno = MyProcNumber; /* advertise for PgArchWakeup */arch_files = palloc(sizeof(struct arch_files_state));arch_files->arch_heap = binaryheap_allocate(NUM_FILES_PER_DIRECTORY_SCAN, ready_file_comparator, NULL);LoadArchiveLibrary();pgarch_MainLoop();The wakeup path is PgArchWakeup, which sets the archiver’s latch via its
advertised proc number. It deliberately does not take ProcArrayLock —
procLatch is never freed, so a stale proc number at worst wakes the wrong
process, and the archiver is relaunched shortly anyway:
// PgArchWakeup — src/backend/postmaster/pgarch.cint arch_pgprocno = PgArch->pgprocno;if (arch_pgprocno != INVALID_PROC_NUMBER) SetLatch(&ProcGlobal->allProcs[arch_pgprocno].procLatch);ProcessPgArchInterrupts is called from both loops; besides the usual
barrier/config handling, it specially detects a change to archive_library
and proc_exit(0)s the archiver so the new module is loaded on relaunch
(there is no library-unload mechanism, so restart is the only safe path).
File selection: directory scan, max-heap, priority
Section titled “File selection: directory scan, max-heap, priority”pgarch_readyXlog is the most intricate routine in pgarch.c. It returns
the single highest-priority .ready file, but to avoid scanning the
directory once per file it scans once and buffers up to 64 names. On entry
it first tries to serve a buffered name (re-stating to confirm the
.ready still exists), and only rescans when the buffer is empty or
force_dir_scan was set:
// pgarch_readyXlog — src/backend/postmaster/pgarch.c (condensed)if (pg_atomic_exchange_u32(&PgArch->force_dir_scan, 0) == 1) arch_files->arch_files_size = 0; /* forced rescan */while (arch_files->arch_files_size > 0) /* serve buffered names */{ arch_file = arch_files->arch_files[--arch_files->arch_files_size]; StatusFilePath(status_file, arch_file, ".ready"); if (stat(status_file, &st) == 0) { strcpy(xlog, arch_file); return true; }}/* else: scan archive_status/, push .ready basenames into the max-heap, ... */rldir = AllocateDir(XLogArchiveStatusDir);while ((rlde = ReadDir(rldir, XLogArchiveStatusDir)) != NULL){ /* validate name length/chars, require ".ready" suffix */ if (arch_files->arch_heap->bh_size < NUM_FILES_PER_DIRECTORY_SCAN) binaryheap_add_unordered(arch_files->arch_heap, CStringGetDatum(arch_file)); else if (ready_file_comparator(binaryheap_first(arch_files->arch_heap), CStringGetDatum(basename), NULL) > 0) { /* evict lowest-priority, insert this one */ }}/* drain heap into arch_files[] in ascending priority, return highest */The heap holds the 64 highest-priority candidates; remaining entries are
discovered on the next scan after these drain. ready_file_comparator
(shown in the previous section) defines priority: history files first, then
oldest segment. Because WAL filenames sort with timeline as the most
significant field, “smaller timeline = older = higher priority,” which
gives past timelines precedence — desirable for keeping the recovery chain
contiguous.
Shipping a file and marking it done
Section titled “Shipping a file and marking it done”pgarch_archiveXlog is where the archive callback actually runs. The
sigsetjmp block is the notable detail: the archiver lives at the bottom of
the exception stack, so a bare ereport(ERROR) from a module would become
FATAL and restart the process. The hand-rolled handler catches it, runs the
cleanup that a PG_CATCH normally would (LWLockReleaseAll,
AtEOXact_Files, pgaio_error_cleanup, etc.), and returns false:
// pgarch_archiveXlog — src/backend/postmaster/pgarch.c (condensed)if (sigsetjmp(local_sigjmp_buf, 1) != 0){ error_context_stack = NULL; HOLD_INTERRUPTS(); EmitErrorReport(); LWLockReleaseAll(); ConditionVariableCancelSleep(); pgaio_error_cleanup(); ReleaseAuxProcessResources(false); AtEOXact_Files(false); MemoryContextSwitchTo(oldcontext); FlushErrorState(); RESUME_INTERRUPTS(); ret = false; /* retry the file, don't restart */}else{ PG_exception_stack = &local_sigjmp_buf; ret = ArchiveCallbacks->archive_file_cb(archive_module_state, xlog, pathname); PG_exception_stack = NULL;}On success pgarch_archiveDone renames NNN.ready to NNN.done. The
rename is deliberately not durable (rename, not durable_rename) — if a
crash loses the rename, the .ready reappears and the segment is simply
re-archived, which is harmless because archive commands must be idempotent.
Recovery-side: RestoreArchivedFile and the status protocol
Section titled “Recovery-side: RestoreArchivedFile and the status protocol”On the recovery path RestoreArchivedFile (walked in the previous section)
builds the command via BuildRestoreCommand, brackets system() with
PreRestoreCommand/PostRestoreCommand so a SIGTERM during the external
command can fast-exit, and treats a non-signal failure as “end of WAL.” A
successfully restored file is moved into place under its canonical name by
KeepFileRestoredFromArchive, which also forces a .done (so the restored
segment is not re-archived unless archive_mode=always) and wakes any
walsenders:
// KeepFileRestoredFromArchive — src/backend/access/transam/xlogarchive.cdurable_rename(path, xlogfpath, ERROR);if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS) XLogArchiveForceDone(xlogfname);else XLogArchiveNotify(xlogfname);if (reload) WalSndRqstFileReload();WalSndWakeup(true, false);XLogArchiveNotify is the producer end of the status protocol: it creates
NNN.ready, force-scans for history files, and pokes the archiver. The
consumer ends are XLogArchiveCheckDone (checkpointer asks “may I recycle
this?”), XLogArchiveIsBusy (is it still unarchived? — treats a missing
segment as not-busy to handle the recycle race), and XLogArchiveCleanup
(unlink both status files when a segment is finally gone).
WAL summarizer: main loop and timeline handling
Section titled “WAL summarizer: main loop and timeline handling”WalSummarizerMain mirrors the archiver’s shape — sigsetjmp error barrier,
shared-memory advertisement of its proc number — then enters an infinite
loop. Each iteration finds the safe upper bound (GetLatestLSN), computes a
timeline switch point if the current timeline has gone historic, summarizes
one file’s worth of WAL, and publishes progress:
// WalSummarizerMain — src/backend/postmaster/walsummarizer.c (condensed)current_lsn = GetOldestUnsummarizedLSN(¤t_tli, &exact);if (XLogRecPtrIsInvalid(current_lsn)) proc_exit(0); /* summarize_wal off */for (;;){ ProcessWalSummarizerInterrupts(); MaybeRemoveOldWalSummaries(); latest_lsn = GetLatestLSN(&latest_tli); if (current_tli != latest_tli && XLogRecPtrIsInvalid(switch_lsn)) switch_lsn = tliSwitchPoint(current_tli, readTimeLineHistory(latest_tli), &switch_tli); end_of_summary_lsn = SummarizeWAL(current_tli, current_lsn, exact, switch_lsn, latest_lsn); current_lsn = end_of_summary_lsn; exact = true; LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE); WalSummarizerCtl->summarized_lsn = end_of_summary_lsn; WalSummarizerCtl->pending_lsn = end_of_summary_lsn; LWLockRelease(WALSummarizerLock); ConditionVariableBroadcast(&WalSummarizerCtl->summary_file_cv);}GetLatestLSN picks the correct ceiling depending on role: on a primary
it is GetFlushRecPtr (never summarize unflushed WAL); during recovery it
is the greater of the walreceiver’s flush position and the replay position,
because replayed WAL is necessarily flushed. GetOldestUnsummarizedLSN
bootstraps summarized_lsn from the newest existing .summary file’s end,
or from the oldest segment on disk if none exist — and it is also callable
by non-summarizer backends, which read the value to avoid recycling WAL the
summarizer still needs.
SummarizeWAL: the record-reading loop and BlockRefTable
Section titled “SummarizeWAL: the record-reading loop and BlockRefTable”SummarizeWAL does the real work for one summary file. It allocates an
XLogReaderState with summarizer_read_local_xlog_page as the page-read
callback, positions it (exactly, or by XLogFindNextRecord when exact is
false), and reads records one at a time. Record handling has three cases:
RM_XLOG_ID records (checkpoint/parameter/end-of-recovery) decide file
boundaries and fast-forward mode; RM_DBASE_ID/RM_SMGR_ID/RM_XACT_ID
records need limit-block bookkeeping; everything else just contributes its
block references:
// SummarizeWAL — src/backend/postmaster/walsummarizer.c (condensed)BlockRefTable *brtab = CreateEmptyBlockRefTable();bool fast_forward = true;/* ... XLogBeginRead or XLogFindNextRecord to set summary_start_lsn ... */while (1){ record = XLogReadRecord(xlogreader, &errormsg); if (record == NULL) { /* end of WAL on historic TLI -> break */ } if (!XLogRecPtrIsInvalid(switch_lsn) && xlogreader->ReadRecPtr >= switch_lsn) { summary_end_lsn = switch_lsn; break; } rmid = XLogRecGetRmid(xlogreader); if (rmid == RM_XLOG_ID) { if (SummarizeXlogRecord(xlogreader, &new_fast_forward)) { if (xlogreader->ReadRecPtr > summary_start_lsn) { summary_end_lsn = xlogreader->ReadRecPtr; break; } /* cut file */ else fast_forward = new_fast_forward; /* first record */ } } else if (!fast_forward) switch (rmid) { case RM_DBASE_ID: SummarizeDbaseRecord(xlogreader, brtab); break; case RM_SMGR_ID: SummarizeSmgrRecord(xlogreader, brtab); break; case RM_XACT_ID: SummarizeXactRecord(xlogreader, brtab); break; } if (!fast_forward) for (block_id = 0; block_id <= XLogRecMaxBlockId(xlogreader); block_id++) { if (!XLogRecGetBlockTagExtended(xlogreader, block_id, &rlocator, &forknum, &blocknum, NULL)) continue; if (forknum != FSM_FORKNUM) BlockRefTableMarkBlockModified(brtab, &rlocator, forknum, blocknum); } summary_end_lsn = xlogreader->EndRecPtr; /* publish pending_lsn under WALSummarizerLock */}When the loop ends, if any non-trivial range was covered and we are not
fast-forwarding, the BlockRefTable is serialized to a temp file and
durable_renamed to its final name
pg_wal/summaries/TLITLITLI-STARTLSN-ENDLSN.summary (the %08X quintuple
of TLI plus the two LSN halves each):
// SummarizeWAL — src/backend/postmaster/walsummarizer.csnprintf(final_path, MAXPGPATH, XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary", tli, LSN_FORMAT_ARGS(summary_start_lsn), LSN_FORMAT_ARGS(summary_end_lsn));io.file = PathNameOpenFile(temp_path, O_WRONLY | O_CREAT | O_TRUNC);WriteBlockRefTable(brtab, WriteWalSummary, &io);FileClose(io.file);durable_rename(temp_path, final_path, ERROR);Limit blocks: handling drops, truncations, and new forks
Section titled “Limit blocks: handling drops, truncations, and new forks”A subtle correctness point: marking blocks “modified” is not enough. If a
relation is dropped, recreated, or truncated, an incremental backup must
not assume the old blocks are still valid. The summarizer encodes this with
limit blocks. SummarizeSmgrRecord sets the limit block to 0 when a fork
is created (everything is new) and to the truncation point on truncate:
// SummarizeSmgrRecord — src/backend/postmaster/walsummarizer.cif (info == XLOG_SMGR_CREATE) { xlrec = (xl_smgr_create *) XLogRecGetData(xlogreader); if (xlrec->forkNum != FSM_FORKNUM) BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator, xlrec->forkNum, 0);} else if (info == XLOG_SMGR_TRUNCATE) { xlrec = (xl_smgr_truncate *) XLogRecGetData(xlogreader); if ((xlrec->flags & SMGR_TRUNCATE_HEAP) != 0) BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator, MAIN_FORKNUM, xlrec->blkno); /* ... VM fork via visibilitymap_truncation_length ... */}SummarizeDbaseRecord does the same at database/tablespace granularity for
CREATE DATABASE ... (FILE_COPY) and DROP DATABASE (which can create or
remove many relation files without per-file WAL), and SummarizeXactRecord
clears tracking for relations dropped at commit or abort. The guiding
invariant, stated in the source, is “We can never lose data by marking
more stuff as needing to be backed up in full” — limit blocks are always
conservative.
Fast-forward and wal_level=minimal
Section titled “Fast-forward and wal_level=minimal”SummarizeXlogRecord inspects checkpoint, parameter-change, and
end-of-recovery records to extract the wal_level in effect, and decides
both whether to cut the file here (true for redo points and shutdown
checkpoints, since redo can only begin there) and whether to fast-forward
(skip emitting summaries when wal_level=minimal, because an incremental
backup spanning minimal-level WAL would be unsafe):
// SummarizeXlogRecord — src/backend/postmaster/walsummarizer.cif (info == XLOG_CHECKPOINT_REDO) memcpy(&record_wal_level, XLogRecGetData(xlogreader), sizeof(int));else if (info == XLOG_CHECKPOINT_SHUTDOWN) { /* ... rec_ckpt.wal_level ... */ }else if (info == XLOG_PARAMETER_CHANGE) { /* ... xlrec.wal_level ... */ }else if (info == XLOG_END_OF_RECOVERY) { /* ... xlrec.wal_level ... */ }else return false; /* not a boundary record */*new_fast_forward = (record_wal_level == WAL_LEVEL_MINIMAL);return true;Catch-up waiting and end-of-WAL backoff
Section titled “Catch-up waiting and end-of-WAL backoff”Two waiting routines tie the summarizer to its consumers. Inside record
reading, summarizer_read_local_xlog_page calls summarizer_wait_for_wal
when it reaches end-of-WAL on the current timeline; that routine implements
an adaptive backoff in 200 ms quanta (doubling when idle up to 30 s,
shrinking under load). Outside, WaitForWalSummarization is what an
incremental backup calls to block until summarized_lsn reaches the
backup’s start LSN, with a progress watchdog that errors out if
pending_lsn fails to advance for a full minute:
// WaitForWalSummarization — src/backend/postmaster/walsummarizer.c (condensed)while (1){ if (!summarize_wal) return; /* disabled while waiting: give up */ summarized_lsn = WalSummarizerCtl->summarized_lsn; /* under lock */ pending_lsn = WalSummarizerCtl->pending_lsn; if (summarized_lsn >= lsn) break; /* per 10s cycle: track deadcycles; error after ~60s of no progress */ if (deadcycles >= 6) ereport(ERROR, ... "WAL summarization is not progressing" ...); ConditionVariableTimedSleep(&WalSummarizerCtl->summary_file_cv, timeout_in_ms, WAIT_EVENT_WAL_SUMMARY_READY);}Summary cleanup
Section titled “Summary cleanup”MaybeRemoveOldWalSummaries runs once per checkpoint cycle (gated on the
redo pointer advancing). It removes a .summary file only when both its
underlying WAL is gone (ws->end_lsn <= oldest_lsn on disk) and the file
is older than wal_summary_keep_time — so a summary outlives its WAL only
long enough to satisfy the configured retention, and incremental backups
that still need it can complete.
flowchart TB
A["WalSummarizerMain loop"] --> B["GetLatestLSN<br/>(flush on primary;<br/>max of replay/recv flush on standby)"]
B --> C{current_tli<br/>!= latest_tli?}
C -- yes --> D["tliSwitchPoint<br/>compute switch_lsn"]
C -- no --> E["SummarizeWAL"]
D --> E
E --> F["XLogReadRecord loop"]
F --> G{RM_XLOG_ID<br/>boundary?}
G -- redo/shutdown --> H["cut file at ReadRecPtr<br/>or set fast_forward"]
G -- other rmgr --> I["BlockRefTableMarkBlockModified<br/>+ limit-block bookkeeping"]
H --> J["WriteBlockRefTable<br/>durable_rename .summary"]
I --> F
J --> K["publish summarized_lsn<br/>broadcast summary_file_cv"]
K --> A
Figure 3 — The summarizer’s per-file flow. Boundary records (redo points,
shutdown checkpoints) cut summary files and toggle fast-forward; all other
records feed block references into the in-memory BlockRefTable that is
serialized at the cut.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
PgArchiverMain | src/backend/postmaster/pgarch.c | 218 |
PgArchWakeup | src/backend/postmaster/pgarch.c | 281 |
pgarch_MainLoop | src/backend/postmaster/pgarch.c | 311 |
pgarch_ArchiverCopyLoop | src/backend/postmaster/pgarch.c | 381 |
pgarch_archiveXlog | src/backend/postmaster/pgarch.c | 517 |
pgarch_readyXlog | src/backend/postmaster/pgarch.c | 645 |
ready_file_comparator | src/backend/postmaster/pgarch.c | 781 |
PgArchForceDirScan | src/backend/postmaster/pgarch.c | 804 |
pgarch_archiveDone | src/backend/postmaster/pgarch.c | 818 |
LoadArchiveLibrary | src/backend/postmaster/pgarch.c | 913 |
RestoreArchivedFile | src/backend/access/transam/xlogarchive.c | 54 |
ExecuteRecoveryCommand | src/backend/access/transam/xlogarchive.c | 295 |
KeepFileRestoredFromArchive | src/backend/access/transam/xlogarchive.c | 358 |
XLogArchiveNotify | src/backend/access/transam/xlogarchive.c | 444 |
XLogArchiveForceDone | src/backend/access/transam/xlogarchive.c | 510 |
XLogArchiveCheckDone | src/backend/access/transam/xlogarchive.c | 565 |
XLogArchiveIsBusy | src/backend/access/transam/xlogarchive.c | 619 |
XLogArchiveCleanup | src/backend/access/transam/xlogarchive.c | 712 |
WalSummarizerMain | src/backend/postmaster/walsummarizer.c | 214 |
GetWalSummarizerState | src/backend/postmaster/walsummarizer.c | 451 |
GetOldestUnsummarizedLSN | src/backend/postmaster/walsummarizer.c | 509 |
WaitForWalSummarization | src/backend/postmaster/walsummarizer.c | 664 |
GetLatestLSN | src/backend/postmaster/walsummarizer.c | 804 |
SummarizeWAL | src/backend/postmaster/walsummarizer.c | 910 |
SummarizeSmgrRecord | src/backend/postmaster/walsummarizer.c | 1319 |
SummarizeXlogRecord | src/backend/postmaster/walsummarizer.c | 1429 |
summarizer_read_local_xlog_page | src/backend/postmaster/walsummarizer.c | 1502 |
summarizer_wait_for_wal | src/backend/postmaster/walsummarizer.c | 1616 |
MaybeRemoveOldWalSummaries | src/backend/postmaster/walsummarizer.c | 1662 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”All excerpts were taken from REL_18_STABLE at commit
273fe94852b3a7e34fd171e8abdf1481beb302fa (PG 18.x). The following claims
were checked directly against the source tree:
-
Archiver is a latch-driven auxiliary process, not on the commit path. Confirmed:
PgArchiverMainsetsMyBackendType = B_ARCHIVERandpgarch_MainLoopblocks inWaitLatch(... WAIT_EVENT_ARCHIVER_MAIN)with aPGARCH_AUTOWAKE_INTERVAL(60 s) timeout. Backends never call intopgarch.cto archive; they only drop.readyfiles and callPgArchWakeup. -
.ready/.donestatus protocol. Confirmed inxlogarchive.c:XLogArchiveNotifycreates<xlog>.ready;pgarch_archiveDone(pgarch.c) renames.readyto.done;XLogArchiveCheckDonegates recycling on.doneand re-creates a missing.ready. The done-rename is non-durable by design (plainrename, with a comment that re-archiving after a crash is acceptable). -
64-file max-heap with history-first, oldest-first priority. Confirmed:
NUM_FILES_PER_DIRECTORY_SCANis 64;pgarch_readyXlogfillsarch_files->arch_heap(abinaryheapordered byready_file_comparator), which returns history files before segments and older names before newer. -
Exactly one of
archive_command/archive_library. Confirmed: bothLoadArchiveLibraryandProcessPgArchInterruptsereport(ERROR)if both GUCs are non-empty; an emptyarchive_libraryselects the built-inshell_archive_init. -
Archive module errors are downgraded, not fatal. Confirmed:
pgarch_archiveXloginstalls its ownsigsetjmp/PG_exception_stackand returnsfalseon caughtERROR, with explicit cleanup (LWLockReleaseAll,pgaio_error_cleanup,AtEOXact_Files, …). -
restore_commandfailure means end-of-WAL. Confirmed:RestoreArchivedFilereturnsfalse(and recovery treats that as end-of-WAL) whensystem()returns non-zero for non-signal reasons; a SIGTERM triggersproc_exit(1)instead. -
Summarizer exists and is PG17+. Confirmed:
walsummarizer.cdefinesWalSummarizerMainwithMyBackendType = B_WAL_SUMMARIZER, gated by thesummarize_walGUC; the file’s header documents emitting summary files of modified blocks per LSN range. (Feature introduced in PG 17 and present on REL_18.) -
Summary file naming and content. Confirmed:
SummarizeWALwritespg_wal/summaries/%08X%08X%08X%08X%08X.summary(TLI + start LSN halves + end LSN halves) viaWriteBlockRefTablethendurable_rename. Block references come fromBlockRefTableMarkBlockModified, skippingFSM_FORKNUM. -
File boundaries at checkpoint redo / shutdown;
wal_level=minimalfast-forward. Confirmed:SummarizeXlogRecordreturns true (cut here) forXLOG_CHECKPOINT_REDO,XLOG_CHECKPOINT_SHUTDOWN,XLOG_PARAMETER_CHANGE,XLOG_END_OF_RECOVERY, and sets*new_fast_forward = (record_wal_level == WAL_LEVEL_MINIMAL). -
Limit blocks for drops/truncations/creates. Confirmed in
SummarizeSmgrRecord,SummarizeDbaseRecord,SummarizeXactRecordviaBlockRefTableSetLimitBlock. -
Catch-up wait watchdog. Confirmed:
WaitForWalSummarizationereport(ERROR ... "WAL summarization is not progressing")afterdeadcycles >= 6(≈60 s with nopending_lsnadvance).
Adjacent mechanisms are intentionally deferred to sibling docs and were not
re-verified here: the WAL record format, LSN model, and segment recycling
(postgres-xlog-wal.md); the consumer side of summaries in incremental
backup (postgres-incremental-backup.md); and base-backup mechanics
(postgres-backup-basebackup.md). The streaming transport that feeds a
standby’s restore/replay sits in postgres-wal-sender-receiver.md.
Scope note.
postgres_fdwand othercontrib/archive modules are out of scope; only the in-core mechanism (theArchiveModuleCallbacksvtable, built-inshell_archive) is analyzed. Third-party tools such as pgBackRest or WAL-G are named in the next section only as comparative examples, not analyzed from source.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”PostgreSQL’s archiving and summarization are conservative, log-derived, out-of-band designs. Naming the alternatives sharpens what those choices buy and what they cost.
-
External archivers vs. the in-core daemon — pgBackRest / WAL-G. The most common production setups replace the bare
archive_commandwith a helper invoked as the archive command. pgBackRest and WAL-G still ride the.ready/.donehandshake and the per-segment callback — they are fancierarchive_file_cbpayloads — but add compression, encryption, parallel transfer to object storage, and retention policy that the in-core archiver deliberately omits. The instructive contrast is the fork-per-file cost: PostgreSQL’sarchive_commandforks a shell per 16 MB segment, which caps single-stream throughput;archive_library(the module path analyzed here) was added precisely so a tool can stay resident and batch. A note mapping pgBackRest’s async-archive queue onto PostgreSQL’s max-heappgarch_readyXlogpriority would show where the community tool re-implements the same back-pressure the core already has. -
In-band changed-block tracking — Oracle BCT, SQL Server DCM. Oracle’s Block Change Tracking file and SQL Server’s differential changed-map (DCM) bitmaps update a change record synchronously on the write path, so an incremental backup reads a ready-made bitmap with near-zero latency. PostgreSQL chose the opposite: the WAL summarizer derives the change set after the fact, keeping the write path untouched at the cost of a summarization lag that
pg_basebackup --incrementalmust wait out (theWaitForWalSummarizationwatchdog). The trade is write-amplification and recovery-correctness coupling (in-band) versus background-CPU and lag (out-of-band). PostgreSQL’s bet is that WAL already names every modified block, so a second synchronous structure would be redundant work on the hot path. -
Physical vs. logical change tracking. The
BlockRefTableis a physical index — (relfilenode, fork, blocknumber) — so it composes with block-level PITR but says nothing about rows. Logical-replication change streams (postgres-logical-decoding.md) are the row-level analogue; they answer “what tuples changed” but cannot drive a block-incremental backup. The two indexes over the same WAL serve disjoint consumers, which is why PostgreSQL maintains both rather than unifying them. -
Continuous-archiving alternatives — streaming-only DR. A standby fed purely by streaming replication (
postgres-wal-sender-receiver.md) needs no archive at all for HA, which tempts operators to drop archiving. The frontier question is gap recovery: a standby that falls behind the primary’swal_keep_sizemust fall back torestore_commandfrom the archive (or a replication slot must pin the WAL, at the cost of unboundedpg_walgrowth). The archive and the slot are two answers to the same “don’t recycle WAL a consumer still needs” problem — one durable and out-of-cluster, one in-cluster and disk-pressure-prone. -
Summary-file granularity and the research frontier. Cutting summaries at checkpoint-redo boundaries guarantees every incremental backup’s reference LSN lands on a file boundary, but it ties summary resolution to checkpoint frequency. Finer-grained or overlapping summaries (so a backup could start mid-checkpoint) would shrink the wait but complicate the union logic in
BlockRefTablemerging. Likewise, the summarizer is a single process reading WAL serially; a parallel summarizer (partitioning the LSN range) is an unrealized scaling axis for very high WAL-rate clusters. Both are open design questions the current code’s comments gesture at but do not pursue. -
ARIES lineage. The whole edifice rests on ARIES (Mohan et al. 1992): write-ahead logging makes the WAL canonical, idempotent page-LSN redo makes roll-forward onto an arbitrarily old base backup correct, and the redo-from-checkpoint discipline is exactly what forces summary boundaries to checkpoint records. PostgreSQL’s contribution here is not the recovery theory but the secondary index over the log (the
BlockRefTable) that turns “replay everything” into “copy only what changed” — a pragmatic, post-ARIES engineering layer rather than a new recovery algorithm.
Sources
Section titled “Sources”In-tree source files (REL_18_STABLE, commit 273fe94)
Section titled “In-tree source files (REL_18_STABLE, commit 273fe94)”src/backend/postmaster/pgarch.c— the archiver process:PgArchiverMain,pgarch_MainLoop,pgarch_ArchiverCopyLoop,pgarch_archiveXlog,pgarch_readyXlog,ready_file_comparator,PgArchForceDirScan,pgarch_archiveDone,LoadArchiveLibrary, theSIGUSR2-final-cycle protocol, the 64-file max-heap, and orphan-.readycleanup.src/backend/postmaster/walsummarizer.c— the WAL summarizer:WalSummarizerMain,SummarizeWAL,SummarizeXlogRecord,SummarizeSmgrRecord/SummarizeDbaseRecord/SummarizeXactRecord,summarizer_read_local_xlog_page,summarizer_wait_for_wal,WaitForWalSummarization,GetOldestUnsummarizedLSN,MaybeRemoveOldWalSummaries, and theBlockRefTablepopulation that skipsFSM_FORKNUM.src/backend/access/transam/xlogarchive.c— the recovery/notification side:RestoreArchivedFile,ExecuteRecoveryCommand,KeepFileRestoredFromArchive,XLogArchiveNotify,XLogArchiveForceDone,XLogArchiveCheckDone,XLogArchiveIsBusy,XLogArchiveCleanup— the.ready/.donestatus protocol andrestore_commandexecution.src/backend/archive/shell_archive.c— the built-inshell_archive_initmodule that wrapsarchive_command, so the archiver only ever calls oneArchiveModuleCallbacksvtable.src/include/postmaster/pgarch.h,src/include/postmaster/walsummarizer.h,src/include/access/xlogarchive.h,src/include/archive/archive_module.h— the public signatures, theArchiveModuleCallbacksstruct, and the GUC hooks (archive_mode,archive_command,archive_library,summarize_wal,wal_summary_keep_time).
Papers and textbook chapters
Section titled “Papers and textbook chapters”- Mohan, C., Haderle, D., Lindsay, B., Pirahesh, H. & Schwarz, P. (1992).
“ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking
and Partial Rollbacks Using Write-Ahead Logging.” ACM TODS 17(1):94-162.
The write-ahead-logging and idempotent-redo discipline that makes
roll-forward onto an old base backup correct. Captured in
knowledge/research/dbms-papers/aries.md. - Database Internals (Petrov, 2019), ch. 3 (WAL / log-structured storage)
and the recovery chapters — the “WAL is the source of truth, data files are
derived” framing both archiving and summarization rest on. Capture under
knowledge/research/dbms-general/. - Database System Concepts (Silberschatz, Korth, Sudarshan, 7e), ch. 19
“Recovery System” — checkpoints, log-based recovery, and the
fuzzy-checkpoint / redo-from-checkpoint model that fixes summary-file
boundaries. Capture under
knowledge/research/dbms-general/.
Sibling docs (cross-references — mechanism owned there, not duplicated here)
Section titled “Sibling docs (cross-references — mechanism owned there, not duplicated here)”postgres-xlog-wal.md— the WAL record format, LSN model, segment naming, andpg_walrecycling horizon that archiving extends and the summarizer reads.postgres-incremental-backup.md— the consumer of.summaryfiles: howpg_basebackup --incrementalunionsBlockRefTables between a reference backup and now to copy only changed blocks.postgres-backup-basebackup.md— base-backup mechanics (pg_backup_start/pg_backup_stop, the backup label,pg_basebackup) that PITR rolls archived WAL forward on top of.postgres-checkpoint.md— theXLOG_CHECKPOINT_REDO/XLOG_CHECKPOINT_SHUTDOWNrecords that fix summary-file cut points and the redo pointer PITR replays from.postgres-recovery-redo.md— the recovery loop that callsRestoreArchivedFileto pull archived segments back during PITR.postgres-wal-sender-receiver.md— the streaming transport that is the archive’s alternative for feeding a standby, and the gap-recovery fallback torestore_command.postgres-aux-processes.md— where the archiver (B_ARCHIVER) and WAL summarizer (B_WAL_SUMMARIZER) sit in the postmaster’s auxiliary-process roster.