Skip to content

PostgreSQL WAL Archiving & Summarization — PITR and the Incremental-Backup Substrate

Contents:

A write-ahead log is, by construction, a complete and ordered record of every durable change a database has ever made. Once you have that record, two capabilities fall out almost for free — if you keep the log around long enough and if you can find the blocks it touched. Continuous archiving harvests the first; WAL summarization harvests the second.

Point-in-time recovery (PITR). Crash recovery replays WAL from the last checkpoint’s redo pointer to the end of the log, recovering the database to the moment it died. PITR generalizes this: take a base backup (a copy of the data directory plus a record of the LSN at which it was consistent), then replay an unbounded WAL history on top of it, stopping at any chosen target — a wall-clock time, a named restore point, an LSN, or a transaction id. The base backup is the “checkpoint”; the archived WAL is the “log to replay.” The only new requirement over crash recovery is that the WAL between the backup and the target must still exist somewhere — and pg_wal is recycled aggressively, so it must be copied off to durable, separate storage before recycling. That copying is continuous archiving.

Incremental backup. A full base backup copies every block in the cluster, even blocks that have not changed since the last backup. If you know which blocks changed since a reference backup, you can copy only those, plus enough metadata to reconstruct the rest by reference. The WAL already names every modified block — each WAL record carries the RelFileLocator, fork number, and block number it touched. WAL summarization distills that information: it scans WAL once and writes compact files saying “in LSN range X..Y, these blocks of these relation forks were modified.” An incremental backup then unions the summaries between the reference backup and now to learn the change set, without re-reading the WAL itself.

Database Internals (Petrov, 2019) frames the WAL as the system’s “source of truth”: the page cache and the data files are derived, reconstructable state, but the log is canonical. Both archiving and summarization are consequences of taking that framing seriously. Archiving extends the log’s lifetime beyond the recycling horizon so that an arbitrarily old base backup can still be rolled forward. Summarization builds a secondary index over the log — a block-reference table — so that the question “what changed?” can be answered in time proportional to the answer rather than to the whole log. ARIES (Mohan et al., 1992) supplies the redo discipline that makes roll-forward correct in the first place: every change is logged before the page is written (write-ahead), and redo is idempotent (replaying a record whose effect is already on the page is a no-op via the page LSN), so replaying archived WAL onto a base backup converges to a consistent state regardless of how far back the backup was taken.

The design space for archiving and summarization has a few axes:

  1. Push vs. pull. Does the database push completed log segments to the archive (an archiver daemon), or does an external tool pull them? Most systems push: the database knows precisely when a segment is complete and safe to copy.

  2. Whole-segment vs. record granularity. Archiving ships whole WAL segments (16 MB by default) because that is the unit pg_wal recycles. Summarization, by contrast, reads at record granularity but emits summaries at checkpoint granularity, because checkpoints are the only LSNs where redo can legally begin.

  3. Block-delta vs. file-delta vs. byte-delta incremental backup. PostgreSQL chose block-delta: the unit of change tracking is the 8 KB page. This matches the page-LSN redo model and the WAL’s own block-reference granularity.

  4. In-band vs. out-of-band change tracking. Some systems maintain a live “changed-block bitmap” updated synchronously on every write (in-band, low-latency but adds write-path cost). PostgreSQL chose out-of-band: a separate process derives the change set from the WAL after the fact, keeping the write path untouched at the cost of a summarization lag that incremental backup must wait out.

Continuous archiving and incremental backup are old, well-trodden ideas; most mature systems converge on a similar skeleton. Naming the shared conventions first makes PostgreSQL’s specific symbols read as one set of choices within a common playbook.

A notification-file handshake between the log writer and the archiver

Section titled “A notification-file handshake between the log writer and the archiver”

The process that fills WAL segments and the process that archives them are decoupled. The universal pattern is a small status-file protocol: when a segment is complete, the writer drops a marker (“this segment is ready”); the archiver picks it up, copies the segment, and replaces the marker with a different one (“this segment is archived”); a third party (the checkpointer) reads the second marker to decide when the segment may be recycled. PostgreSQL implements exactly this with archive_status/NNN.ready and NNN.done files. The handshake decouples the three actors so none blocks the others, and it survives crashes: a .ready that reappears after a crash simply gets re-archived (so archive commands must be idempotent).

A dedicated archiver process with retry and back-pressure

Section titled “A dedicated archiver process with retry and back-pressure”

Archiving talks to slow, possibly remote, possibly flaky storage. It must not be on the commit path, and a transient archive failure must not crash the server or lose WAL. The convention is a dedicated process that retries with a bounded count, leaves the .ready file in place on failure so the segment is not recycled, and applies natural back-pressure: if archiving falls behind, .ready files accumulate, pg_wal is not recycled, and the disk fills — a loud, recoverable failure rather than silent data loss.

Early systems hard-coded “run this shell command per file.” That is simple but fork-per-file is slow and shell quoting is a security and correctness hazard. The modern convention adds a module interface: a loadable library exposes a callback invoked once per file in-process, avoiding the fork and the shell. PostgreSQL keeps both: archive_command (shell) and archive_library (module), with the shell path itself implemented as a built-in module so the archiver only ever calls one callback.

A change-tracking index derived from the log

Section titled “A change-tracking index derived from the log”

Incremental backup needs a “what changed since LSN X” oracle. Systems that build it from the log (rather than a synchronous bitmap) run a background reader that consumes WAL and materializes a compact change index. The index is keyed by (relation, fork, block) and bounded by an LSN range so that a backup can select exactly the ranges between its reference point and now. PostgreSQL’s WAL summary files, each named TLI-startLSN-endLSN.summary and containing a serialized BlockRefTable, are this index.

A change-tracking index is only useful if its LSN boundaries coincide with points where recovery can actually start and stop. The convention is to cut index segments at checkpoint redo points. PostgreSQL cuts summary files at XLOG_CHECKPOINT_REDO and XLOG_CHECKPOINT_SHUTDOWN records, because those are the only LSNs where redo may begin — guaranteeing that any incremental backup’s reference LSN lands on a summary-file boundary.

flowchart LR
  subgraph primary["Primary cluster"]
    WAL["pg_wal/<br/>WAL segments"] -->|segment full| RDY["archive_status/<br/>NNN.ready"]
    RDY -->|notify latch| ARCH["archiver process<br/>(PgArchiverMain)"]
    ARCH -->|archive_file_cb| STORE["archive storage<br/>(command or library)"]
    ARCH -->|rename| DONE["archive_status/<br/>NNN.done"]
    WAL -->|xlogreader scan| SUMM["walsummarizer<br/>(WalSummarizerMain)"]
    SUMM -->|BlockRefTable| SUMFILES["pg_wal/summaries/<br/>TLI-LSN-LSN.summary"]
  end
  STORE -->|restore_command| RESTORE["recovery / PITR<br/>(RestoreArchivedFile)"]
  SUMFILES -->|change set| INCR["incremental backup<br/>(pg_basebackup)"]

Figure 1 — The two harvesting pipelines that share the WAL stream. The archiver pushes completed segments to durable storage for PITR; the summarizer indexes modified blocks for incremental backup. Both read from pg_wal, neither is on the commit path.

PostgreSQL splits the work across two independent auxiliary processes forked by the postmaster — the archiver and the WAL summarizer — plus a body of recovery-side restore logic in xlogarchive.c. All three are gated by GUCs (archive_mode + archive_command/archive_library; summarize_wal) and none of them sits on a backend’s commit path.

The archiver: latch, scan, prioritize, ship

Section titled “The archiver: latch, scan, prioritize, ship”

The archiver is an auxiliary process whose entire job is to drain pg_wal/archive_status of .ready files. PgArchiverMain sets up signal handlers (notably SIGUSR2 for “do one final cycle and stop”), advertises its proc number so backends can wake it, allocates a max-heap workspace, loads the archive module, and enters pgarch_MainLoop. The main loop is a latch wait that wakes on a notification, a 60-second autowake, or shutdown:

// pgarch_MainLoop — src/backend/postmaster/pgarch.c
do
{
ResetLatch(MyLatch);
time_to_stop = ready_to_stop; /* set by SIGUSR2 handler */
ProcessPgArchInterrupts();
/* ... SIGTERM-but-no-SIGUSR2 grace handling ... */
pgarch_ArchiverCopyLoop(); /* archive everything outstanding */
if (!time_to_stop)
rc = WaitLatch(MyLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
PGARCH_AUTOWAKE_INTERVAL * 1000L,
WAIT_EVENT_ARCHIVER_MAIN);
} while (!time_to_stop);

The wakeup comes from PgArchWakeup, which the WAL machinery calls via XLogArchiveNotify whenever a new .ready file appears. The 60-second autowake (PGARCH_AUTOWAKE_INTERVAL) is a safety net in case a wakeup is missed — “the archiver exists to protect our data,” so it polls proactively.

pgarch_ArchiverCopyLoop is the per-cycle workhorse: it repeatedly asks pgarch_readyXlog for the next file and archives it, with bounded retries. A clever wrinkle is the orphan cleanup: if the .ready file names a segment that no longer exists in pg_wal (a crash left a stale marker for an already-recycled segment), the archiver simply unlinks the .ready and moves on rather than failing forever:

// pgarch_ArchiverCopyLoop — src/backend/postmaster/pgarch.c
while (pgarch_readyXlog(xlog))
{
int failures = 0, failures_orphan = 0;
for (;;)
{
if (ShutdownRequestPending || !PostmasterIsAlive())
return;
ProcessPgArchInterrupts(); /* pick up config changes fast */
snprintf(pathname, MAXPGPATH, XLOGDIR "/%s", xlog);
if (stat(pathname, &stat_buf) != 0 && errno == ENOENT)
{
/* orphan .ready for an already-recycled segment: unlink it */
StatusFilePath(xlogready, xlog, ".ready");
if (unlink(xlogready) == 0) break;
/* ... bounded NUM_ORPHAN_CLEANUP_RETRIES ... */
}
if (pgarch_archiveXlog(xlog)) /* the actual copy */
{
pgarch_archiveDone(xlog); /* .ready -> .done */
pgstat_report_archiver(xlog, false);
break;
}
else if (++failures >= NUM_ARCHIVE_RETRIES)
return; /* give up this cycle, retry later */
}
}

Prioritization. When many segments are waiting, the archiver does not naively pick the alphabetically first directory entry. pgarch_readyXlog performs a directory scan that fills a bounded max-heap (NUM_FILES_PER_DIRECTORY_SCAN = 64) so that one scan can serve up to 64 files in priority order, amortizing the (expensive) directory scan. The priority order is encoded in ready_file_comparator: timeline-history files always win (so a promoted standby’s new timeline is archived ASAP), then oldest segment first (to keep the WAL chain contiguous and to free the soonest-recyclable segments):

// ready_file_comparator — src/backend/postmaster/pgarch.c
bool a_history = IsTLHistoryFileName(a_str);
bool b_history = IsTLHistoryFileName(b_str);
/* Timeline history files always have the highest priority. */
if (a_history != b_history)
return a_history ? -1 : 1;
/* Priority is given to older files. */
return strcmp(a_str, b_str);

PgArchForceDirScan lets the WAL machinery force the next scan to ignore the cached heap — XLogArchiveNotify calls it for timeline-history files so they jump the queue immediately rather than waiting behind 64 buffered segment names.

Pluggable transport: shell command vs. archive library

Section titled “Pluggable transport: shell command vs. archive library”

The archiver never runs system() directly. LoadArchiveLibrary resolves a single ArchiveModuleCallbacks vtable: if archive_library is empty it uses the built-in shell_archive_init (which wraps archive_command), otherwise it dlopens the named library and calls its _PG_archive_module_init. Exactly one of archive_command / archive_library may be set:

// LoadArchiveLibrary — src/backend/postmaster/pgarch.c
if (XLogArchiveLibrary[0] != '\0' && XLogArchiveCommand[0] != '\0')
ereport(ERROR, ... "both \"archive_command\" and \"archive_library\" set" ...);
if (XLogArchiveLibrary[0] == '\0')
archive_init = shell_archive_init;
else
archive_init = (ArchiveModuleInit)
load_external_function(XLogArchiveLibrary,
"_PG_archive_module_init", false, NULL);
ArchiveCallbacks = (*archive_init) ();
if (ArchiveCallbacks->archive_file_cb == NULL)
ereport(ERROR, ... "archive modules must register an archive callback" ...);

pgarch_archiveXlog invokes the callback inside its own sigsetjmp exception barrier so that an ERROR raised by the archive module turns into “return false, retry this file” rather than a FATAL that restarts the whole archiver. This is the one place in the backend where a custom PG_exception_stack is installed precisely to downgrade errors.

Recovery side: restore_command and the .ready/.done protocol

Section titled “Recovery side: restore_command and the .ready/.done protocol”

The flip side of archiving is restoring. During archive recovery (PITR or standby bootstrap), the startup process needs WAL segments that are no longer in pg_wal. RestoreArchivedFile runs restore_command to copy a named segment out of the archive into a temporary file in pg_wal, crosschecks its size, and reports success/failure. Crucially, “restore fails” is the normal signal for “end of WAL reached” — recovery rolls forward until the restore fails:

// RestoreArchivedFile — src/backend/access/transam/xlogarchive.c
xlogRestoreCmd = BuildRestoreCommand(recoveryRestoreCommand,
xlogpath, xlogfname,
lastRestartPointFname);
PreRestoreCommand(); /* arm SIGTERM fast-exit */
rc = system(xlogRestoreCmd);
PostRestoreCommand();
if (rc == 0 && stat(xlogpath, &stat_buf) == 0) { /* success: use xlogpath */ }
/* on signal death, punt; otherwise treat as "no such file, end of WAL" */
if (wait_result_is_signal(rc, SIGTERM))
proc_exit(1);

The .ready/.done status files are the contract between the WAL writer, the archiver, and the checkpointer. XLogArchiveNotify creates NNN.ready when a segment fills; pgarch_archiveDone renames it to NNN.done after a successful copy; XLogArchiveCheckDone is what the checkpointer calls before recycling a segment — it returns true (deletable) only if .done exists, and re-creates a missing .ready if neither exists so the segment is not lost:

// XLogArchiveCheckDone — src/backend/access/transam/xlogarchive.c
if (!XLogArchivingActive())
return true; /* archive_mode=off: always deletable */
StatusFilePath(archiveStatusPath, xlog, ".done");
if (stat(archiveStatusPath, &stat_buf) == 0)
return true; /* archived: recycle it */
StatusFilePath(archiveStatusPath, xlog, ".ready");
if (stat(archiveStatusPath, &stat_buf) == 0)
return false; /* still pending: keep it */
/* neither exists -> (re)create .ready and keep */
XLogArchiveNotify(xlog);
return false;
stateDiagram-v2
  [*] --> SegmentFull
  SegmentFull --> Ready: XLogArchiveNotify creates NNN.ready
  Ready --> Archiving: pgarch_readyXlog selects file
  Archiving --> Ready: archive_file_cb fails \n retry up to NUM_ARCHIVE_RETRIES
  Archiving --> Done: pgarch_archiveDone renames to NNN.done
  Done --> Recycled: checkpoint sees .done \n XLogArchiveCheckDone returns true
  Recycled --> [*]

Figure 2 — The lifecycle of one WAL segment through the archive status protocol. A failed copy returns the segment to the Ready state; only a .done marker authorizes the checkpointer to recycle the segment.

The WAL summarizer: indexing modified blocks

Section titled “The WAL summarizer: indexing modified blocks”

When summarize_wal is on, the postmaster runs a second auxiliary process, WalSummarizerMain. It reads WAL with an XLogReaderState, accumulates a BlockRefTable of modified (relation, fork, block) tuples, and periodically writes a .summary file. The shared-memory WalSummarizerData advertises summarized_lsn (durable on disk), pending_lsn (read into memory but not yet written), and the timeline, all under WALSummarizerLock. Incremental backup reads summarized_lsn to know how far the change index extends and waits (via WaitForWalSummarization) if it needs the summarizer to catch up to the backup’s start LSN.

The summarizer cuts a new summary file at every checkpoint redo point, skips WAL written under wal_level=minimal (which is unsafe to base an incremental backup on), follows timeline switches, and removes old summaries once the underlying WAL is gone and the file is older than wal_summary_keep_time. The next section walks the call flow in detail.

This section follows the three subsystems by call flow: the archiver process (pgarch.c), the recovery-side restore and status protocol (xlogarchive.c), and the WAL summarizer (walsummarizer.c). Symbols are the stable anchors; the position-hint table at the end maps each to a (file, line) pair as observed on REL_18 at commit 273fe94.

PgArchiverMain is the entry point the postmaster calls when XLogArchivingActive(). It sets MyBackendType = B_ARCHIVER, installs signal handlers, publishes its proc number into shared memory so backends can wake it, builds the per-scan workspace, and loads the archive module before looping:

// PgArchiverMain — src/backend/postmaster/pgarch.c
MyBackendType = B_ARCHIVER;
AuxiliaryProcessMainCommon();
pqsignal(SIGUSR2, pgarch_waken_stop); /* "final cycle then exit" */
Assert(XLogArchivingActive());
on_shmem_exit(pgarch_die, 0);
PgArch->pgprocno = MyProcNumber; /* advertise for PgArchWakeup */
arch_files = palloc(sizeof(struct arch_files_state));
arch_files->arch_heap = binaryheap_allocate(NUM_FILES_PER_DIRECTORY_SCAN,
ready_file_comparator, NULL);
LoadArchiveLibrary();
pgarch_MainLoop();

The wakeup path is PgArchWakeup, which sets the archiver’s latch via its advertised proc number. It deliberately does not take ProcArrayLockprocLatch is never freed, so a stale proc number at worst wakes the wrong process, and the archiver is relaunched shortly anyway:

// PgArchWakeup — src/backend/postmaster/pgarch.c
int arch_pgprocno = PgArch->pgprocno;
if (arch_pgprocno != INVALID_PROC_NUMBER)
SetLatch(&ProcGlobal->allProcs[arch_pgprocno].procLatch);

ProcessPgArchInterrupts is called from both loops; besides the usual barrier/config handling, it specially detects a change to archive_library and proc_exit(0)s the archiver so the new module is loaded on relaunch (there is no library-unload mechanism, so restart is the only safe path).

File selection: directory scan, max-heap, priority

Section titled “File selection: directory scan, max-heap, priority”

pgarch_readyXlog is the most intricate routine in pgarch.c. It returns the single highest-priority .ready file, but to avoid scanning the directory once per file it scans once and buffers up to 64 names. On entry it first tries to serve a buffered name (re-stating to confirm the .ready still exists), and only rescans when the buffer is empty or force_dir_scan was set:

// pgarch_readyXlog — src/backend/postmaster/pgarch.c (condensed)
if (pg_atomic_exchange_u32(&PgArch->force_dir_scan, 0) == 1)
arch_files->arch_files_size = 0; /* forced rescan */
while (arch_files->arch_files_size > 0) /* serve buffered names */
{
arch_file = arch_files->arch_files[--arch_files->arch_files_size];
StatusFilePath(status_file, arch_file, ".ready");
if (stat(status_file, &st) == 0) { strcpy(xlog, arch_file); return true; }
}
/* else: scan archive_status/, push .ready basenames into the max-heap, ... */
rldir = AllocateDir(XLogArchiveStatusDir);
while ((rlde = ReadDir(rldir, XLogArchiveStatusDir)) != NULL)
{
/* validate name length/chars, require ".ready" suffix */
if (arch_files->arch_heap->bh_size < NUM_FILES_PER_DIRECTORY_SCAN)
binaryheap_add_unordered(arch_files->arch_heap, CStringGetDatum(arch_file));
else if (ready_file_comparator(binaryheap_first(arch_files->arch_heap),
CStringGetDatum(basename), NULL) > 0)
{ /* evict lowest-priority, insert this one */ }
}
/* drain heap into arch_files[] in ascending priority, return highest */

The heap holds the 64 highest-priority candidates; remaining entries are discovered on the next scan after these drain. ready_file_comparator (shown in the previous section) defines priority: history files first, then oldest segment. Because WAL filenames sort with timeline as the most significant field, “smaller timeline = older = higher priority,” which gives past timelines precedence — desirable for keeping the recovery chain contiguous.

pgarch_archiveXlog is where the archive callback actually runs. The sigsetjmp block is the notable detail: the archiver lives at the bottom of the exception stack, so a bare ereport(ERROR) from a module would become FATAL and restart the process. The hand-rolled handler catches it, runs the cleanup that a PG_CATCH normally would (LWLockReleaseAll, AtEOXact_Files, pgaio_error_cleanup, etc.), and returns false:

// pgarch_archiveXlog — src/backend/postmaster/pgarch.c (condensed)
if (sigsetjmp(local_sigjmp_buf, 1) != 0)
{
error_context_stack = NULL;
HOLD_INTERRUPTS();
EmitErrorReport();
LWLockReleaseAll();
ConditionVariableCancelSleep();
pgaio_error_cleanup();
ReleaseAuxProcessResources(false);
AtEOXact_Files(false);
MemoryContextSwitchTo(oldcontext);
FlushErrorState();
RESUME_INTERRUPTS();
ret = false; /* retry the file, don't restart */
}
else
{
PG_exception_stack = &local_sigjmp_buf;
ret = ArchiveCallbacks->archive_file_cb(archive_module_state, xlog, pathname);
PG_exception_stack = NULL;
}

On success pgarch_archiveDone renames NNN.ready to NNN.done. The rename is deliberately not durable (rename, not durable_rename) — if a crash loses the rename, the .ready reappears and the segment is simply re-archived, which is harmless because archive commands must be idempotent.

Recovery-side: RestoreArchivedFile and the status protocol

Section titled “Recovery-side: RestoreArchivedFile and the status protocol”

On the recovery path RestoreArchivedFile (walked in the previous section) builds the command via BuildRestoreCommand, brackets system() with PreRestoreCommand/PostRestoreCommand so a SIGTERM during the external command can fast-exit, and treats a non-signal failure as “end of WAL.” A successfully restored file is moved into place under its canonical name by KeepFileRestoredFromArchive, which also forces a .done (so the restored segment is not re-archived unless archive_mode=always) and wakes any walsenders:

// KeepFileRestoredFromArchive — src/backend/access/transam/xlogarchive.c
durable_rename(path, xlogfpath, ERROR);
if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
XLogArchiveForceDone(xlogfname);
else
XLogArchiveNotify(xlogfname);
if (reload)
WalSndRqstFileReload();
WalSndWakeup(true, false);

XLogArchiveNotify is the producer end of the status protocol: it creates NNN.ready, force-scans for history files, and pokes the archiver. The consumer ends are XLogArchiveCheckDone (checkpointer asks “may I recycle this?”), XLogArchiveIsBusy (is it still unarchived? — treats a missing segment as not-busy to handle the recycle race), and XLogArchiveCleanup (unlink both status files when a segment is finally gone).

WAL summarizer: main loop and timeline handling

Section titled “WAL summarizer: main loop and timeline handling”

WalSummarizerMain mirrors the archiver’s shape — sigsetjmp error barrier, shared-memory advertisement of its proc number — then enters an infinite loop. Each iteration finds the safe upper bound (GetLatestLSN), computes a timeline switch point if the current timeline has gone historic, summarizes one file’s worth of WAL, and publishes progress:

// WalSummarizerMain — src/backend/postmaster/walsummarizer.c (condensed)
current_lsn = GetOldestUnsummarizedLSN(&current_tli, &exact);
if (XLogRecPtrIsInvalid(current_lsn)) proc_exit(0); /* summarize_wal off */
for (;;)
{
ProcessWalSummarizerInterrupts();
MaybeRemoveOldWalSummaries();
latest_lsn = GetLatestLSN(&latest_tli);
if (current_tli != latest_tli && XLogRecPtrIsInvalid(switch_lsn))
switch_lsn = tliSwitchPoint(current_tli,
readTimeLineHistory(latest_tli), &switch_tli);
end_of_summary_lsn = SummarizeWAL(current_tli, current_lsn, exact,
switch_lsn, latest_lsn);
current_lsn = end_of_summary_lsn;
exact = true;
LWLockAcquire(WALSummarizerLock, LW_EXCLUSIVE);
WalSummarizerCtl->summarized_lsn = end_of_summary_lsn;
WalSummarizerCtl->pending_lsn = end_of_summary_lsn;
LWLockRelease(WALSummarizerLock);
ConditionVariableBroadcast(&WalSummarizerCtl->summary_file_cv);
}

GetLatestLSN picks the correct ceiling depending on role: on a primary it is GetFlushRecPtr (never summarize unflushed WAL); during recovery it is the greater of the walreceiver’s flush position and the replay position, because replayed WAL is necessarily flushed. GetOldestUnsummarizedLSN bootstraps summarized_lsn from the newest existing .summary file’s end, or from the oldest segment on disk if none exist — and it is also callable by non-summarizer backends, which read the value to avoid recycling WAL the summarizer still needs.

SummarizeWAL: the record-reading loop and BlockRefTable

Section titled “SummarizeWAL: the record-reading loop and BlockRefTable”

SummarizeWAL does the real work for one summary file. It allocates an XLogReaderState with summarizer_read_local_xlog_page as the page-read callback, positions it (exactly, or by XLogFindNextRecord when exact is false), and reads records one at a time. Record handling has three cases: RM_XLOG_ID records (checkpoint/parameter/end-of-recovery) decide file boundaries and fast-forward mode; RM_DBASE_ID/RM_SMGR_ID/RM_XACT_ID records need limit-block bookkeeping; everything else just contributes its block references:

// SummarizeWAL — src/backend/postmaster/walsummarizer.c (condensed)
BlockRefTable *brtab = CreateEmptyBlockRefTable();
bool fast_forward = true;
/* ... XLogBeginRead or XLogFindNextRecord to set summary_start_lsn ... */
while (1)
{
record = XLogReadRecord(xlogreader, &errormsg);
if (record == NULL) { /* end of WAL on historic TLI -> break */ }
if (!XLogRecPtrIsInvalid(switch_lsn) && xlogreader->ReadRecPtr >= switch_lsn)
{ summary_end_lsn = switch_lsn; break; }
rmid = XLogRecGetRmid(xlogreader);
if (rmid == RM_XLOG_ID)
{
if (SummarizeXlogRecord(xlogreader, &new_fast_forward))
{
if (xlogreader->ReadRecPtr > summary_start_lsn)
{ summary_end_lsn = xlogreader->ReadRecPtr; break; } /* cut file */
else
fast_forward = new_fast_forward; /* first record */
}
}
else if (!fast_forward)
switch (rmid) {
case RM_DBASE_ID: SummarizeDbaseRecord(xlogreader, brtab); break;
case RM_SMGR_ID: SummarizeSmgrRecord(xlogreader, brtab); break;
case RM_XACT_ID: SummarizeXactRecord(xlogreader, brtab); break;
}
if (!fast_forward)
for (block_id = 0; block_id <= XLogRecMaxBlockId(xlogreader); block_id++)
{
if (!XLogRecGetBlockTagExtended(xlogreader, block_id, &rlocator,
&forknum, &blocknum, NULL))
continue;
if (forknum != FSM_FORKNUM)
BlockRefTableMarkBlockModified(brtab, &rlocator, forknum, blocknum);
}
summary_end_lsn = xlogreader->EndRecPtr;
/* publish pending_lsn under WALSummarizerLock */
}

When the loop ends, if any non-trivial range was covered and we are not fast-forwarding, the BlockRefTable is serialized to a temp file and durable_renamed to its final name pg_wal/summaries/TLITLITLI-STARTLSN-ENDLSN.summary (the %08X quintuple of TLI plus the two LSN halves each):

// SummarizeWAL — src/backend/postmaster/walsummarizer.c
snprintf(final_path, MAXPGPATH,
XLOGDIR "/summaries/%08X%08X%08X%08X%08X.summary",
tli, LSN_FORMAT_ARGS(summary_start_lsn), LSN_FORMAT_ARGS(summary_end_lsn));
io.file = PathNameOpenFile(temp_path, O_WRONLY | O_CREAT | O_TRUNC);
WriteBlockRefTable(brtab, WriteWalSummary, &io);
FileClose(io.file);
durable_rename(temp_path, final_path, ERROR);

Limit blocks: handling drops, truncations, and new forks

Section titled “Limit blocks: handling drops, truncations, and new forks”

A subtle correctness point: marking blocks “modified” is not enough. If a relation is dropped, recreated, or truncated, an incremental backup must not assume the old blocks are still valid. The summarizer encodes this with limit blocks. SummarizeSmgrRecord sets the limit block to 0 when a fork is created (everything is new) and to the truncation point on truncate:

// SummarizeSmgrRecord — src/backend/postmaster/walsummarizer.c
if (info == XLOG_SMGR_CREATE) {
xlrec = (xl_smgr_create *) XLogRecGetData(xlogreader);
if (xlrec->forkNum != FSM_FORKNUM)
BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator, xlrec->forkNum, 0);
} else if (info == XLOG_SMGR_TRUNCATE) {
xlrec = (xl_smgr_truncate *) XLogRecGetData(xlogreader);
if ((xlrec->flags & SMGR_TRUNCATE_HEAP) != 0)
BlockRefTableSetLimitBlock(brtab, &xlrec->rlocator, MAIN_FORKNUM, xlrec->blkno);
/* ... VM fork via visibilitymap_truncation_length ... */
}

SummarizeDbaseRecord does the same at database/tablespace granularity for CREATE DATABASE ... (FILE_COPY) and DROP DATABASE (which can create or remove many relation files without per-file WAL), and SummarizeXactRecord clears tracking for relations dropped at commit or abort. The guiding invariant, stated in the source, is “We can never lose data by marking more stuff as needing to be backed up in full” — limit blocks are always conservative.

SummarizeXlogRecord inspects checkpoint, parameter-change, and end-of-recovery records to extract the wal_level in effect, and decides both whether to cut the file here (true for redo points and shutdown checkpoints, since redo can only begin there) and whether to fast-forward (skip emitting summaries when wal_level=minimal, because an incremental backup spanning minimal-level WAL would be unsafe):

// SummarizeXlogRecord — src/backend/postmaster/walsummarizer.c
if (info == XLOG_CHECKPOINT_REDO)
memcpy(&record_wal_level, XLogRecGetData(xlogreader), sizeof(int));
else if (info == XLOG_CHECKPOINT_SHUTDOWN) { /* ... rec_ckpt.wal_level ... */ }
else if (info == XLOG_PARAMETER_CHANGE) { /* ... xlrec.wal_level ... */ }
else if (info == XLOG_END_OF_RECOVERY) { /* ... xlrec.wal_level ... */ }
else
return false; /* not a boundary record */
*new_fast_forward = (record_wal_level == WAL_LEVEL_MINIMAL);
return true;

Two waiting routines tie the summarizer to its consumers. Inside record reading, summarizer_read_local_xlog_page calls summarizer_wait_for_wal when it reaches end-of-WAL on the current timeline; that routine implements an adaptive backoff in 200 ms quanta (doubling when idle up to 30 s, shrinking under load). Outside, WaitForWalSummarization is what an incremental backup calls to block until summarized_lsn reaches the backup’s start LSN, with a progress watchdog that errors out if pending_lsn fails to advance for a full minute:

// WaitForWalSummarization — src/backend/postmaster/walsummarizer.c (condensed)
while (1)
{
if (!summarize_wal) return; /* disabled while waiting: give up */
summarized_lsn = WalSummarizerCtl->summarized_lsn; /* under lock */
pending_lsn = WalSummarizerCtl->pending_lsn;
if (summarized_lsn >= lsn) break;
/* per 10s cycle: track deadcycles; error after ~60s of no progress */
if (deadcycles >= 6)
ereport(ERROR, ... "WAL summarization is not progressing" ...);
ConditionVariableTimedSleep(&WalSummarizerCtl->summary_file_cv,
timeout_in_ms, WAIT_EVENT_WAL_SUMMARY_READY);
}

MaybeRemoveOldWalSummaries runs once per checkpoint cycle (gated on the redo pointer advancing). It removes a .summary file only when both its underlying WAL is gone (ws->end_lsn <= oldest_lsn on disk) and the file is older than wal_summary_keep_time — so a summary outlives its WAL only long enough to satisfy the configured retention, and incremental backups that still need it can complete.

flowchart TB
  A["WalSummarizerMain loop"] --> B["GetLatestLSN<br/>(flush on primary;<br/>max of replay/recv flush on standby)"]
  B --> C{current_tli<br/>!= latest_tli?}
  C -- yes --> D["tliSwitchPoint<br/>compute switch_lsn"]
  C -- no --> E["SummarizeWAL"]
  D --> E
  E --> F["XLogReadRecord loop"]
  F --> G{RM_XLOG_ID<br/>boundary?}
  G -- redo/shutdown --> H["cut file at ReadRecPtr<br/>or set fast_forward"]
  G -- other rmgr --> I["BlockRefTableMarkBlockModified<br/>+ limit-block bookkeeping"]
  H --> J["WriteBlockRefTable<br/>durable_rename .summary"]
  I --> F
  J --> K["publish summarized_lsn<br/>broadcast summary_file_cv"]
  K --> A

Figure 3 — The summarizer’s per-file flow. Boundary records (redo points, shutdown checkpoints) cut summary files and toggle fast-forward; all other records feed block references into the in-memory BlockRefTable that is serialized at the cut.

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
PgArchiverMainsrc/backend/postmaster/pgarch.c218
PgArchWakeupsrc/backend/postmaster/pgarch.c281
pgarch_MainLoopsrc/backend/postmaster/pgarch.c311
pgarch_ArchiverCopyLoopsrc/backend/postmaster/pgarch.c381
pgarch_archiveXlogsrc/backend/postmaster/pgarch.c517
pgarch_readyXlogsrc/backend/postmaster/pgarch.c645
ready_file_comparatorsrc/backend/postmaster/pgarch.c781
PgArchForceDirScansrc/backend/postmaster/pgarch.c804
pgarch_archiveDonesrc/backend/postmaster/pgarch.c818
LoadArchiveLibrarysrc/backend/postmaster/pgarch.c913
RestoreArchivedFilesrc/backend/access/transam/xlogarchive.c54
ExecuteRecoveryCommandsrc/backend/access/transam/xlogarchive.c295
KeepFileRestoredFromArchivesrc/backend/access/transam/xlogarchive.c358
XLogArchiveNotifysrc/backend/access/transam/xlogarchive.c444
XLogArchiveForceDonesrc/backend/access/transam/xlogarchive.c510
XLogArchiveCheckDonesrc/backend/access/transam/xlogarchive.c565
XLogArchiveIsBusysrc/backend/access/transam/xlogarchive.c619
XLogArchiveCleanupsrc/backend/access/transam/xlogarchive.c712
WalSummarizerMainsrc/backend/postmaster/walsummarizer.c214
GetWalSummarizerStatesrc/backend/postmaster/walsummarizer.c451
GetOldestUnsummarizedLSNsrc/backend/postmaster/walsummarizer.c509
WaitForWalSummarizationsrc/backend/postmaster/walsummarizer.c664
GetLatestLSNsrc/backend/postmaster/walsummarizer.c804
SummarizeWALsrc/backend/postmaster/walsummarizer.c910
SummarizeSmgrRecordsrc/backend/postmaster/walsummarizer.c1319
SummarizeXlogRecordsrc/backend/postmaster/walsummarizer.c1429
summarizer_read_local_xlog_pagesrc/backend/postmaster/walsummarizer.c1502
summarizer_wait_for_walsrc/backend/postmaster/walsummarizer.c1616
MaybeRemoveOldWalSummariessrc/backend/postmaster/walsummarizer.c1662

All excerpts were taken from REL_18_STABLE at commit 273fe94852b3a7e34fd171e8abdf1481beb302fa (PG 18.x). The following claims were checked directly against the source tree:

  • Archiver is a latch-driven auxiliary process, not on the commit path. Confirmed: PgArchiverMain sets MyBackendType = B_ARCHIVER and pgarch_MainLoop blocks in WaitLatch(... WAIT_EVENT_ARCHIVER_MAIN) with a PGARCH_AUTOWAKE_INTERVAL (60 s) timeout. Backends never call into pgarch.c to archive; they only drop .ready files and call PgArchWakeup.

  • .ready / .done status protocol. Confirmed in xlogarchive.c: XLogArchiveNotify creates <xlog>.ready; pgarch_archiveDone (pgarch.c) renames .ready to .done; XLogArchiveCheckDone gates recycling on .done and re-creates a missing .ready. The done-rename is non-durable by design (plain rename, with a comment that re-archiving after a crash is acceptable).

  • 64-file max-heap with history-first, oldest-first priority. Confirmed: NUM_FILES_PER_DIRECTORY_SCAN is 64; pgarch_readyXlog fills arch_files->arch_heap (a binaryheap ordered by ready_file_comparator), which returns history files before segments and older names before newer.

  • Exactly one of archive_command / archive_library. Confirmed: both LoadArchiveLibrary and ProcessPgArchInterrupts ereport(ERROR) if both GUCs are non-empty; an empty archive_library selects the built-in shell_archive_init.

  • Archive module errors are downgraded, not fatal. Confirmed: pgarch_archiveXlog installs its own sigsetjmp/PG_exception_stack and returns false on caught ERROR, with explicit cleanup (LWLockReleaseAll, pgaio_error_cleanup, AtEOXact_Files, …).

  • restore_command failure means end-of-WAL. Confirmed: RestoreArchivedFile returns false (and recovery treats that as end-of-WAL) when system() returns non-zero for non-signal reasons; a SIGTERM triggers proc_exit(1) instead.

  • Summarizer exists and is PG17+. Confirmed: walsummarizer.c defines WalSummarizerMain with MyBackendType = B_WAL_SUMMARIZER, gated by the summarize_wal GUC; the file’s header documents emitting summary files of modified blocks per LSN range. (Feature introduced in PG 17 and present on REL_18.)

  • Summary file naming and content. Confirmed: SummarizeWAL writes pg_wal/summaries/%08X%08X%08X%08X%08X.summary (TLI + start LSN halves + end LSN halves) via WriteBlockRefTable then durable_rename. Block references come from BlockRefTableMarkBlockModified, skipping FSM_FORKNUM.

  • File boundaries at checkpoint redo / shutdown; wal_level=minimal fast-forward. Confirmed: SummarizeXlogRecord returns true (cut here) for XLOG_CHECKPOINT_REDO, XLOG_CHECKPOINT_SHUTDOWN, XLOG_PARAMETER_CHANGE, XLOG_END_OF_RECOVERY, and sets *new_fast_forward = (record_wal_level == WAL_LEVEL_MINIMAL).

  • Limit blocks for drops/truncations/creates. Confirmed in SummarizeSmgrRecord, SummarizeDbaseRecord, SummarizeXactRecord via BlockRefTableSetLimitBlock.

  • Catch-up wait watchdog. Confirmed: WaitForWalSummarization ereport(ERROR ... "WAL summarization is not progressing") after deadcycles >= 6 (≈60 s with no pending_lsn advance).

Adjacent mechanisms are intentionally deferred to sibling docs and were not re-verified here: the WAL record format, LSN model, and segment recycling (postgres-xlog-wal.md); the consumer side of summaries in incremental backup (postgres-incremental-backup.md); and base-backup mechanics (postgres-backup-basebackup.md). The streaming transport that feeds a standby’s restore/replay sits in postgres-wal-sender-receiver.md.

Scope note. postgres_fdw and other contrib/ archive modules are out of scope; only the in-core mechanism (the ArchiveModuleCallbacks vtable, built-in shell_archive) is analyzed. Third-party tools such as pgBackRest or WAL-G are named in the next section only as comparative examples, not analyzed from source.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

PostgreSQL’s archiving and summarization are conservative, log-derived, out-of-band designs. Naming the alternatives sharpens what those choices buy and what they cost.

  • External archivers vs. the in-core daemon — pgBackRest / WAL-G. The most common production setups replace the bare archive_command with a helper invoked as the archive command. pgBackRest and WAL-G still ride the .ready/.done handshake and the per-segment callback — they are fancier archive_file_cb payloads — but add compression, encryption, parallel transfer to object storage, and retention policy that the in-core archiver deliberately omits. The instructive contrast is the fork-per-file cost: PostgreSQL’s archive_command forks a shell per 16 MB segment, which caps single-stream throughput; archive_library (the module path analyzed here) was added precisely so a tool can stay resident and batch. A note mapping pgBackRest’s async-archive queue onto PostgreSQL’s max-heap pgarch_readyXlog priority would show where the community tool re-implements the same back-pressure the core already has.

  • In-band changed-block tracking — Oracle BCT, SQL Server DCM. Oracle’s Block Change Tracking file and SQL Server’s differential changed-map (DCM) bitmaps update a change record synchronously on the write path, so an incremental backup reads a ready-made bitmap with near-zero latency. PostgreSQL chose the opposite: the WAL summarizer derives the change set after the fact, keeping the write path untouched at the cost of a summarization lag that pg_basebackup --incremental must wait out (the WaitForWalSummarization watchdog). The trade is write-amplification and recovery-correctness coupling (in-band) versus background-CPU and lag (out-of-band). PostgreSQL’s bet is that WAL already names every modified block, so a second synchronous structure would be redundant work on the hot path.

  • Physical vs. logical change tracking. The BlockRefTable is a physical index — (relfilenode, fork, blocknumber) — so it composes with block-level PITR but says nothing about rows. Logical-replication change streams (postgres-logical-decoding.md) are the row-level analogue; they answer “what tuples changed” but cannot drive a block-incremental backup. The two indexes over the same WAL serve disjoint consumers, which is why PostgreSQL maintains both rather than unifying them.

  • Continuous-archiving alternatives — streaming-only DR. A standby fed purely by streaming replication (postgres-wal-sender-receiver.md) needs no archive at all for HA, which tempts operators to drop archiving. The frontier question is gap recovery: a standby that falls behind the primary’s wal_keep_size must fall back to restore_command from the archive (or a replication slot must pin the WAL, at the cost of unbounded pg_wal growth). The archive and the slot are two answers to the same “don’t recycle WAL a consumer still needs” problem — one durable and out-of-cluster, one in-cluster and disk-pressure-prone.

  • Summary-file granularity and the research frontier. Cutting summaries at checkpoint-redo boundaries guarantees every incremental backup’s reference LSN lands on a file boundary, but it ties summary resolution to checkpoint frequency. Finer-grained or overlapping summaries (so a backup could start mid-checkpoint) would shrink the wait but complicate the union logic in BlockRefTable merging. Likewise, the summarizer is a single process reading WAL serially; a parallel summarizer (partitioning the LSN range) is an unrealized scaling axis for very high WAL-rate clusters. Both are open design questions the current code’s comments gesture at but do not pursue.

  • ARIES lineage. The whole edifice rests on ARIES (Mohan et al. 1992): write-ahead logging makes the WAL canonical, idempotent page-LSN redo makes roll-forward onto an arbitrarily old base backup correct, and the redo-from-checkpoint discipline is exactly what forces summary boundaries to checkpoint records. PostgreSQL’s contribution here is not the recovery theory but the secondary index over the log (the BlockRefTable) that turns “replay everything” into “copy only what changed” — a pragmatic, post-ARIES engineering layer rather than a new recovery algorithm.

In-tree source files (REL_18_STABLE, commit 273fe94)

Section titled “In-tree source files (REL_18_STABLE, commit 273fe94)”
  • src/backend/postmaster/pgarch.c — the archiver process: PgArchiverMain, pgarch_MainLoop, pgarch_ArchiverCopyLoop, pgarch_archiveXlog, pgarch_readyXlog, ready_file_comparator, PgArchForceDirScan, pgarch_archiveDone, LoadArchiveLibrary, the SIGUSR2-final-cycle protocol, the 64-file max-heap, and orphan-.ready cleanup.
  • src/backend/postmaster/walsummarizer.c — the WAL summarizer: WalSummarizerMain, SummarizeWAL, SummarizeXlogRecord, SummarizeSmgrRecord/SummarizeDbaseRecord/SummarizeXactRecord, summarizer_read_local_xlog_page, summarizer_wait_for_wal, WaitForWalSummarization, GetOldestUnsummarizedLSN, MaybeRemoveOldWalSummaries, and the BlockRefTable population that skips FSM_FORKNUM.
  • src/backend/access/transam/xlogarchive.c — the recovery/notification side: RestoreArchivedFile, ExecuteRecoveryCommand, KeepFileRestoredFromArchive, XLogArchiveNotify, XLogArchiveForceDone, XLogArchiveCheckDone, XLogArchiveIsBusy, XLogArchiveCleanup — the .ready/.done status protocol and restore_command execution.
  • src/backend/archive/shell_archive.c — the built-in shell_archive_init module that wraps archive_command, so the archiver only ever calls one ArchiveModuleCallbacks vtable.
  • src/include/postmaster/pgarch.h, src/include/postmaster/walsummarizer.h, src/include/access/xlogarchive.h, src/include/archive/archive_module.h — the public signatures, the ArchiveModuleCallbacks struct, and the GUC hooks (archive_mode, archive_command, archive_library, summarize_wal, wal_summary_keep_time).
  • Mohan, C., Haderle, D., Lindsay, B., Pirahesh, H. & Schwarz, P. (1992). “ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging.” ACM TODS 17(1):94-162. The write-ahead-logging and idempotent-redo discipline that makes roll-forward onto an old base backup correct. Captured in knowledge/research/dbms-papers/aries.md.
  • Database Internals (Petrov, 2019), ch. 3 (WAL / log-structured storage) and the recovery chapters — the “WAL is the source of truth, data files are derived” framing both archiving and summarization rest on. Capture under knowledge/research/dbms-general/.
  • Database System Concepts (Silberschatz, Korth, Sudarshan, 7e), ch. 19 “Recovery System” — checkpoints, log-based recovery, and the fuzzy-checkpoint / redo-from-checkpoint model that fixes summary-file boundaries. Capture under knowledge/research/dbms-general/.

Sibling docs (cross-references — mechanism owned there, not duplicated here)

Section titled “Sibling docs (cross-references — mechanism owned there, not duplicated here)”
  • postgres-xlog-wal.md — the WAL record format, LSN model, segment naming, and pg_wal recycling horizon that archiving extends and the summarizer reads.
  • postgres-incremental-backup.md — the consumer of .summary files: how pg_basebackup --incremental unions BlockRefTables between a reference backup and now to copy only changed blocks.
  • postgres-backup-basebackup.md — base-backup mechanics (pg_backup_start/pg_backup_stop, the backup label, pg_basebackup) that PITR rolls archived WAL forward on top of.
  • postgres-checkpoint.md — the XLOG_CHECKPOINT_REDO / XLOG_CHECKPOINT_SHUTDOWN records that fix summary-file cut points and the redo pointer PITR replays from.
  • postgres-recovery-redo.md — the recovery loop that calls RestoreArchivedFile to pull archived segments back during PITR.
  • postgres-wal-sender-receiver.md — the streaming transport that is the archive’s alternative for feeding a standby, and the gap-recovery fallback to restore_command.
  • postgres-aux-processes.md — where the archiver (B_ARCHIVER) and WAL summarizer (B_WAL_SUMMARIZER) sit in the postmaster’s auxiliary-process roster.