PostgreSQL Recovery & Redo — Crash Recovery, PITR, and Hot Standby
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Recovery is the answer to the question: after a crash, how does the
engine get back to a state in which every committed transaction is
visible and every uncommitted transaction is invisible? The canonical
answer is ARIES (Mohan et al., ARIES: A Transaction Recovery
Method Supporting Fine-Granularity Locking and Partial Rollbacks Using
Write-Ahead Logging, ACM TODS 1992; captured at
knowledge/research/dbms-papers/aries.md). PostgreSQL’s recovery
machinery is a direct, faithful instantiation of ARIES’s three
principles:
-
Write-ahead logging. Every change to a data page is first recorded in the WAL. The WAL record at LSN L must reach stable storage before the data page whose
PageLSNis L is written. This invariant — enforced byXLogFlushin the buffer manager — makes both the steal and no-force policies safe: a dirty page can leave the cache before its transaction commits (steal), and a committed transaction’s pages need not be flushed at commit time (no-force), because the log alone is sufficient to redo or undo either case. -
Repeating history during redo. On restart, every WAL record since the last checkpoint is replayed in LSN order — including records belonging to transactions that ultimately aborted — to reconstruct the exact page state at the moment of the crash. Only after this “repeat history” pass does the engine roll back uncommitted transactions. PostgreSQL takes this principle literally:
PerformWalRecoveryreplays forward from the checkpoint’s redo pointer without skipping any record type, letting each rmgr’srm_redocallback handle its own record in exactly the same way it would during normal operation. -
Logging undo actions (compensation log records). When a transaction is rolled back, the undo actions themselves are logged as compensation log records so that a crash during undo does not lose progress. PostgreSQL implements this for explicit rollbacks; crash recovery replays and then undoes using the same mechanism (
xact_redofor abort records).
The key data structure tying together all three principles is the
LSN (Log Sequence Number) — a 64-bit byte offset into the WAL
stream. Every heap and index page carries a PageLSN field in its
header (see postgres-xlog-wal.md and postgres-page-layout.md).
Two comparisons do all the work:
- WAL rule: a data page may be written to disk only when
flushedLSN >= PageLSN(page). Enforced inFlushBuffer. - Idempotent redo: a WAL record at LSN L is skipped if
PageLSN(page) >= L. This makes crash-restart safe to retry.
Beyond pure crash recovery, PostgreSQL generalizes the replay loop to
two further modes. Point-in-Time Recovery (PITR) stops replay at a
user-specified target — an XID, a timestamp, a named restore point, or
an LSN — rather than at end-of-WAL. Hot standby keeps replay
running indefinitely, consuming WAL from a primary via streaming
replication or archive fetch, and opens read-only query connections
once the database reaches a consistent state. All three modes share the
same PerformWalRecovery loop; the differences lie in where WAL comes
from and when the loop stops.
Database Internals (Petrov, ch. 5, “Transaction Processing and Recovery”) frames the design space around two axes that shape the implementation:
-
Granularity of logging — physical (byte ranges), logical (operations), or physiological (page-scoped operations). PostgreSQL is physiological: each rmgr describes a page edit in terms its handler understands, not raw byte diffs. This makes redo handlers simpler to write and makes FPI (full-page image) redundancy straightforward.
-
Redo-only vs. undo+redo — does the engine need a separate undo pass, or is the WAL self-contained? PostgreSQL uses redo-only crash recovery (the “repeat history” principle), with undo handled by replaying abort records rather than a separate undo log.
Common DBMS Design
Section titled “Common DBMS Design”The textbook gives the model; the following patterns are the engineering conventions that nearly every ARIES-family engine — Oracle, InnoDB, SQL Server, DB2, and PostgreSQL — converges on in some form.
Three-phase recovery: analysis, redo, undo
Section titled “Three-phase recovery: analysis, redo, undo”ARIES prescribes three passes over the log on restart:
- Analysis phase — scan forward from the last checkpoint to reconstruct the dirty-page table (which pages were modified and not yet flushed) and the active transaction table (which transactions were in flight at the crash). This pass determines the redo start point (the oldest dirty-page LSN) and the undo set (transactions that never committed).
- Redo phase — replay every log record from the redo start point,
applying changes to pages that need it (those whose
PageLSNis below the record LSN). After this pass the buffer pool reflects exactly what it held at the crash. - Undo phase — roll back every transaction that was active at the crash, logging compensation records for each undo action.
PostgreSQL simplifies the analysis phase: because the WAL contains
explicit checkpoint records (with a redo pointer and a list of XIDs
in flight at checkpoint time), the engine can derive the redo start
point and the active-transaction set directly from the checkpoint
record rather than scanning backward through the log. The undo phase
is implicit: aborting transactions are replayed as their ABORT records
arrive in the redo phase, and any transactions still active after full
replay are detected and rolled back by post-recovery cleanup.
Recovery modes and signal files
Section titled “Recovery modes and signal files”Production engines distinguish crash recovery (automatic, no
configuration) from archive recovery / PITR (operator-initiated,
requires a restore command) from standby mode (long-running,
continuously consuming WAL). The standard mechanism is a signal file:
the presence of recovery.signal or standby.signal in $PGDATA
selects the mode, and the absence of both leaves the engine in crash
recovery mode. GUC parameters (recovery_target, primary_conninfo,
restore_command) are the external knobs; the signal file is the mode
selector.
Timelines and forking
Section titled “Timelines and forking”When a standby is promoted or a PITR target is reached and the engine starts writing new WAL, the new WAL stream must be distinguishable from the old stream at the same LSN positions — because another replica may have already consumed the old stream up to the promotion point, and the two streams diverge from there. The universal solution is a timeline ID: a monotonically increasing integer prepended to WAL segment filenames. Each promotion increments the timeline counter and writes a .history file recording the LSN at which this timeline branched from its parent. Recovery consults the history chain to identify which timeline’s WAL to consume for any given LSN range.
Consistency point and hot standby
Section titled “Consistency point and hot standby”A standby cannot serve queries until it has reached a state where every
visible tuple’s inserting transaction is known to be committed, and the
minimum recovery point (the LSN recorded in pg_control that the
standby must reach before allowing promotion) has been passed. Engines
that support read queries during recovery track this “consistency
point” and signal it to connection-accepting components.
Read-ahead and prefetching during replay
Section titled “Read-ahead and prefetching during replay”Replay is I/O bound when the working set exceeds the buffer pool: every
rm_redo call that needs a page not already in the pool must wait for
a synchronous read. The standard mitigation is WAL prefetching: a
separate thread or coroutine scans ahead in the decoded WAL stream,
identifies pages that will be needed soon, and issues non-blocking
readahead(2) or posix_fadvise(POSIX_FADV_WILLNEED) hints to prime
the kernel’s page cache before the redo loop reaches those records.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Theory concept | PostgreSQL name |
|---|---|
| Checkpoint redo pointer | CheckPoint.redo (in CheckPoint struct) |
| Dirty-page table / redo start | RedoStartLSN (from checkpoint record) |
| Active-transaction set at crash | in-flight XIDs in CheckPoint struct |
| Analysis phase | implicit — read from checkpoint record in InitWalRecovery |
| Redo phase | PerformWalRecovery / ApplyWalRecord loop |
| Undo phase | replay of ABORT records + post-recovery RecoverPreparedTransactions |
| Compensation log record | XLOG_XACT_ABORT and undo-logging in xact_redo |
| LSN (byte offset) | XLogRecPtr (64-bit, pg_lsn type for SQL) |
| Page LSN | PageHeaderData.pd_lsn (first 8 bytes of every page) |
| Timeline ID | TimeLineID (uint32; incremented on every promotion) |
| Timeline history file | <tli>.history in pg_wal/ |
| Consistency point | minRecoveryPoint in pg_control; reachedConsistency flag |
| Hot standby activation | PMSIGNAL_BEGIN_HOT_STANDBY from CheckRecoveryConsistency |
| WAL prefetch hint | XLogPrefetcherNextBlock → PrefetchSharedBuffer |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”Recovery modes and startup orchestration
Section titled “Recovery modes and startup orchestration”PostgreSQL’s recovery machinery is concentrated in a single source file,
xlogrecovery.c, and runs entirely inside the startup process — the
special backend type launched by postmaster on every restart. Three
functions divide the work cleanly:
// InitWalRecovery — src/backend/access/transam/xlogrecovery.cInitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr, bool *haveBackupLabel_ptr, bool *haveTblspcMap_ptr)InitWalRecovery reads pg_control, calls readRecoverySignalFile()
to detect recovery.signal or standby.signal, and calls
validateRecoveryParameters() to check that the GUC recovery target
settings are consistent with the chosen mode. It allocates the
XLogReaderState and wraps it in an XLogPrefetcher. The checkpoint
record is read from pg_control (or from backup_label for base
restores), and RedoStartLSN is established from the checkpoint’s
redo field. The function returns with the system positioned at the
first WAL record to replay.
// PerformWalRecovery — src/backend/access/transam/xlogrecovery.cvoid PerformWalRecovery(void)PerformWalRecovery is the main loop. It initializes the shared
XLogRecoveryCtlData watermarks, calls RmgrStartup() to register
all resource managers, then enters the do { ... } while (record != NULL) replay loop. Each iteration calls ReadRecord (via the
prefetcher) to fetch the next decoded record, checks
recoveryStopsBefore / recoveryStopsAfter for PITR target hits, and
calls ApplyWalRecord. The loop exits when there is no more WAL or a
recovery target is reached.
// FinishWalRecovery — src/backend/access/transam/xlogrecovery.cEndOfWalRecoveryInfo *FinishWalRecovery(void)FinishWalRecovery shuts down the WAL receiver, determines the end of
the last valid record (endOfLog), and returns an
EndOfWalRecoveryInfo struct that the caller (StartupXLOG in
xlog.c) uses to initialize the WAL for writing. After this call, the
engine is ready to accept connections and new writes.
The shared control block: XLogRecoveryCtlData
Section titled “The shared control block: XLogRecoveryCtlData”All three functions communicate through a fixed shared-memory struct
allocated in XLogRecoveryShmemInit:
// XLogRecoveryCtlData — src/backend/access/transam/xlogrecovery.ctypedef struct XLogRecoveryCtlData{ bool SharedHotStandbyActive; bool SharedPromoteIsTriggered; Latch recoveryWakeupLatch;
XLogRecPtr lastReplayedReadRecPtr; /* start of last replayed record */ XLogRecPtr lastReplayedEndRecPtr; /* end+1 of last replayed record */ TimeLineID lastReplayedTLI;
/* during rm_redo call: end+1 of record being replayed */ XLogRecPtr replayEndRecPtr; TimeLineID replayEndTLI;
TimestampTz recoveryLastXTime; TimestampTz currentChunkStartTime; RecoveryPauseState recoveryPauseState; ConditionVariable recoveryNotPausedCV;
slock_t info_lck;} XLogRecoveryCtlData;The two watermarks lastReplayedEndRecPtr (updated after a record is
successfully replayed) and replayEndRecPtr (updated before the
rm_redo call begins) serve different consumers. replayEndRecPtr is
read by XLogFlush so that minRecoveryPoint is updated correctly
even mid-record. lastReplayedEndRecPtr is the value reported by
GetXLogReplayRecPtr and used by CheckRecoveryConsistency to decide
when to open hot-standby connections.
The recoveryPauseState / recoveryNotPausedCV pair implements the
pg_wal_replay_pause() / pg_wal_replay_resume() SQL functions by
suspending PerformWalRecovery mid-loop at a latch wait.
ApplyWalRecord: dispatch and idempotency
Section titled “ApplyWalRecord: dispatch and idempotency”// ApplyWalRecord — src/backend/access/transam/xlogrecovery.cstatic voidApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *replayTLI){ // ... condensed ... AdvanceNextFullTransactionIdPastXid(record->xl_xid);
/* Check for timeline switch */ if (record->xl_rmid == RM_XLOG_ID) { // detect XLOG_CHECKPOINT_SHUTDOWN / XLOG_END_OF_RECOVERY // update *replayTLI if newReplayTLI != *replayTLI }
/* Update replayEndRecPtr BEFORE calling rm_redo */ XLogRecoveryCtl->replayEndRecPtr = xlogreader->EndRecPtr;
/* Track in-flight XIDs for hot standby conflict resolution */ if (standbyState >= STANDBY_INITIALIZED && TransactionIdIsValid(record->xl_xid)) RecordKnownAssignedTransactionIds(record->xl_xid);
/* Dispatch */ if (record->xl_rmid == RM_XLOG_ID) xlogrecovery_redo(xlogreader, *replayTLI); GetRmgr(record->xl_rmid).rm_redo(xlogreader);
/* Consistency check after FPI records */ if ((record->xl_info & XLR_CHECK_CONSISTENCY) != 0) verifyBackupPageConsistency(xlogreader);
/* Update lastReplayedEndRecPtr AFTER rm_redo returns */ XLogRecoveryCtl->lastReplayedEndRecPtr = xlogreader->EndRecPtr; // ... condensed ...}The idempotency invariant is enforced inside each rmgr’s rm_redo
callback, not in ApplyWalRecord itself. The standard pattern — used
by every heap and index rmgr — is to call XLogInitBufferForRedo or
XLogReadBufferForRedo, both of which compare the page’s PageLSN
to the record’s LSN and return BLK_NEEDS_REDO only if the page
requires the update. This means replay is safe to restart from any
checkpoint: any record whose effect is already on the page is silently
skipped.
AdvanceNextFullTransactionIdPastXid ensures that
TransamVariables->nextXid is always beyond any XID seen in a replayed
record, so that new transactions assigned after recovery cannot collide
with replayed ones.
Timeline management
Section titled “Timeline management”// readTimeLineHistory — src/backend/access/transam/timeline.cList *readTimeLineHistory(TimeLineID targetTLI)A timeline history file <tli>.history in pg_wal/ contains one line
per ancestor timeline of the form <tli>\t<switchpoint>. Timeline 1
has no history file (it is the root). readTimeLineHistory parses this
file into a List * of TimeLineHistoryEntry items; the caller
(typically InitWalRecovery) uses tliSwitchPoint to find the LSN at
which the target timeline diverged from any given ancestor.
// writeTimeLineHistory — src/backend/access/transam/timeline.cvoid writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI, XLogRecPtr switchpoint, char *reason)writeTimeLineHistory is called at promotion time. It appends one line
to the new .history file recording (parentTLI, switchpoint).
Standbys and PITR nodes consult this history to decide which segment
files belong to which timeline: a segment file for TLI N at position
P is authoritative only up to the LSN where TLI N switched to TLI
N+1.
Figure 1 — Timeline fork and history chain:
flowchart LR
TL1["TLI 1<br/>WAL 0/1 → 0/A0"]
TL2["TLI 2<br/>WAL 0/A0 → 0/C0"]
TL3["TLI 3<br/>WAL 0/B0 → ..."]
HF2["2.history\n1\t0/A0"]
HF3["3.history\n1\t0/A0\n2\t0/B0"]
TL1 -->|"promote at 0/A0"| TL2
TL2 -->|"promote at 0/B0"| TL3
HF2 -. "read by replica".- TL2
HF3 -. "read by replica".- TL3
Figure 1 — Each promotion increments the TLI counter and writes a
.history file that records the fork LSN. A replica reading TLI-3 WAL
chains through both history files to discover that it should read
TLI-1 WAL up to 0/A0, TLI-2 WAL from 0/A0 to 0/B0, and TLI-3
WAL from 0/B0 onward.
The recovery loop and PITR target evaluation
Section titled “The recovery loop and PITR target evaluation”The main loop in PerformWalRecovery uses two boundary checks:
// PerformWalRecovery (loop body) — src/backend/access/transam/xlogrecovery.cdo { /* stop BEFORE applying this record? */ if (recoveryStopsBefore(xlogreader)) { reachedRecoveryTarget = true; break; }
/* optional apply delay for replica lag simulation */ if (recoveryApplyDelay(xlogreader)) { /* latch wait */ }
ApplyWalRecord(xlogreader, record, &replayTLI);
/* stop AFTER applying this record? */ if (recoveryStopsAfter(xlogreader)) { reachedRecoveryTarget = true; break; }
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);} while (record != NULL);recoveryStopsBefore handles exclusive targets (recovery_target_inclusive = false). recoveryStopsAfter handles inclusive targets. Both
functions compare the record against the five RecoveryTargetType
variants (XID, TIME, NAME, LSN, IMMEDIATE).
When reachedRecoveryTarget is true, the engine consults
recoveryTargetAction — pause, promote, or shutdown — and takes
the corresponding action. pause calls SetRecoveryPause(true) and
blocks until a pg_wal_replay_resume() call.
Consistency and hot-standby activation
Section titled “Consistency and hot-standby activation”// CheckRecoveryConsistency — src/backend/access/transam/xlogrecovery.cvoid CheckRecoveryConsistency(void){ // ... condensed ... if (!reachedConsistency && !backupEndRequired && minRecoveryPoint <= lastReplayedEndRecPtr) { XLogCheckInvalidPages(); CheckTablespaceDirectory(); reachedConsistency = true; SendPostmasterSignal(PMSIGNAL_RECOVERY_CONSISTENT); }
if (standbyState == STANDBY_SNAPSHOT_READY && !LocalHotStandbyActive && reachedConsistency && IsUnderPostmaster) { XLogRecoveryCtl->SharedHotStandbyActive = true; LocalHotStandbyActive = true; SendPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY); }}Two signals go to postmaster. PMSIGNAL_RECOVERY_CONSISTENT tells it
the database is structurally sound (no invalid page references, no
dangling tablespace directories). PMSIGNAL_BEGIN_HOT_STANDBY enables
read-only connections: postmaster starts accepting client connections
only after receiving the second signal.
The standbyState >= STANDBY_SNAPSHOT_READY gate reflects a second
condition: hot standby also requires that KnownAssignedXids (the
in-memory replica of the primary’s active-transaction set) is
sufficiently populated to answer visibility queries. standbyState
progresses from STANDBY_DISABLED → STANDBY_INITIALIZED →
STANDBY_SNAPSHOT_PENDING → STANDBY_SNAPSHOT_READY as XIDs
accumulate.
Figure 2 — Recovery state machine:
stateDiagram-v2
[*] --> InitWalRecovery
InitWalRecovery --> CrashRecovery : no signal file
InitWalRecovery --> ArchiveRecovery : recovery.signal
InitWalRecovery --> StandbyMode : standby.signal
CrashRecovery --> PerformWalRecovery
ArchiveRecovery --> PerformWalRecovery
StandbyMode --> PerformWalRecovery
PerformWalRecovery --> ConsistencyReached : minRecoveryPoint passed
ConsistencyReached --> HotStandbyActive : standbyState SNAPSHOT_READY
ConsistencyReached --> PITRTarget : recovery target hit
PerformWalRecovery --> EndOfWAL : no more records
PITRTarget --> FinishWalRecovery
EndOfWAL --> FinishWalRecovery
HotStandbyActive --> PromotionTriggered : promote signal
PromotionTriggered --> FinishWalRecovery
FinishWalRecovery --> [*]
Figure 2 — The three entry paths (crash, archive, standby) all feed
into the single PerformWalRecovery loop. Hot standby remains in the
loop; PITR and end-of-WAL exit to FinishWalRecovery. Promotion of a
standby also exits via FinishWalRecovery and then starts writing
new WAL on the next timeline.
WAL prefetching (XLogPrefetcher)
Section titled “WAL prefetching (XLogPrefetcher)”Replay throughput is limited by I/O when the working set exceeds the
buffer pool. PostgreSQL 15 introduced XLogPrefetcher — a thin wrapper
around XLogReaderState that decodes WAL records ahead of the replay
position and issues PrefetchSharedBuffer calls (ultimately
posix_fadvise(POSIX_FADV_WILLNEED) or readahead) for the pages
those records will touch.
// XLogPrefetcherAllocate — src/backend/access/transam/xlogprefetcher.cXLogPrefetcher *XLogPrefetcherAllocate(XLogReaderState *reader){ XLogPrefetcher *prefetcher = palloc0(sizeof(XLogPrefetcher)); prefetcher->reader = reader; // filter_table: skip pages already in the buffer pool or not yet created prefetcher->filter_table = hash_create("XLogPrefetcherFilterTable", 1024, ...); dlist_init(&prefetcher->filter_queue); prefetcher->reconfigure_count = XLogPrefetchReconfigureCount - 1; return prefetcher;}The prefetcher maintains an LsnReadQueue — a ring of pending I/O
slots indexed by LSN. XLogPrefetcherNextBlock scans ahead in the
decoded record queue, picks the next block reference not already
filtered out (e.g., pages in the buffer pool, pages in a database being
CREATE DATABASE FILE COPY’d), and calls PrefetchSharedBuffer.
// XLogPrefetcherNextBlock (condensed) — src/backend/access/transam/xlogprefetcher.cstatic LsnReadQueueNextStatusXLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn){ // ... condensed ... record = XLogReadAhead(prefetcher->reader, nonblocking); // check filter_table: skip if already buffered or filtered // call PrefetchSharedBuffer for main-fork blocks // suppress readahead across TLI-switch checkpoints: // prefetcher->no_readahead_until = record->lsn; // return LRQ_NEXT_IO / LRQ_NEXT_NO_IO / LRQ_NEXT_AGAIN}The wal_decode_buffer_size GUC controls how far ahead the decoder
can run. The recovery_prefetch GUC (default try) enables
prefetching; setting it to off disables the lookahead but keeps
the decode-queue mechanism. Statistics are exposed via
pg_stat_recovery_prefetch (populated by
XLogPrefetcherComputeStats).
Two important suppression rules prevent incorrect prefetching:
- TLI-switch suppression. When the decoder encounters an
XLOG_CHECKPOINT_SHUTDOWNorXLOG_END_OF_RECOVERYrecord that may carry a timeline switch,no_readahead_untilis set to that record’s LSN. Readahead past a TLI switch could prefetch from the wrong segment file. - Relation-creation filter. When a
XLOG_DBASE_CREATE_FILE_COPYrecord is seen, all blocks in that database are added tofilter_tableuntil the record is replayed, preventing ENOENT errors from prefetching into a database that does not yet exist on disk.
Source Walkthrough
Section titled “Source Walkthrough”InitWalRecovery and mode selection
Section titled “InitWalRecovery and mode selection”XLogRecoveryShmemInit— allocatesXLogRecoveryCtlDatain shared memory; called fromCreateSharedMemoryAndSemaphores.InitWalRecovery— top-level entry point fromStartupXLOG; readspg_control, detects signal files, establishesRedoStartLSNandCheckPointLoc, allocatesxlogreaderandxlogprefetcher.readRecoverySignalFile— checks forstandby.signalandrecovery.signalin$PGDATA; setsStandbyModeRequested,ArchiveRecoveryRequested.validateRecoveryParameters— ensuresrecovery_target*GUCs are consistent; detects use of the removedrecovery.conf.read_backup_label— readsbackup_labelif present; providesCheckPointLocandRedoStartLSNfor base-backup restores.EnableStandbyMode— setsStandbyMode = trueand initialises the hot-standby machinery.
PerformWalRecovery and the replay loop
Section titled “PerformWalRecovery and the replay loop”PerformWalRecovery— main redo loop; initializesXLogRecoveryCtlDatawatermarks; callsRmgrStartup;ReadRecord/ApplyWalRecordloop.ApplyWalRecord— per-record dispatch: advancesnextXid, detects TLI switches, updatesreplayEndRecPtr, callsRecordKnownAssignedTransactionIdsfor hot standby, dispatches torm_redo, verifies FPI consistency ifXLR_CHECK_CONSISTENCYis set, updateslastReplayedEndRecPtr.ReadRecord— thin wrapper overXLogPrefetcherReadRecordwhich ultimately callsXLogReaderValidatePageHeaderandDecodeXLogRecord.recoveryStopsBefore/recoveryStopsAfter— evaluate the fiveRecoveryTargetTypevariants against the current record.recoveryApplyDelay— implementsrecovery_min_apply_delay; waits onrecoveryWakeupLatch.xlogrecovery_redo— handlesRM_XLOG_IDrecords that are interpreted by the recovery machinery itself (e.g.,XLOG_BACKUP_END).CheckRecoveryConsistency— checked after every record; signalsPMSIGNAL_RECOVERY_CONSISTENTandPMSIGNAL_BEGIN_HOT_STANDBY.RmgrStartup/RmgrCleanup— call each rmgr’srm_startup/rm_cleanupat beginning / end of redo.
FinishWalRecovery
Section titled “FinishWalRecovery”FinishWalRecovery— shuts down WAL receiver, locates the end of the last valid record, returnsEndOfWalRecoveryInfotoStartupXLOG.ShutdownWalRecovery— frees thexlogprefetcherandxlogreader.
Timeline management
Section titled “Timeline management”readTimeLineHistory— parses<tli>.historyintoList *<TimeLineHistoryEntry>.writeTimeLineHistory— appends a new.historyentry at promotion; archives it ifarchive_modeis set.tliSwitchPoint— returns the LSN at which a given TLI diverges from another in a history list.tliOfPointInHistory— given aList *and an LSN, returns the TLI that was active at that LSN.existsTimeLineHistory— checks whether a.historyfile exists (via archive orpg_wal/).checkTimeLineSwitch— validates that a TLI switch encountered during redo is consistent with the expected history chain.
WAL prefetcher
Section titled “WAL prefetcher”XLogPrefetcherAllocate/XLogPrefetcherFree— lifecycle.XLogPrefetcherNextBlock— callback invoked byLsnReadQueuefor each next decoded block reference; issuesPrefetchSharedBuffer.XLogPrefetcherComputeStats— updatesSharedStatsfields read bypg_stat_recovery_prefetch.XLogPrefetcherAddFilter/XLogPrefetcherIsFiltered/XLogPrefetcherCompleteFilters— filter-table management.lrq_alloc/lrq_prefetch/lrq_complete_lsn—LsnReadQueuering-buffer operations.
Position hints (commit 273fe94, 2026-06-05)
Section titled “Position hints (commit 273fe94, 2026-06-05)”| Symbol | File | Line |
|---|---|---|
XLogRecoveryCtlData | xlogrecovery.c | 311 |
XLogRecoveryShmemInit | xlogrecovery.c | 465 |
InitWalRecovery | xlogrecovery.c | 519 |
readRecoverySignalFile | xlogrecovery.c | 1046 |
PerformWalRecovery | xlogrecovery.c | 1671 |
ApplyWalRecord | xlogrecovery.c | 1928 |
FinishWalRecovery | xlogrecovery.c | 1477 |
CheckRecoveryConsistency | xlogrecovery.c | 2196 |
xlogrecovery_redo | xlogrecovery.c | 2092 |
rm_redo_error_callback | xlogrecovery.c | 2297 |
XLogPrefetcherAllocate | xlogprefetcher.c | 362 |
XLogPrefetcherNextBlock | xlogprefetcher.c | 459 |
XLogPrefetcherComputeStats | xlogprefetcher.c | 410 |
readTimeLineHistory | timeline.c | 76 |
writeTimeLineHistory | timeline.c | 304 |
tliSwitchPoint | timeline.c | 572 |
checkTimeLineSwitch | xlogrecovery.c | 2399 |
RecoveryTargetType | xlogrecovery.h | 23 |
EndOfWalRecoveryInfo | xlogrecovery.h | 91 |
RecoveryPauseState | xlogrecovery.h | 44 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified facts
Section titled “Verified facts”-
InitWalRecovery,PerformWalRecovery, andFinishWalRecoveryare the three-function API exported byxlogrecovery.c. Verified by reading the function signatures at lines 519, 1671, and 1477 respectively, and the matching declarations inxlogrecovery.h. -
XLogRecoveryCtlDatais allocated in shared memory, not on the stack. Verified:XLogRecoveryShmemInitcallsShmemInitStructat line 469. The two watermarkslastReplayedEndRecPtrandreplayEndRecPtrare distinct; the former is updated afterrm_redoreturns, the latter before it is entered (line 1992 vs 2030). -
ApplyWalRecordadvancesTransamVariables->nextXidpast the record’s XID before callingrm_redo. Verified at line 1942:AdvanceNextFullTransactionIdPastXid(record->xl_xid)runs unconditionally before the dispatch block. -
The idempotency check is inside each rmgr’s
rm_redocallback, not inApplyWalRecord. Verified by readingApplyWalRecordin full: there is noPageLSN >= record-LSNcomparison inxlogrecovery.c. The pattern is inXLogReadBufferForRedoinxlogutils.c, called by every heap/index rmgr redo handler. -
CheckRecoveryConsistencysends two distinct signals. Verified at lines 2267 (PMSIGNAL_RECOVERY_CONSISTENT) and 2289 (PMSIGNAL_BEGIN_HOT_STANDBY). Hot-standby activation requires bothreachedConsistencyandstandbyState == STANDBY_SNAPSHOT_READY. -
Timeline 1 has no
.historyfile. Verified inreadTimeLineHistoryat line 88: the function returns a synthetic single-entry list fortargetTLI == 1without attempting a file read. -
The WAL prefetcher suppresses readahead past TLI-switch checkpoints. Verified in
XLogPrefetcherNextBlockat lines 537–553:XLOG_CHECKPOINT_SHUTDOWNandXLOG_END_OF_RECOVERYboth setprefetcher->no_readahead_until = record->lsn. -
recovery.confis explicitly rejected. Verified inreadRecoverySignalFile(line 1046): ifrecovery.confexists in$PGDATA, the startup process logs a FATAL with a message directing the user torecovery.signal.
Open questions
Section titled “Open questions”-
Parallel redo. The current redo loop is single-threaded inside the startup process. There is no parallel redo path in PG18 for crash recovery (unlike logical replication’s parallel apply). The investigation path: search for
ParallelApplypatterns inxlogrecovery.cand check the mailing list archives for parallel redo proposals. -
XLR_CHECK_CONSISTENCYtrigger conditions.ApplyWalRecordcallsverifyBackupPageConsistencywhenXLR_CHECK_CONSISTENCYis set inxl_info. The exact conditions under which a WAL record gets this flag set during normal operation (vs. only whenwal_consistency_checkingGUC is active) are not fully traced. Investigation path: grep forXLR_CHECK_CONSISTENCYacrossxloginsert.cand rmgr source files. -
LsnReadQueuesize and memory pressure. TheLsnReadQueuering (lrq_alloc) size is determined bywal_decode_buffer_size. The interaction between decode-buffer size, prefetch distance, and shared-buffer hit rate during recovery has not been profiled in the source comments. Investigation path: readXLogPrefetchReconfigureand thepg_stat_recovery_prefetchcolumnsblocks_prefetched,blocks_skipped_on_relationship,blocks_skipped_init.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”-
ARIES undo log vs. PostgreSQL’s no-undo approach. ARIES prescribes a full undo pass after redo, rolling back every loser transaction with compensation log records. PostgreSQL avoids a separate undo log by relying on MVCC: old row versions remain visible in heap pages, and
VACUUMreclaims them asynchronously. Crash recovery merely replays abort records; there is no per-row undo traversal. The trade-off is that PostgreSQL must run VACUUM to reclaim space that an undo-based engine (Oracle, InnoDB) reclaims immediately at rollback. A comparative study:knowledge/research/ dbms-papers/aries.md. -
Parallel redo. MySQL InnoDB and MariaDB introduced parallel redo (multi-threaded apply) to exploit multi-core hardware during crash recovery. PostgreSQL’s redo is currently single-threaded in the startup process. The XLogPrefetcher partially compensates by hiding I/O latency, but CPU-bound redo (e.g., index rebuilds) does not benefit. Parallel redo is an active area of PG development.
-
Logical replication’s parallel apply. PG16 introduced parallel apply for logical replication (
max_parallel_apply_workers_per_ subscription), which is structurally analogous to parallel redo at the logical level. Its design — apply workers coordinated by a leader that tracks XID dependencies — could inform a future physical redo parallelization. -
Incremental / page-level checksums during recovery. The WAL summarization feature (PG17) produces per-block WAL summaries that support incremental basebackup (
pg_basebackup --incremental). The summarizer (walsummarizer.c) reads WAL during recovery and normal operation; its interaction with the recovery replay position is relevant context forpostgres-archiving-walsummary.md. -
Instant recovery (Redo-only designs). Shasank Chavan et al., “Oracle Database In-Memory” and related work on “redo-only” engines (no undo log at all, relying entirely on MVCC) are a useful contrast to PostgreSQL’s hybrid. Closer to PostgreSQL: the discussion in Hellerstein et al. 2007 (Architecture of a DB System,
dbms-papers/fntdb07-architecture.md) §“Storage and Buffer Management” frames the steal/no-force design space and the recovery cost of each choice.
Sources
Section titled “Sources”Raw sources
Section titled “Raw sources”- (no raw inbox files — doc synthesized directly from the source tree)
Source code
Section titled “Source code”src/backend/access/transam/xlogrecovery.c— recovery state machine,InitWalRecovery,PerformWalRecovery,FinishWalRecovery,ApplyWalRecord,CheckRecoveryConsistencysrc/backend/access/transam/xlogprefetcher.c—XLogPrefetcher,LsnReadQueue,XLogPrefetcherNextBlocksrc/backend/access/transam/timeline.c— timeline history read/write/querysrc/include/access/xlogrecovery.h—RecoveryTargetType,EndOfWalRecoveryInfo,RecoveryPauseStatesrc/include/access/timeline.h—TimeLineHistoryEntrysrc/backend/access/transam/README— in-tree design notes for the transam subsystem
Research anchors
Section titled “Research anchors”- Mohan et al. 1992, ARIES: A Transaction Recovery Method — captured
at
knowledge/research/dbms-papers/aries.md - Petrov 2019, Database Internals, ch. 5 — WAL, recovery, LSN model
- Silberschatz et al. 2020, Database System Concepts, 7e, ch. 19 — recovery theory
- Hellerstein et al. 2007, Architecture of a Database System —
captured at
knowledge/research/dbms-papers/fntdb07-architecture.md