Skip to content

PostgreSQL Recovery & Redo — Crash Recovery, PITR, and Hot Standby

Contents:

Recovery is the answer to the question: after a crash, how does the engine get back to a state in which every committed transaction is visible and every uncommitted transaction is invisible? The canonical answer is ARIES (Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM TODS 1992; captured at knowledge/research/dbms-papers/aries.md). PostgreSQL’s recovery machinery is a direct, faithful instantiation of ARIES’s three principles:

  1. Write-ahead logging. Every change to a data page is first recorded in the WAL. The WAL record at LSN L must reach stable storage before the data page whose PageLSN is L is written. This invariant — enforced by XLogFlush in the buffer manager — makes both the steal and no-force policies safe: a dirty page can leave the cache before its transaction commits (steal), and a committed transaction’s pages need not be flushed at commit time (no-force), because the log alone is sufficient to redo or undo either case.

  2. Repeating history during redo. On restart, every WAL record since the last checkpoint is replayed in LSN order — including records belonging to transactions that ultimately aborted — to reconstruct the exact page state at the moment of the crash. Only after this “repeat history” pass does the engine roll back uncommitted transactions. PostgreSQL takes this principle literally: PerformWalRecovery replays forward from the checkpoint’s redo pointer without skipping any record type, letting each rmgr’s rm_redo callback handle its own record in exactly the same way it would during normal operation.

  3. Logging undo actions (compensation log records). When a transaction is rolled back, the undo actions themselves are logged as compensation log records so that a crash during undo does not lose progress. PostgreSQL implements this for explicit rollbacks; crash recovery replays and then undoes using the same mechanism (xact_redo for abort records).

The key data structure tying together all three principles is the LSN (Log Sequence Number) — a 64-bit byte offset into the WAL stream. Every heap and index page carries a PageLSN field in its header (see postgres-xlog-wal.md and postgres-page-layout.md). Two comparisons do all the work:

  • WAL rule: a data page may be written to disk only when flushedLSN >= PageLSN(page). Enforced in FlushBuffer.
  • Idempotent redo: a WAL record at LSN L is skipped if PageLSN(page) >= L. This makes crash-restart safe to retry.

Beyond pure crash recovery, PostgreSQL generalizes the replay loop to two further modes. Point-in-Time Recovery (PITR) stops replay at a user-specified target — an XID, a timestamp, a named restore point, or an LSN — rather than at end-of-WAL. Hot standby keeps replay running indefinitely, consuming WAL from a primary via streaming replication or archive fetch, and opens read-only query connections once the database reaches a consistent state. All three modes share the same PerformWalRecovery loop; the differences lie in where WAL comes from and when the loop stops.

Database Internals (Petrov, ch. 5, “Transaction Processing and Recovery”) frames the design space around two axes that shape the implementation:

  1. Granularity of logging — physical (byte ranges), logical (operations), or physiological (page-scoped operations). PostgreSQL is physiological: each rmgr describes a page edit in terms its handler understands, not raw byte diffs. This makes redo handlers simpler to write and makes FPI (full-page image) redundancy straightforward.

  2. Redo-only vs. undo+redo — does the engine need a separate undo pass, or is the WAL self-contained? PostgreSQL uses redo-only crash recovery (the “repeat history” principle), with undo handled by replaying abort records rather than a separate undo log.

The textbook gives the model; the following patterns are the engineering conventions that nearly every ARIES-family engine — Oracle, InnoDB, SQL Server, DB2, and PostgreSQL — converges on in some form.

Three-phase recovery: analysis, redo, undo

Section titled “Three-phase recovery: analysis, redo, undo”

ARIES prescribes three passes over the log on restart:

  1. Analysis phase — scan forward from the last checkpoint to reconstruct the dirty-page table (which pages were modified and not yet flushed) and the active transaction table (which transactions were in flight at the crash). This pass determines the redo start point (the oldest dirty-page LSN) and the undo set (transactions that never committed).
  2. Redo phase — replay every log record from the redo start point, applying changes to pages that need it (those whose PageLSN is below the record LSN). After this pass the buffer pool reflects exactly what it held at the crash.
  3. Undo phase — roll back every transaction that was active at the crash, logging compensation records for each undo action.

PostgreSQL simplifies the analysis phase: because the WAL contains explicit checkpoint records (with a redo pointer and a list of XIDs in flight at checkpoint time), the engine can derive the redo start point and the active-transaction set directly from the checkpoint record rather than scanning backward through the log. The undo phase is implicit: aborting transactions are replayed as their ABORT records arrive in the redo phase, and any transactions still active after full replay are detected and rolled back by post-recovery cleanup.

Production engines distinguish crash recovery (automatic, no configuration) from archive recovery / PITR (operator-initiated, requires a restore command) from standby mode (long-running, continuously consuming WAL). The standard mechanism is a signal file: the presence of recovery.signal or standby.signal in $PGDATA selects the mode, and the absence of both leaves the engine in crash recovery mode. GUC parameters (recovery_target, primary_conninfo, restore_command) are the external knobs; the signal file is the mode selector.

When a standby is promoted or a PITR target is reached and the engine starts writing new WAL, the new WAL stream must be distinguishable from the old stream at the same LSN positions — because another replica may have already consumed the old stream up to the promotion point, and the two streams diverge from there. The universal solution is a timeline ID: a monotonically increasing integer prepended to WAL segment filenames. Each promotion increments the timeline counter and writes a .history file recording the LSN at which this timeline branched from its parent. Recovery consults the history chain to identify which timeline’s WAL to consume for any given LSN range.

A standby cannot serve queries until it has reached a state where every visible tuple’s inserting transaction is known to be committed, and the minimum recovery point (the LSN recorded in pg_control that the standby must reach before allowing promotion) has been passed. Engines that support read queries during recovery track this “consistency point” and signal it to connection-accepting components.

Replay is I/O bound when the working set exceeds the buffer pool: every rm_redo call that needs a page not already in the pool must wait for a synchronous read. The standard mitigation is WAL prefetching: a separate thread or coroutine scans ahead in the decoded WAL stream, identifies pages that will be needed soon, and issues non-blocking readahead(2) or posix_fadvise(POSIX_FADV_WILLNEED) hints to prime the kernel’s page cache before the redo loop reaches those records.

Theory conceptPostgreSQL name
Checkpoint redo pointerCheckPoint.redo (in CheckPoint struct)
Dirty-page table / redo startRedoStartLSN (from checkpoint record)
Active-transaction set at crashin-flight XIDs in CheckPoint struct
Analysis phaseimplicit — read from checkpoint record in InitWalRecovery
Redo phasePerformWalRecovery / ApplyWalRecord loop
Undo phasereplay of ABORT records + post-recovery RecoverPreparedTransactions
Compensation log recordXLOG_XACT_ABORT and undo-logging in xact_redo
LSN (byte offset)XLogRecPtr (64-bit, pg_lsn type for SQL)
Page LSNPageHeaderData.pd_lsn (first 8 bytes of every page)
Timeline IDTimeLineID (uint32; incremented on every promotion)
Timeline history file<tli>.history in pg_wal/
Consistency pointminRecoveryPoint in pg_control; reachedConsistency flag
Hot standby activationPMSIGNAL_BEGIN_HOT_STANDBY from CheckRecoveryConsistency
WAL prefetch hintXLogPrefetcherNextBlockPrefetchSharedBuffer

PostgreSQL’s recovery machinery is concentrated in a single source file, xlogrecovery.c, and runs entirely inside the startup process — the special backend type launched by postmaster on every restart. Three functions divide the work cleanly:

// InitWalRecovery — src/backend/access/transam/xlogrecovery.c
InitWalRecovery(ControlFileData *ControlFile,
bool *wasShutdown_ptr,
bool *haveBackupLabel_ptr,
bool *haveTblspcMap_ptr)

InitWalRecovery reads pg_control, calls readRecoverySignalFile() to detect recovery.signal or standby.signal, and calls validateRecoveryParameters() to check that the GUC recovery target settings are consistent with the chosen mode. It allocates the XLogReaderState and wraps it in an XLogPrefetcher. The checkpoint record is read from pg_control (or from backup_label for base restores), and RedoStartLSN is established from the checkpoint’s redo field. The function returns with the system positioned at the first WAL record to replay.

// PerformWalRecovery — src/backend/access/transam/xlogrecovery.c
void PerformWalRecovery(void)

PerformWalRecovery is the main loop. It initializes the shared XLogRecoveryCtlData watermarks, calls RmgrStartup() to register all resource managers, then enters the do { ... } while (record != NULL) replay loop. Each iteration calls ReadRecord (via the prefetcher) to fetch the next decoded record, checks recoveryStopsBefore / recoveryStopsAfter for PITR target hits, and calls ApplyWalRecord. The loop exits when there is no more WAL or a recovery target is reached.

// FinishWalRecovery — src/backend/access/transam/xlogrecovery.c
EndOfWalRecoveryInfo *FinishWalRecovery(void)

FinishWalRecovery shuts down the WAL receiver, determines the end of the last valid record (endOfLog), and returns an EndOfWalRecoveryInfo struct that the caller (StartupXLOG in xlog.c) uses to initialize the WAL for writing. After this call, the engine is ready to accept connections and new writes.

The shared control block: XLogRecoveryCtlData

Section titled “The shared control block: XLogRecoveryCtlData”

All three functions communicate through a fixed shared-memory struct allocated in XLogRecoveryShmemInit:

// XLogRecoveryCtlData — src/backend/access/transam/xlogrecovery.c
typedef struct XLogRecoveryCtlData
{
bool SharedHotStandbyActive;
bool SharedPromoteIsTriggered;
Latch recoveryWakeupLatch;
XLogRecPtr lastReplayedReadRecPtr; /* start of last replayed record */
XLogRecPtr lastReplayedEndRecPtr; /* end+1 of last replayed record */
TimeLineID lastReplayedTLI;
/* during rm_redo call: end+1 of record being replayed */
XLogRecPtr replayEndRecPtr;
TimeLineID replayEndTLI;
TimestampTz recoveryLastXTime;
TimestampTz currentChunkStartTime;
RecoveryPauseState recoveryPauseState;
ConditionVariable recoveryNotPausedCV;
slock_t info_lck;
} XLogRecoveryCtlData;

The two watermarks lastReplayedEndRecPtr (updated after a record is successfully replayed) and replayEndRecPtr (updated before the rm_redo call begins) serve different consumers. replayEndRecPtr is read by XLogFlush so that minRecoveryPoint is updated correctly even mid-record. lastReplayedEndRecPtr is the value reported by GetXLogReplayRecPtr and used by CheckRecoveryConsistency to decide when to open hot-standby connections.

The recoveryPauseState / recoveryNotPausedCV pair implements the pg_wal_replay_pause() / pg_wal_replay_resume() SQL functions by suspending PerformWalRecovery mid-loop at a latch wait.

// ApplyWalRecord — src/backend/access/transam/xlogrecovery.c
static void
ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record,
TimeLineID *replayTLI)
{
// ... condensed ...
AdvanceNextFullTransactionIdPastXid(record->xl_xid);
/* Check for timeline switch */
if (record->xl_rmid == RM_XLOG_ID)
{
// detect XLOG_CHECKPOINT_SHUTDOWN / XLOG_END_OF_RECOVERY
// update *replayTLI if newReplayTLI != *replayTLI
}
/* Update replayEndRecPtr BEFORE calling rm_redo */
XLogRecoveryCtl->replayEndRecPtr = xlogreader->EndRecPtr;
/* Track in-flight XIDs for hot standby conflict resolution */
if (standbyState >= STANDBY_INITIALIZED && TransactionIdIsValid(record->xl_xid))
RecordKnownAssignedTransactionIds(record->xl_xid);
/* Dispatch */
if (record->xl_rmid == RM_XLOG_ID)
xlogrecovery_redo(xlogreader, *replayTLI);
GetRmgr(record->xl_rmid).rm_redo(xlogreader);
/* Consistency check after FPI records */
if ((record->xl_info & XLR_CHECK_CONSISTENCY) != 0)
verifyBackupPageConsistency(xlogreader);
/* Update lastReplayedEndRecPtr AFTER rm_redo returns */
XLogRecoveryCtl->lastReplayedEndRecPtr = xlogreader->EndRecPtr;
// ... condensed ...
}

The idempotency invariant is enforced inside each rmgr’s rm_redo callback, not in ApplyWalRecord itself. The standard pattern — used by every heap and index rmgr — is to call XLogInitBufferForRedo or XLogReadBufferForRedo, both of which compare the page’s PageLSN to the record’s LSN and return BLK_NEEDS_REDO only if the page requires the update. This means replay is safe to restart from any checkpoint: any record whose effect is already on the page is silently skipped.

AdvanceNextFullTransactionIdPastXid ensures that TransamVariables->nextXid is always beyond any XID seen in a replayed record, so that new transactions assigned after recovery cannot collide with replayed ones.

// readTimeLineHistory — src/backend/access/transam/timeline.c
List *readTimeLineHistory(TimeLineID targetTLI)

A timeline history file <tli>.history in pg_wal/ contains one line per ancestor timeline of the form <tli>\t<switchpoint>. Timeline 1 has no history file (it is the root). readTimeLineHistory parses this file into a List * of TimeLineHistoryEntry items; the caller (typically InitWalRecovery) uses tliSwitchPoint to find the LSN at which the target timeline diverged from any given ancestor.

// writeTimeLineHistory — src/backend/access/transam/timeline.c
void writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
XLogRecPtr switchpoint, char *reason)

writeTimeLineHistory is called at promotion time. It appends one line to the new .history file recording (parentTLI, switchpoint). Standbys and PITR nodes consult this history to decide which segment files belong to which timeline: a segment file for TLI N at position P is authoritative only up to the LSN where TLI N switched to TLI N+1.

Figure 1 — Timeline fork and history chain:

flowchart LR
    TL1["TLI 1<br/>WAL 0/1 → 0/A0"]
    TL2["TLI 2<br/>WAL 0/A0 → 0/C0"]
    TL3["TLI 3<br/>WAL 0/B0 → ..."]
    HF2["2.history\n1\t0/A0"]
    HF3["3.history\n1\t0/A0\n2\t0/B0"]
    TL1 -->|"promote at 0/A0"| TL2
    TL2 -->|"promote at 0/B0"| TL3
    HF2 -. "read by replica".- TL2
    HF3 -. "read by replica".- TL3

Figure 1 — Each promotion increments the TLI counter and writes a .history file that records the fork LSN. A replica reading TLI-3 WAL chains through both history files to discover that it should read TLI-1 WAL up to 0/A0, TLI-2 WAL from 0/A0 to 0/B0, and TLI-3 WAL from 0/B0 onward.

The recovery loop and PITR target evaluation

Section titled “The recovery loop and PITR target evaluation”

The main loop in PerformWalRecovery uses two boundary checks:

// PerformWalRecovery (loop body) — src/backend/access/transam/xlogrecovery.c
do {
/* stop BEFORE applying this record? */
if (recoveryStopsBefore(xlogreader)) { reachedRecoveryTarget = true; break; }
/* optional apply delay for replica lag simulation */
if (recoveryApplyDelay(xlogreader)) { /* latch wait */ }
ApplyWalRecord(xlogreader, record, &replayTLI);
/* stop AFTER applying this record? */
if (recoveryStopsAfter(xlogreader)) { reachedRecoveryTarget = true; break; }
record = ReadRecord(xlogprefetcher, LOG, false, replayTLI);
} while (record != NULL);

recoveryStopsBefore handles exclusive targets (recovery_target_inclusive = false). recoveryStopsAfter handles inclusive targets. Both functions compare the record against the five RecoveryTargetType variants (XID, TIME, NAME, LSN, IMMEDIATE).

When reachedRecoveryTarget is true, the engine consults recoveryTargetActionpause, promote, or shutdown — and takes the corresponding action. pause calls SetRecoveryPause(true) and blocks until a pg_wal_replay_resume() call.

// CheckRecoveryConsistency — src/backend/access/transam/xlogrecovery.c
void CheckRecoveryConsistency(void)
{
// ... condensed ...
if (!reachedConsistency && !backupEndRequired &&
minRecoveryPoint <= lastReplayedEndRecPtr)
{
XLogCheckInvalidPages();
CheckTablespaceDirectory();
reachedConsistency = true;
SendPostmasterSignal(PMSIGNAL_RECOVERY_CONSISTENT);
}
if (standbyState == STANDBY_SNAPSHOT_READY &&
!LocalHotStandbyActive &&
reachedConsistency && IsUnderPostmaster)
{
XLogRecoveryCtl->SharedHotStandbyActive = true;
LocalHotStandbyActive = true;
SendPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY);
}
}

Two signals go to postmaster. PMSIGNAL_RECOVERY_CONSISTENT tells it the database is structurally sound (no invalid page references, no dangling tablespace directories). PMSIGNAL_BEGIN_HOT_STANDBY enables read-only connections: postmaster starts accepting client connections only after receiving the second signal.

The standbyState >= STANDBY_SNAPSHOT_READY gate reflects a second condition: hot standby also requires that KnownAssignedXids (the in-memory replica of the primary’s active-transaction set) is sufficiently populated to answer visibility queries. standbyState progresses from STANDBY_DISABLEDSTANDBY_INITIALIZEDSTANDBY_SNAPSHOT_PENDINGSTANDBY_SNAPSHOT_READY as XIDs accumulate.

Figure 2 — Recovery state machine:

stateDiagram-v2
    [*] --> InitWalRecovery
    InitWalRecovery --> CrashRecovery : no signal file
    InitWalRecovery --> ArchiveRecovery : recovery.signal
    InitWalRecovery --> StandbyMode : standby.signal
    CrashRecovery --> PerformWalRecovery
    ArchiveRecovery --> PerformWalRecovery
    StandbyMode --> PerformWalRecovery
    PerformWalRecovery --> ConsistencyReached : minRecoveryPoint passed
    ConsistencyReached --> HotStandbyActive : standbyState SNAPSHOT_READY
    ConsistencyReached --> PITRTarget : recovery target hit
    PerformWalRecovery --> EndOfWAL : no more records
    PITRTarget --> FinishWalRecovery
    EndOfWAL --> FinishWalRecovery
    HotStandbyActive --> PromotionTriggered : promote signal
    PromotionTriggered --> FinishWalRecovery
    FinishWalRecovery --> [*]

Figure 2 — The three entry paths (crash, archive, standby) all feed into the single PerformWalRecovery loop. Hot standby remains in the loop; PITR and end-of-WAL exit to FinishWalRecovery. Promotion of a standby also exits via FinishWalRecovery and then starts writing new WAL on the next timeline.

Replay throughput is limited by I/O when the working set exceeds the buffer pool. PostgreSQL 15 introduced XLogPrefetcher — a thin wrapper around XLogReaderState that decodes WAL records ahead of the replay position and issues PrefetchSharedBuffer calls (ultimately posix_fadvise(POSIX_FADV_WILLNEED) or readahead) for the pages those records will touch.

// XLogPrefetcherAllocate — src/backend/access/transam/xlogprefetcher.c
XLogPrefetcher *
XLogPrefetcherAllocate(XLogReaderState *reader)
{
XLogPrefetcher *prefetcher = palloc0(sizeof(XLogPrefetcher));
prefetcher->reader = reader;
// filter_table: skip pages already in the buffer pool or not yet created
prefetcher->filter_table = hash_create("XLogPrefetcherFilterTable", 1024, ...);
dlist_init(&prefetcher->filter_queue);
prefetcher->reconfigure_count = XLogPrefetchReconfigureCount - 1;
return prefetcher;
}

The prefetcher maintains an LsnReadQueue — a ring of pending I/O slots indexed by LSN. XLogPrefetcherNextBlock scans ahead in the decoded record queue, picks the next block reference not already filtered out (e.g., pages in the buffer pool, pages in a database being CREATE DATABASE FILE COPY’d), and calls PrefetchSharedBuffer.

// XLogPrefetcherNextBlock (condensed) — src/backend/access/transam/xlogprefetcher.c
static LsnReadQueueNextStatus
XLogPrefetcherNextBlock(uintptr_t pgsr_private, XLogRecPtr *lsn)
{
// ... condensed ...
record = XLogReadAhead(prefetcher->reader, nonblocking);
// check filter_table: skip if already buffered or filtered
// call PrefetchSharedBuffer for main-fork blocks
// suppress readahead across TLI-switch checkpoints:
// prefetcher->no_readahead_until = record->lsn;
// return LRQ_NEXT_IO / LRQ_NEXT_NO_IO / LRQ_NEXT_AGAIN
}

The wal_decode_buffer_size GUC controls how far ahead the decoder can run. The recovery_prefetch GUC (default try) enables prefetching; setting it to off disables the lookahead but keeps the decode-queue mechanism. Statistics are exposed via pg_stat_recovery_prefetch (populated by XLogPrefetcherComputeStats).

Two important suppression rules prevent incorrect prefetching:

  1. TLI-switch suppression. When the decoder encounters an XLOG_CHECKPOINT_SHUTDOWN or XLOG_END_OF_RECOVERY record that may carry a timeline switch, no_readahead_until is set to that record’s LSN. Readahead past a TLI switch could prefetch from the wrong segment file.
  2. Relation-creation filter. When a XLOG_DBASE_CREATE_FILE_COPY record is seen, all blocks in that database are added to filter_table until the record is replayed, preventing ENOENT errors from prefetching into a database that does not yet exist on disk.
  • XLogRecoveryShmemInit — allocates XLogRecoveryCtlData in shared memory; called from CreateSharedMemoryAndSemaphores.
  • InitWalRecovery — top-level entry point from StartupXLOG; reads pg_control, detects signal files, establishes RedoStartLSN and CheckPointLoc, allocates xlogreader and xlogprefetcher.
  • readRecoverySignalFile — checks for standby.signal and recovery.signal in $PGDATA; sets StandbyModeRequested, ArchiveRecoveryRequested.
  • validateRecoveryParameters — ensures recovery_target* GUCs are consistent; detects use of the removed recovery.conf.
  • read_backup_label — reads backup_label if present; provides CheckPointLoc and RedoStartLSN for base-backup restores.
  • EnableStandbyMode — sets StandbyMode = true and initialises the hot-standby machinery.
  • PerformWalRecovery — main redo loop; initializes XLogRecoveryCtlData watermarks; calls RmgrStartup; ReadRecord / ApplyWalRecord loop.
  • ApplyWalRecord — per-record dispatch: advances nextXid, detects TLI switches, updates replayEndRecPtr, calls RecordKnownAssignedTransactionIds for hot standby, dispatches to rm_redo, verifies FPI consistency if XLR_CHECK_CONSISTENCY is set, updates lastReplayedEndRecPtr.
  • ReadRecord — thin wrapper over XLogPrefetcherReadRecord which ultimately calls XLogReaderValidatePageHeader and DecodeXLogRecord.
  • recoveryStopsBefore / recoveryStopsAfter — evaluate the five RecoveryTargetType variants against the current record.
  • recoveryApplyDelay — implements recovery_min_apply_delay; waits on recoveryWakeupLatch.
  • xlogrecovery_redo — handles RM_XLOG_ID records that are interpreted by the recovery machinery itself (e.g., XLOG_BACKUP_END).
  • CheckRecoveryConsistency — checked after every record; signals PMSIGNAL_RECOVERY_CONSISTENT and PMSIGNAL_BEGIN_HOT_STANDBY.
  • RmgrStartup / RmgrCleanup — call each rmgr’s rm_startup / rm_cleanup at beginning / end of redo.
  • FinishWalRecovery — shuts down WAL receiver, locates the end of the last valid record, returns EndOfWalRecoveryInfo to StartupXLOG.
  • ShutdownWalRecovery — frees the xlogprefetcher and xlogreader.
  • readTimeLineHistory — parses <tli>.history into List *<TimeLineHistoryEntry>.
  • writeTimeLineHistory — appends a new .history entry at promotion; archives it if archive_mode is set.
  • tliSwitchPoint — returns the LSN at which a given TLI diverges from another in a history list.
  • tliOfPointInHistory — given a List * and an LSN, returns the TLI that was active at that LSN.
  • existsTimeLineHistory — checks whether a .history file exists (via archive or pg_wal/).
  • checkTimeLineSwitch — validates that a TLI switch encountered during redo is consistent with the expected history chain.
  • XLogPrefetcherAllocate / XLogPrefetcherFree — lifecycle.
  • XLogPrefetcherNextBlock — callback invoked by LsnReadQueue for each next decoded block reference; issues PrefetchSharedBuffer.
  • XLogPrefetcherComputeStats — updates SharedStats fields read by pg_stat_recovery_prefetch.
  • XLogPrefetcherAddFilter / XLogPrefetcherIsFiltered / XLogPrefetcherCompleteFilters — filter-table management.
  • lrq_alloc / lrq_prefetch / lrq_complete_lsnLsnReadQueue ring-buffer operations.

Position hints (commit 273fe94, 2026-06-05)

Section titled “Position hints (commit 273fe94, 2026-06-05)”
SymbolFileLine
XLogRecoveryCtlDataxlogrecovery.c311
XLogRecoveryShmemInitxlogrecovery.c465
InitWalRecoveryxlogrecovery.c519
readRecoverySignalFilexlogrecovery.c1046
PerformWalRecoveryxlogrecovery.c1671
ApplyWalRecordxlogrecovery.c1928
FinishWalRecoveryxlogrecovery.c1477
CheckRecoveryConsistencyxlogrecovery.c2196
xlogrecovery_redoxlogrecovery.c2092
rm_redo_error_callbackxlogrecovery.c2297
XLogPrefetcherAllocatexlogprefetcher.c362
XLogPrefetcherNextBlockxlogprefetcher.c459
XLogPrefetcherComputeStatsxlogprefetcher.c410
readTimeLineHistorytimeline.c76
writeTimeLineHistorytimeline.c304
tliSwitchPointtimeline.c572
checkTimeLineSwitchxlogrecovery.c2399
RecoveryTargetTypexlogrecovery.h23
EndOfWalRecoveryInfoxlogrecovery.h91
RecoveryPauseStatexlogrecovery.h44
  • InitWalRecovery, PerformWalRecovery, and FinishWalRecovery are the three-function API exported by xlogrecovery.c. Verified by reading the function signatures at lines 519, 1671, and 1477 respectively, and the matching declarations in xlogrecovery.h.

  • XLogRecoveryCtlData is allocated in shared memory, not on the stack. Verified: XLogRecoveryShmemInit calls ShmemInitStruct at line 469. The two watermarks lastReplayedEndRecPtr and replayEndRecPtr are distinct; the former is updated after rm_redo returns, the latter before it is entered (line 1992 vs 2030).

  • ApplyWalRecord advances TransamVariables->nextXid past the record’s XID before calling rm_redo. Verified at line 1942: AdvanceNextFullTransactionIdPastXid(record->xl_xid) runs unconditionally before the dispatch block.

  • The idempotency check is inside each rmgr’s rm_redo callback, not in ApplyWalRecord. Verified by reading ApplyWalRecord in full: there is no PageLSN >= record-LSN comparison in xlogrecovery.c. The pattern is in XLogReadBufferForRedo in xlogutils.c, called by every heap/index rmgr redo handler.

  • CheckRecoveryConsistency sends two distinct signals. Verified at lines 2267 (PMSIGNAL_RECOVERY_CONSISTENT) and 2289 (PMSIGNAL_BEGIN_HOT_STANDBY). Hot-standby activation requires both reachedConsistency and standbyState == STANDBY_SNAPSHOT_READY.

  • Timeline 1 has no .history file. Verified in readTimeLineHistory at line 88: the function returns a synthetic single-entry list for targetTLI == 1 without attempting a file read.

  • The WAL prefetcher suppresses readahead past TLI-switch checkpoints. Verified in XLogPrefetcherNextBlock at lines 537–553: XLOG_CHECKPOINT_SHUTDOWN and XLOG_END_OF_RECOVERY both set prefetcher->no_readahead_until = record->lsn.

  • recovery.conf is explicitly rejected. Verified in readRecoverySignalFile (line 1046): if recovery.conf exists in $PGDATA, the startup process logs a FATAL with a message directing the user to recovery.signal.

  1. Parallel redo. The current redo loop is single-threaded inside the startup process. There is no parallel redo path in PG18 for crash recovery (unlike logical replication’s parallel apply). The investigation path: search for ParallelApply patterns in xlogrecovery.c and check the mailing list archives for parallel redo proposals.

  2. XLR_CHECK_CONSISTENCY trigger conditions. ApplyWalRecord calls verifyBackupPageConsistency when XLR_CHECK_CONSISTENCY is set in xl_info. The exact conditions under which a WAL record gets this flag set during normal operation (vs. only when wal_consistency_checking GUC is active) are not fully traced. Investigation path: grep for XLR_CHECK_CONSISTENCY across xloginsert.c and rmgr source files.

  3. LsnReadQueue size and memory pressure. The LsnReadQueue ring (lrq_alloc) size is determined by wal_decode_buffer_size. The interaction between decode-buffer size, prefetch distance, and shared-buffer hit rate during recovery has not been profiled in the source comments. Investigation path: read XLogPrefetchReconfigure and the pg_stat_recovery_prefetch columns blocks_prefetched, blocks_skipped_on_relationship, blocks_skipped_init.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”
  • ARIES undo log vs. PostgreSQL’s no-undo approach. ARIES prescribes a full undo pass after redo, rolling back every loser transaction with compensation log records. PostgreSQL avoids a separate undo log by relying on MVCC: old row versions remain visible in heap pages, and VACUUM reclaims them asynchronously. Crash recovery merely replays abort records; there is no per-row undo traversal. The trade-off is that PostgreSQL must run VACUUM to reclaim space that an undo-based engine (Oracle, InnoDB) reclaims immediately at rollback. A comparative study: knowledge/research/ dbms-papers/aries.md.

  • Parallel redo. MySQL InnoDB and MariaDB introduced parallel redo (multi-threaded apply) to exploit multi-core hardware during crash recovery. PostgreSQL’s redo is currently single-threaded in the startup process. The XLogPrefetcher partially compensates by hiding I/O latency, but CPU-bound redo (e.g., index rebuilds) does not benefit. Parallel redo is an active area of PG development.

  • Logical replication’s parallel apply. PG16 introduced parallel apply for logical replication (max_parallel_apply_workers_per_ subscription), which is structurally analogous to parallel redo at the logical level. Its design — apply workers coordinated by a leader that tracks XID dependencies — could inform a future physical redo parallelization.

  • Incremental / page-level checksums during recovery. The WAL summarization feature (PG17) produces per-block WAL summaries that support incremental basebackup (pg_basebackup --incremental). The summarizer (walsummarizer.c) reads WAL during recovery and normal operation; its interaction with the recovery replay position is relevant context for postgres-archiving-walsummary.md.

  • Instant recovery (Redo-only designs). Shasank Chavan et al., “Oracle Database In-Memory” and related work on “redo-only” engines (no undo log at all, relying entirely on MVCC) are a useful contrast to PostgreSQL’s hybrid. Closer to PostgreSQL: the discussion in Hellerstein et al. 2007 (Architecture of a DB System, dbms-papers/fntdb07-architecture.md) §“Storage and Buffer Management” frames the steal/no-force design space and the recovery cost of each choice.

  • (no raw inbox files — doc synthesized directly from the source tree)
  • src/backend/access/transam/xlogrecovery.c — recovery state machine, InitWalRecovery, PerformWalRecovery, FinishWalRecovery, ApplyWalRecord, CheckRecoveryConsistency
  • src/backend/access/transam/xlogprefetcher.cXLogPrefetcher, LsnReadQueue, XLogPrefetcherNextBlock
  • src/backend/access/transam/timeline.c — timeline history read/write/query
  • src/include/access/xlogrecovery.hRecoveryTargetType, EndOfWalRecoveryInfo, RecoveryPauseState
  • src/include/access/timeline.hTimeLineHistoryEntry
  • src/backend/access/transam/README — in-tree design notes for the transam subsystem
  • Mohan et al. 1992, ARIES: A Transaction Recovery Method — captured at knowledge/research/dbms-papers/aries.md
  • Petrov 2019, Database Internals, ch. 5 — WAL, recovery, LSN model
  • Silberschatz et al. 2020, Database System Concepts, 7e, ch. 19 — recovery theory
  • Hellerstein et al. 2007, Architecture of a Database System — captured at knowledge/research/dbms-papers/fntdb07-architecture.md