Skip to content

PostgreSQL pg_rewind — Timeline Divergence Detection and Data Directory Resync

Contents:

High-availability deployments of any DBMS depend on the ability to rejoin a node that has fallen behind or diverged. The difficulty is that two nodes starting from the same checkpoint can each accept writes after a network partition, after a mis-fenced primary, or after a planned switchover gone wrong. When the dust settles, one node holds the authoritative history; the other holds a diverged history that can never be merged with the new timeline and must be discarded.

PostgreSQL represents its WAL stream as a linear sequence of timelines. Each promotion increments the timeline ID (TLI) and records the fork point in a .history file: <TLI>.history maps each predecessor TLI to the LSN at which that timeline ended. The TimeLineHistoryEntry array built from that file is how the system answers “at which LSN did timelines A and B last share the same WAL byte?”.

The fundamental relationship is:

target LSN stream: ... [common ancestor checkpoint C] ---> [diverged WAL D]
source LSN stream: ... [common ancestor checkpoint C] ---> [new WAL after promotion]

After the divergence point the two streams are independent. A rejoin operation must:

  1. Identify C — the last common checkpoint before divergence.
  2. Discover the set of data pages the target modified in region D (diverged WAL) so those pages can be overwritten from the source.
  3. Copy the necessary files and pages from source to target.
  4. Leave a marker that tells the target where to begin WAL replay from the source, so the remaining gap is closed on next startup.

This is distinct from crash recovery (which replays a single node’s own WAL) and from streaming replication (which requires the target never to have diverged). pg_rewind fills the gap between those two: a diverged but otherwise healthy node that can be made current at lower cost than a full pg_basebackup.

A full base backup copies every byte of the data directory — typically tens to hundreds of gigabytes. pg_rewind copies only the pages the target dirtied after divergence, plus any non-relation files that differ. For a node that diverged recently (minutes to hours) the transfer is orders of magnitude smaller, and the former primary can rejoin as a standby without the network cost of a full reseed.

The trade-off: pg_rewind requires that the WAL from the last common checkpoint onward still exist on the target (or be recoverable from an archive), and that the target uses checksums or wal_log_hints = on so that hint-bit updates do not silently corrupt the diverged pages.

pg_rewind does not run WAL replay itself. It leaves the target in a state that looks like a base backup begun at checkpoint C: the control file’s minRecoveryPoint is set to the source’s current WAL end, and a backup_label file names checkpoint C as the starting point for replay. Normal PostgreSQL recovery (startup process) then replays the source’s WAL from C forward, closing the gap. This is the same model used by pg_basebackup --wal-method=stream: the backup tool transfers files, and recovery does the WAL work.

Most replication tools must handle at least two data sources: a live running server and a local directory. The universal pattern is a thin vtable (or interface) that wraps the difference. The caller only sees operations like “list files”, “fetch this byte range”, “get current WAL position”; the backend decides whether to satisfy the request via COPY queries over libpq or via direct file I/O. PostgreSQL uses this pattern in rewind_source.

The insight that makes partial resync possible: the target’s own WAL records precisely which pages it touched after divergence. Reading the WAL from C to the target’s end-of-WAL yields a complete set of (relfilelocator, forknum, blockno) tuples — the exact blocks that must be overwritten. This avoids block-level checksumming of the entire data directory, which would require reading every page on both sides.

Any resync tool must decide, for each path in the data directory, one of a fixed set of actions: copy the whole file, copy only a tail extension, truncate, remove, or leave alone. Relation data files and non-relation files get different treatment because WAL tracking is available only for the main fork of relation files; everything else must be copied in full when it differs.

Applying file actions in the wrong order can leave the directory in an unrecoverable state. The universal safe ordering is: create directories before writing their contents, copy files before removing old ones, remove leaf entries before their parent directories. Encoding the ordering directly in the action enum (so that sorting by action value yields the correct sequence) is a clean implementation technique.

pg_rewind is a standalone client-side binary (src/bin/pg_rewind/). It does not link against the server backend; it uses libpq for the live-source path and re-implements the handful of backend functions it needs (WAL reading, timeline history parsing) in pure frontend code. The main driver is pg_rewind.c; the four supporting modules are parsexlog.c (WAL scan), filemap.c (file decision table), file_ops.c (target-side I/O), and timeline.c (history parsing).

// main — src/bin/pg_rewind/pg_rewind.c
main()
├── init_libpq_source / init_local_source ← source vtable
├── ensureCleanShutdown ← single-user postgres if dirty
├── sanityChecks ← system_identifier, checksums
├── getTimelineHistory × 2 ← source + target TLI arrays
├── findCommonAncestorTimeline ← → divergerec
├── findLastCheckpoint ← → chkptrec / chkptredo
├── filehash_init + traverse_files × 2 ← populate hash
├── extractPageMap ← WAL scan → dirty-page bitmaps
├── decide_file_actions → filemap_t ← per-file action enum
└── perform_rewind ← copy + backup_label + control file

Phase 1 — preconditions and timeline divergence

Section titled “Phase 1 — preconditions and timeline divergence”

main() first ensures the target is cleanly shut down. If the control file shows state != DB_SHUTDOWNED and --no-ensure-shutdown is not set, ensureCleanShutdown launches postgres --single to finish crash recovery before pg_rewind touches anything. This is safety: an unclean target might have an inconsistent page state that confuses the subsequent WAL scan.

sanityChecks then verifies three invariants:

  1. system_identifier matches — source and target came from the same initdb cluster.
  2. pg_control_version and catalog_version_no match the compiled constant — same major PostgreSQL version.
  3. The target has checksums or wal_log_hints — without one of these, hint-bit updates are invisible in WAL and pg_rewind would miss modified pages.
// sanityChecks — src/bin/pg_rewind/pg_rewind.c
if (ControlFile_target.system_identifier != ControlFile_source.system_identifier)
pg_fatal("source and target clusters are from different systems");
if (ControlFile_target.data_checksum_version != PG_DATA_CHECKSUM_VERSION &&
!ControlFile_target.wal_log_hints)
pg_fatal("target server needs to use either data checksums or "
"\"wal_log_hints = on\"");

Timeline history is fetched for both sides via getTimelineHistory. For TLI 1 (no history file) a synthetic single-entry array is returned. For higher TLIs the .history file is fetched from the appropriate source and parsed by rewind_parseTimeLineHistory into a TimeLineHistoryEntry[] array.

findCommonAncestorTimeline walks both arrays in parallel, stopping at the first tli or begin mismatch. The MinXLogRecPtr of the two end fields at that index is divergerec — the LSN at which the histories parted.

// findCommonAncestorTimeline — src/bin/pg_rewind/pg_rewind.c
for (i = 0; i < n; i++)
{
if (a_history[i].tli != b_history[i].tli ||
a_history[i].begin != b_history[i].begin)
break;
}
if (i > 0)
{
i--;
*recptr = MinXLogRecPtr(a_history[i].end, b_history[i].end);
*tliIndex = i;
return;
}

If target_wal_endrec <= divergerec the target is already an ancestor of the source (it never wrote beyond the fork point), so no rewind is needed. If target_wal_endrec > divergerec a rewind is required.

findLastCheckpoint (in parsexlog.c) walks the target’s WAL backwards from divergerec, looking for the most recent XLOG_CHECKPOINT_SHUTDOWN or XLOG_CHECKPOINT_ONLINE record that precedes the fork. This is the rewind start point chkptrec; the checkpoint’s redo field gives chkptredo.

Walking backwards is correct because pg_rewind does not care about in-flight transactions — it only needs a checkpoint from which WAL replay can reconstruct a consistent state. The backward walk also tracks which WAL segment files contain the checkpoint, adding them to the keepwal hash so they are not accidentally removed during the file reconciliation phase.

// findLastCheckpoint — src/bin/pg_rewind/parsexlog.c
info = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;
if (searchptr < forkptr &&
XLogRecGetRmid(xlogreader) == RM_XLOG_ID &&
(info == XLOG_CHECKPOINT_SHUTDOWN ||
info == XLOG_CHECKPOINT_ONLINE))
{
memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
*lastchkptrec = searchptr;
*lastchkpttli = checkPoint.ThisTimeLineID;
*lastchkptredo = checkPoint.redo;
break;
}
/* Walk backwards to previous record. */
searchptr = record->xl_prev;

Phase 3 — file inventory and WAL page map

Section titled “Phase 3 — file inventory and WAL page map”

traverse_files is called on the source (via the rewind_source vtable) and traverse_datadir on the target. Both call back into process_source_file and process_target_file respectively, each of which inserts or updates a file_entry_t record in the filehash hash table keyed by relative path.

// process_source_file — src/bin/pg_rewind/filemap.c
entry = insert_filehash_entry(path);
entry->source_exists = true;
entry->source_type = type;
entry->source_size = size;
entry->source_link_target = link_target ? pg_strdup(link_target) : NULL;

extractPageMap then reads the target’s WAL from chkptrec to target_wal_endrec using XLogReaderAllocate / XLogReadRecord. For each decoded record it calls extractPageInfo, which iterates over the record’s block references and calls process_target_wal_block_change for each MAIN_FORKNUM block.

// extractPageInfo — src/bin/pg_rewind/parsexlog.c
for (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++)
{
if (!XLogRecGetBlockTagExtended(record, block_id,
&rlocator, &forknum, &blkno, NULL))
continue;
if (forknum != MAIN_FORKNUM)
continue;
process_target_wal_block_change(forknum, rlocator, blkno);
}

process_target_wal_block_change looks up the file entry for the corresponding data segment and adds blkno_inseg to the entry’s target_pages_to_overwrite bitmap — but only if the block is within the source file’s bounds (end_offset <= source_size). Blocks beyond the source EOF will be truncated away anyway; there is no point fetching them.

// process_target_wal_block_change — src/bin/pg_rewind/filemap.c
end_offset = (blkno_inseg + 1) * BLCKSZ;
if (end_offset <= entry->source_size && end_offset <= entry->target_size)
datapagemap_add(&entry->target_pages_to_overwrite, blkno_inseg);

The datapagemap_t is a compact byte bitmap: bit N represents block N within its segment. The bitmap grows on demand with a small headroom so that sequential block insertions do not trigger per-block realloc.

decide_file_actions() iterates the hash table, calling decide_file_action() for each entry. The decision logic is:

ConditionAction
path matches excludeFiles listREMOVE (if target-only) or NONE
source only (new on source)COPY or CREATE (dir/symlink)
target only (deleted on source)REMOVE, unless in keepwal
both exist, non-relation fileCOPY
both exist, relation, target largerTRUNCATE
both exist, relation, target smallerCOPY_TAIL
both exist, relation, same sizeNONE (dirty pages handled separately)

Special-cased paths include XLOG_CONTROL_FILE (deferred to end of perform_rewind), PG_VERSION (never overwritten), .DS_Store, and directories/symlinks that exist on both sides (no action needed for the entry itself).

// decide_file_action — src/bin/pg_rewind/filemap.c
if (strcmp(path, XLOG_CONTROL_FILE) == 0)
return FILE_ACTION_NONE; /* handled separately at end */
if (!entry->target_exists && entry->source_exists)
{
switch (entry->source_type)
{
case FILE_TYPE_DIRECTORY: return FILE_ACTION_CREATE;
case FILE_TYPE_REGULAR: return FILE_ACTION_COPY;
// ... symlink → CREATE
}
}
else if (entry->target_exists && !entry->source_exists)
{
if (keepwal_entry_exists(path))
return FILE_ACTION_NONE; /* WAL needed for recovery */
return FILE_ACTION_REMOVE;
}

For a relation data file present on both sides, the action is decided purely by comparing the two segment sizes. A target that is larger has extra blocks the source truncated away, so it is truncated now (WAL replay would do the same later); a target that is smaller is missing a tail the source has, so the tail is copied; equal sizes need no whole-file action because per-block changes are already captured in the target_pages_to_overwrite bitmap from the WAL scan.

// decide_file_action — src/bin/pg_rewind/filemap.c
case FILE_TYPE_REGULAR:
if (!entry->isrelfile)
return FILE_ACTION_COPY; /* non-data file: copy in toto */
else
{
if (entry->target_size < entry->source_size)
return FILE_ACTION_COPY_TAIL; /* fetch the missing tail */
else if (entry->target_size > entry->source_size)
return FILE_ACTION_TRUNCATE; /* drop the extra blocks */
else
return FILE_ACTION_NONE; /* dirty pages handled via bitmap */
}

The file_action_t enum is deliberately ordered so that qsort-by-action produces the safe execution sequence: CREATE first (so parent directories exist before children), then COPY/COPY_TAIL, then NONE, then TRUNCATE, then REMOVE. REMOVE entries are secondarily sorted in reverse-path order so foo/bar is removed before foo.

Phase 5 — perform_rewind: copy and control file update

Section titled “Phase 5 — perform_rewind: copy and control file update”

perform_rewind iterates the sorted filemap_t. For each entry with a non-empty target_pages_to_overwrite bitmap it iterates the bitmap and calls source->queue_fetch_range for each dirty block. Then it dispatches the per-file action: COPY → queue_fetch_file, COPY_TAIL → queue_fetch_range for the tail, TRUNCATE → truncate_target_file, REMOVE → remove_target, CREATE → create_target.

// perform_rewind — src/bin/pg_rewind/pg_rewind.c
iter = datapagemap_iterate(&entry->target_pages_to_overwrite);
while (datapagemap_next(iter, &blkno))
{
offset = blkno * BLCKSZ;
source->queue_fetch_range(source, entry->path, offset, BLCKSZ);
}
// ... then switch(entry->action) for whole-file operations
source->finish_fetch(source);

After all file operations, pg_rewind re-fetches the source’s control file and constructs a ControlFile_new:

  • state is set to DB_IN_ARCHIVE_RECOVERY.
  • minRecoveryPoint is set to the source’s current WAL insert LSN (for a live primary source) or its latest checkpoint (for a local directory source) — this is how far the target must replay before it can accept connections.
  • minRecoveryPointTLI is set to the source’s TLI.

A backup_label file is written naming chkptredo as the START WAL LOCATION. This causes the startup process to begin WAL replay from the last common checkpoint, exactly as if pg_rewind had been a base backup started at that point.

// createBackupLabel — src/bin/pg_rewind/pg_rewind.c
len = snprintf(buf, sizeof(buf),
"START WAL LOCATION: %X/%X (file %s)\n"
"CHECKPOINT LOCATION: %X/%X\n"
"BACKUP METHOD: pg_rewind\n"
"BACKUP FROM: standby\n"
"START TIME: %s\n",
LSN_FORMAT_ARGS(startpoint), xlogfilename,
LSN_FORMAT_ARGS(checkpointloc),
strfbuf);

The rewind_source struct is a C vtable with seven function pointers:

// rewind_source — src/bin/pg_rewind/rewind_source.h
typedef struct rewind_source
{
void (*traverse_files)(struct rewind_source *,
process_file_callback_t callback);
char *(*fetch_file)(struct rewind_source *, const char *path,
size_t *filesize);
void (*queue_fetch_range)(struct rewind_source *, const char *path,
off_t offset, size_t len);
void (*queue_fetch_file)(struct rewind_source *, const char *path,
size_t len);
void (*finish_fetch)(struct rewind_source *);
XLogRecPtr (*get_current_wal_insert_lsn)(struct rewind_source *);
void (*destroy)(struct rewind_source *);
} rewind_source;

init_libpq_source (libpq_source.c) implements the vtable using COPY queries and pg_read_binary_file() calls over a live libpq connection. It batches range-fetch requests and flushes them in finish_fetch. init_local_source (local_source.c) implements it with direct filesystem calls.

Both backends are transparent to the caller; all phases above use only the vtable interface.

filemap.c maintains two exclusion lists. excludeDirContents names directories whose contents are always regenerated on server start (pg_stat_tmp, pg_replslot, pg_dynshmem, pg_notify, pg_serial, pg_snapshots, pg_subtrans). excludeFiles names specific filenames that should not be copied (postmaster.pid, postmaster.opts, backup_label, backup_manifest, pg_internal.init prefix, postgresql.auto.conf.tmp, current_logfiles.tmp). These lists are documented as needing to stay in sync with basebackup.c.

// excludeDirContents — src/bin/pg_rewind/filemap.c
static const char *const excludeDirContents[] = {
"pg_stat_tmp", "pg_replslot", "pg_dynshmem",
"pg_notify", "pg_serial", "pg_snapshots", "pg_subtrans",
NULL
};
flowchart TD
    A["main()<br/>parse args, connect"] --> B["ensureCleanShutdown<br/>(postgres --single if dirty)"]
    B --> C["sanityChecks<br/>system_identifier<br/>checksums/wal_log_hints"]
    C --> D["getTimelineHistory<br/>source + target"]
    D --> E["findCommonAncestorTimeline<br/>→ divergerec"]
    E --> F{rewind<br/>needed?}
    F -- no --> Z["exit 0"]
    F -- yes --> G["findLastCheckpoint<br/>→ chkptrec / chkptredo"]
    G --> H["traverse_files source<br/>+ traverse_datadir target<br/>→ filehash populated"]
    H --> I["extractPageMap<br/>WAL chkptrec → diverge-end<br/>→ dirty page bitmaps"]
    I --> J["decide_file_actions<br/>→ sorted filemap_t"]
    J --> K["perform_rewind<br/>queue_fetch_range dirty blocks<br/>+ file-level actions"]
    K --> L["createBackupLabel<br/>update_controlfile<br/>minRecoveryPoint = source WAL end"]
    L --> M["sync_target_dir<br/>Done"]

Divergence detection + changed-block copy flow

Section titled “Divergence detection + changed-block copy flow”

The first flowchart shows the top-level phase sequence. This second diagram zooms into the analytical core: how the two timeline histories yield divergerec, how the backward checkpoint walk yields chkptrec, and how the forward WAL scan from chkptrec turns each modified MAIN_FORKNUM block into either a queued fetch or a discard. Note that a block is added to the overwrite bitmap only when it lies within both the source and target file bounds — anything past the source EOF is left for the later TRUNCATE/COPY_TAIL whole-file decision rather than fetched block-by-block.

flowchart TD
    A["source history[]<br/>target history[]"] --> B["findCommonAncestorTimeline<br/>parallel-walk until tli/begin mismatch"]
    B --> C["divergerec =<br/>MinXLogRecPtr(end_a, end_b)"]
    C --> D{"target_wal_endrec<br/>&gt; divergerec ?"}
    D -- no --> E["target is ancestor<br/>no rewind"]
    D -- yes --> F["findLastCheckpoint<br/>walk WAL backwards from divergerec"]
    F --> G{"RM_XLOG_ID record &&<br/>SHUTDOWN or ONLINE<br/>&& searchptr &lt; forkptr ?"}
    G -- no --> H["searchptr = record-&gt;xl_prev<br/>keepwal_add_entry for segment"]
    H --> F
    G -- yes --> I["chkptrec = searchptr<br/>chkptredo = checkPoint.redo"]
    I --> J["extractPageMap<br/>XLogReadRecord chkptrec .. endpoint"]
    J --> K["extractPageInfo<br/>for each block_id"]
    K --> L{"forknum ==<br/>MAIN_FORKNUM ?"}
    L -- no --> M["skip<br/>FSM/VM/init copied in toto"]
    L -- yes --> N["process_target_wal_block_change"]
    N --> O{"end_offset &lt;= source_size<br/>&& &lt;= target_size ?"}
    O -- no --> P["ignore<br/>truncated/removed later"]
    O -- yes --> Q["datapagemap_add(blkno_inseg)<br/>queued for queue_fetch_range"]
SymbolRole
mainTop-level: argument parsing, source init, phase sequencing
perform_rewindExecutes filemap actions; writes backup_label and control file
sanityChecksValidates system_identifier, versions, checksums/wal_log_hints
getTimelineHistoryFetches .history file and wraps rewind_parseTimeLineHistory
findCommonAncestorTimelineParallel-walks two TimeLineHistoryEntry[] arrays to find divergerec
ensureCleanShutdownRuns postgres --single to force crash recovery on dirty target
createBackupLabelWrites START WAL LOCATION / CHECKPOINT LOCATION marker
digestControlFileReads and CRC-checks a ControlFileData buffer
getRestoreCommandInvokes postgres -C restore_command to obtain archive restore command
progress_reportThrottled stderr progress output (once per second)
ControlFile_targetGlobal: target’s pg_control read at startup
ControlFile_sourceGlobal: source’s pg_control read at startup
ControlFile_source_afterGlobal: source’s pg_control re-read after file copy (sanity check)
targetHistory / targetNentriesGlobal: target timeline history array (used by parsexlog.c)
SymbolRole
extractPageMapDrives XLogReaderAllocate / XLogReadRecord loop from chkptrec to endpoint
extractPageInfoIterates block refs in one decoded WAL record; calls process_target_wal_block_change
findLastCheckpointBackward WAL scan to find the most recent checkpoint before divergerec
readOneRecordReads one record at a given LSN; used to find target_wal_endrec
SimpleXLogPageReadXLogReader page-read callback; handles segment switching and archive restore
XLogPageReadPrivatePer-reader state: tliIndex and restoreCommand
SymbolRole
filehashsimplehash-based hash table keyed by relative path; holds file_entry_t records
file_entry_tPer-path record: source/target size, type, link target, dirty-page bitmap, action
filemap_tFinal sorted array of file_entry_t * ready for execution
file_action_tEnum: UNDECIDED / CREATE / COPY / COPY_TAIL / NONE / TRUNCATE / REMOVE
filehash_initAllocates the hash table
process_source_fileCallback: records source-side file metadata into filehash
process_target_fileCallback: records target-side file metadata into filehash
process_target_wal_block_changeCallback: adds one block to a file entry’s target_pages_to_overwrite bitmap
decide_file_actionsCalls decide_file_action for each entry; sorts into filemap_t
decide_file_actionPer-entry logic: size comparison, exclusion filters, keepwal check
keepwalSecondary hash table protecting WAL segment files from removal
keepwal_init / keepwal_add_entryInitialise and populate the keepwal table
check_file_excludedTests against excludeFiles and excludeDirContents lists
isRelDataFilePath regex: detects global/<oid>, base/<db>/<oid>, pg_tblspc/... patterns
calculate_totalsSums fetch_size and total_size for progress reporting
final_filemap_cmpSort comparator: action enum order, then path; REMOVE entries reversed
SymbolRole
datapagemap_tStruct: bitmap byte array + bitmapsize
datapagemap_addSets bit for one block number; grows array on demand
datapagemap_iterateAllocates an iterator at block 0
datapagemap_nextAdvances iterator; returns next set block number
SymbolRole
open_target_file / close_target_fileMaintain the currently-open target file descriptor (dstfd)
write_target_rangeSeeks and writes a byte range; updates fetch_done for progress
remove_target / create_targetDispatch to type-specific helpers (file/dir/symlink)
truncate_target_fileCalls ftruncate on the target path
slurpFileReads an entire file into a malloc’d buffer (used for pg_control, history files)
traverse_datadirRecursive directory walker; calls process_file_callback_t for each entry
sync_target_dirCalls sync_pgdata for a two-pass fsync of the whole data directory
SymbolRole
rewind_parseTimeLineHistoryParses a .history file buffer into a TimeLineHistoryEntry[] array; appends a tip entry for the target TLI

Position hints for REL_18_STABLE commit 273fe94. Symbols are the stable anchors; line numbers are hints that decay.

SymbolFileApprox. line
mainsrc/bin/pg_rewind/pg_rewind.c120
perform_rewindsrc/bin/pg_rewind/pg_rewind.c554
sanityCheckssrc/bin/pg_rewind/pg_rewind.c734
findCommonAncestorTimelinesrc/bin/pg_rewind/pg_rewind.c921
createBackupLabelsrc/bin/pg_rewind/pg_rewind.c963
ensureCleanShutdownsrc/bin/pg_rewind/pg_rewind.c1130
getRestoreCommandsrc/bin/pg_rewind/pg_rewind.c1057
digestControlFilesrc/bin/pg_rewind/pg_rewind.c1024
extractPageMapsrc/bin/pg_rewind/parsexlog.c65
extractPageInfosrc/bin/pg_rewind/parsexlog.c388
findLastCheckpointsrc/bin/pg_rewind/parsexlog.c167
readOneRecordsrc/bin/pg_rewind/parsexlog.c123
SimpleXLogPageReadsrc/bin/pg_rewind/parsexlog.c275
filehash_initsrc/bin/pg_rewind/filemap.c196
process_source_filesrc/bin/pg_rewind/filemap.c279
process_target_filesrc/bin/pg_rewind/filemap.c315
process_target_wal_block_changesrc/bin/pg_rewind/filemap.c353
decide_file_actionssrc/bin/pg_rewind/filemap.c860
decide_file_actionsrc/bin/pg_rewind/filemap.c699
final_filemap_cmpsrc/bin/pg_rewind/filemap.c679
isRelDataFilesrc/bin/pg_rewind/filemap.c570
keepwal_initsrc/bin/pg_rewind/filemap.c242
datapagemap_addsrc/bin/pg_rewind/datapagemap.c31
datapagemap_iteratesrc/bin/pg_rewind/datapagemap.c74
datapagemap_nextsrc/bin/pg_rewind/datapagemap.c87
traverse_datadirsrc/bin/pg_rewind/file_ops.c384
sync_target_dirsrc/bin/pg_rewind/file_ops.c318
slurpFilesrc/bin/pg_rewind/file_ops.c337
rewind_parseTimeLineHistorysrc/bin/pg_rewind/timeline.c28
rewind_source (struct)src/bin/pg_rewind/rewind_source.h23
file_entry_t (struct)src/bin/pg_rewind/filemap.h49
filemap_t (struct)src/bin/pg_rewind/filemap.h89
file_action_t (enum)src/bin/pg_rewind/filemap.h16
datapagemap_t (struct)src/bin/pg_rewind/datapagemap.h

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

MySQL’s high-availability tooling relies on mysqldump or Percona XtraBackup for full reseeds. For GTID-based replication, a diverged node can be rejoined without a full backup only if all missing transactions are still in the binary log of the source. There is no equivalent to pg_rewind’s WAL-scan-based partial-page resync.

Patroni, the most widely deployed PostgreSQL HA stack, calls pg_rewind automatically when a former primary is detected to have diverged after a failover. Before calling pg_rewind Patroni ensures the former primary is shut down cleanly (matching ensureCleanShutdown) and that the new primary has wal_log_hints = on or checksums enabled. The use_pg_rewind configuration option was added precisely because full pg_basebackup reseeds were too slow for large clusters.

The algorithmic core of pg_rewind — comparing two timeline history arrays to find a divergence LSN — is a special case of a more general problem studied in distributed systems: finding the last common prefix of two diverging log sequences. In the context of Raft and Paxos-based replication (see Ongaro & Ousterhout, In Search of an Understandable Consensus Algorithm, USENIX ATC 2014), this is solved by the nextIndex / matchIndex protocol. PostgreSQL’s approach is simpler because timelines are a strict tree (no concurrent forks) and the divergence point is a physical byte offset, not a logical term+index tuple.

PostgreSQL 17 introduced WAL summarization (pg_wal_summarize) and the pg_basebackup --incremental mode. Both use a per-block change tracking file (.walsumm files in pg_wal/summaries/) to identify which pages changed since the last backup — exactly the same problem that pg_rewind solves by scanning WAL. The incremental backup approach avoids WAL scan latency at backup time but requires that WAL summarization has been running continuously since the last full backup. pg_rewind’s WAL scan approach works without any prior setup, at the cost of scanning the target’s diverged WAL at resync time. Both are in scope for REL_18 (WAL summarization shipped in PG17 and is present in REL_18_STABLE).

  1. Concurrent source modification. The libpq source path explicitly handles a source that may be modified during the copy (it re-reads the control file at the end and sets minRecoveryPoint to the current WAL insert LSN). The local source path asserts that the source has not changed (a memcmp of the two control file reads with a fatal error on mismatch). Whether the local-path assertion is strictly necessary — or whether the same “set minRecoveryPoint to latest checkpoint” logic would be safe — is noted as a XXX comment in perform_rewind.

  2. Non-main fork tracking. extractPageInfo explicitly skips all forks except MAIN_FORKNUM, copying FSM, VM, and init forks in full. This is conservative and correct but means that a table with a large free-space map gets its entire FSM copied even if only one page changed in the main fork.

  3. excludeFiles / basebackup.c sync. The comment in filemap.c notes that excludeDirContents should stay in sync with basebackup.c. There is no automated enforcement of this invariant.

Source files (REL_18_STABLE, commit 273fe94)

Section titled “Source files (REL_18_STABLE, commit 273fe94)”
  • src/bin/pg_rewind/pg_rewind.c — main orchestrator
  • src/bin/pg_rewind/parsexlog.c — WAL reader (extractPageMap, findLastCheckpoint)
  • src/bin/pg_rewind/filemap.c — file decision table (filehash, decide_file_actions)
  • src/bin/pg_rewind/rewind_source.h — source vtable definition
  • src/bin/pg_rewind/file_ops.c — target-side I/O (traverse_datadir, sync_target_dir)
  • src/bin/pg_rewind/timeline.c — timeline history parser
  • src/bin/pg_rewind/datapagemap.c — block bitmap
  • src/bin/pg_rewind/filemap.h — file_entry_t, filemap_t, file_action_t
  • knowledge/code-analysis/postgres/postgres-recovery-redo.md — WAL replay on target startup after rewind
  • knowledge/code-analysis/postgres/postgres-xlog-wal.md — WAL insertion, LSN model, full-page writes
  • knowledge/code-analysis/postgres/postgres-checkpoint.md — checkpoint mechanics; chkptredo anchor used by rewind
  • knowledge/code-analysis/postgres/postgres-wal-records-rmgr.md — rmgr dispatch; WAL record structure decoded by parsexlog.c
  • knowledge/code-analysis/postgres/postgres-incremental-backup.md — WAL summarization; alternative change-tracking approach