PostgreSQL pg_rewind — Timeline Divergence Detection and Data Directory Resync
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”High-availability deployments of any DBMS depend on the ability to rejoin a node that has fallen behind or diverged. The difficulty is that two nodes starting from the same checkpoint can each accept writes after a network partition, after a mis-fenced primary, or after a planned switchover gone wrong. When the dust settles, one node holds the authoritative history; the other holds a diverged history that can never be merged with the new timeline and must be discarded.
Timeline divergence
Section titled “Timeline divergence”PostgreSQL represents its WAL stream as a linear sequence of
timelines. Each promotion increments the timeline ID (TLI) and
records the fork point in a .history file: <TLI>.history maps each
predecessor TLI to the LSN at which that timeline ended. The
TimeLineHistoryEntry array built from that file is how the system
answers “at which LSN did timelines A and B last share the same WAL
byte?”.
The fundamental relationship is:
target LSN stream: ... [common ancestor checkpoint C] ---> [diverged WAL D]source LSN stream: ... [common ancestor checkpoint C] ---> [new WAL after promotion]After the divergence point the two streams are independent. A rejoin operation must:
- Identify C — the last common checkpoint before divergence.
- Discover the set of data pages the target modified in region D (diverged WAL) so those pages can be overwritten from the source.
- Copy the necessary files and pages from source to target.
- Leave a marker that tells the target where to begin WAL replay from the source, so the remaining gap is closed on next startup.
This is distinct from crash recovery (which replays a single node’s
own WAL) and from streaming replication (which requires the target
never to have diverged). pg_rewind fills the gap between those two:
a diverged but otherwise healthy node that can be made current at lower
cost than a full pg_basebackup.
Why not just pg_basebackup again?
Section titled “Why not just pg_basebackup again?”A full base backup copies every byte of the data directory — typically tens to hundreds of gigabytes. pg_rewind copies only the pages the target dirtied after divergence, plus any non-relation files that differ. For a node that diverged recently (minutes to hours) the transfer is orders of magnitude smaller, and the former primary can rejoin as a standby without the network cost of a full reseed.
The trade-off: pg_rewind requires that the WAL from the last common
checkpoint onward still exist on the target (or be recoverable from an
archive), and that the target uses checksums or wal_log_hints = on
so that hint-bit updates do not silently corrupt the diverged pages.
Relationship to ARIES and WAL replay
Section titled “Relationship to ARIES and WAL replay”pg_rewind does not run WAL replay itself. It leaves the target in a
state that looks like a base backup begun at checkpoint C: the control
file’s minRecoveryPoint is set to the source’s current WAL end, and
a backup_label file names checkpoint C as the starting point for
replay. Normal PostgreSQL recovery (startup process) then replays the
source’s WAL from C forward, closing the gap. This is the same model
used by pg_basebackup --wal-method=stream: the backup tool transfers
files, and recovery does the WAL work.
Common DBMS Design
Section titled “Common DBMS Design”Source abstraction (vtable pattern)
Section titled “Source abstraction (vtable pattern)”Most replication tools must handle at least two data sources: a live
running server and a local directory. The universal pattern is a thin
vtable (or interface) that wraps the difference. The caller only sees
operations like “list files”, “fetch this byte range”, “get current WAL
position”; the backend decides whether to satisfy the request via
COPY queries over libpq or via direct file I/O. PostgreSQL uses this
pattern in rewind_source.
Dirty-page tracking via WAL scan
Section titled “Dirty-page tracking via WAL scan”The insight that makes partial resync possible: the target’s own WAL
records precisely which pages it touched after divergence. Reading the
WAL from C to the target’s end-of-WAL yields a complete set of
(relfilelocator, forknum, blockno) tuples — the exact blocks that
must be overwritten. This avoids block-level checksumming of the entire
data directory, which would require reading every page on both sides.
File-action decision table
Section titled “File-action decision table”Any resync tool must decide, for each path in the data directory, one of a fixed set of actions: copy the whole file, copy only a tail extension, truncate, remove, or leave alone. Relation data files and non-relation files get different treatment because WAL tracking is available only for the main fork of relation files; everything else must be copied in full when it differs.
Safe action ordering
Section titled “Safe action ordering”Applying file actions in the wrong order can leave the directory in an unrecoverable state. The universal safe ordering is: create directories before writing their contents, copy files before removing old ones, remove leaf entries before their parent directories. Encoding the ordering directly in the action enum (so that sorting by action value yields the correct sequence) is a clean implementation technique.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”Architecture overview
Section titled “Architecture overview”pg_rewind is a standalone client-side binary (src/bin/pg_rewind/).
It does not link against the server backend; it uses libpq for the
live-source path and re-implements the handful of backend functions it
needs (WAL reading, timeline history parsing) in pure frontend code.
The main driver is pg_rewind.c; the four supporting modules are
parsexlog.c (WAL scan), filemap.c (file decision table),
file_ops.c (target-side I/O), and timeline.c (history parsing).
// main — src/bin/pg_rewind/pg_rewind.cmain() ├── init_libpq_source / init_local_source ← source vtable ├── ensureCleanShutdown ← single-user postgres if dirty ├── sanityChecks ← system_identifier, checksums ├── getTimelineHistory × 2 ← source + target TLI arrays ├── findCommonAncestorTimeline ← → divergerec ├── findLastCheckpoint ← → chkptrec / chkptredo ├── filehash_init + traverse_files × 2 ← populate hash ├── extractPageMap ← WAL scan → dirty-page bitmaps ├── decide_file_actions → filemap_t ← per-file action enum └── perform_rewind ← copy + backup_label + control filePhase 1 — preconditions and timeline divergence
Section titled “Phase 1 — preconditions and timeline divergence”main() first ensures the target is cleanly shut down. If the control
file shows state != DB_SHUTDOWNED and --no-ensure-shutdown is not
set, ensureCleanShutdown launches postgres --single to finish
crash recovery before pg_rewind touches anything. This is safety: an
unclean target might have an inconsistent page state that confuses the
subsequent WAL scan.
sanityChecks then verifies three invariants:
system_identifiermatches — source and target came from the sameinitdbcluster.pg_control_versionandcatalog_version_nomatch the compiled constant — same major PostgreSQL version.- The target has checksums or
wal_log_hints— without one of these, hint-bit updates are invisible in WAL and pg_rewind would miss modified pages.
// sanityChecks — src/bin/pg_rewind/pg_rewind.cif (ControlFile_target.system_identifier != ControlFile_source.system_identifier) pg_fatal("source and target clusters are from different systems");
if (ControlFile_target.data_checksum_version != PG_DATA_CHECKSUM_VERSION && !ControlFile_target.wal_log_hints) pg_fatal("target server needs to use either data checksums or " "\"wal_log_hints = on\"");Timeline history is fetched for both sides via getTimelineHistory.
For TLI 1 (no history file) a synthetic single-entry array is
returned. For higher TLIs the .history file is fetched from the
appropriate source and parsed by rewind_parseTimeLineHistory into a
TimeLineHistoryEntry[] array.
findCommonAncestorTimeline walks both arrays in parallel, stopping at
the first tli or begin mismatch. The MinXLogRecPtr of the two
end fields at that index is divergerec — the LSN at which the
histories parted.
// findCommonAncestorTimeline — src/bin/pg_rewind/pg_rewind.cfor (i = 0; i < n; i++){ if (a_history[i].tli != b_history[i].tli || a_history[i].begin != b_history[i].begin) break;}if (i > 0){ i--; *recptr = MinXLogRecPtr(a_history[i].end, b_history[i].end); *tliIndex = i; return;}If target_wal_endrec <= divergerec the target is already an ancestor
of the source (it never wrote beyond the fork point), so no rewind is
needed. If target_wal_endrec > divergerec a rewind is required.
Phase 2 — last common checkpoint
Section titled “Phase 2 — last common checkpoint”findLastCheckpoint (in parsexlog.c) walks the target’s WAL
backwards from divergerec, looking for the most recent
XLOG_CHECKPOINT_SHUTDOWN or XLOG_CHECKPOINT_ONLINE record that
precedes the fork. This is the rewind start point chkptrec; the
checkpoint’s redo field gives chkptredo.
Walking backwards is correct because pg_rewind does not care about
in-flight transactions — it only needs a checkpoint from which WAL
replay can reconstruct a consistent state. The backward walk also
tracks which WAL segment files contain the checkpoint, adding them to
the keepwal hash so they are not accidentally removed during the file
reconciliation phase.
// findLastCheckpoint — src/bin/pg_rewind/parsexlog.cinfo = XLogRecGetInfo(xlogreader) & ~XLR_INFO_MASK;if (searchptr < forkptr && XLogRecGetRmid(xlogreader) == RM_XLOG_ID && (info == XLOG_CHECKPOINT_SHUTDOWN || info == XLOG_CHECKPOINT_ONLINE)){ memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint)); *lastchkptrec = searchptr; *lastchkpttli = checkPoint.ThisTimeLineID; *lastchkptredo = checkPoint.redo; break;}/* Walk backwards to previous record. */searchptr = record->xl_prev;Phase 3 — file inventory and WAL page map
Section titled “Phase 3 — file inventory and WAL page map”traverse_files is called on the source (via the rewind_source
vtable) and traverse_datadir on the target. Both call back into
process_source_file and process_target_file respectively, each of
which inserts or updates a file_entry_t record in the filehash hash
table keyed by relative path.
// process_source_file — src/bin/pg_rewind/filemap.centry = insert_filehash_entry(path);entry->source_exists = true;entry->source_type = type;entry->source_size = size;entry->source_link_target = link_target ? pg_strdup(link_target) : NULL;extractPageMap then reads the target’s WAL from chkptrec to
target_wal_endrec using XLogReaderAllocate / XLogReadRecord.
For each decoded record it calls extractPageInfo, which iterates over
the record’s block references and calls
process_target_wal_block_change for each MAIN_FORKNUM block.
// extractPageInfo — src/bin/pg_rewind/parsexlog.cfor (block_id = 0; block_id <= XLogRecMaxBlockId(record); block_id++){ if (!XLogRecGetBlockTagExtended(record, block_id, &rlocator, &forknum, &blkno, NULL)) continue; if (forknum != MAIN_FORKNUM) continue; process_target_wal_block_change(forknum, rlocator, blkno);}process_target_wal_block_change looks up the file entry for the
corresponding data segment and adds blkno_inseg to the entry’s
target_pages_to_overwrite bitmap — but only if the block is within
the source file’s bounds (end_offset <= source_size). Blocks beyond
the source EOF will be truncated away anyway; there is no point
fetching them.
// process_target_wal_block_change — src/bin/pg_rewind/filemap.cend_offset = (blkno_inseg + 1) * BLCKSZ;if (end_offset <= entry->source_size && end_offset <= entry->target_size) datapagemap_add(&entry->target_pages_to_overwrite, blkno_inseg);The datapagemap_t is a compact byte bitmap: bit N represents block N
within its segment. The bitmap grows on demand with a small headroom
so that sequential block insertions do not trigger per-block realloc.
Phase 4 — file action decision
Section titled “Phase 4 — file action decision”decide_file_actions() iterates the hash table, calling
decide_file_action() for each entry. The decision logic is:
| Condition | Action |
|---|---|
path matches excludeFiles list | REMOVE (if target-only) or NONE |
| source only (new on source) | COPY or CREATE (dir/symlink) |
| target only (deleted on source) | REMOVE, unless in keepwal |
| both exist, non-relation file | COPY |
| both exist, relation, target larger | TRUNCATE |
| both exist, relation, target smaller | COPY_TAIL |
| both exist, relation, same size | NONE (dirty pages handled separately) |
Special-cased paths include XLOG_CONTROL_FILE (deferred to end of
perform_rewind), PG_VERSION (never overwritten), .DS_Store, and
directories/symlinks that exist on both sides (no action needed for the
entry itself).
// decide_file_action — src/bin/pg_rewind/filemap.cif (strcmp(path, XLOG_CONTROL_FILE) == 0) return FILE_ACTION_NONE; /* handled separately at end */
if (!entry->target_exists && entry->source_exists){ switch (entry->source_type) { case FILE_TYPE_DIRECTORY: return FILE_ACTION_CREATE; case FILE_TYPE_REGULAR: return FILE_ACTION_COPY; // ... symlink → CREATE }}else if (entry->target_exists && !entry->source_exists){ if (keepwal_entry_exists(path)) return FILE_ACTION_NONE; /* WAL needed for recovery */ return FILE_ACTION_REMOVE;}For a relation data file present on both sides, the action is decided
purely by comparing the two segment sizes. A target that is larger has
extra blocks the source truncated away, so it is truncated now (WAL
replay would do the same later); a target that is smaller is missing a
tail the source has, so the tail is copied; equal sizes need no
whole-file action because per-block changes are already captured in the
target_pages_to_overwrite bitmap from the WAL scan.
// decide_file_action — src/bin/pg_rewind/filemap.ccase FILE_TYPE_REGULAR: if (!entry->isrelfile) return FILE_ACTION_COPY; /* non-data file: copy in toto */ else { if (entry->target_size < entry->source_size) return FILE_ACTION_COPY_TAIL; /* fetch the missing tail */ else if (entry->target_size > entry->source_size) return FILE_ACTION_TRUNCATE; /* drop the extra blocks */ else return FILE_ACTION_NONE; /* dirty pages handled via bitmap */ }The file_action_t enum is deliberately ordered so that
qsort-by-action produces the safe execution sequence: CREATE first
(so parent directories exist before children), then COPY/COPY_TAIL,
then NONE, then TRUNCATE, then REMOVE. REMOVE entries are secondarily
sorted in reverse-path order so foo/bar is removed before foo.
Phase 5 — perform_rewind: copy and control file update
Section titled “Phase 5 — perform_rewind: copy and control file update”perform_rewind iterates the sorted filemap_t. For each entry with a
non-empty target_pages_to_overwrite bitmap it iterates the bitmap and
calls source->queue_fetch_range for each dirty block. Then it
dispatches the per-file action: COPY → queue_fetch_file, COPY_TAIL →
queue_fetch_range for the tail, TRUNCATE → truncate_target_file,
REMOVE → remove_target, CREATE → create_target.
// perform_rewind — src/bin/pg_rewind/pg_rewind.citer = datapagemap_iterate(&entry->target_pages_to_overwrite);while (datapagemap_next(iter, &blkno)){ offset = blkno * BLCKSZ; source->queue_fetch_range(source, entry->path, offset, BLCKSZ);}// ... then switch(entry->action) for whole-file operationssource->finish_fetch(source);After all file operations, pg_rewind re-fetches the source’s control
file and constructs a ControlFile_new:
stateis set toDB_IN_ARCHIVE_RECOVERY.minRecoveryPointis set to the source’s current WAL insert LSN (for a live primary source) or its latest checkpoint (for a local directory source) — this is how far the target must replay before it can accept connections.minRecoveryPointTLIis set to the source’s TLI.
A backup_label file is written naming chkptredo as the START WAL LOCATION. This causes the startup process to begin WAL replay from the
last common checkpoint, exactly as if pg_rewind had been a base backup
started at that point.
// createBackupLabel — src/bin/pg_rewind/pg_rewind.clen = snprintf(buf, sizeof(buf), "START WAL LOCATION: %X/%X (file %s)\n" "CHECKPOINT LOCATION: %X/%X\n" "BACKUP METHOD: pg_rewind\n" "BACKUP FROM: standby\n" "START TIME: %s\n", LSN_FORMAT_ARGS(startpoint), xlogfilename, LSN_FORMAT_ARGS(checkpointloc), strfbuf);Source abstraction: rewind_source vtable
Section titled “Source abstraction: rewind_source vtable”The rewind_source struct is a C vtable with seven function pointers:
// rewind_source — src/bin/pg_rewind/rewind_source.htypedef struct rewind_source{ void (*traverse_files)(struct rewind_source *, process_file_callback_t callback); char *(*fetch_file)(struct rewind_source *, const char *path, size_t *filesize); void (*queue_fetch_range)(struct rewind_source *, const char *path, off_t offset, size_t len); void (*queue_fetch_file)(struct rewind_source *, const char *path, size_t len); void (*finish_fetch)(struct rewind_source *); XLogRecPtr (*get_current_wal_insert_lsn)(struct rewind_source *); void (*destroy)(struct rewind_source *);} rewind_source;init_libpq_source (libpq_source.c) implements the vtable using
COPY queries and pg_read_binary_file() calls over a live libpq
connection. It batches range-fetch requests and flushes them in
finish_fetch. init_local_source (local_source.c) implements it
with direct filesystem calls.
Both backends are transparent to the caller; all phases above use only the vtable interface.
Excluded paths
Section titled “Excluded paths”filemap.c maintains two exclusion lists. excludeDirContents names
directories whose contents are always regenerated on server start
(pg_stat_tmp, pg_replslot, pg_dynshmem, pg_notify,
pg_serial, pg_snapshots, pg_subtrans). excludeFiles names
specific filenames that should not be copied (postmaster.pid,
postmaster.opts, backup_label, backup_manifest, pg_internal.init
prefix, postgresql.auto.conf.tmp, current_logfiles.tmp). These
lists are documented as needing to stay in sync with basebackup.c.
// excludeDirContents — src/bin/pg_rewind/filemap.cstatic const char *const excludeDirContents[] = { "pg_stat_tmp", "pg_replslot", "pg_dynshmem", "pg_notify", "pg_serial", "pg_snapshots", "pg_subtrans", NULL};Data flow diagram
Section titled “Data flow diagram”flowchart TD
A["main()<br/>parse args, connect"] --> B["ensureCleanShutdown<br/>(postgres --single if dirty)"]
B --> C["sanityChecks<br/>system_identifier<br/>checksums/wal_log_hints"]
C --> D["getTimelineHistory<br/>source + target"]
D --> E["findCommonAncestorTimeline<br/>→ divergerec"]
E --> F{rewind<br/>needed?}
F -- no --> Z["exit 0"]
F -- yes --> G["findLastCheckpoint<br/>→ chkptrec / chkptredo"]
G --> H["traverse_files source<br/>+ traverse_datadir target<br/>→ filehash populated"]
H --> I["extractPageMap<br/>WAL chkptrec → diverge-end<br/>→ dirty page bitmaps"]
I --> J["decide_file_actions<br/>→ sorted filemap_t"]
J --> K["perform_rewind<br/>queue_fetch_range dirty blocks<br/>+ file-level actions"]
K --> L["createBackupLabel<br/>update_controlfile<br/>minRecoveryPoint = source WAL end"]
L --> M["sync_target_dir<br/>Done"]
Divergence detection + changed-block copy flow
Section titled “Divergence detection + changed-block copy flow”The first flowchart shows the top-level phase sequence. This second
diagram zooms into the analytical core: how the two timeline histories
yield divergerec, how the backward checkpoint walk yields chkptrec,
and how the forward WAL scan from chkptrec turns each modified
MAIN_FORKNUM block into either a queued fetch or a discard. Note that
a block is added to the overwrite bitmap only when it lies within
both the source and target file bounds — anything past the source EOF
is left for the later TRUNCATE/COPY_TAIL whole-file decision rather than
fetched block-by-block.
flowchart TD
A["source history[]<br/>target history[]"] --> B["findCommonAncestorTimeline<br/>parallel-walk until tli/begin mismatch"]
B --> C["divergerec =<br/>MinXLogRecPtr(end_a, end_b)"]
C --> D{"target_wal_endrec<br/>> divergerec ?"}
D -- no --> E["target is ancestor<br/>no rewind"]
D -- yes --> F["findLastCheckpoint<br/>walk WAL backwards from divergerec"]
F --> G{"RM_XLOG_ID record &&<br/>SHUTDOWN or ONLINE<br/>&& searchptr < forkptr ?"}
G -- no --> H["searchptr = record->xl_prev<br/>keepwal_add_entry for segment"]
H --> F
G -- yes --> I["chkptrec = searchptr<br/>chkptredo = checkPoint.redo"]
I --> J["extractPageMap<br/>XLogReadRecord chkptrec .. endpoint"]
J --> K["extractPageInfo<br/>for each block_id"]
K --> L{"forknum ==<br/>MAIN_FORKNUM ?"}
L -- no --> M["skip<br/>FSM/VM/init copied in toto"]
L -- yes --> N["process_target_wal_block_change"]
N --> O{"end_offset <= source_size<br/>&& <= target_size ?"}
O -- no --> P["ignore<br/>truncated/removed later"]
O -- yes --> Q["datapagemap_add(blkno_inseg)<br/>queued for queue_fetch_range"]
Source Walkthrough
Section titled “Source Walkthrough”pg_rewind.c — main orchestrator
Section titled “pg_rewind.c — main orchestrator”| Symbol | Role |
|---|---|
main | Top-level: argument parsing, source init, phase sequencing |
perform_rewind | Executes filemap actions; writes backup_label and control file |
sanityChecks | Validates system_identifier, versions, checksums/wal_log_hints |
getTimelineHistory | Fetches .history file and wraps rewind_parseTimeLineHistory |
findCommonAncestorTimeline | Parallel-walks two TimeLineHistoryEntry[] arrays to find divergerec |
ensureCleanShutdown | Runs postgres --single to force crash recovery on dirty target |
createBackupLabel | Writes START WAL LOCATION / CHECKPOINT LOCATION marker |
digestControlFile | Reads and CRC-checks a ControlFileData buffer |
getRestoreCommand | Invokes postgres -C restore_command to obtain archive restore command |
progress_report | Throttled stderr progress output (once per second) |
ControlFile_target | Global: target’s pg_control read at startup |
ControlFile_source | Global: source’s pg_control read at startup |
ControlFile_source_after | Global: source’s pg_control re-read after file copy (sanity check) |
targetHistory / targetNentries | Global: target timeline history array (used by parsexlog.c) |
parsexlog.c — WAL reader
Section titled “parsexlog.c — WAL reader”| Symbol | Role |
|---|---|
extractPageMap | Drives XLogReaderAllocate / XLogReadRecord loop from chkptrec to endpoint |
extractPageInfo | Iterates block refs in one decoded WAL record; calls process_target_wal_block_change |
findLastCheckpoint | Backward WAL scan to find the most recent checkpoint before divergerec |
readOneRecord | Reads one record at a given LSN; used to find target_wal_endrec |
SimpleXLogPageRead | XLogReader page-read callback; handles segment switching and archive restore |
XLogPageReadPrivate | Per-reader state: tliIndex and restoreCommand |
filemap.c — file decision table
Section titled “filemap.c — file decision table”| Symbol | Role |
|---|---|
filehash | simplehash-based hash table keyed by relative path; holds file_entry_t records |
file_entry_t | Per-path record: source/target size, type, link target, dirty-page bitmap, action |
filemap_t | Final sorted array of file_entry_t * ready for execution |
file_action_t | Enum: UNDECIDED / CREATE / COPY / COPY_TAIL / NONE / TRUNCATE / REMOVE |
filehash_init | Allocates the hash table |
process_source_file | Callback: records source-side file metadata into filehash |
process_target_file | Callback: records target-side file metadata into filehash |
process_target_wal_block_change | Callback: adds one block to a file entry’s target_pages_to_overwrite bitmap |
decide_file_actions | Calls decide_file_action for each entry; sorts into filemap_t |
decide_file_action | Per-entry logic: size comparison, exclusion filters, keepwal check |
keepwal | Secondary hash table protecting WAL segment files from removal |
keepwal_init / keepwal_add_entry | Initialise and populate the keepwal table |
check_file_excluded | Tests against excludeFiles and excludeDirContents lists |
isRelDataFile | Path regex: detects global/<oid>, base/<db>/<oid>, pg_tblspc/... patterns |
calculate_totals | Sums fetch_size and total_size for progress reporting |
final_filemap_cmp | Sort comparator: action enum order, then path; REMOVE entries reversed |
datapagemap.c — block bitmap
Section titled “datapagemap.c — block bitmap”| Symbol | Role |
|---|---|
datapagemap_t | Struct: bitmap byte array + bitmapsize |
datapagemap_add | Sets bit for one block number; grows array on demand |
datapagemap_iterate | Allocates an iterator at block 0 |
datapagemap_next | Advances iterator; returns next set block number |
file_ops.c — target-side I/O
Section titled “file_ops.c — target-side I/O”| Symbol | Role |
|---|---|
open_target_file / close_target_file | Maintain the currently-open target file descriptor (dstfd) |
write_target_range | Seeks and writes a byte range; updates fetch_done for progress |
remove_target / create_target | Dispatch to type-specific helpers (file/dir/symlink) |
truncate_target_file | Calls ftruncate on the target path |
slurpFile | Reads an entire file into a malloc’d buffer (used for pg_control, history files) |
traverse_datadir | Recursive directory walker; calls process_file_callback_t for each entry |
sync_target_dir | Calls sync_pgdata for a two-pass fsync of the whole data directory |
timeline.c — history parser
Section titled “timeline.c — history parser”| Symbol | Role |
|---|---|
rewind_parseTimeLineHistory | Parses a .history file buffer into a TimeLineHistoryEntry[] array; appends a tip entry for the target TLI |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Position hints for REL_18_STABLE commit 273fe94. Symbols are the stable anchors; line numbers are hints that decay.
| Symbol | File | Approx. line |
|---|---|---|
main | src/bin/pg_rewind/pg_rewind.c | 120 |
perform_rewind | src/bin/pg_rewind/pg_rewind.c | 554 |
sanityChecks | src/bin/pg_rewind/pg_rewind.c | 734 |
findCommonAncestorTimeline | src/bin/pg_rewind/pg_rewind.c | 921 |
createBackupLabel | src/bin/pg_rewind/pg_rewind.c | 963 |
ensureCleanShutdown | src/bin/pg_rewind/pg_rewind.c | 1130 |
getRestoreCommand | src/bin/pg_rewind/pg_rewind.c | 1057 |
digestControlFile | src/bin/pg_rewind/pg_rewind.c | 1024 |
extractPageMap | src/bin/pg_rewind/parsexlog.c | 65 |
extractPageInfo | src/bin/pg_rewind/parsexlog.c | 388 |
findLastCheckpoint | src/bin/pg_rewind/parsexlog.c | 167 |
readOneRecord | src/bin/pg_rewind/parsexlog.c | 123 |
SimpleXLogPageRead | src/bin/pg_rewind/parsexlog.c | 275 |
filehash_init | src/bin/pg_rewind/filemap.c | 196 |
process_source_file | src/bin/pg_rewind/filemap.c | 279 |
process_target_file | src/bin/pg_rewind/filemap.c | 315 |
process_target_wal_block_change | src/bin/pg_rewind/filemap.c | 353 |
decide_file_actions | src/bin/pg_rewind/filemap.c | 860 |
decide_file_action | src/bin/pg_rewind/filemap.c | 699 |
final_filemap_cmp | src/bin/pg_rewind/filemap.c | 679 |
isRelDataFile | src/bin/pg_rewind/filemap.c | 570 |
keepwal_init | src/bin/pg_rewind/filemap.c | 242 |
datapagemap_add | src/bin/pg_rewind/datapagemap.c | 31 |
datapagemap_iterate | src/bin/pg_rewind/datapagemap.c | 74 |
datapagemap_next | src/bin/pg_rewind/datapagemap.c | 87 |
traverse_datadir | src/bin/pg_rewind/file_ops.c | 384 |
sync_target_dir | src/bin/pg_rewind/file_ops.c | 318 |
slurpFile | src/bin/pg_rewind/file_ops.c | 337 |
rewind_parseTimeLineHistory | src/bin/pg_rewind/timeline.c | 28 |
rewind_source (struct) | src/bin/pg_rewind/rewind_source.h | 23 |
file_entry_t (struct) | src/bin/pg_rewind/filemap.h | 49 |
filemap_t (struct) | src/bin/pg_rewind/filemap.h | 89 |
file_action_t (enum) | src/bin/pg_rewind/filemap.h | 16 |
datapagemap_t (struct) | src/bin/pg_rewind/datapagemap.h | — |
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”MySQL/InnoDB: no equivalent tool
Section titled “MySQL/InnoDB: no equivalent tool”MySQL’s high-availability tooling relies on mysqldump or Percona XtraBackup for full reseeds. For GTID-based replication, a diverged
node can be rejoined without a full backup only if all missing
transactions are still in the binary log of the source. There is no
equivalent to pg_rewind’s WAL-scan-based partial-page resync.
Patroni and pg_rewind integration
Section titled “Patroni and pg_rewind integration”Patroni, the most widely deployed PostgreSQL HA stack, calls pg_rewind
automatically when a former primary is detected to have diverged after
a failover. Before calling pg_rewind Patroni ensures the former primary
is shut down cleanly (matching ensureCleanShutdown) and that the new
primary has wal_log_hints = on or checksums enabled. The
use_pg_rewind configuration option was added precisely because full
pg_basebackup reseeds were too slow for large clusters.
Research context: divergence detection
Section titled “Research context: divergence detection”The algorithmic core of pg_rewind — comparing two timeline history
arrays to find a divergence LSN — is a special case of a more general
problem studied in distributed systems: finding the last common prefix
of two diverging log sequences. In the context of Raft and Paxos-based
replication (see Ongaro & Ousterhout, In Search of an Understandable
Consensus Algorithm, USENIX ATC 2014), this is solved by the
nextIndex / matchIndex protocol. PostgreSQL’s approach is simpler
because timelines are a strict tree (no concurrent forks) and the
divergence point is a physical byte offset, not a logical term+index
tuple.
Incremental backup synergy
Section titled “Incremental backup synergy”PostgreSQL 17 introduced WAL summarization (pg_wal_summarize) and the
pg_basebackup --incremental mode. Both use a per-block change
tracking file (.walsumm files in pg_wal/summaries/) to identify
which pages changed since the last backup — exactly the same problem
that pg_rewind solves by scanning WAL. The incremental backup approach
avoids WAL scan latency at backup time but requires that WAL
summarization has been running continuously since the last full backup.
pg_rewind’s WAL scan approach works without any prior setup, at the
cost of scanning the target’s diverged WAL at resync time. Both are in
scope for REL_18 (WAL summarization shipped in PG17 and is present in
REL_18_STABLE).
Open questions
Section titled “Open questions”-
Concurrent source modification. The libpq source path explicitly handles a source that may be modified during the copy (it re-reads the control file at the end and sets
minRecoveryPointto the current WAL insert LSN). The local source path asserts that the source has not changed (amemcmpof the two control file reads with a fatal error on mismatch). Whether the local-path assertion is strictly necessary — or whether the same “set minRecoveryPoint to latest checkpoint” logic would be safe — is noted as aXXXcomment inperform_rewind. -
Non-main fork tracking.
extractPageInfoexplicitly skips all forks exceptMAIN_FORKNUM, copying FSM, VM, and init forks in full. This is conservative and correct but means that a table with a large free-space map gets its entire FSM copied even if only one page changed in the main fork. -
excludeFiles / basebackup.c sync. The comment in
filemap.cnotes thatexcludeDirContentsshould stay in sync withbasebackup.c. There is no automated enforcement of this invariant.
Sources
Section titled “Sources”Source files (REL_18_STABLE, commit 273fe94)
Section titled “Source files (REL_18_STABLE, commit 273fe94)”src/bin/pg_rewind/pg_rewind.c— main orchestratorsrc/bin/pg_rewind/parsexlog.c— WAL reader (extractPageMap, findLastCheckpoint)src/bin/pg_rewind/filemap.c— file decision table (filehash, decide_file_actions)src/bin/pg_rewind/rewind_source.h— source vtable definitionsrc/bin/pg_rewind/file_ops.c— target-side I/O (traverse_datadir, sync_target_dir)src/bin/pg_rewind/timeline.c— timeline history parsersrc/bin/pg_rewind/datapagemap.c— block bitmapsrc/bin/pg_rewind/filemap.h— file_entry_t, filemap_t, file_action_t
Related knowledge-base docs
Section titled “Related knowledge-base docs”knowledge/code-analysis/postgres/postgres-recovery-redo.md— WAL replay on target startup after rewindknowledge/code-analysis/postgres/postgres-xlog-wal.md— WAL insertion, LSN model, full-page writesknowledge/code-analysis/postgres/postgres-checkpoint.md— checkpoint mechanics; chkptredo anchor used by rewindknowledge/code-analysis/postgres/postgres-wal-records-rmgr.md— rmgr dispatch; WAL record structure decoded by parsexlog.cknowledge/code-analysis/postgres/postgres-incremental-backup.md— WAL summarization; alternative change-tracking approach