PostgreSQL Incremental Backup — WAL Summaries, the Manifest, and pg_combinebackup
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A physical backup of a database is a byte-for-byte copy of its on-disk state: the data files, plus enough write-ahead log (WAL) to bring that copy to a consistent point during recovery. A full backup copies every file. The cost of a full backup grows with the size of the cluster, not with the rate of change — a 10 TB warehouse that mutates 1 % of its pages per day still pays 10 TB of read, transfer, and storage for each backup. That mismatch between data size and change size is the entire motivation for incremental backup.
An incremental backup copies only the data that changed since some earlier reference backup, plus the metadata needed to reconstruct a full image later. The design question every system must answer is: how do we know which bytes changed? There are three classical answers, and they trade accuracy for bookkeeping cost:
-
Timestamp / mtime comparison. Copy any file whose modification time is newer than the reference backup. Cheap, but coarse (a one-byte change copies the whole file) and unsafe (clock skew, in-place writes that do not bump mtime, files removed and recreated with the same name). File-level granularity is far too coarse for a database where a 1 GB relation segment changes a handful of 8 KB pages.
-
Block-checksum diffing. Read every block of both the live file and the reference, compare checksums, copy the blocks that differ. Accurate at page granularity, but it must read the entire database to find the changes — it eliminates the transfer cost but not the read cost. This is what
rsync --checksumstyle tools effectively do. -
Change tracking from the log. The database already records every page modification in its WAL for durability. If we summarize the WAL — distill from it the set of (relation, fork, block) tuples that were touched between two LSNs — we can know exactly which blocks changed without reading the data files at all. The cost is proportional to the volume of WAL, i.e. to the change rate, which is exactly the quantity we wanted to be proportional to.
PostgreSQL 17 implements option 3. This is the same idea that underlies ARIES-style redo (Mohan 1992): the WAL is an authoritative, ordered record of every change, so any question of the form “what changed between LSN a and LSN b?” can in principle be answered by replaying the log between those points. Incremental backup replays the log not to apply changes but to enumerate them. Database Internals (Petrov, ch. 3 “File Formats” and the recovery discussion) frames the WAL as the system’s single source of truth for durability; incremental backup is a second consumer of that same truth, parallel to crash recovery and to streaming replication.
Two derived concepts make this concrete:
The block reference table. The summarized form of “what changed” is a
map from each (RelFileLocator, ForkNumber) to the set of block numbers
modified in the LSN window, plus a limit block marking truncations
(blocks at or above the limit no longer exist). PostgreSQL calls this a
BlockRefTable. Its in-memory representation converges to roughly one bit
per modified block for densely-modified relations, so even a large change
set stays compact.
The reconstruction problem. An incremental backup is, by construction, not restorable on its own — it is a delta. To restore, you need the incremental backup plus the entire chain of backups it depends on, back to a full backup. Reconstruction walks that chain and, for every block of every file, decides which backup in the chain holds the authoritative copy. This is a classic most-recent-wins merge over a layered set of deltas, the same shape as a log-structured merge or a stack of filesystem overlay layers.
The design space PostgreSQL chose within:
- Summary granularity — file, segment, or block? PostgreSQL summarizes at block granularity but stores summaries per WAL LSN range, decoupled from any particular backup.
- Who does the diffing — the backup client, or the server? PostgreSQL does it on the server, because only the server has the WAL summaries and the timeline history.
- Where reconstruction happens — at backup time (synthetic full
backups) or at restore time? PostgreSQL defers it to restore time via a
separate frontend tool,
pg_combinebackup, keeping the server side purely about producing deltas.
Common DBMS Design
Section titled “Common DBMS Design”Incremental and differential backup is old technology; nearly every serious DBMS and storage system has a version of it, and they converge on a small set of structural choices. Naming them makes PostgreSQL’s specific symbols read as one set of points in a shared design space.
Incremental vs. differential vs. cumulative
Section titled “Incremental vs. differential vs. cumulative”The vocabulary is shared across the industry:
- A differential (or cumulative) backup records all changes since the last full backup. Restoration needs exactly two backups: the full plus the latest differential. The differential grows over time.
- An incremental backup records changes since the last backup of any kind. Restoration needs the full chain. Each increment is small, but the chain can be long.
PostgreSQL’s mechanism is genuinely incremental: a BASE_BACKUP ... INCREMENTAL is taken relative to whatever prior backup’s manifest you
upload, and that prior backup may itself have been incremental. The chain
length is bounded only by how often you take a full backup.
A change-tracking side structure
Section titled “A change-tracking side structure”Systems that diff from the log rather than from the data files all maintain some persistent side structure that records “blocks dirtied in this log window”:
- SQL Server keeps a Differential Changed Map (DCM), one bit per extent, updated as pages are written; differential backups read the DCM.
- Oracle RMAN maintains a block change tracking file recording changed blocks since the last backup, so an incremental backup need not scan every datafile.
- Db2 uses incremental backup driven by a tracking bitmap as well.
PostgreSQL’s analog is the set of WAL summary files under
pg_wal/summaries/, produced by a dedicated WAL summarizer background
worker (covered in the sibling note; here we only consume its output).
The crucial PostgreSQL-specific twist is that the summaries are keyed by
LSN range, not by backup — a summary covering 0/10000000 to
0/20000000 is reusable for any incremental backup whose chain crosses
that window. The side structure is decoupled from the backup schedule.
A self-describing manifest
Section titled “A self-describing manifest”Every modern backup format ships a manifest: a list of files with sizes,
checksums, and the WAL range needed for consistency. The manifest serves
three roles — verification (pg_verifybackup), incremental referencing
(it tells the next incremental backup what the prior state was), and
reconstruction (it lets the combiner reuse stored checksums). PostgreSQL’s
backup_manifest is a JSON document; for incremental backups the client
uploads the prior manifest to the server before requesting the backup.
Reconstruction as a layered merge
Section titled “Reconstruction as a layered merge”Whether reconstruction happens eagerly (synthesizing a full backup at backup time, as some enterprise tools do) or lazily (at restore time), the algorithm is the same most-recent-wins overlay: for each block, scan the backup chain from newest to oldest and take the first copy you find; if no backup in the chain has the block but the block is below the file’s truncation point, it must be a zero-filled hole.
flowchart TD
subgraph Theory["Change-tracking backup, abstractly"]
A["Authoritative change log<br/>(WAL)"] --> B["Summarizer:<br/>distill changed (rel,fork,block)<br/>per LSN range"]
B --> C["Persistent side structure<br/>(WAL summary files)"]
C --> D["Backup time:<br/>diff = blocks changed since<br/>reference backup"]
D --> E["Incremental backup<br/>(deltas + manifest)"]
E --> F["Restore time:<br/>layered most-recent-wins merge<br/>over the backup chain"]
F --> G["Full synthetic data directory"]
end
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL splits incremental backup across four cooperating pieces. The boundary lines matter, because the task scope here is the core backup mechanism — the first three live in the server, the fourth is a standalone frontend tool:
-
WAL summarizer (a background worker,
walsummarizer.c) continuously reads WAL as it is generated and writes block-level summary files intopg_wal/summaries/. This runs whether or not anyone ever takes an incremental backup, gated bysummarize_wal = on. Out of scope here — see the sibling note on archiving/WAL summarization. -
Manifest ingestion (
basebackup_incremental.c+UploadManifestinwalsender.c). Before requesting an incremental backup, the client sendsUPLOAD_MANIFESTand streams the prior backup’sbackup_manifest. The server parses it incrementally into anIncrementalBackupInfoobject. -
The incremental BASE_BACKUP (
basebackup.c+PrepareForIncrementalBackup/GetFileBackupMethod). The server loads the WAL summaries spanning the prior backup’s start through the current backup’s start, merges them into oneBlockRefTable, then for each relation file decides — full, incremental, or stub — and streams the chosen bytes. -
Reconstruction (
pg_combinebackup, a frontend tool undersrc/bin/). Given a chain of backup directories, it produces a single full data directory by merging blocks newest-to-oldest. The server is never involved.
Wiring the protocol: UPLOAD_MANIFEST then BASE_BACKUP
Section titled “Wiring the protocol: UPLOAD_MANIFEST then BASE_BACKUP”The client (pg_basebackup --incremental=PRIOR/backup_manifest) first
runs UPLOAD_MANIFEST. The walsender allocates an IncrementalBackupInfo
in a dedicated memory context, then loops over CopyData packets feeding
manifest bytes in:
// UploadManifest — src/backend/replication/walsender.cmcxt = AllocSetContextCreate(CurrentMemoryContext, "incremental backup information", ALLOCSET_DEFAULT_SIZES);ib = CreateIncrementalBackupInfo(mcxt);/* ... send CopyInResponse ... */while (HandleUploadManifestPacket(&buf, &offset, ib)) ;FinalizeIncrementalManifest(ib);/* preserve ib across the later BASE_BACKUP in CacheMemoryContext */MemoryContextSetParent(mcxt, CacheMemoryContext);uploaded_manifest = ib;The parsed-and-retained uploaded_manifest is handed to
perform_base_backup when the subsequent BASE_BACKUP ... INCREMENTAL
arrives. The incremental option itself is validated at parse time and
refuses to proceed unless WAL summarization is enabled — there can be no
summaries to diff against otherwise:
// parse_basebackup_options — src/backend/backup/basebackup.celse if (strcmp(defel->defname, "incremental") == 0){ opt->incremental = defGetBoolean(defel); if (opt->incremental && !summarize_wal) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("incremental backups cannot be taken unless WAL summarization is enabled")));}Parsing the manifest incrementally
Section titled “Parsing the manifest incrementally”A manifest for a large cluster can be many megabytes (one entry per file).
CreateIncrementalBackupInfo wires up a streaming JSON parser with
callbacks, and seeds a simplehash of file entries sized for a realistic
cluster:
// CreateIncrementalBackupInfo — src/backend/backup/basebackup_incremental.cib->manifest_files = backup_file_create(mcxt, 10000, NULL);context = palloc0(sizeof(JsonManifestParseContext));context->private_data = ib;context->version_cb = manifest_process_version;context->system_identifier_cb = manifest_process_system_identifier;context->per_file_cb = manifest_process_file;context->per_wal_range_cb = manifest_process_wal_range;context->error_cb = manifest_report_error;ib->inc_state = json_parse_manifest_incremental_init(context);The per-WAL-range callback is the load-bearing one for incremental logic:
it records the (tli, start_lsn, end_lsn) of each WAL range the prior
backup needs. The per-file callback retains only path and size — used for
sanity checks, not for deciding what changed (that comes from the WAL
summaries):
// manifest_process_wal_range — basebackup_incremental.crange->tli = tli;range->start_lsn = start_lsn;range->end_lsn = end_lsn;ib->manifest_wal_ranges = lappend(ib->manifest_wal_ranges, range);The streaming design (AppendIncrementalManifestData triggers a parse
step whenever the buffer is about to exceed MAX_CHUNK = 128 KiB, keeping
the last MIN_CHUNK = 1 KiB so the trailing checksum line stays intact)
exists precisely so the whole manifest never has to be buffered at once.
Deciding which blocks are needed
Section titled “Deciding which blocks are needed”PrepareForIncrementalBackup is the heart of the server side. It does
five things in order: (1) validate the manifest’s WAL ranges against this
server’s timeline history; (2) sanity-check the LSN boundaries; (3) wait
for the WAL summarizer to catch up; (4) gather the WAL summary files
covering the range of interest; (5) merge them into one BlockRefTable.
The timeline matching guards against taking an incremental backup relative to a backup that does not actually represent a prior state of this server:
// PrepareForIncrementalBackup — basebackup_incremental.cexpectedTLEs = readTimeLineHistory(backup_state->starttli);/* ... match each manifest WAL range's TLI into expectedTLEs ... */if (tlep[i] == NULL) ereport(ERROR, (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), errmsg("timeline %u found in manifest, but not in this server's history", range->tli)));Once the LSN window is known, it blocks until summarization reaches the backup’s start point, then collects and filters the summary files, and finally streams each summary’s block references into the in-memory table:
// PrepareForIncrementalBackup — basebackup_incremental.c (summary merge)WaitForWalSummarization(backup_state->startpoint);all_wslist = GetWalSummaries(0, earliest_wal_range_start_lsn, backup_state->startpoint);/* ... per-timeline FilterWalSummaries + WalSummariesAreComplete check ... */ib->brtab = CreateEmptyBlockRefTable();foreach(lc, required_wslist){ /* open summary, read each relation fork ... */ while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum, &limit_block)) { BlockRefTableSetLimitBlock(ib->brtab, &rlocator, forknum, limit_block); while ((nblocks = BlockRefTableReaderGetBlocks(reader, blocks, BLOCKS_PER_READ)) != 0) for (i = 0; i < nblocks; ++i) BlockRefTableMarkBlockModified(ib->brtab, &rlocator, forknum, blocks[i]); }}If any summary is missing for a required LSN range, the backup fails loudly rather than silently producing an unsafe delta — the whole point is that the change set must be complete.
flowchart TD C1["pg_basebackup --incremental=PRIOR/backup_manifest"] --> C2["UPLOAD_MANIFEST<br/>(stream prior manifest)"] C2 --> S1["CreateIncrementalBackupInfo<br/>+ streaming JSON parse"] S1 --> C3["BASE_BACKUP ... INCREMENTAL"] C3 --> S2["PrepareForIncrementalBackup"] S2 --> S2a["match WAL ranges vs<br/>timeline history"] S2a --> S2b["WaitForWalSummarization<br/>(start LSN)"] S2b --> S2c["GetWalSummaries +<br/>FilterWalSummaries +<br/>WalSummariesAreComplete"] S2c --> S2d["merge into one<br/>BlockRefTable (ib->brtab)"] S2d --> S3["sendDir: for each relation file<br/>GetFileBackupMethod"] S3 --> M1["BACK_UP_FILE_FULLY"] S3 --> M2["BACK_UP_FILE_INCREMENTALLY<br/>(header + changed blocks)"] M1 --> S4["sendFile streams bytes"] M2 --> S4 S4 --> OUT["base.tar + backup_manifest<br/>(this backup)"]
Source Walkthrough
Section titled “Source Walkthrough”This section walks the server-side code top-down — manifest ingestion,
the per-file decision, the on-the-wire incremental format — and then
crosses into the pg_combinebackup frontend for reconstruction. Symbols
are the durable anchors; the position-hint table at the end pins each to a
(file, line) as observed on 2026-06-05.
Manifest ingestion: CreateIncrementalBackupInfo and the streaming parser
Section titled “Manifest ingestion: CreateIncrementalBackupInfo and the streaming parser”UploadManifest (in walsender.c) is the protocol entry point. It creates
an IncrementalBackupInfo, then drives the streaming JSON parser one
CopyData packet at a time and finalizes. The object is reparented into
CacheMemoryContext so it survives until the matching BASE_BACKUP:
// UploadManifest — src/backend/replication/walsender.cmcxt = AllocSetContextCreate(CurrentMemoryContext, "incremental backup information", ALLOCSET_DEFAULT_SIZES);ib = CreateIncrementalBackupInfo(mcxt);/* ... CopyInResponse, then loop feeding bytes ... */while (HandleUploadManifestPacket(&buf, &offset, ib)) ;FinalizeIncrementalManifest(ib);MemoryContextSetParent(mcxt, CacheMemoryContext);uploaded_manifest = ib;The buffering discipline in AppendIncrementalManifestData is what keeps
an arbitrarily large manifest from being held in memory all at once. It
parses incrementally whenever the accumulated buffer would cross
MAX_CHUNK, always retaining the last MIN_CHUNK bytes so a checksum line
straddling a chunk boundary is never split mid-token:
// AppendIncrementalManifestData — src/backend/backup/basebackup_incremental.cif (ib->buf.len > MIN_CHUNK && ib->buf.len + len > MAX_CHUNK){ /* Parse all but the last MIN_CHUNK bytes of data we have so far. */ json_parse_manifest_incremental_chunk( ib->inc_state, ib->buf.data, ib->buf.len - MIN_CHUNK, false); /* Now shift the data that hasn't yet been parsed to the start of * the buffer. */ memmove(ib->buf.data, ib->buf.data + (ib->buf.len - MIN_CHUNK), MIN_CHUNK + 1); ib->buf.len = MIN_CHUNK;}The only callback whose output drives the diff is
manifest_process_wal_range — the per-file callback merely records path
and size for the existence checks in GetFileBackupMethod. The WAL ranges
tell the server which timeline/LSN windows the prior backup spans, which is
exactly what PrepareForIncrementalBackup later validates against this
server’s own timeline history.
The per-file decision: GetFileBackupMethod
Section titled “The per-file decision: GetFileBackupMethod”This is the function that turns “what changed” into “how do I send this
file.” It is called once per relation segment from sendDir in
basebackup.c. The early-out ladder is worth reading verbatim, because the
order of the bail-outs encodes correctness constraints, not just
optimizations.
First, two unconditional full-backup cases — a malformed size, and the free-space map fork, which is not WAL-logged and therefore cannot be reconstructed from WAL summaries:
// GetFileBackupMethod — src/backend/backup/basebackup_incremental.cif ((size % BLCKSZ) != 0 || size / BLCKSZ > RELSEG_SIZE) return BACK_UP_FILE_FULLY;
/* * The free-space map fork is not properly WAL-logged, so we need to * backup the entire file every time. */if (forknum == FSM_FORKNUM) return BACK_UP_FILE_FULLY;Second, the “did this file exist in the prior backup?” guard. A file the
prior manifest never mentioned cannot be sent as a delta against it — and
critically, a file created after the current backup started will have no
WAL summary coverage, so it too must be sent fully. The code probes both
the plain path and the INCREMENTAL.* path (the prior backup may itself
have stored this segment incrementally):
// GetFileBackupMethod — basebackup_incremental.cif (backup_file_lookup(ib->manifest_files, path) == NULL){ char *ipath;
ipath = GetIncrementalFilePath(dboid, spcoid, relfilenumber, forknum, segno); if (backup_file_lookup(ib->manifest_files, ipath) == NULL) return BACK_UP_FILE_FULLY;}Third, the actual BlockRefTable lookup. A missing entry means the WAL
recorded no changes to this relation fork — so the file can be sent as a
zero-block incremental stub (header only). A present entry yields the set
of changed blocks plus a limit_block marking truncation:
// GetFileBackupMethod — basebackup_incremental.cbrtentry = BlockRefTableGetEntry(ib->brtab, &rlocator, forknum, &limit_block);if (brtentry == NULL){ if (size == 0) return BACK_UP_FILE_FULLY; *num_blocks_required = 0; *truncation_block_length = size / BLCKSZ; return BACK_UP_FILE_INCREMENTALLY; /* a header-only stub */}Finally, the 90 % heuristic. BlockRefTableEntryGetBlocks fills the
caller’s array with the absolute block numbers in this segment’s range; if
that count would be more than 90 % of the file, the incremental encoding
saves nothing and the file is sent fully. Otherwise the block numbers are
sorted and rebased to segment-relative form, and truncation_block_length
is clamped against the limit block and the segment size:
// GetFileBackupMethod — basebackup_incremental.cnblocks = BlockRefTableEntryGetBlocks(brtentry, start_blkno, stop_blkno, relative_block_numbers, RELSEG_SIZE);/* If we'd need to send 90% of the blocks anyway, send the whole file. */if (nblocks * BLCKSZ > size * 0.9) return BACK_UP_FILE_FULLY;
qsort(relative_block_numbers, nblocks, sizeof(BlockNumber), compare_block_numbers);if (start_blkno != 0) for (i = 0; i < nblocks; ++i) relative_block_numbers[i] -= start_blkno;*num_blocks_required = nblocks;*truncation_block_length = size / BLCKSZ;/* ... clamp truncation_block_length to [relative_limit, RELSEG_SIZE] ... */return BACK_UP_FILE_INCREMENTALLY;sendDir consumes the return value: on BACK_UP_FILE_INCREMENTALLY it
rewrites the tar member name to INCREMENTAL.<name> and shrinks
statbuf.st_size to the incremental file’s size before calling sendFile:
// sendDir (incremental branch) — src/backend/backup/basebackup.cmethod = GetFileBackupMethod(ib, lookup_path, dboid, relspcoid, relfilenumber, relForkNum, segno, statbuf.st_size, &num_blocks_required, relative_block_numbers, &truncation_block_length);if (method == BACK_UP_FILE_INCREMENTALLY){ statbuf.st_size = GetIncrementalFileSize(num_blocks_required); snprintf(tarfilenamebuf, sizeof(tarfilenamebuf), "%s/INCREMENTAL.%s", path + basepathlen + 1, de->d_name); tarfilename = tarfilenamebuf;}flowchart TD
G0["GetFileBackupMethod(path, segno, size)"] --> G1{"size not a BLCKSZ multiple<br/>or > RELSEG_SIZE?"}
G1 -- yes --> FULL["BACK_UP_FILE_FULLY"]
G1 -- no --> G2{"forknum == FSM_FORKNUM?"}
G2 -- yes --> FULL
G2 -- no --> G3{"path in prior manifest<br/>(plain or INCREMENTAL.*)?"}
G3 -- no --> FULL
G3 -- yes --> G4{"BlockRefTable entry<br/>for this relfilenode?"}
G4 -- "none (no WAL changes)" --> STUB["INCREMENTALLY<br/>num_blocks = 0 (stub)"]
G4 -- present --> G5["GetBlocks → nblocks"]
G5 --> G6{"nblocks*BLCKSZ > 0.9*size?"}
G6 -- yes --> FULL
G6 -- no --> INC["INCREMENTALLY<br/>header + changed blocks"]
The on-the-wire incremental format: sendFile
Section titled “The on-the-wire incremental format: sendFile”When incremental_blocks != NULL, sendFile writes a header before any
block data. The header is three little-endian uint32s — INCREMENTAL_MAGIC
(0xd3ae1f0d), the block count, and the truncation block length — followed
by the array of relative block numbers, optionally padded so block data
starts on a BLCKSZ boundary:
// sendFile (incremental header) — src/backend/backup/basebackup.cif (incremental_blocks != NULL){ unsigned magic = INCREMENTAL_MAGIC; size_t header_bytes_done = 0;
push_to_sink(sink, &checksum_ctx, &header_bytes_done, &magic, sizeof(magic)); push_to_sink(sink, &checksum_ctx, &header_bytes_done, &num_incremental_blocks, sizeof(num_incremental_blocks)); push_to_sink(sink, &checksum_ctx, &header_bytes_done, &truncation_block_length, sizeof(truncation_block_length)); push_to_sink(sink, &checksum_ctx, &header_bytes_done, incremental_blocks, sizeof(BlockNumber) * num_incremental_blocks); /* ... pad to BLCKSZ if num_incremental_blocks > 0 ... */}The data loop then reads exactly the listed blocks, one at a time, seeking
to relative_blkno * BLCKSZ for each. A short read mid-block is treated as
a concurrent truncation and ends the loop — WAL replay during restore will
fix up the tail:
// sendFile (incremental data loop) — src/backend/backup/basebackup.crelative_blkno = incremental_blocks[ibindex++];cnt = read_file_data_into_buffer(sink, readfilename, fd, relative_blkno * BLCKSZ, /* seek */ BLCKSZ, /* one block */ relative_blkno + segno * RELSEG_SIZE, verify_checksum, &checksum_failures);if (cnt < BLCKSZ) break; /* transient truncation; WAL replay will fix it */Inspecting summaries from SQL: walsummaryfuncs.c
Section titled “Inspecting summaries from SQL: walsummaryfuncs.c”The same BlockRefTable reader the backup path uses is exposed to SQL for
diagnostics. pg_available_wal_summaries lists the files;
pg_wal_summary_contents opens one file and emits one row per
(relfilenode, fork, block) — plus a synthetic limit_block row marking a
truncation point:
// pg_wal_summary_contents — src/backend/backup/walsummaryfuncs.cio.file = OpenWalSummaryFile(&ws, false);reader = CreateBlockRefTableReader(ReadWalSummary, &io, FilePathName(io.file), ReportWalSummaryError, NULL);while (BlockRefTableReaderNextRelation(reader, &rlocator, &forknum, &limit_block)){ /* emit limit_block row if BlockNumberIsValid(limit_block) ... */ /* then loop over blocks, MAX_BLOCKS_PER_CALL at a time ... */}This is the user-visible window onto exactly the data
PrepareForIncrementalBackup merges into ib->brtab, which makes it the
go-to tool for answering “why was this file sent fully?”
Reconstruction: pg_combinebackup and reconstruct_from_incremental_file
Section titled “Reconstruction: pg_combinebackup and reconstruct_from_incremental_file”The frontend tool is where an incremental backup becomes restorable. For
each output file, reconstruct_from_incremental_file reads the newest
incremental file’s header to learn the reconstructed length, then builds a
sourcemap (which file holds each block) and an offsetmap (at what byte
offset). Blocks present in the newest file win outright:
// reconstruct_from_incremental_file — src/bin/pg_combinebackup/reconstruct.clatest_source = make_incremental_rfile(input_filename);source[n_prior_backups] = latest_source;block_length = find_reconstructed_block_length(latest_source);sourcemap = pg_malloc0(sizeof(rfile *) * block_length);offsetmap = pg_malloc0(sizeof(off_t) * block_length);
for (i = 0; i < latest_source->num_blocks; ++i){ BlockNumber b = latest_source->relative_block_numbers[i]; sourcemap[b] = latest_source; offsetmap[b] = latest_source->header_length + (i * BLCKSZ);}make_incremental_rfile is the reader for the format sendFile wrote — it
validates INCREMENTAL_MAGIC and reconstructs header_length identically
to the server’s GetIncrementalHeaderSize, which is how the two halves of
the feature stay byte-compatible:
// make_incremental_rfile — src/bin/pg_combinebackup/reconstruct.cread_bytes(rf, &magic, sizeof(magic));if (magic != INCREMENTAL_MAGIC) pg_fatal("file \"%s\" has bad incremental magic number (0x%x, expected 0x%x)", filename, magic, INCREMENTAL_MAGIC);read_bytes(rf, &rf->num_blocks, sizeof(rf->num_blocks));read_bytes(rf, &rf->truncation_block_length, sizeof(rf->truncation_block_length));/* ... read relative_block_numbers[] ... */rf->header_length = sizeof(magic) + sizeof(rf->num_blocks) + sizeof(rf->truncation_block_length) + sizeof(BlockNumber) * rf->num_blocks;The chain walk descends from newest to oldest prior backup. The moment it
hits a full copy of the file, it fills every still-unassigned block below
truncation_block_length from that full file and stops — older backups
cannot override a block already claimed by a newer layer (most-recent-wins):
// reconstruct_from_incremental_file (full-file source) — reconstruct.cblocklength = sb.st_size / BLCKSZ;for (b = 0; b < latest_source->truncation_block_length; ++b){ if (sourcemap[b] == NULL && b < blocklength) { sourcemap[b] = s; /* fill from the full file */ offsetmap[b] = b * BLCKSZ; }}/* ... then break: no older source can override these ... */Any block still NULL after the walk but below the truncation length is a
zero-filled hole — a block the server extended into existence but never
WAL-logged, which write_reconstructed_file materializes as zeroes.
flowchart TD
R0["reconstruct_from_incremental_file(output file)"] --> R1["make_incremental_rfile(newest):<br/>magic, num_blocks,<br/>truncation_block_length"]
R1 --> R2["block_length =<br/>find_reconstructed_block_length"]
R2 --> R3["claim blocks present in<br/>newest incremental → sourcemap"]
R3 --> R4{"walk prior backups<br/>newest → oldest"}
R4 -- "incremental layer" --> R5["claim still-unfilled blocks<br/>below truncation length"]
R5 --> R4
R4 -- "full file found" --> R6["fill remaining blocks<br/>from full file, then stop"]
R6 --> R7["any still-NULL block<br/>below truncation = zero hole"]
R7 --> R8["write_reconstructed_file:<br/>copy each block from its source"]
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
INCREMENTAL_MAGIC | src/include/backup/basebackup_incremental.h | 20 |
MIN_CHUNK / MAX_CHUNK | basebackup_incremental.c | 39 |
manifest_process_wal_range | basebackup_incremental.c | 138 |
CreateIncrementalBackupInfo | basebackup_incremental.c | 152 |
AppendIncrementalManifestData | basebackup_incremental.c | 194 |
FinalizeIncrementalManifest | basebackup_incremental.c | 227 |
PrepareForIncrementalBackup | basebackup_incremental.c | 263 |
GetIncrementalFilePath | basebackup_incremental.c | 625 |
GetFileBackupMethod | basebackup_incremental.c | 663 |
GetIncrementalHeaderSize | basebackup_incremental.c | 881 |
GetIncrementalFileSize | basebackup_incremental.c | 909 |
parse_basebackup_options | basebackup.c | 698 |
perform_base_backup | basebackup.c | 234 |
sendFile | basebackup.c | 1573 |
read_file_data_into_buffer | basebackup.c | 96 |
push_to_sink | basebackup.c | 102 |
pg_available_wal_summaries | walsummaryfuncs.c | 32 |
pg_wal_summary_contents | walsummaryfuncs.c | 69 |
UploadManifest | src/backend/replication/walsender.c | 667 |
HandleUploadManifestPacket | walsender.c | 733 |
reconstruct_from_incremental_file | src/bin/pg_combinebackup/reconstruct.c | 88 |
make_incremental_rfile | reconstruct.c | 456 |
find_reconstructed_block_length | reconstruct.c | 439 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified facts
Section titled “Verified facts”-
The incremental file header is
INCREMENTAL_MAGIC(0xd3ae1f0d), a block count, a truncation block length, and the relative block-number array. Verified insendFile(server writer,basebackup.c) andmake_incremental_rfile(frontend reader,reconstruct.c) — both computeheader_lengththe same way, and the magic constant is defined once insrc/include/backup/basebackup_incremental.h. -
The free-space map fork is always backed up fully. Verified in
GetFileBackupMethod:if (forknum == FSM_FORKNUM) return BACK_UP_FILE_FULLY;, with the in-source comment “The free-space map fork is not properly WAL-logged.” A relation whose FSM changed thus always ships its FSM in full even when the main fork ships incrementally. -
A relation with no WAL-logged changes is sent as a header-only stub, not skipped. Verified in
GetFileBackupMethod: whenBlockRefTableGetEntryreturnsNULLand the file is non-empty, it returnsBACK_UP_FILE_INCREMENTALLYwith*num_blocks_required = 0. The stub is what tellspg_combinebackup“this file is unchanged — take it whole from the prior backup.” -
The 90 % threshold downgrades an incremental file to a full one. Verified in
GetFileBackupMethod:if (nblocks * BLCKSZ > size * 0.9) return BACK_UP_FILE_FULLY;. The in-source comment notes the threshold is not configurable and is a deliberate “don’t bother” guard, and that a file where every block changed is always sent fully. -
Incremental backups are refused unless
summarize_walis on. Verified inparse_basebackup_options: theincrementaloption errors withERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE(“incremental backups cannot be taken unless WAL summarization is enabled”) whensummarize_walis off. -
The manifest is parsed incrementally, retaining the trailing
MIN_CHUNKbytes. Verified inAppendIncrementalManifestData: it triggersjson_parse_manifest_incremental_chunkonce the buffer would exceedMAX_CHUNK = 128 KiB, thenmemmoves the lastMIN_CHUNK = 1 KiBto the buffer head. This keeps the closing checksum line intact across chunk boundaries. -
PrepareForIncrementalBackupwaits for summarization and fails if any required summary is missing. Verified:WaitForWalSummarization(backup_ state->startpoint)blocks until the summarizer reaches the backup start LSN, thenGetWalSummaries+ per-timelineFilterWalSummariescollect the files and a completeness check errors out on a gap, rather than producing an unsafe partial delta. -
Reconstruction is most-recent-wins over the backup chain, with zero-fill for holes. Verified in
reconstruct_from_incremental_file: the newest incremental claims its blocks first; the walk fills still-unassigned blocks from older layers and stops at the first full file; blocks leftNULLbelowtruncation_block_lengthare treated as zero-filled extensions.
Cross-references / deferred
Section titled “Cross-references / deferred”-
WAL summarizer internals (the
walsummarizer.cbackground worker, thepg_wal/summaries/file format,BlockRefTableon-disk encoding) are out of scope here and live inpostgres-archiving-walsummary.md. This doc treats the summaries purely as an input consumed byPrepareForIncrementalBackup. -
The full (non-incremental)
BASE_BACKUPflow — the bbsink pipeline,perform_base_backup, tar framing, compression sinks, and the backup manifest writer — is covered inpostgres-backup-basebackup.md. Here we only touch the incremental-specific branches ofsendDir/sendFile. -
WAL itself as the source of truth (rmgr dispatch, record format, redo) is in
postgres-wal-records-rmgr.mdandpostgres-recovery-redo.md. -
Not verified beyond reading the code: the actual end-to-end restore of a multi-link chain (
pg_combinebackupproducing a startable cluster) was read, not executed, for this revision.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”PostgreSQL’s design lands at a specific, defensible point in the incremental-backup design space, and the contrasts are instructive.
Block change tracking: bitmap-at-write vs. summarize-from-log. Oracle
RMAN’s block change tracking file and SQL Server’s Differential Changed
Map are updated eagerly, in the write path, as pages are dirtied. That
makes the differential read cheap but imposes a small steady tax on every
write and couples the tracking structure to the live datafiles. PostgreSQL
instead derives the change set lazily, after the fact, by summarizing
WAL that it was already writing for durability. The cost moves off the hot
write path and onto a background worker, and the tracking artifact (WAL
summary files) is fully decoupled from both the datafiles and the backup
schedule — a summary for an LSN window is reusable by any backup whose
chain crosses it. The trade is latency: you cannot take an incremental
backup until the summarizer has caught up to the start LSN, which is
exactly why PrepareForIncrementalBackup must WaitForWalSummarization.
Eager vs. lazy reconstruction. Many enterprise tools (and Oracle’s
incrementally-updated backups) roll forward a synthetic full image at
backup time, so a restore is always a single full copy. PostgreSQL defers
reconstruction entirely to restore time in pg_combinebackup. This keeps
the server side purely a delta producer — no read amplification on the
primary to maintain a synthetic full — at the cost of a restore-time merge
whose work is proportional to chain length. The most-recent-wins overlay in
reconstruct_from_incremental_file is structurally identical to reading a
key through the levels of an LSM tree, or resolving a file through a
stack of overlay-filesystem layers: newest layer that has the block
wins, and a “tombstone” (here, the truncation_block_length clamp) bounds
what older layers may contribute.
Granularity and the WAL-logging assumption. Block-granular tracking is
only sound for data that is fully WAL-logged; the FSM exception
(forknum == FSM_FORKNUM → full backup) is the visible seam where that
assumption breaks. A research-frontier question is whether unlogged or
minimally-logged objects could be incrementally captured via a different
channel; today PostgreSQL simply sends them whole. Relatedly, the file-level
“created after the prior backup” guard shows how change-tracking from the
log must be defended against the namespace changing underneath it (files
deleted and recreated under the same name), not just block contents.
Where the literature sits. The conceptual backbone is ARIES (Mohan et
al., 1992): the WAL is an authoritative, ordered, replayable record of every
page change, which is precisely what makes “enumerate what changed between
two LSNs” a well-posed question. Database Internals (Petrov, 2019) frames
the WAL as the single source of truth that recovery, replication, and now
backup-diffing all consume. Incremental backup is best understood as a
third consumer of redo information, parallel to crash recovery (which
applies the changes) and physical replication (which streams them) —
here the log is replayed not to apply or ship changes but to enumerate
them. The active frontiers are largely operational: bounding chain length
automatically, parallelizing pg_combinebackup, and integrating
incremental physical backup with delta-aware object storage so that the
on-disk delta and the archived delta are the same bytes.
Sources
Section titled “Sources”- PostgreSQL source (REL_18_STABLE, commit 273fe94, 2026-06-05):
src/backend/backup/basebackup_incremental.c— manifest ingestion (CreateIncrementalBackupInfo,AppendIncrementalManifestData),PrepareForIncrementalBackup, and the per-file decisionGetFileBackupMethod.src/backend/backup/basebackup.c—parse_basebackup_options(thesummarize_walprecondition),sendDir’s incremental branch, andsendFile’s incremental header + block-seek loop.src/backend/backup/walsummaryfuncs.c— the SQL-callablepg_available_wal_summaries/pg_wal_summary_contentsdiagnostics.src/backend/replication/walsender.c—UploadManifest/HandleUploadManifestPacket(theUPLOAD_MANIFESTprotocol).src/bin/pg_combinebackup/reconstruct.c—reconstruct_from_incremental_file,make_incremental_rfile, the sourcemap/offsetmap construction and most-recent-wins chain walk.src/include/backup/basebackup_incremental.h—INCREMENTAL_MAGIC,FileBackupMethod, and the server-side declarations.
- Theory / textbook anchors:
- C. Mohan et al., ARIES: A Transaction Recovery Method… (1992) — WAL
as the authoritative, replayable record of change. See
knowledge/research/dbms-general/and the bibliography plandbms-papers/aries.md. - A. Petrov, Database Internals (2019), ch. 3 and the recovery
discussion — WAL as the single source of truth across recovery,
replication, and backup. Captured in
knowledge/research/dbms-general/database-internals.md.
- C. Mohan et al., ARIES: A Transaction Recovery Method… (1992) — WAL
as the authoritative, replayable record of change. See
- Sibling code-analysis docs (this tree):
postgres-archiving-walsummary.md(the WAL summarizer that produces the summaries),postgres-backup-basebackup.md(the full backup pipeline and bbsink chain),postgres-wal-records-rmgr.mdandpostgres-recovery-redo.md(WAL as the underlying change record).