Skip to content

PostgreSQL Base Backup — BASE_BACKUP, Backup Label, and the Streaming Protocol

Contents:

A backup is the second half of the durability story. The WAL (write-ahead log) protects committed transactions against a crash — but only relative to some baseline image of the database. If the disk holding the heap files is lost, replaying WAL against nothing reconstructs nothing. The baseline image is the backup, and the discipline of taking it while the system keeps running is what Database System Concepts (Silberschatz, 7e) treats in ch. 19 (“Recovery System”) under the heading of archival dumps.

The naive form of a dump is the quiescent one: stop all transactions, flush every buffer, copy the files, resume. DSC names the cost plainly — “the database must be brought to a halt” — and that halt is unacceptable for any system with a real availability requirement. The interesting design problem is the fuzzy dump (DSC §19.x, “archival dump”): copy the data files while transactions continue to modify them, accepting that the resulting image is internally inconsistent, and rely on the log to repair it. The textbook frames the dump as just another point in the same recovery framework that handles crashes:

“An archival dump of the contents of the database is a copy of the database contents at some point in time. … A fuzzy dump allows transactions to be active while the dump is in progress.”

Three theoretical properties make this work, and every base-backup implementation is an instance of them:

  1. The dump need not be consistent — it must only be recoverable. A file copied at wall-clock time t₁ and another copied at t₂ > t₁ reflect different commit frontiers; a single 8 KB page may even be read torn, half old and half new, if a writer overlaps the reader. None of this matters as long as the log contains every change from a known redo point (a checkpoint at or before the start of the dump) through the end of the dump, and replay is idempotent.

  2. Redo must be idempotent and physical. Replaying a logged page change over a torn page must produce the correct page regardless of what the torn copy contained. This is exactly the physiological / physical redo that ARIES (Mohan et al. 1992) guarantees, combined with full-page images: the first modification of a page after the redo point logs the entire page, so replay can overwrite a torn copy wholesale rather than trusting it.

  3. The recovery boundary must be self-describing. The restored image must carry, somewhere, the answer to “where does replay start, and how far must it go before the database is consistent?” A crash recovery reads this from the control file; a backup must inject an analogous breadcrumb, because the control file in the copy is itself part of the fuzzy image.

DSC also distinguishes the two consumers of a backup. One is disaster recovery — restore the dump, replay archived WAL, and you are back. The other is point-in-time recovery (PITR) — replay only up to a chosen target time/LSN, abandoning later transactions, which is how you undo an erroneous bulk DELETE. Both consume the identical artifact; only the replay stop condition differs. And a third consumer, peculiar to replication, bootstraps a streaming standby: the standby starts from the base backup and then continues replay indefinitely from the live WAL stream rather than stopping. PostgreSQL’s BASE_BACKUP serves all three because it produces the one artifact they share: a fuzzy physical image plus the metadata to drive idempotent redo over it.

A fourth idea is worth pulling forward from the recovery chapter, because it explains why the start operation forces a checkpoint rather than simply noting the current LSN. Recovery in an ARIES-style system is a three-pass affair — analysis, redo, undo — anchored at a checkpoint. The checkpoint bounds how far back redo must scan: replay can begin at the checkpoint’s redo point and trust that everything before it is already on disk. A base backup wants the smallest possible redo interval and a known origin, so it forces a fresh checkpoint at start and records that checkpoint’s location as the recovery origin. Without the forced checkpoint, the restore would have to begin redo at whatever stale checkpoint the copied control file happened to name — possibly far in the past, possibly referencing WAL the operator no longer has.

There is also a subtlety the textbook only gestures at but every real implementation must handle: the dump is fuzzy at the granularity of a single page, not just across files. A reader copying an 8 KB page can race a writer flushing that same page, producing a half-old/half-new image. The log-based fix (full-page images) handles this only if the page is logged in full at least once after the redo point — which is exactly why property (2) couples “force a checkpoint” with “force full-page writes”: the checkpoint bounds the interval, and the full-page writes guarantee that any page touched within it can be reconstructed wholesale.

The rest of this document traces how REL_18 realizes properties (1)–(3): the start/stop bracket that pins the redo point and forces full-page writes (property 2), the file walk that produces the fuzzy image (property 1), and backup_label (property 3).

The textbook gives the model — a fuzzy archival dump made recoverable by the log. This section names the engineering conventions that production systems converge on to make a fuzzy physical backup safe, online, and streamable. PostgreSQL’s specific choices in ## PostgreSQL's Approach are one set of dials within this shared space.

Physical vs. logical backup. A logical backup (PostgreSQL’s pg_dump) re-derives SQL statements or row data and is portable across versions and architectures but slow and unable to seed a binary-replicated standby. A physical backup copies the on-disk file format verbatim: fast, byte-exact, replayable, but version- and platform-locked. Base backup is the physical kind. The two are complementary, not competing.

Bracket the copy with start/stop markers. Every online physical backup brackets the file copy between a start and a stop operation. Start does three jobs: (a) pick a redo point — usually by forcing a checkpoint — so replay has a defined origin; (b) force full-page writes for the duration so torn pages are self-healing; (c) record the start LSN. Stop records the end LSN and ensures the WAL spanning [start, stop] is durably retained (archived), because that WAL is mandatory for the restore. The backup is useless without both endpoints and the WAL between them.

Copy at the file granularity, skip the volatile. The walker copies the data directory file by file, but a running cluster has files that must not be copied verbatim: temporary files, the relation-cache init file (rebuilt on startup), per-backend stats, replication-slot state, postmaster.pid. Each engine maintains an exclusion list. The WAL directory itself is special — its contents are reconstructed from the archive on restore, so only the empty directory (for permissions) is copied.

Stream, don’t stage. A backup of a multi-terabyte cluster cannot be staged to a temp file on the server. The bytes are streamed to the consumer — over a network socket to pg_basebackup, or to a server-side file — as they are read, with backpressure. A clean way to express this is a chain of sinks: read a buffer, hand it to a sink, and let each sink in a stack transform (compress), meter (throttle, progress), or transport (socket, file) the bytes before passing to the next. This is the same decorator pattern other systems use for backup pipelines.

Frame the stream so the consumer can parse it. A single byte stream must carry multiple logical objects: several tar archives (one per tablespace), a manifest, periodic progress reports. The framing protocol tags each chunk so the client can demultiplex — “new archive”, “archive data”, “manifest”, “progress”. PostgreSQL layers this on the COPY sub-protocol of the wire protocol.

Make the restore self-describing. Because the copied control file is part of the fuzzy image, the backup injects an out-of-band breadcrumb — a small text file naming the start LSN, the checkpoint redo location, the timeline, and whether the backup came from a standby. Recovery reads this instead of trusting the control file’s idea of where to start. PostgreSQL calls it backup_label.

Incremental backup needs a change oracle. A full physical backup re-copies unchanged data. To copy only what changed since a prior backup, the system needs an oracle that answers “which blocks of relation R changed in the LSN interval since the last backup?” The two common oracles are a maintained bitmap of dirty blocks and a scan of the WAL for block references. PostgreSQL chose the latter, materialized as WAL summaries feeding a block-reference table.

The restore is a guided crash recovery. The reason a physical backup can be fuzzy is that the restore is not a copy-back — it is a recovery. The operator lays the files down, deletes nothing, and starts the server; the startup process notices the recovery breadcrumb, fetches archived WAL (via restore_command or a pre-staged archive), and replays from the redo point. Because redo is idempotent and uses full-page images, it does not matter that file A was copied before file B or that some page was torn: the log overwrites whatever was there. The server declares the cluster consistent once it has replayed at least up to the backup’s end LSN; only then will it accept connections (for disaster recovery) or open read-only/hot-standby (for a replica). This is the single fact that unifies the three consumers: disaster recovery replays to the end of the archive, PITR replays to a configured recovery_target, and a standby replays to the end and then keeps going from the live stream — but all three start from the same fuzzy image plus the same breadcrumb. An operator who skips the WAL, or deletes backup_label thinking it is cruft, gets a silently corrupt database; the design leans hard on making that mistake hard to make accidentally.

flowchart TD
  subgraph Bracket["Online backup bracket (every engine)"]
    S["START<br/>pick redo point (checkpoint)<br/>force full-page writes<br/>record start LSN"]
    C["Copy data files<br/>fuzzy, online, file-granular<br/>exclude volatile files"]
    E["STOP<br/>record end LSN<br/>ensure WAL [start,stop] archived"]
    S --> C --> E
  end
  Bracket --> R["Restore: lay down files<br/>replay WAL from redo point<br/>to end LSN -> consistent"]
  R --> U1["Disaster recovery<br/>replay to end of archive"]
  R --> U2["PITR<br/>replay to chosen target"]
  R --> U3["Standby bootstrap<br/>replay then follow live stream"]

PostgreSQL takes a base backup as a replication-protocol command. A client (canonically pg_basebackup, but also pg_basebackup --incremental or any libpq client speaking the replication protocol) connects in replication mode and issues BASE_BACKUP [ options ]. The walsender backend parses it into a BaseBackupCmd and calls SendBaseBackup. There is no SQL-callable equivalent that streams files; the SQL functions pg_backup_start() / pg_backup_stop() only place the bracket (the low-level API) and leave the file copy to the operator (tar, rsync, etc.). Both paths share the same C primitives do_pg_backup_start and do_pg_backup_stop in xlog.c.

The bracket: do_pg_backup_start. Start does the three textbook jobs. First it increments a shared counter, runningBackups, under the WAL insertion locks, which is what forces full-page writes globally — the torn-page defense (property 2):

// do_pg_backup_start — src/backend/access/transam/xlog.c
WALInsertLockAcquireExclusive();
XLogCtl->Insert.runningBackups++;
WALInsertLockRelease();

The comment in the source states the rationale exactly: “it’s quite possible for the backup dump to obtain a ‘torn’ (partially written) copy of a database page if it reads the page concurrently with our write to the same page. This can be fixed as long as the first write to the page in the WAL sequence is a full-page write.” Then it forces a WAL switch and a checkpoint to pin the redo point, and records state->startpoint / state->starttli. The checkpoint can be fast (immediate) or spread (throttled over checkpoint_completion_target), selected by the CHECKPOINT { fast | spread } option (default spread). Note REL_18 uses the spelling fast/spread; the parser still maps them onto the boolean opt->fastcheckpoint.

The bracket: do_pg_backup_stop. Stop records the end point and, unless the client asked not to wait, blocks until the WAL covering [start, stop] has been archived — because that WAL is mandatory for the restore. It decrements runningBackups, dropping the forced-full-page-write obligation.

The cleanup discipline around the bracket is strict. perform_base_backup wraps everything between start and stop in PG_ENSURE_ERROR_CLEANUP so that any error runs do_pg_abort_backup, which decrements runningBackups — a “leaked” backup counter would force full-page writes forever:

// perform_base_backup — src/backend/backup/basebackup.c
do_pg_backup_start(opt->label, opt->fastcheckpoint, &state.tablespaces,
backup_state, tablespace_map);
state.startptr = backup_state->startpoint;
state.starttli = backup_state->starttli;
PG_ENSURE_ERROR_CLEANUP(do_pg_abort_backup, BoolGetDatum(false));
{
/* ... walk $PGDATA, stream every file ... */
do_pg_backup_stop(backup_state, !opt->nowait);
}
PG_END_ENSURE_ERROR_CLEANUP(do_pg_abort_backup, BoolGetDatum(false));

backup_label: the self-describing breadcrumb (property 3). The very first file written into base.tar is backup_label, built from the BackupState by build_backup_content. It is text and names START WAL LOCATION, CHECKPOINT LOCATION (the redo point), the backup method, the start time, the label string, and whether the backup was taken from a standby. On restore, the startup process reads backup_label instead of trusting the copied pg_control, learns where redo must begin, and refuses to start without it if the file is present — which is precisely what stops an operator from accidentally treating a fuzzy image as a crashed-but-clean cluster. backup_label is sent first; pg_control is sent last so it reflects a consistent post-walk state:

// perform_base_backup — src/backend/backup/basebackup.c
bbsink_begin_archive(sink, "base.tar");
/* In the main tar, include the backup_label first... */
backup_label = build_backup_content(backup_state, false);
sendFileWithContent(sink, BACKUP_LABEL_FILE, backup_label, -1, &manifest);
/* Then the tablespace_map file, if required... */
if (opt->sendtblspcmapfile)
sendFileWithContent(sink, TABLESPACE_MAP, tablespace_map->data, -1, &manifest);
/* Then the bulk of the files... */
sendDir(sink, ".", 1, false, state.tablespaces, sendtblspclinks, &manifest, InvalidOid, ib);
/* ... and pg_control after everything else. */
sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf, ...);

The exact text of backup_label is worth seeing, because it is the restore contract. build_backup_content formats it field by field from the BackupState. The two LSN fields — START WAL LOCATION (the WAL position at which the backup began) and CHECKPOINT LOCATION (the redo point of the forced checkpoint) — are the load-bearing ones; BACKUP FROM tells recovery whether to expect a standby’s quirks, and the INCREMENTAL FROM lines appear only for incremental backups:

// build_backup_content — src/backend/access/transam/xlogbackup.c
appendStringInfo(result, "START WAL LOCATION: %X/%X (file %s)\n",
LSN_FORMAT_ARGS(state->startpoint), startxlogfile);
appendStringInfo(result, "CHECKPOINT LOCATION: %X/%X\n",
LSN_FORMAT_ARGS(state->checkpointloc));
appendStringInfoString(result, "BACKUP METHOD: streamed\n");
appendStringInfo(result, "BACKUP FROM: %s\n",
state->started_in_recovery ? "standby" : "primary");
appendStringInfo(result, "START TIME: %s\n", startstrbuf);
appendStringInfo(result, "LABEL: %s\n", state->name);
appendStringInfo(result, "START TIMELINE: %u\n", state->starttli);
/* ... STOP fields here only when ishistoryfile ... */
if (!XLogRecPtrIsInvalid(state->istartpoint))
{
appendStringInfo(result, "INCREMENTAL FROM LSN: %X/%X\n",
LSN_FORMAT_ARGS(state->istartpoint));
appendStringInfo(result, "INCREMENTAL FROM TLI: %u\n",
state->istarttli);
}

When recovery sees a backup_label on startup, it begins redo at CHECKPOINT LOCATION (not at the redo pointer in the copied pg_control, which may name an earlier checkpoint) and treats the cluster as in-backup-recovery until it has replayed past the backup end LSN — only then is the image consistent. This is the mechanism that makes property (3) of the theory section concrete: the breadcrumb overrides the untrustworthy control file. The same content, with STOP fields added (ishistoryfile = true), becomes the <startwalfile>.<offset>.backup history file that do_pg_backup_stop drops into pg_wal for archival bookkeeping.

tablespace_map. A cluster may have tablespaces living outside $PGDATA, reached by symlinks under pg_tblspc/. The backup cannot blindly copy absolute symlinks (the restore target’s paths differ), so it records the symlink-to-path mapping in tablespace_map and emits each tablespace as its own tar archive named <oid>.tar. The main data directory is the archive whose ti->path == NULL, and it is always sent last so the WAL-append step (below) has a place to attach.

The sink chain. SendBaseBackup assembles the streaming pipeline as a stack of bbsink decorators, base sink innermost:

// SendBaseBackup — src/backend/backup/basebackup.c
sink = bbsink_copystream_new(opt.send_to_client);
if (opt.target_handle != NULL)
sink = BaseBackupGetSink(opt.target_handle, sink);
if (opt.maxrate > 0)
sink = bbsink_throttle_new(sink, opt.maxrate);
if (opt.compression == PG_COMPRESSION_GZIP)
sink = bbsink_gzip_new(sink, &opt.compression_specification);
/* ... lz4, zstd ... */
sink = bbsink_progress_new(sink, opt.progress);

The outermost sink (bbsink_progress) is what basebackup.c calls; each sink does its job and forwards to the one it wraps. The bbsink_state struct carries the shared cursor — tablespaces, bytes_done, bytes_total, startptr, starttli — so any sink can report progress.

Framing on the wire. The innermost bbsink_copystream turns the byte stream into wire-protocol messages. It first sends a result set (the start LSN + tablespace list), then opens a single COPY OUT stream, and frames every subsequent chunk as a CopyData message whose first payload byte is a type tag: n = new archive, d = archive/manifest data, m = start of manifest, p = progress report. A clever alignment trick lets the type byte sit just before the aligned data buffer so the whole message ships in one pq_putmessage with no extra copy:

// bbsink_copystream_begin_backup — src/backend/backup/basebackup_copy.c
buf = palloc(mysink->base.bbs_buffer_length + MAXIMUM_ALIGNOF);
mysink->msgbuffer = buf + (MAXIMUM_ALIGNOF - 1);
mysink->base.bbs_buffer = buf + MAXIMUM_ALIGNOF;
mysink->msgbuffer[0] = 'd'; /* archive or manifest data */
SendXlogRecPtrResult(state->startptr, state->starttli);
SendTablespaceList(state->tablespaces);
pq_puttextmessage(PqMsg_CommandComplete, "SELECT");
SendCopyOutResponse();
flowchart TD
  Cmd["BASE_BACKUP cmd<br/>over replication protocol"] --> SBB["SendBaseBackup<br/>parse opts, build sink chain"]
  SBB --> PBB["perform_base_backup"]
  PBB --> Start["do_pg_backup_start<br/>runningBackups++, checkpoint<br/>= redo point + full-page writes"]
  Start --> Label["sendFileWithContent backup_label<br/>then tablespace_map"]
  Label --> Walk["sendDir walks PGDATA<br/>sendFile per tar member"]
  Walk --> Ctl["sendFile pg_control LAST"]
  Ctl --> Stop["do_pg_backup_stop<br/>end LSN, wait for WAL archived"]
  Stop --> WAL["if WAL: append pg_wal segments<br/>start..end to base.tar"]
  WAL --> Man["SendBackupManifest"]
  Man --> End["bbsink_end_backup<br/>CopyDone + end LSN/tli"]
  subgraph Sinks["bbsink chain (decorators)"]
    direction LR
    Prog["progress"] --> Comp["gzip/lz4/zstd"] --> Thr["throttle"] --> Tgt["target"] --> Cs["copystream -> wire"]
  end
  Walk -.bbsink_archive_contents.-> Sinks

This section follows the call flow from the replication command down to a single tar member, grouped by subsystem. Symbols are the durable anchor; line numbers live only in the position-hint table at the end.

The walsender receives BASE_BACKUP and dispatches to SendBaseBackup (in basebackup.c). It rejects a concurrent backup in the same session (get_backup_status() == SESSION_BACKUP_RUNNING), parses the options with parse_basebackup_options, flips the walsender into WALSNDSTATE_BACKUP, validates the incremental precondition, and builds the sink chain before handing off to perform_base_backup.

parse_basebackup_options walks the DefElem option list with a duplicate-detection boolean per option. The notable options: LABEL, PROGRESS, CHECKPOINT { fast | spread }, WAIT, WAL (include pg_wal segments), INCREMENTAL, MAX_RATE, TABLESPACE_MAP, VERIFY_CHECKSUMS, MANIFEST { yes | no | force-encode }, MANIFEST_CHECKSUMS, TARGET / TARGET_DETAIL, COMPRESSION / COMPRESSION_DETAIL. A guard rejects an incremental backup when WAL summarization is off:

// parse_basebackup_options — src/backend/backup/basebackup.c
opt->incremental = defGetBoolean(defel);
if (opt->incremental && !summarize_wal)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("incremental backups cannot be taken unless WAL summarization is enabled")));

Target resolution decides where the bytes go: no TARGET means stream to the client over COPY (send_to_client = true) and also sets use_copytblspc for backward compatibility; TARGET 'client' is explicit client streaming; any other target name resolves through BaseBackupGetTargetHandle to a server-side sink (e.g. server, blackhole). The parser also enforces the cross-option constraints that cannot be expressed per-option: MANIFEST_CHECKSUMS requires a manifest; COMPRESSION_DETAIL requires COMPRESSION; TARGET_DETAIL requires TARGET; and when no manifest is requested the checksum type is forced to CHECKSUM_TYPE_NONE. Compression specifications are parsed and validated here too (parse_compress_specification / validate_compress_specification), so an invalid zstd:level=99 is rejected before any file is read rather than failing mid-stream.

SendBaseBackup then guards the incremental contract: a full backup with a stray uploaded manifest just ignores it (ib = NULL), but an incremental backup with no prior UPLOAD_MANIFEST is an error, as is an incremental request when WAL summarization is disabled (caught earlier in option parsing). Only after these checks does it assemble the sink chain and call perform_base_backup inside a PG_TRY/PG_FINALLY that guarantees bbsink_cleanup runs even on error — the sink-level analogue of the backup-counter cleanup discipline.

2. The start/stop bracket (xlog.c, xlogbackup.c)

Section titled “2. The start/stop bracket (xlog.c, xlogbackup.c)”

perform_base_backup first switches CurrentResourceOwner to the aux process resource owner (it uses a BufFile for incremental bookkeeping), initializes the manifest, allocates a BackupState, and calls do_pg_backup_start. That function enforces wal_level >= replica, marks the backup active by bumping XLogCtl->Insert.runningBackups under the WAL insert locks (forcing full-page writes), forces an XLOG switch, runs the checkpoint, and fills BackupState->startpoint / starttli. The redo point recorded there is the LSN replay will start from.

build_backup_content (in xlogbackup.c) serializes the BackupState into the backup_label text — START WAL LOCATION, CHECKPOINT LOCATION, BACKUP METHOD, BACKUP FROM, START TIME, LABEL, START TIMELINE — and, when ishistoryfile is true, the additional STOP fields for the .backup history file written by do_pg_backup_stop.

do_pg_backup_stop records stoppoint / stoptli, decrements runningBackups, and (when waitforarchive) blocks until the WAL up to the stop LSN is archived. It re-acquires the WAL insert locks to decrement the counter atomically and clears sessionBackupState before releasing them, with the source warning that a CHECK_FOR_INTERRUPTS() between the two updates could leave them inconsistent and break a later do_pg_abort_backup:

// do_pg_backup_stop — src/backend/access/transam/xlog.c
WALInsertLockAcquireExclusive();
Assert(XLogCtl->Insert.runningBackups > 0);
XLogCtl->Insert.runningBackups--;
sessionBackupState = SESSION_BACKUP_NONE;
WALInsertLockRelease();

It also re-checks, for a backup taken on a standby, that the standby was not promoted during the copy — a promotion mid-backup invalidates the image because the timeline forked underneath it. do_pg_abort_backup is the error-path shrinker that performs just this counter decrement, registered via PG_ENSURE_ERROR_CLEANUP, so any ereport(ERROR) between start and stop still releases the forced-full-page-write obligation. A “leaked” counter would be a silent, cluster-wide performance regression with no visible cause, which is why the cleanup discipline is this strict.

3. Walking the data directory: sendDir / sendFile

Section titled “3. Walking the data directory: sendDir / sendFile”

sendDir recurses through a directory, calling _tarWriteHeader for each subdirectory entry and sendFile for each regular file, returning the cumulative size (the sizeonly pass uses the same code to pre-compute bytes_total for progress). Its responsibilities:

  • Recognize relation directories. A path whose last component is all digits under ./base or a tablespace version directory, or ./global, holds relation files; isRelationDir gates per-file relfilenode parsing via parse_filename_for_nontemp_relation.
  • Apply the exclusion lists. excludeFiles (e.g. postmaster.pid, the relcache init file, a stray backup_label/tablespace_map/ backup_manifest) and excludeDirContents (e.g. pg_stat_tmp, pg_replslot, pg_notify, pg_serial, pg_subtrans — sent as empty directories to preserve permissions).
  • Skip the volatile. Temp files (PG_TEMP_FILE_PREFIX), unlogged-table non-init forks (when an _init fork exists), temp relations, and pg_control (sent last). pg_wal is emitted as an empty directory plus its archive_status and summaries subdirs — its contents come from the archive on restore.
  • Handle tablespace symlinks. Under ./pg_tblspc, a symlink is written into the tar with its link target (or, when tablespace_map carries the mapping, skipped here and recorded there instead).
  • Detect interruption / promotion. CHECK_FOR_INTERRUPTS() plus a RecoveryInProgress() != backup_started_in_recovery check aborts a backup if a standby was promoted mid-copy (the image would be corrupt).

The exclusion lists are not arbitrary; each entry encodes a correctness or cleanliness fact. pg_replslot is excluded because replication-slot state is node-specific and restoring it would let a restored copy hold WAL on the original primary’s behalf. pg_stat_tmp and pg_notify/pg_serial/ pg_subtrans/pg_snapshots are excluded because their contents are recreated or zeroed at startup, so copying them is pointless (and they are kept as empty directories so file permissions survive). A stray backup_label, tablespace_map, or backup_manifest found in the live data directory is excluded because the backup injects its own correct versions — a copied-in one would describe a different backup and mislead recovery. postmaster.pid is excluded so the restored copy does not think the original postmaster is still alive. The source notes this list must be kept in sync with pg_rewind’s filemap, since both tools reason about which files are node-local versus part of the logical cluster image.

The same sendDir/sendFile code runs twice when PROGRESS is requested: first with sizeonly = true to sum bytes_total (so the client can show a percentage), then with sizeonly = false to actually stream. Running the identical traversal twice keeps the size estimate exactly consistent with what will be sent — at the cost of a second directory scan — rather than maintaining a separate, drift-prone size accounting path.

sendFile opens the file, writes the tar header with _tarWriteHeader, and streams its contents in bbs_buffer-sized chunks through read_file_data_into_buffer -> bbsink_archive_contents. The buffer size (SINK_BUFFER_LENGTH = Max(32768, BLCKSZ)) is deliberately a multiple of BLCKSZ so that checksum verification, which works in whole-page units, can operate on full buffers. When data checksums are on and the file is a relation file, sendFile verifies each page’s checksum (verify_page_checksum), counting failures into total_checksum_failures. A torn read during the backup is not immediately fatal: a page whose checksum fails is re-read once, and only a persistently bad page counts as a failure, because a page being written concurrently with the backup read is expected and will be repaired by full-page-write replay. A non-zero total at the very end of the backup raises ERRCODE_DATA_CORRUPTED — surfacing genuine on-disk corruption that would otherwise silently propagate into the backup. The dispatch from sendDir chooses the file method per relation file:

// sendDir — src/backend/backup/basebackup.c
method = GetFileBackupMethod(ib, lookup_path, dboid, relspcoid,
relfilenumber, relForkNum,
segno, statbuf.st_size,
&num_blocks_required,
relative_block_numbers,
&truncation_block_length);
if (method == BACK_UP_FILE_INCREMENTALLY)
{
statbuf.st_size = GetIncrementalFileSize(num_blocks_required);
snprintf(tarfilenamebuf, sizeof(tarfilenamebuf),
"%s/INCREMENTAL.%s", path + basepathlen + 1, de->d_name);
tarfilename = tarfilenamebuf;
}

3a. The bbsink abstraction (basebackup_sink.h)

Section titled “3a. The bbsink abstraction (basebackup_sink.h)”

The walker never knows where bytes go. It writes into a shared buffer (bbs_buffer, of bbs_buffer_length) and calls one of the bbsink_* inline wrappers, which dispatch through the sink’s bbs_ops vtable. Each sink holds a bbs_next pointer to the sink it wraps, so a call either handles the data locally or forwards it down the chain:

// struct bbsink — src/include/backup/basebackup_sink.h
struct bbsink
{
const bbsink_ops *bbs_ops;
char *bbs_buffer;
size_t bbs_buffer_length;
bbsink *bbs_next;
bbsink_state *bbs_state;
};

The lifecycle is a strict call sequence enforced by assertions in the wrappers: exactly one begin_backup, then per archive one begin_archive, many archive_contents, one end_archive; then optionally one begin_manifest / many manifest_contents / one end_manifest; then one end_backup; and finally cleanup (invoked on the error path too). A sink that has nothing to add for a given callback uses the bbsink_forward_* helper to pass straight through. This is what lets SendBaseBackup build arbitrary stacks — progress(compress(throttle(target(copystream)))) — with no sink aware of its neighbours beyond bbs_next. The buffer-fill contract (Assert(len > 0 && len <= bbs_buffer_length) and bbs_buffer_length % BLCKSZ == 0) is why SINK_BUFFER_LENGTH is block-aligned and why checksum verification can assume whole pages.

4. Framing on the wire (basebackup_copy.c)

Section titled “4. Framing on the wire (basebackup_copy.c)”

bbsink_copystream implements the bbsink_ops vtable that turns sink callbacks into protocol messages. begin_archive sends a CopyData whose first byte is 'n' followed by the archive name and tablespace path; archive_contents ships the buffered chunk as a 'd' message and, every PROGRESS_REPORT_BYTE_INTERVAL bytes, checks the clock and may emit a 'p' progress report; begin_manifest/manifest_contents use 'm'/'d'. The end-of-backup handshake closes the COPY stream and sends the end LSN:

// bbsink_copystream_end_backup — src/backend/backup/basebackup_copy.c
static void
bbsink_copystream_end_backup(bbsink *sink, XLogRecPtr endptr, TimeLineID endtli)
{
SendCopyDone();
SendXlogRecPtrResult(endptr, endtli);
}

SendXlogRecPtrResult and SendTablespaceList build small result sets via begin_tup_output_tupdesc / do_tup_output against a DestRemoteSimple receiver — the start LSN and the {spcoid, spclocation, size} rows the client reads before the COPY stream begins.

When WAL was requested (pg_basebackup -X fetch), the main data-directory archive is left open after the file walk — the source comment notes the main data directory is “always sent last” precisely so the WAL can be appended to its archive. perform_base_backup scans pg_wal, selecting every segment whose name falls in [firstoff, lastoff] (derived from startsegno and endsegno) regardless of timeline, plus all timeline history files. It sorts them oldest-first via compareWalFileNames (which ignores the timeline prefix and compares only the log/segment portion) to reduce the chance a segment is recycled before it is sent, then runs a gap-free sanity check: the first segment must cover startptr, each successive segment must be contiguous, and the last must cover endptr, with CheckXLogRemoved re-checked at each step in case a checkpoint recycled a needed segment mid-copy. Each segment is appended as a tar member and immediately followed by a synthetic .done archive-status file (sendFileWithContent(..., "", ...)) so that a node promoted from this backup will not redundantly re-archive segments it already has — matching what walreceiver does after a complete segment. Finally AddWALInfoToBackupManifest records the WAL range and SendBackupManifest streams the manifest (as 'm'/'d' messages), then bbsink_end_backup closes the COPY stream and sends the end LSN/timeline. The default pg_basebackup mode (-X stream) takes a different path — it opens a second replication connection and streams WAL concurrently rather than fetching it at the end — which is why WAL is an option, not a mandate; that streaming path lives in the walsender, covered by postgres-wal-sender-receiver.md.

6. Incremental backup (basebackup_incremental.c)

Section titled “6. Incremental backup (basebackup_incremental.c)”

For an incremental backup the client first sends UPLOAD_MANIFEST with the prior backup’s manifest; the walsender streams it through CreateIncrementalBackupInfo / AppendIncrementalManifestData / FinalizeIncrementalManifest, which parse the manifest’s WAL ranges and file list into an IncrementalBackupInfo. At backup time, PrepareForIncrementalBackup matches the manifest’s WAL ranges against this server’s timeline history, determines the LSN interval since the prior backup, loads the relevant WAL summaries, and merges them into an in-memory block-reference table (ib->brtab) — the change oracle.

GetFileBackupMethod is then consulted per relation file. It returns BACK_UP_FILE_FULLY when the file is suspicious (size not a multiple of BLCKSZ, or larger than a segment), is an FSM fork (not reliably WAL-logged), was absent from the prior manifest, belongs to a newly-created database/tablespace, or would require ≥ 90% of its blocks anyway. Otherwise it returns BACK_UP_FILE_INCREMENTALLY with the list of changed blocks and a truncation length:

// GetFileBackupMethod — src/backend/backup/basebackup_incremental.c
if (nblocks * BLCKSZ > size * 0.9)
return BACK_UP_FILE_FULLY;
/* ... sort + relativize block numbers ... */
*num_blocks_required = nblocks;
*truncation_block_length = size / BLCKSZ;
/* ... clamp truncation_block_length against limit_block and RELSEG_SIZE ... */
return BACK_UP_FILE_INCREMENTALLY;

The decision logic in GetFileBackupMethod is conservative by design — every doubtful case falls back to a full copy, because a wrongly-incremental file silently corrupts the restored database whereas a wrongly-full file merely wastes space. The early returns encode that bias: a file whose size is not a clean multiple of BLCKSZ or exceeds a segment is structurally suspect; the FSM fork is not reliably WAL-logged so its changes would not appear in the summaries; a file absent from the prior manifest cannot be diffed against anything; and a database/tablespace OID created since the prior backup (detected via a special relNumber = 0 entry in the block-reference table) means all its files are new. The 90%-threshold is a pure optimization: if nearly every block changed, the incremental file plus its header would be about as large as the full file but slower to reconstruct, so it sends the full file instead.

An incremental file is laid out by sendFile as a header — magic (INCREMENTAL_MAGIC), block count, truncation block length, then the relative block numbers — followed by the raw 8 KB images of just those blocks. The truncation block length is the key reconstruction hint: it tells pg_combinebackup the final file length, so blocks below it that are absent from the incremental are taken from the prior backup, while blocks at or above it that are absent are treated as truncated-away (zero-filled if needed). GetIncrementalFileSize / GetIncrementalHeaderSize compute the on-tar size (the header is padded to a BLCKSZ multiple only when block data follows, to keep zero-block incrementals tiny). Reconstruction back into a full file is the job of pg_combinebackup, not the server. The block numbers written into the file are relative to the segment start, computed by subtracting start_blkno after sorting, so each INCREMENTAL.<seg> file is self-contained. See the cross-ref doc postgres-incremental-backup.md for the WAL-summary side of the oracle and the reconstruction algorithm in full.

flowchart TD
  UM["UPLOAD_MANIFEST<br/>prior backup manifest"] --> CIB["CreateIncrementalBackupInfo<br/>Append/Finalize parse manifest"]
  CIB --> Prep["PrepareForIncrementalBackup<br/>match WAL ranges to timeline<br/>load WAL summaries -> brtab"]
  Prep --> GFBM["GetFileBackupMethod per file"]
  GFBM -->|"not WAL-logged / new db<br/>not in prior / >=90% blocks"| Full["BACK_UP_FILE_FULLY<br/>send whole file"]
  GFBM -->|"few changed blocks"| Inc["BACK_UP_FILE_INCREMENTALLY<br/>INCREMENTAL.* file:<br/>header + changed blocks"]
  Full --> Tar["tar member in archive"]
  Inc --> Tar
  Tar --> PCB["pg_combinebackup<br/>reconstructs full files at restore"]

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
SendBaseBackupsrc/backend/backup/basebackup.c990
perform_base_backupsrc/backend/backup/basebackup.c234
parse_basebackup_optionssrc/backend/backup/basebackup.c698
sendDirsrc/backend/backup/basebackup.c1189
sendFilesrc/backend/backup/basebackup.c1574
sendFileWithContentsrc/backend/backup/basebackup.c1075
excludeDirContents[]src/backend/backup/basebackup.c151
excludeFiles[]src/backend/backup/basebackup.c191
basebackup_options (struct)src/backend/backup/basebackup.c62
SINK_BUFFER_LENGTHsrc/backend/backup/basebackup.c60
bbsink_copystream_newsrc/backend/backup/basebackup_copy.c108
bbsink_copystream_begin_backupsrc/backend/backup/basebackup_copy.c126
bbsink_copystream_begin_archivesrc/backend/backup/basebackup_copy.c165
bbsink_copystream_archive_contentssrc/backend/backup/basebackup_copy.c183
bbsink_copystream_end_backupsrc/backend/backup/basebackup_copy.c297
SendXlogRecPtrResultsrc/backend/backup/basebackup_copy.c341
SendTablespaceListsrc/backend/backup/basebackup_copy.c378
bbsink_copystream_ops (vtable)src/backend/backup/basebackup_copy.c92
GetFileBackupMethodsrc/backend/backup/basebackup_incremental.c663
PrepareForIncrementalBackupsrc/backend/backup/basebackup_incremental.c263
CreateIncrementalBackupInfosrc/backend/backup/basebackup_incremental.c152
AppendIncrementalManifestDatasrc/backend/backup/basebackup_incremental.c194
FinalizeIncrementalManifestsrc/backend/backup/basebackup_incremental.c227
GetIncrementalFilePathsrc/backend/backup/basebackup_incremental.c625
GetIncrementalFileSizesrc/backend/backup/basebackup_incremental.c909
GetIncrementalHeaderSizesrc/backend/backup/basebackup_incremental.c881
IncrementalBackupInfo (struct)src/backend/backup/basebackup_incremental.c74
do_pg_backup_startsrc/backend/access/transam/xlog.c8842
do_pg_backup_stopsrc/backend/access/transam/xlog.c9170
do_pg_abort_backupsrc/backend/access/transam/xlog.c9444
build_backup_contentsrc/backend/access/transam/xlogbackup.c29
bbsink_state (struct)src/include/backup/basebackup_sink.h66

Verified against the REL_18 working tree at /data/hgryoo/references/postgres, commit 273fe94852b3a7e34fd171e8abdf1481beb302fa (PG 18.x).

  • Commit / branch. git log -1 reports 273fe94 dated 2026-06-05. All line numbers above were read directly from this tree.
  • Files exist at stated sizes. basebackup.c is 2136 lines, basebackup_copy.c 422, basebackup_incremental.c 1056.
  • Symbols present. SendBaseBackup, perform_base_backup, parse_basebackup_options, sendDir, sendFile, sendFileWithContent, GetFileBackupMethod, PrepareForIncrementalBackup, GetIncrementalFileSize, GetIncrementalHeaderSize, bbsink_copystream_new, SendXlogRecPtrResult, SendTablespaceList, and the bbsink_copystream_ops vtable were all confirmed by grep.
  • Bracket primitives. do_pg_backup_start, do_pg_backup_stop, and do_pg_abort_backup are in xlog.c; build_backup_content is in xlogbackup.c. The runningBackups++ under WAL insert locks and the full-page-write rationale were read in place.
  • Wire framing tags. The type bytes 'n' (new archive), 'd' (data), 'm' (manifest), 'p' (progress) appear in basebackup_copy.c exactly as described; PqMsg_CopyData / PqMsg_CopyOutResponse / PqMsg_CopyDone are the message constants used.
  • Option spelling. REL_18 accepts CHECKPOINT { fast | spread } and the boolean-ish MANIFEST { yes | no | force-encode }; the INCREMENTAL precondition on summarize_wal is enforced in parse_basebackup_options.
  • Incremental layout. INCREMENTAL_MAGIC, the INCREMENTAL.<name> tar member naming, and the 90%-of-blocks full-backup threshold were confirmed in basebackup_incremental.c / sendFile. No claim relies on a removed-in-18 symbol; the block-reference table API (BlockRefTableGetEntry, BlockRefTableEntryGetBlocks) is the REL_18 surface.
  • Scope boundaries. WAL streaming/replication transport, WAL summarization, and pg_combinebackup reconstruction are deferred to the cross-ref docs and not re-asserted here beyond the seam.
  • Checksum re-read. The claim that a failing page is re-read once before counting as a failure was confirmed against read_file_data_into_buffer, which holds a reread_cnt re-read path and only counts a checksum_failures++ after the second read still fails.
  • backup_label fields. The exact START WAL LOCATION / CHECKPOINT LOCATION / BACKUP METHOD: streamed / BACKUP FROM / START TIME / LABEL / START TIMELINE / optional INCREMENTAL FROM LSN + INCREMENTAL FROM TLI field set was read directly from build_backup_content; the .backup history-file variant is the same function called with ishistoryfile = true.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

PostgreSQL’s base backup is one point in a broad design space. Naming the alternatives sharpens what PG actually chose.

Snapshot-based backup (copy-on-write filesystems / volumes). ZFS, LVM, and cloud block-store snapshots can take a crash-consistent point-in-time image of the whole volume atomically, sidestepping the file-by-file walk entirely. The trade is that the snapshot is consistent only at the block device layer; to make it a PostgreSQL-consistent backup you still need the WAL bracket — pg_backup_start / pg_backup_stop around the snapshot — so that the restored volume replays correctly. PG’s BASE_BACKUP and the low-level API are complementary here: the API exists precisely so snapshot tooling can wrap a non-PG copy mechanism. The lesson is that the recoverability metadata (backup_label, the WAL interval) is the load-bearing part; the byte-copy mechanism is interchangeable.

Block-change tracking vs. WAL scanning for incrementals. Oracle RMAN’s block change tracking maintains a persistent bitmap of changed blocks in a dedicated file, updated as blocks are dirtied; SQL Server’s differential backups track changed extents via the DCM (Differential Changed Map) bitmap page. Both pay a small write-time cost to make the incremental oracle a direct lookup. PostgreSQL instead chose to derive the oracle after the fact by scanning WAL into summaries (the WAL summarizer) and merging them into a block-reference table at backup time. The advantage is zero write-path overhead when summarization is on and no on-disk bitmap to keep coherent across crashes; the cost is that the summarizer must keep up and GetFileBackupMethod must reconstruct the change set per backup. This is a classic eager-vs-lazy materialization trade — the same axis the executor doc draws between materialized and pipelined evaluation, applied to change tracking.

Push vs. pull, and where compression lives. PG streams the backup pull-style: the client drives, the server reads and frames. Server-side compression (COMPRESSION gzip|lz4|zstd) is a sink in the chain, so the CPU cost lands on the primary; client-side compression (in pg_basebackup) moves it off the primary at the cost of network bytes. MySQL’s xtrabackup historically copied InnoDB files plus a redo-log tail in a broadly similar bracket-and-copy shape but as an external tool reading the files directly rather than through a server command. The bbsink decorator stack is PG’s clean answer to “where does each concern (meter, compress, transport) live” — each is one composable sink.

Torn pages without full-page writes. PG’s torn-page defense is full-page-write WAL during the backup window. Other engines avoid the problem differently: some use doublewrite buffers (InnoDB) so a torn page is always recoverable from the doublewrite area independent of any backup, and some rely on atomic 8 KB+ writes from the storage layer. PG’s choice keeps the heap write path simple at steady state and pays the cost only during backups (and after checkpoints) — but it is why a base backup of a busy cluster temporarily inflates WAL volume.

Research frontiers. Incremental-forever and synthetic full strategies (merge a chain of incrementals into a notional full without re-reading the primary) are well-explored in enterprise backup products and partially realized by pg_combinebackup. Open questions for PG specifically include making the in-memory block-reference table spill to disk for very large clusters (the source itself flags the memory concern), and tighter integration between WAL summarization, archiving, and backup so that the change oracle is always warm. The broader literature on fuzzy checkpointing and ARIES-family recovery (Mohan 1992) remains the theoretical backbone: a base backup is just a fuzzy checkpoint taken to external media, and every refinement is about reducing the bytes copied or the overhead of the bracket while preserving idempotent physical redo.

A final design observation ties the comparison back to PostgreSQL’s particular taste. The base-backup subsystem is unusually composable for a piece of recovery machinery: the bracket primitives (do_pg_backup_start / do_pg_backup_stop) are reused verbatim by the SQL low-level API, snapshot tooling, and BASE_BACKUP; the sink chain lets compression, throttling, progress, and transport be mixed and matched without touching the walker; and the incremental path is a thin addition — one extra method enum and one oracle lookup — layered onto the same walk. That composability is the real lesson of the module: rather than building three backup engines (full, incremental, snapshot-assisted), PostgreSQL factored out the one invariant that all physical backups share — the recoverable fuzzy image bracketed by a redo-pinning start and an archive-ensuring stop — and made everything else a pluggable decision on top. The cross-referenced docs (postgres-incremental-backup.md, postgres-archiving-walsummary.md, postgres-wal-sender-receiver.md) each pick up one of those pluggable seams.

  • PostgreSQL REL_18 source (/data/hgryoo/references/postgres, commit 273fe94):
    • src/backend/backup/basebackup.cSendBaseBackup, perform_base_backup, parse_basebackup_options, sendDir, sendFile, sendFileWithContent, WAL-append loop, exclusion lists.
    • src/backend/backup/basebackup_copy.cbbsink_copystream vtable, COPY framing, SendXlogRecPtrResult, SendTablespaceList.
    • src/backend/backup/basebackup_incremental.cGetFileBackupMethod, PrepareForIncrementalBackup, manifest parsing, incremental file sizing.
    • src/include/backup/basebackup_sink.hbbsink, bbsink_state, bbsink_ops.
    • src/backend/access/transam/xlog.cdo_pg_backup_start, do_pg_backup_stop, do_pg_abort_backup, runningBackups.
    • src/backend/access/transam/xlogbackup.cbuild_backup_content.
  • Theory anchor. Database System Concepts, Silberschatz, Korth & Sudarshan, 7th ed., ch. 19 “Recovery System” — archival dumps, fuzzy dump, and recovery from media failure (knowledge/research/dbms-general/database-system-concepts.md).
  • Recovery model. Mohan et al., “ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging,” 1992 — idempotent physiological redo and full-page images, the basis for repairing a fuzzy image (knowledge/research/dbms-papers/aries.md).
  • Cross-references (this tree):
    • postgres-wal-sender-receiver.md — the replication transport that carries BASE_BACKUP and the live WAL stream a standby follows.
    • postgres-archiving-walsummary.md — WAL archiving and the WAL summarizer that feeds the incremental change oracle.
    • postgres-incremental-backup.md — the block-reference table and pg_combinebackup reconstruction in depth.
    • postgres-xlog-wal.md, postgres-checkpoint.md, postgres-recovery-redo.md — the WAL, checkpoint, and redo machinery the bracket and restore rely on.
    • postgres-overview-replication-ha.md — the subsystem map this doc sits in.