PostgreSQL pg_dump / pg_restore — Catalog-Driven Logical Backup
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-06)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A database backup answers one question — how do I reconstruct this data after a loss? — but there are two structurally different answers, and the choice colors everything downstream.
A physical backup copies the bytes of the storage layer: the data
files, the WAL, the control file. It is fast (sequential reads, no parsing),
it is exact (every tuple, every dead row, every index leaf), and it is
opaque — the copy is only meaningful to a server of the same major
version, same block size, same architecture. PostgreSQL’s physical path is
pg_basebackup / the WAL archive (see postgres-backup-basebackup.md).
A logical backup copies the meaning of the database: a stream of SQL
statements (CREATE TABLE, COPY, CREATE INDEX, ALTER TABLE … ADD CONSTRAINT) that, when replayed against an empty cluster, reconstructs an
equivalent database. It is slower (the server must re-parse, re-plan,
re-index everything) and it is not byte-exact (a restored table is freshly
packed, has no dead tuples, gets new OIDs). In exchange it is portable:
across major versions, across architectures, across forks, and it can be
restored selectively (one schema, one table, schema-only). pg_dump is
PostgreSQL’s logical-backup engine.
Three theoretical problems define the logical-backup design space:
-
Snapshot consistency. A dump runs for minutes to hours against a live database that keeps changing. Without isolation, the dump would see a table created after the catalog scan but miss its data, or capture a foreign key whose referenced table was dropped mid-dump. The dump must read the entire database — catalog and data — as of a single point in time. This is exactly the snapshot-isolation guarantee MVCC already provides; the dumper’s job is to acquire one snapshot and pin it for the whole run (
postgres-mvcc-snapshots.md). -
Reconstruction ordering — the topological-sort problem. SQL is not commutative under definition.
CREATE TABLE orders (cust_id REFERENCES customers)cannot execute beforecustomersexists; a view cannot be created before the tables it selects from; an index’s operator class must exist first. The set of “X must exist before Y” facts forms a directed graph, and a valid restore order is any topological sort of that graph. The Art of Computer Programming, Vol. 1 (Knuth, §2.2.3) gives the canonical algorithm: repeatedly emit a node with no unsatisfied predecessors, decrement the predecessor counts of its successors, repeat. The wrinkle is that real schemas contain cycles — a view that references itself through a rule, two tables with mutual foreign keys — and a cyclic graph has no topological order. The dumper must detect cycles and break them by splitting one object into two TOC entries (the table definition, then the constraint that closes the loop), restoring the second piece in a later pass. -
Format / engine separation. The decisions what to dump and in what order are independent of how the bytes are laid out on disk. A clean design factors the object-extraction logic (which talks SQL to the server) away from the archive-serialization logic (which writes a file), connected only by an abstract “archive entry” interface. This is the classic strategy pattern: one extraction front end, several interchangeable format back ends.
Database System Concepts (Silberschatz et al., “Recovery System”) frames
the backup/restore duality as the difference between dump (a snapshot of
state) and log (a record of changes), and notes that a logical dump is
fundamentally a re-derivation recipe rather than a copy. Database
Internals (Petrov, Part II) frames the same split as “logical vs. physical
replication,” where the logical form ships row/statement intent and the
physical form ships page images — the same trade between portability and
fidelity that separates pg_dump from pg_basebackup.
Common DBMS Design
Section titled “Common DBMS Design”Every mature engine ships a logical-export tool, and they converge on a small set of patterns. Naming them first makes PostgreSQL’s choices read as selections within a shared space.
Catalog as the source of truth
Section titled “Catalog as the source of truth”A logical dumper does not parse the DDL the user originally typed (that
text is not retained). It reconstructs DDL by querying the system catalog
— the tables that describe tables, columns, types, constraints, indexes. The
dumper is, in effect, a large catalog-to-DDL decompiler. Every engine has
one: MySQL’s mysqldump reads INFORMATION_SCHEMA + SHOW CREATE TABLE;
Oracle’s Data Pump reads the dictionary views; SQL Server’s SMO walks
sys.*. The fidelity of the dump is bounded by the completeness of the
catalog read — a column default stored in pg_attrdef that the dumper
forgets to query simply vanishes from the backup.
Snapshot pinning for a consistent read
Section titled “Snapshot pinning for a consistent read”The export must read catalog and data under one snapshot. The universal
mechanism is to open a transaction in the strictest isolation the engine
offers and hold it open for the whole dump: REPEATABLE READ / SERIALIZABLE
in PostgreSQL, START TRANSACTION WITH CONSISTENT SNAPSHOT in MySQL/InnoDB,
FLASHBACK SCNs in Oracle. The cost is a long-lived transaction that pins
old row versions (blocking vacuum) and holds AccessShareLock on every
dumped table for the duration.
An intermediate object model, not a streaming pipe
Section titled “An intermediate object model, not a streaming pipe”A naive dumper could stream CREATE … ; COPY … ; straight to a file as it
walks the catalog. But ordering forbids this: you cannot know the safe order
until you have collected all objects and all their dependency edges.
Mature tools therefore build an in-memory object graph first, sort it,
and only then serialize. This graph is also what enables selective restore
(restore one node and its transitive prerequisites) and parallelism
(dispatch independent subgraphs concurrently).
A table of contents (TOC) decoupling listing from content
Section titled “A table of contents (TOC) decoupling listing from content”A container format puts a table of contents at a known location: a list
of (id, kind, name, owner, dependency-list, offset-to-data) records. The
TOC lets a restore tool (a) list the archive’s contents without reading the
data, (b) reorder or filter entries, and (c) seek directly to one table’s
data for selective or parallel restore. pg_restore -l / -L, Oracle Data
Pump’s master table, and SQL Server’s backup header all serve this role.
Section phasing: schema-before-data-before-constraints
Section titled “Section phasing: schema-before-data-before-constraints”Bulk-loading data is dramatically faster when indexes and constraints do not
yet exist — each inserted row would otherwise pay for index maintenance and
constraint checks. Every fast loader therefore splits restore into phases:
create bare tables, bulk-load data, then build indexes and add
constraints. PostgreSQL formalizes this as three sections — pre-data,
data, post-data — and the section boundary is itself a node in the
dependency graph.
Parallelism bounded by the dependency graph
Section titled “Parallelism bounded by the dependency graph”Restore is embarrassingly parallel except where dependencies serialize it: two tables’ data can load concurrently, but an index cannot build before its table’s data is present, and a foreign key cannot be added before both endpoints exist. The general pattern is a worker pool fed by a ready queue: an item is “ready” when all its dependencies are done; workers pull ready items, and completion of one item may make several others ready.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL splits the problem across a handful of files in
src/bin/pg_dump/. The cleanest way to read the codebase is as four
cooperating layers:
flowchart TB
subgraph extract["Extraction front end — talks SQL"]
A["pg_dump.c main()<br/>option parsing, connect,<br/>set snapshot"]
B["common.c getSchemaData()<br/>getTables / getTypes / getFuncs / …<br/>builds DumpableObject graph"]
end
subgraph order["Ordering"]
C["pg_dump_sort.c<br/>sortDumpableObjectsByTypeName<br/>then sortDumpableObjects (TopoSort)"]
end
subgraph emit["Archive back end — writes bytes"]
D["pg_backup_archiver.c<br/>ArchiveEntry() builds TOC<br/>CreateArchive() picks format"]
E["pg_backup_custom.c<br/>pg_backup_directory.c<br/>pg_backup_tar.c<br/>pg_backup_null.c (plain)"]
end
subgraph restore["pg_restore"]
F["OpenArchive + ReadToc<br/>RestoreArchive()<br/>serial or parallel"]
end
A --> B --> C --> D --> E
E -.archive file.-> F
Figure 1 — The four layers. Solid arrows are the dump path; the dashed
arrow is the file handed to pg_restore. Plain-text dump is the same path
with the archNull back end, and pg_dump itself drives RestoreArchive
in that case rather than writing a container.
Catalog-driven extraction under a pinned snapshot
Section titled “Catalog-driven extraction under a pinned snapshot”main() in pg_dump.c connects with a regular libpq connection, then opens
a transaction and (for a consistent dump) sets it to the appropriate
isolation level and exports a snapshot — the same machinery a parallel dump
uses to share one snapshot across worker connections. Inside that
transaction, getSchemaData (in common.c) runs a long, deliberately
ordered sequence of catalog reads. The ordering is not cosmetic; comments in
the function spell out the data dependencies:
// getSchemaData — common.c (condensed)/* extensions first: membership gates whether other objects dump */extinfo = getExtensions(fout, &numExtensions);getExtensionMembership(fout, extinfo, numExtensions);getNamespaces(fout);/* getTables ASAP to minimize the lock-acquisition window */tblinfo = getTables(fout, &numTables);getOwnedSeqs(fout, tblinfo, numTables);getFuncs(fout);getTypes(fout); /* must be after getTables and getFuncs *//* ... aggregates, operators, opclasses, collations, casts, ... */flagInhTables(fout, tblinfo, numTables, inhinfo, numInherits);getTableAttrs(fout, tblinfo, numTables); /* columns of interesting tables */getIndexes(fout, tblinfo, numTables);getConstraints(fout, tblinfo, numTables);getTriggers(fout, tblinfo, numTables);/* ... rules, policies, publications, subscriptions ... */Each getXxx runs a SELECT against the relevant catalog (pg_class,
pg_type, pg_proc, pg_constraint, …) and turns each row into a
heap-allocated *Info struct (TableInfo, TypeInfo, FuncInfo, …) whose
first member is a DumpableObject. The comment on getTables captures the
locking concern: every table that may be dumped is LOCK TABLE … IN ACCESS SHARE MODE-d to prevent it being dropped or rewritten mid-dump, and the
function runs early to keep the window between transaction start and lock
acquisition small.
The DumpableObject graph: CatalogId and DumpId
Section titled “The DumpableObject graph: CatalogId and DumpId”Every dumpable object carries two identities. The CatalogId is its
catalog coordinate — (tableoid, oid), i.e. which catalog and which row
— and is how the dumper cross-references objects it learned about from
different queries. The DumpId is a dense small integer assigned
sequentially as objects are created; it is the key used by the dependency
arrays and the sort. AssignDumpId wires both into the lookup structures:
// AssignDumpId — common.c (condensed)dobj->dumpId = ++lastDumpId;dobj->dump = DUMP_COMPONENT_ALL; /* default: dump everything */dobj->components = DUMP_COMPONENT_DEFINITION;dobj->dependencies = NULL; dobj->nDeps = 0; dobj->allocDeps = 0;
/* index by DumpId (dense array, grown by doubling) */dumpIdMap[dobj->dumpId] = dobj;
/* index by CatalogId (open-addressing hash) if it has one */if (OidIsValid(dobj->catId.tableoid)){ entry = catalogid_insert(catalogIdHash, dobj->catId, &found); entry->dobj = dobj;}Two indexes, two access patterns: findObjectByDumpId (array indexing, used
by the sort) and findObjectByCatalogId (hash lookup, used when one catalog
read references an OID discovered by another). Dependency edges are plain
DumpId lists attached to each object; addObjectDependency appends to a
doubling array, removeObjectDependency filters one out — these are the
primitives the loop-repair logic uses to rewrite the graph.
flowchart LR
subgraph idx["common.c lookup tables"]
DM["dumpIdMap[]<br/>DumpId to DumpableObject*"]
CH["catalogIdHash<br/>CatalogId to DumpableObject*"]
end
O1["TableInfo customers<br/>dumpId=12 catId=(1259,16400)"]
O2["TableInfo orders<br/>dumpId=15 catId=(1259,16410)<br/>deps=[12]"]
O3["ConstraintInfo orders_fk<br/>dumpId=22 deps=[15,12]"]
O1 --> DM
O2 --> DM
O3 --> DM
O1 --> CH
O2 --> CH
O2 -. depends on .-> O1
O3 -. depends on .-> O2
O3 -. depends on .-> O1
Figure 2 — The object graph. orders depends on customers (foreign key
target); the FK constraint depends on both endpoints. The sort must emit
customers before orders before orders_fk — and if a mutual FK created
a cycle, loop repair would split the constraint out so it lands in
post-data.
Explicit dependencies from pg_depend, plus synthetic edges
Section titled “Explicit dependencies from pg_depend, plus synthetic edges”Most edges come straight from the catalog: getDependencies (in
pg_dump.c) reads pg_depend, maps each (classid, objid) →
(refclassid, refobjid) row to a pair of DumpableObjects via
findObjectByCatalogId, and calls addObjectDependency. On top of these,
the extraction code adds synthetic edges that pg_depend does not model
— for example, a column default that must be emitted after its table
(addObjectDependency(&attrDef->dobj, tbinfo->dobj.dumpId) in common.c),
or the matview-refresh edges built by buildMatViewRefreshDependencies.
Ordering: type-name pre-sort, then topological sort
Section titled “Ordering: type-name pre-sort, then topological sort”pg_dump.c sorts in two stages. First sortDumpableObjectsByTypeName
imposes a stable, deterministic baseline order (by object-type priority,
then by schema and name) so that two logically identical databases dump
byte-identically — important for diffing dumps. Then sortDumpableObjects
runs the real topological sort, using the type-name order only as a
tie-breaker to “minimize unnecessary rearrangement”:
// sortDumpableObjects — pg_dump_sort.c (condensed)preDataBoundId = preBoundaryId; /* the pre-data/data section boundary */postDataBoundId = postBoundaryId; /* the data/post-data boundary */
ordering = pg_malloc(numObjs * sizeof(DumpableObject *));while (!TopoSort(objs, numObjs, ordering, &nOrdering)) findDependencyLoops(ordering, nOrdering, numObjs);
memcpy(objs, ordering, numObjs * sizeof(DumpableObject *));The while loop is the cycle-handling structure: TopoSort returns true
on success; if it returns false it has filled ordering with the objects
it could not place (those in or downstream of a cycle), and
findDependencyLoops finds an actual cycle among them and calls
repairDependencyLoop to break it, after which the sort is retried.
The two boundary IDs name dummy objects (DO_PRE_DATA_BOUNDARY,
DO_POST_DATA_BOUNDARY) that the sort treats as ordinary nodes. By making
every pre-data object a predecessor of the pre-data boundary, every data
object a successor of it and a predecessor of the post-data boundary, the
section split falls out of the same topological sort rather than needing a
separate pass.
Loop repair: splitting objects into multiple TOC entries
Section titled “Loop repair: splitting objects into multiple TOC entries”repairDependencyLoop is a catalog of known cycle shapes. A type and its
I/O function reference each other; a view and its ON SELECT rule reference
each other; two tables can have mutual foreign keys. Each case has a
hand-coded repair that removes one edge — usually by deferring part of the
object to post-data:
// repairDependencyLoop — pg_dump_sort.c (condensed)/* Datatype and one of its I/O or canonicalize functions */if (nLoop == 2 && loop[0]->objType == DO_TYPE && loop[1]->objType == DO_FUNC){ repairTypeFuncLoop(loop[0], loop[1]); return;}/* View (including matview) and its ON SELECT rule */if (nLoop == 2 && loop[0]->objType == DO_TABLE && loop[1]->objType == DO_RULE && (((TableInfo *) loop[0])->relkind == RELKIND_VIEW || … RELKIND_MATVIEW) && ((RuleInfo *) loop[1])->ev_type == '1' && … ){ repairViewRuleLoop(loop[0], loop[1]); /* dump view as CREATE+rule */ return;}/* … table-constraint loops, larger indirect loops, etc. … */This is why a self-referential view dumps as a bare CREATE VIEW … AS SELECT NULL (or a dummy table) in pre-data plus a CREATE RULE "_RETURN" in
post-data: the loop repair split it.
Building the TOC: ArchiveEntry
Section titled “Building the TOC: ArchiveEntry”Once sorted, pg_dump.c walks the ordered array and calls dumpDumpableObject
on each, which dispatches by type (dumpTableSchema, dumpType, dumpFunc,
…). Each dumpXxx builds the object’s SQL text and registers it as a TOC
entry via ArchiveEntry, supplying its tag, owner, section, the
CREATE/DROP SQL, an optional COPY statement, an optional data dumper
callback, and its dependency list:
// ArchiveEntry — pg_backup_archiver.c (condensed)newToc = pg_malloc0(sizeof(TocEntry));/* link into the circular TOC list */newToc->prev = AH->toc->prev; newToc->next = AH->toc;AH->toc->prev->next = newToc; AH->toc->prev = newToc;
newToc->dumpId = dumpId;newToc->section = opts->section; /* PRE_DATA / DATA / POST_DATA */newToc->tag = pg_strdup(opts->tag);newToc->desc = pg_strdup(opts->description); /* "TABLE", "INDEX", … */newToc->defn = opts->createStmt ? pg_strdup(opts->createStmt) : NULL;newToc->dropStmt = opts->dropStmt ? pg_strdup(opts->dropStmt) : NULL;newToc->copyStmt = opts->copyStmt ? pg_strdup(opts->copyStmt) : NULL;newToc->dataDumper = opts->dumpFn; /* table-data callback */memcpy(newToc->dependencies, opts->deps, opts->nDeps * sizeof(DumpId));The desc string (“TABLE”, “INDEX”, “FK CONSTRAINT”, “TABLE DATA”,
“DATABASE”, …) is load-bearing: the restore-time _tocEntryRequired filter
keys entirely off desc string comparisons, so the descriptor doubles as
the object’s restore-policy class.
Table data: the COPY hook
Section titled “Table data: the COPY hook”A table’s definition is a SQL string in the TOC; its data is not — it is
a callback (dataDumper) that streams rows on demand. For most tables that
callback is dumpTableData_copy, which issues a server-side COPY … TO stdout and relays the protocol’s CopyData rows straight into the archive:
// dumpTableData_copy — pg_dump.c (condensed)column_list = fmtCopyColumnList(tbinfo, clistBuf); /* explicit col order */if (tdinfo->filtercond || tbinfo->relkind == RELKIND_FOREIGN_TABLE) appendPQExpBuffer(q, "COPY (SELECT %s FROM %s %s) TO stdout;", …, fmtQualifiedDumpable(tbinfo), tdinfo->filtercond ? tdinfo->filtercond : "");else appendPQExpBuffer(q, "COPY %s %s TO stdout;", fmtQualifiedDumpable(tbinfo), column_list);Deferring data to a callback is what makes the archive formats pluggable: a
plain dump runs the callback and writes SQL inline; the custom/directory
formats run it during WriteDataChunks and store the compressed CopyData
block at a TOC-recorded offset, so restore can seek to it. The
--inserts mode swaps the callback for dumpTableData_insert, which emits
INSERT statements instead (portable to other engines, far slower).
Four archive formats behind one vtable
Section titled “Four archive formats behind one vtable”CreateArchive selects a format and _allocAH dispatches to that format’s
init function, which fills a vtable of function pointers
(WriteDataPtr, StartTocPtr, ReadTocPtr, ClosePtr, …). The rest of
the archiver calls through the vtable and never names a concrete format:
// _allocAH — pg_backup_archiver.c (condensed)switch (AH->format){ case archCustom: InitArchiveFmt_Custom(AH); break; case archNull: InitArchiveFmt_Null(AH); break; /* plain text */ case archDirectory: InitArchiveFmt_Directory(AH); break; case archTar: InitArchiveFmt_Tar(AH); break; default: pg_fatal("unrecognized file format \"%d\"", …);}The format is chosen from the -F letter in pg_dump’s main:
-F | enum | back end | seekable / parallel | TOC |
|---|---|---|---|---|
p / plain | archNull | pg_backup_null.c | no / no | none — pure SQL stream |
c / custom | archCustom | pg_backup_custom.c | yes (if fseeko) / restore-only | one file, compressed blocks |
d / directory | archDirectory | pg_backup_directory.c | yes / dump and restore | dir of per-table files + toc.dat |
t / tar | archTar | pg_backup_tar.c | limited / no | one tar, toc.dat member |
Only the directory format supports parallel dump (each worker writes its
own file; a tar or single custom file would need coordinated seeks). Custom,
directory, and tar all carry a real TOC and so support selective and
reordered restore via pg_restore; plain text is a one-shot SQL script that
psql replays with no listing or filtering. The plain path is special in
main: RestoreArchive is invoked immediately to emit SQL, whereas the
container formats defer all output to CloseArchive.
Restore: ReadToc, requirement filtering, and the section passes
Section titled “Restore: ReadToc, requirement filtering, and the section passes”pg_restore opens the archive (OpenArchive → _allocAH in read mode),
ReadToc rebuilds the TOC list from the serialized WriteToc output, and
ProcessArchiveRestoreOptions walks the list once to mark each entry with a
reqs bitmask via _tocEntryRequired. That function decides, per entry and
purely from desc/tag string comparisons plus the RestoreOptions,
whether the entry’s schema and/or data should be emitted:
// _tocEntryRequired — pg_backup_archiver.c (condensed)int res = REQ_SCHEMA | REQ_DATA;if (strcmp(te->desc, "ENCODING") == 0 || … "STDSTRINGS" … || … "SEARCHPATH" …) return REQ_SPECIAL;if (strcmp(te->desc, "STATISTICS DATA") == 0){ if (!ropt->dumpStatistics) return 0; res = REQ_STATS; }/* DATABASE / DATABASE PROPERTIES only when --create */if (strcmp(te->desc, "DATABASE") == 0 || … "DATABASE PROPERTIES" …) return ropt->createDB ? REQ_SCHEMA : 0;if (ropt->aclsSkip && _tocEntryIsACL(te)) return 0;if (ropt->no_comments && strcmp(te->desc, "COMMENT") == 0) return 0;/* … no_policies / no_publications / no_security_labels exclusions … */RestoreArchive then replays entries in TOC order, but restore_toc_entry
splits each into its schema part (run when REQ_SCHEMA is set —
_printTocEntry emits the defn) and its data part (run when REQ_DATA is
set — the data block is read back and applied). Because the dump already
ordered everything topologically and the section boundaries are real TOC
entries, serial restore is just an in-order walk.
Parallel restore: prefork, ready heap, multi-pass
Section titled “Parallel restore: prefork, ready heap, multi-pass”Parallel restore (-j N, container formats with --use-db) is a three-phase
engine. restore_toc_entries_prefork runs all pre-data items in the single
parent connection (they are cheap and order-sensitive), pushing data and
post-data items onto a pending_list. Then
restore_toc_entries_parallel drives a worker pool against a ready heap
ordered by data size (largest first, so the pool does not end starved on one
huge table):
// restore_toc_entries_parallel — pg_backup_archiver.c (condensed)ready_heap = binaryheap_allocate(AH->tocCount, TocEntrySizeCompareBinaryheap, NULL);AH->restorePass = RESTORE_PASS_MAIN;move_to_ready_heap(pending_list, ready_heap, AH->restorePass);for (;;){ next_work_item = pop_next_work_item(ready_heap, pstate); if (next_work_item != NULL) DispatchJobForTocEntry(AH, pstate, next_work_item, ACT_RESTORE, mark_restore_job_done, ready_heap); else if (IsEveryWorkerIdle(pstate)) { if (AH->restorePass == RESTORE_PASS_LAST) break; AH->restorePass++; /* advance MAIN→ACL→POST_ACL */ move_to_ready_heap(pending_list, ready_heap, AH->restorePass); continue; } WaitForWorkers(AH, pstate, next_work_item ? WFW_ONE_IDLE : WFW_GOT_STATUS);}Two correctness mechanisms ride along. pop_next_work_item refuses to
dispatch an item that would need a lock conflicting with an
already-running worker (has_lock_conflicts) — important because, e.g., an
ALTER TABLE … ADD CONSTRAINT takes a stronger lock than a concurrent index
build on the same table. And mark_restore_job_done calls
reduce_dependencies, which decrements the blocked-count of every item that
depended on the finished one and moves any now-unblocked item into the ready
heap. The restorePass counter (RESTORE_PASS_MAIN → RESTORE_PASS_ACL →
RESTORE_PASS_POST_ACL) forces a few classes (ACLs, then comments/security
labels on them) into strictly later passes regardless of the graph, because
their dependencies are not fully modeled.
pg_dumpall: cluster globals plus per-database delegation
Section titled “pg_dumpall: cluster globals plus per-database delegation”pg_dump connects to one database and can only see objects in that
database’s catalog — it cannot dump cluster-global objects, which live in
shared catalogs (pg_authid, pg_db_role_setting, pg_tablespace).
pg_dumpall fills that gap. Its main connects to one database (default
postgres), emits the globals directly as SQL, then shells out to pg_dump
once per database:
// dumpDatabases / runPgDump — pg_dumpall.c (condensed)/* globals first, written straight to the output script */dumpRoles(conn); /* CREATE ROLE … from pg_authid */dumpRoleMembership(conn); /* GRANT role TO role from pg_auth_members */dumpTablespaces(conn); /* CREATE TABLESPACE … from pg_tablespace *//* then each database via a child pg_dump process */for each datname: create_opts = "--create"; /* or "--clean --create" */ ret = runPgDump(dbname, create_opts); /* exec pg_dump -Fp / -Fa */runPgDump builds a command line invoking the pg_dump binary with -Fp
(plain) or, when writing to a file, the undocumented -Fa (plain-append) so
multiple databases concatenate into one script. The output is therefore
always plain SQL — pg_dumpall has no archive format of its own. Roles are
dumped before databases because CREATE DATABASE … OWNER and per-database
ALTER ROLE … SET references must resolve; template0 is skipped and
template1/postgres are assumed to pre-exist (the script \connects
rather than CREATEs them).
Source Walkthrough
Section titled “Source Walkthrough”Anchor on symbol names, not line numbers. Use
git grep -n '<symbol>' src/bin/pg_dump/to relocate; line numbers in the table below are hints scoped to commit273fe94.
Driver and option handling (pg_dump.c)
Section titled “Driver and option handling (pg_dump.c)”main— parse options,parseArchiveFormatthe-Fletter, connect,CreateArchive, drive extraction → sort →dumpDumpableObjectloop →SetArchiveOptions/ProcessArchiveRestoreOptions→RestoreArchive(plain) orCloseArchive(container).parseArchiveFormat— mapp/c/d/t(and long names) toarchNull/archCustom/archDirectory/archTar.getTableData/makeTableDataInfo— createDO_TABLE_DATAobjects and attach thedumpTableDatadata-dumper to each dumpable table.dumpTableData_copy/dumpTableData_insert— the two data-dumper callbacks (server-sideCOPY … TO stdoutvs. emittedINSERTs).dumpTableSchema— build a table’sCREATE TABLEtext andArchiveEntryit;dumpDumpableObjectis the type-dispatch hub.dumpDatabase— emit theCREATE DATABASE/DATABASE PROPERTIESentries.getDependencies— readpg_depend, translate toaddObjectDependencyedges.selectDumpableTable/selectDumpableNamespace— policy subroutines that set each object’sdumpcomponent mask from include/exclude patterns.
Object model and catalog extraction (common.c)
Section titled “Object model and catalog extraction (common.c)”getSchemaData— the ordered sequence ofgetXxxcatalog reads; returns theTableInfoarray.AssignDumpId/createDumpId— assign aDumpId, register indumpIdMapandcatalogIdHash.recordAdditionalCatalogID— map a secondCatalogIdonto an existing object (e.g. array type ↔ base type).findObjectByDumpId/findObjectByCatalogId/findTableByOid— the three lookups.addObjectDependency/removeObjectDependency— mutate an object’sdependencies[]array.getDumpableObjects— flattendumpIdMapinto a modifiable array for the sort.flagInhTables/flagInhAttrs— wire child→parent links and suppress inherited columns/defaults.
Ordering (pg_dump_sort.c)
Section titled “Ordering (pg_dump_sort.c)”sortDumpableObjectsByTypeName/DOTypeNameCompare— stable pre-sort bydbObjectTypePriority[objType]then schema/name.dbObjectTypePriority[]— the per-DumpableObjectTypepriority array (designated initializers;StaticAssertDeclguards its length).sortDumpableObjects— set section-boundary IDs, loopTopoSort/findDependencyLoopsuntil acyclic.TopoSort— Knuth Vol. 1 topological sort overDumpIds using abinaryheappriority queue keyed to preserve input order.findDependencyLoops/repairDependencyLoop— detect a concrete cycle and dispatch to a shape-specific repair (repairTypeFuncLoop,repairViewRuleLoop, table-constraint splits, …).
Archive engine (pg_backup_archiver.c)
Section titled “Archive engine (pg_backup_archiver.c)”CreateArchive/OpenArchive/_allocAH— allocate theArchiveHandle, pick the format, callInitArchiveFmt_*._discoverArchiveFormat— sniff the magic of an input archive when-Fis unspecified for restore.ArchiveEntry— build aTocEntryand splice it into the circular TOC.WriteToc/ReadToc— serialize / deserialize the TOC (only entries whosereqsincludeREQ_SCHEMA|REQ_DATA|REQ_STATS|REQ_SPECIAL).WriteDataChunks/WriteDataChunksForTocEntry— invoke each entry’sdataDumper; parallel-dump path sorts by size andDispatchJobForTocEntry.ProcessArchiveRestoreOptions/_tocEntryRequired— compute per-entryreqsbitmask fromRestoreOptions.RestoreArchive/restore_toc_entry/_printTocEntry— serial restore: emit schema part then data part of each TOC entry in order.restore_toc_entries_prefork/restore_toc_entries_parallel— three-phase parallel restore driver.pop_next_work_item/move_to_ready_heap/reduce_dependencies/mark_restore_job_done— ready-heap scheduling and dependency decrement.parallel_restore— the worker-side entry that callsrestore_toc_entry.SortTocFromFile— reorder/filter the TOC from apg_restore -Llist file.
Cluster globals (pg_dumpall.c)
Section titled “Cluster globals (pg_dumpall.c)”main— connect, optionallydropDBs,dumpRoles,dumpRoleMembership,dumpTablespaces, thendumpDatabases.dumpRoles/dumpRoleMembership—CREATE ROLE/GRANTfrompg_authid/pg_auth_members.dumpTablespaces—CREATE TABLESPACEfrompg_tablespace.dumpDatabases/runPgDump— per-database delegation to thepg_dumpbinary with-Fp/-Fa.expand_dbname_patterns— resolve--databasepatterns againstpg_database.
Position hints (as of 2026-06-06, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-06, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
main | pg_dump.c | 414 |
selectDumpableTable | pg_dump.c | 2030 |
dumpTableData_copy | pg_dump.c | 2326 |
dumpTableData | pg_dump.c | 2816 |
getTableData | pg_dump.c | 2966 |
dumpDatabase | pg_dump.c | 3228 |
getTables | pg_dump.c | 7026 |
dumpDumpableObject | pg_dump.c | 11487 |
dumpTableSchema | pg_dump.c | 16919 |
getDependencies | pg_dump.c | 19803 |
getSchemaData | common.c | 98 |
flagInhTables | common.c | 269 |
AssignDumpId | common.c | 657 |
findObjectByCatalogId | common.c | 778 |
getDumpableObjects | common.c | 797 |
addObjectDependency | common.c | 818 |
dbObjectTypePriority[] | pg_dump_sort.c | 105 |
sortDumpableObjectsByTypeName | pg_dump_sort.c | 192 |
sortDumpableObjects | pg_dump_sort.c | 558 |
TopoSort | pg_dump_sort.c | 610 |
findDependencyLoops | pg_dump_sort.c | 759 |
repairDependencyLoop | pg_dump_sort.c | 1173 |
CreateArchive | pg_backup_archiver.c | 230 |
RestoreArchive | pg_backup_archiver.c | 343 |
restore_toc_entry | pg_backup_archiver.c | 843 |
ArchiveEntry | pg_backup_archiver.c | 1240 |
_allocAH | pg_backup_archiver.c | 2376 |
WriteDataChunks | pg_backup_archiver.c | 2496 |
WriteToc | pg_backup_archiver.c | 2610 |
ReadToc | pg_backup_archiver.c | 2708 |
_tocEntryRequired | pg_backup_archiver.c | 2986 |
restore_toc_entries_prefork | pg_backup_archiver.c | 4311 |
restore_toc_entries_parallel | pg_backup_archiver.c | 4434 |
pop_next_work_item | pg_backup_archiver.c | 4711 |
reduce_dependencies | pg_backup_archiver.c | 5060 |
dumpRoles | pg_dumpall.c | 823 |
dumpTablespaces | pg_dumpall.c | 1382 |
dumpDatabases | pg_dumpall.c | 1628 |
runPgDump | pg_dumpall.c | 1724 |
Source verification (as of 2026-06-06)
Section titled “Source verification (as of 2026-06-06)”Facts about the source at commit
273fe94, readable without external materials. Open questions follow.
Verified facts
Section titled “Verified facts”-
The four archive formats are exactly
archCustom,archTar,archNull,archDirectory. Verified inpg_backup.h: theArchiveFormatenum isarchUnknown = 0, archCustom = 1, archTar = 3, archNull = 4, archDirectory = 5(value 2 is intentionally absent). The-Flettersp/plainmap toarchNull, so the plain-text path reuses the null back end rather than a distinct format. Verified inparseArchiveFormat(pg_dump.c). -
_allocAHdispatches format selection throughInitArchiveFmt_*, and every later archiver operation goes through function-pointer members ofArchiveHandle. Verified in_allocAH: theswitch (AH->format)calls one of four init functions; the body of the archiver (e.g.WriteToc,RestoreArchive) callsAH->WriteExtraTocPtr,AH->ReadExtraTocPtr,AH->ClosePtr, etc., never a concrete format function. This is the strategy-pattern factoring. -
The dependency sort is wrapped in a retry loop that handles cycles by graph rewriting, not by erroring. Verified in
sortDumpableObjects:while (!TopoSort(...)) findDependencyLoops(...).TopoSortreturnsfalseand lists the unplaceable objects;findDependencyLoops→repairDependencyLoopmutates the graph (viaremoveObjectDependencyand by splitting objects) and the sort is retried until acyclic. -
The pre-data / data / post-data section split is implemented as ordinary nodes in the same topological sort. Verified:
sortDumpableObjectsstashespreDataBoundId/postDataBoundId(theDumpIds of theDO_PRE_DATA_BOUNDARY/DO_POST_DATA_BOUNDARYdummy objects) in statics, and the boundary objects participate inTopoSortlike any other. There is no separate section-ordering pass. -
WriteTocwrites only entries whosereqsbitmask is non-empty. Verified inWriteToc: it counts and then writes only entries with(te->reqs & (REQ_SCHEMA | REQ_DATA | REQ_STATS | REQ_SPECIAL)) != 0. Thereqsmask is computed earlier byProcessArchiveRestoreOptions/_tocEntryRequired, so the dumped TOC already reflects include/exclude decisions. -
_tocEntryRequiredkeys restore policy off thedescstring. Verified: the function is a cascade ofstrcmp(te->desc, "ENCODING"),"STATISTICS DATA","DATABASE",_tocEntryIsACL,"COMMENT","POLICY","PUBLICATION","SECURITY LABEL", … comparisons. The TOC entry’s human-readable descriptor is simultaneously its restore-class key. -
Parallel dump is only meaningful for the directory format; parallel restore additionally requires a real connection. Verified in
pg_dump’smain: anumWorkers > 1request errors unless the format isarchDirectory. Verified inRestoreArchive:parallel_mode = (AH->public.numWorkers > 1 && ropt->useDB). Single-file custom/tar dumps cannot be written by independent workers. -
Parallel restore respects both dependencies and lock conflicts. Verified:
reduce_dependencies(called frommark_restore_job_done) decrements blocked-counts and promotes newly-ready items;pop_next_work_itemskips any ready item whosehas_lock_conflictsagainst a currently-running worker would collide. The driver also advances throughrestorePassvalues, forcing ACL/post-ACL classes into later passes. -
pg_dumpallproduces plain SQL only and delegates table dumps to thepg_dumpbinary. Verified inrunPgDump(pg_dumpall.c): it builds a command line for the externalpg_dumpwith-Fp(or-Fawhen writing to a file).pg_dumpallitself only emits globals (dumpRoles,dumpRoleMembership,dumpTablespaces) inline; it has no archive format.
Open questions
Section titled “Open questions”-
Snapshot synchronization across parallel-dump workers. The consistent-snapshot guarantee for
-j Ndumps relies on the leader exporting a snapshot and each workerSET TRANSACTION SNAPSHOT-ing it. The exact handshake lives inparallel.c/pg_backup_db.cand is only summarized here; tracingset_archive_cancel_infoand the worker connection setup is left to aparallel.c-focused follow-up. -
Statistics dumping (
STATISTICS DATA). PG18 dumps optimizer statistics as a first-class TOC class (REQ_STATS, theDO_REL_STATSobject type,--statistics-only/--no-statistics). How the stats are serialized and re-applied (pg_restore_relation_stats) is referenced via_tocEntryRequiredhere but not walked end-to-end. -
Custom-format second TOC write.
WriteTochas special handling forte->defnLenset on a second call, which only the custom format makes (to rewrite data offsets in place oncefseekois confirmed usable). The exact close-time rewrite sequence lives inpg_backup_custom.c’s_CloseArchiveand is noted but not traced.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”Pointers, not analysis. Each bullet is a starting handle for a follow-up document, framed as “how does the same problem look elsewhere?”
-
mysqldump— streaming SQL, no intermediate graph. MySQL’s classic logical dumper readsINFORMATION_SCHEMAplusSHOW CREATE TABLEand streamsCREATE/INSERTSQL directly to stdout in catalog order, with no in-memoryDumpableObjectgraph and no topological sort. It leans onSET FOREIGN_KEY_CHECKS=0at restore time to side-step the ordering problem thatpg_dump_sort.csolves structurally. The trade is thatmysqldumpcannot offer selective restore, dependency-honoring parallelism, or a reorderable TOC — there is no container, only a script. The closest PostgreSQL analogue is the-Fp(plain) path, which is also a one-shot script; PostgreSQL’s extra machinery only pays off in the container formats. -
MySQL Shell
util.dumpInstance/dumpSchemas. The modern MySQL dumper (MySQL Shell 8.0+) is a deliberate move towardpg_dump’s container model: it writes a directory of per-table chunked files plus metadata, supports parallel dump and parallel load (util.loadDump), and usesLOCK INSTANCE FOR BACKUP+ a consistent snapshot for isolation. A side-by-side withpg_backup_directory.c’s per-worker file layout and the ready-heap scheduler would show two independent rediscoveries of the same “directory of chunks + dependency-aware loader” design. -
Oracle Data Pump (
expdp/impdp) — the master table. Data Pump materializes its TOC as an actual database table (the master table) in the dumping schema, recording every object, its metadata, and worker progress; this is what lets a Data Pump job be paused, resumed, and attached-to from another session. PostgreSQL’s TOC is in-archive and in-memory, not a live table, so apg_restorejob is not resumable mid-flight. A comparison of restart/resume semantics (Data Pump’s master table vs.pg_restore’s all-or-nothing--single-transaction) is a clean follow-up. -
SQL Server SMO + BACPAC. SQL Server’s logical export (
.bacpacvia SqlPackage / SMO) separates a schema model (DACPAC) from packaged data, and the SMO object model walkssys.*to script DDL — the same “catalog-to-DDL decompiler” roledumpTableSchemaplays. The schema model is validated as a unit before deployment, a design point PostgreSQL has no equivalent of (apg_restorefailure is detected statement-by-statement). -
Logical replication as a continuous dump.
pg_dumpis a point-in-time logical snapshot; logical decoding /pgoutput(seepostgres-logical-decoding.md,postgres-pgoutput.md) is the streaming form of the same “ship intent, not pages” idea. The initial table sync of a logical replication subscription is conceptually aCOPY-based dump of each table under an exported snapshot — the same snapshot-export mechanismpg_dump -juses. Tracing howtablesync.creuses snapshot export would unify the two stories. -
--filterand object-selection grammars. PG17 added a--filter=FILEoption (include/exclude rules read from a file) on top of the long-standing-t/-T/-n/-Npattern flags; the rules feed the sameselectDumpableTable/selectDumpableNamespacecomponent-mask machinery. A comparison with Data Pump’sINCLUDE/EXCLUDEmetadata filters andmysqldump’s coarse--ignore-tablewould map the expressiveness spectrum of object selection. -
Research frontier: parallel-safe consistent logical snapshots. The open question of how
-j Nworkers share one snapshot (SET TRANSACTION SNAPSHOTagainst a leader-exported snapshot id) is the practical face of a deeper topic — exporting and importing MVCC snapshots across sessions while holding back vacuum. The mechanics live inpostgres-mvcc-snapshots.mdandpostgres-procarray.md; the cost model (how long a multi-hour parallel dump pins old row versions cluster-wide) is a backup-planning concern worth its own note.
Sources
Section titled “Sources”PostgreSQL source (under /data/hgryoo/references/postgres, REL_18 273fe94)
Section titled “PostgreSQL source (under /data/hgryoo/references/postgres, REL_18 273fe94)”src/bin/pg_dump/pg_dump.c— driver:main,parseArchiveFormat,getTableData/makeTableDataInfo,dumpTableData_copy/dumpTableData_insert,dumpTableSchema,dumpDumpableObject,dumpDatabase,getDependencies,selectDumpableTable/selectDumpableNamespace.src/bin/pg_dump/common.c— object model:getSchemaData,AssignDumpId/createDumpId,recordAdditionalCatalogID,findObjectByDumpId/findObjectByCatalogId/findTableByOid,addObjectDependency/removeObjectDependency,getDumpableObjects,flagInhTables/flagInhAttrs.src/bin/pg_dump/pg_dump_sort.c— ordering:sortDumpableObjectsByTypeName/DOTypeNameCompare,dbObjectTypePriority[],sortDumpableObjects,TopoSort,findDependencyLoops/repairDependencyLoop(and the per-shaperepairTypeFuncLoop/repairViewRuleLoop/ table-constraint repairs).src/bin/pg_dump/pg_backup_archiver.c— archive engine:CreateArchive/OpenArchive/_allocAH,_discoverArchiveFormat,ArchiveEntry,WriteToc/ReadToc,WriteDataChunks,ProcessArchiveRestoreOptions/_tocEntryRequired,RestoreArchive/restore_toc_entry/_printTocEntry,restore_toc_entries_prefork/restore_toc_entries_parallel,pop_next_work_item/move_to_ready_heap/reduce_dependencies/mark_restore_job_done.src/bin/pg_dump/pg_dumpall.c— cluster globals:main,dumpRoles,dumpRoleMembership,dumpTablespaces,dumpDatabases/runPgDump,expand_dbname_patterns.src/bin/pg_dump/pg_backup.h— theArchiveFormatenum (archCustom/archTar/archNull/archDirectory),RestoreOptions, the publicArchivestruct.src/bin/pg_dump/pg_backup_archiver.h—ArchiveHandle,TocEntry, the function-pointer vtable members, theREQ_*andRESTORE_PASS_*constants.src/bin/pg_dump/parallel.c— (referenced)ParallelState,DispatchJobForTocEntry,WaitForWorkers, worker fork/connect and snapshot sharing.- Format back ends (referenced, not walked):
pg_backup_custom.c,pg_backup_directory.c,pg_backup_tar.c,pg_backup_null.c.
Textbook chapters (under knowledge/research/dbms-general/)
Section titled “Textbook chapters (under knowledge/research/dbms-general/)”- Database System Concepts (Silberschatz, Korth, Sudarshan), “Recovery System” — the dump-vs-log duality; a logical dump as a re-derivation recipe rather than a byte copy.
- Database Internals (Petrov), Part II — logical vs. physical replication,
the portability/fidelity trade that separates
pg_dumpfrompg_basebackup. - The Art of Computer Programming, Vol. 1 (Knuth), §2.2.3 — the
topological-sort algorithm
TopoSortimplements.
Cross-references (sibling module docs)
Section titled “Cross-references (sibling module docs)”postgres-system-catalogs.md— the catalogs (pg_class,pg_type,pg_proc,pg_constraint, …) that thegetXxxextractors read; defer catalog-layout detail there.postgres-dependency-tracking.md—pg_depend/pg_shdependsemantics thatgetDependenciestranslates into the dump graph; defer dependency-row detail there.postgres-mvcc-snapshots.md/postgres-procarray.md— the snapshot-export machinery that gives a dump (and a parallel dump) its consistent read.postgres-copy.md— theCOPY … TO stdoutprotocol thatdumpTableData_copyrides; the restore-sideCOPY … FROM stdin.postgres-backup-basebackup.md/postgres-incremental-backup.md— the physical backup path thatpg_dumpis the logical counterpart to.postgres-pg-upgrade.md—pg_upgradeusespg_dump --binary-upgrade/pg_dumpall --globals-onlyto carry schema across a major-version jump.postgres-ddl-execution.md— the server side that replays the DDL a dump emits.