Skip to content

PostgreSQL pg_dump / pg_restore — Catalog-Driven Logical Backup

Contents:

A database backup answers one question — how do I reconstruct this data after a loss? — but there are two structurally different answers, and the choice colors everything downstream.

A physical backup copies the bytes of the storage layer: the data files, the WAL, the control file. It is fast (sequential reads, no parsing), it is exact (every tuple, every dead row, every index leaf), and it is opaque — the copy is only meaningful to a server of the same major version, same block size, same architecture. PostgreSQL’s physical path is pg_basebackup / the WAL archive (see postgres-backup-basebackup.md).

A logical backup copies the meaning of the database: a stream of SQL statements (CREATE TABLE, COPY, CREATE INDEX, ALTER TABLE … ADD CONSTRAINT) that, when replayed against an empty cluster, reconstructs an equivalent database. It is slower (the server must re-parse, re-plan, re-index everything) and it is not byte-exact (a restored table is freshly packed, has no dead tuples, gets new OIDs). In exchange it is portable: across major versions, across architectures, across forks, and it can be restored selectively (one schema, one table, schema-only). pg_dump is PostgreSQL’s logical-backup engine.

Three theoretical problems define the logical-backup design space:

  1. Snapshot consistency. A dump runs for minutes to hours against a live database that keeps changing. Without isolation, the dump would see a table created after the catalog scan but miss its data, or capture a foreign key whose referenced table was dropped mid-dump. The dump must read the entire database — catalog and data — as of a single point in time. This is exactly the snapshot-isolation guarantee MVCC already provides; the dumper’s job is to acquire one snapshot and pin it for the whole run (postgres-mvcc-snapshots.md).

  2. Reconstruction ordering — the topological-sort problem. SQL is not commutative under definition. CREATE TABLE orders (cust_id REFERENCES customers) cannot execute before customers exists; a view cannot be created before the tables it selects from; an index’s operator class must exist first. The set of “X must exist before Y” facts forms a directed graph, and a valid restore order is any topological sort of that graph. The Art of Computer Programming, Vol. 1 (Knuth, §2.2.3) gives the canonical algorithm: repeatedly emit a node with no unsatisfied predecessors, decrement the predecessor counts of its successors, repeat. The wrinkle is that real schemas contain cycles — a view that references itself through a rule, two tables with mutual foreign keys — and a cyclic graph has no topological order. The dumper must detect cycles and break them by splitting one object into two TOC entries (the table definition, then the constraint that closes the loop), restoring the second piece in a later pass.

  3. Format / engine separation. The decisions what to dump and in what order are independent of how the bytes are laid out on disk. A clean design factors the object-extraction logic (which talks SQL to the server) away from the archive-serialization logic (which writes a file), connected only by an abstract “archive entry” interface. This is the classic strategy pattern: one extraction front end, several interchangeable format back ends.

Database System Concepts (Silberschatz et al., “Recovery System”) frames the backup/restore duality as the difference between dump (a snapshot of state) and log (a record of changes), and notes that a logical dump is fundamentally a re-derivation recipe rather than a copy. Database Internals (Petrov, Part II) frames the same split as “logical vs. physical replication,” where the logical form ships row/statement intent and the physical form ships page images — the same trade between portability and fidelity that separates pg_dump from pg_basebackup.

Every mature engine ships a logical-export tool, and they converge on a small set of patterns. Naming them first makes PostgreSQL’s choices read as selections within a shared space.

A logical dumper does not parse the DDL the user originally typed (that text is not retained). It reconstructs DDL by querying the system catalog — the tables that describe tables, columns, types, constraints, indexes. The dumper is, in effect, a large catalog-to-DDL decompiler. Every engine has one: MySQL’s mysqldump reads INFORMATION_SCHEMA + SHOW CREATE TABLE; Oracle’s Data Pump reads the dictionary views; SQL Server’s SMO walks sys.*. The fidelity of the dump is bounded by the completeness of the catalog read — a column default stored in pg_attrdef that the dumper forgets to query simply vanishes from the backup.

The export must read catalog and data under one snapshot. The universal mechanism is to open a transaction in the strictest isolation the engine offers and hold it open for the whole dump: REPEATABLE READ / SERIALIZABLE in PostgreSQL, START TRANSACTION WITH CONSISTENT SNAPSHOT in MySQL/InnoDB, FLASHBACK SCNs in Oracle. The cost is a long-lived transaction that pins old row versions (blocking vacuum) and holds AccessShareLock on every dumped table for the duration.

An intermediate object model, not a streaming pipe

Section titled “An intermediate object model, not a streaming pipe”

A naive dumper could stream CREATE … ; COPY … ; straight to a file as it walks the catalog. But ordering forbids this: you cannot know the safe order until you have collected all objects and all their dependency edges. Mature tools therefore build an in-memory object graph first, sort it, and only then serialize. This graph is also what enables selective restore (restore one node and its transitive prerequisites) and parallelism (dispatch independent subgraphs concurrently).

A table of contents (TOC) decoupling listing from content

Section titled “A table of contents (TOC) decoupling listing from content”

A container format puts a table of contents at a known location: a list of (id, kind, name, owner, dependency-list, offset-to-data) records. The TOC lets a restore tool (a) list the archive’s contents without reading the data, (b) reorder or filter entries, and (c) seek directly to one table’s data for selective or parallel restore. pg_restore -l / -L, Oracle Data Pump’s master table, and SQL Server’s backup header all serve this role.

Section phasing: schema-before-data-before-constraints

Section titled “Section phasing: schema-before-data-before-constraints”

Bulk-loading data is dramatically faster when indexes and constraints do not yet exist — each inserted row would otherwise pay for index maintenance and constraint checks. Every fast loader therefore splits restore into phases: create bare tables, bulk-load data, then build indexes and add constraints. PostgreSQL formalizes this as three sectionspre-data, data, post-data — and the section boundary is itself a node in the dependency graph.

Parallelism bounded by the dependency graph

Section titled “Parallelism bounded by the dependency graph”

Restore is embarrassingly parallel except where dependencies serialize it: two tables’ data can load concurrently, but an index cannot build before its table’s data is present, and a foreign key cannot be added before both endpoints exist. The general pattern is a worker pool fed by a ready queue: an item is “ready” when all its dependencies are done; workers pull ready items, and completion of one item may make several others ready.

PostgreSQL splits the problem across a handful of files in src/bin/pg_dump/. The cleanest way to read the codebase is as four cooperating layers:

flowchart TB
    subgraph extract["Extraction front end — talks SQL"]
        A["pg_dump.c main()<br/>option parsing, connect,<br/>set snapshot"]
        B["common.c getSchemaData()<br/>getTables / getTypes / getFuncs / …<br/>builds DumpableObject graph"]
    end
    subgraph order["Ordering"]
        C["pg_dump_sort.c<br/>sortDumpableObjectsByTypeName<br/>then sortDumpableObjects (TopoSort)"]
    end
    subgraph emit["Archive back end — writes bytes"]
        D["pg_backup_archiver.c<br/>ArchiveEntry() builds TOC<br/>CreateArchive() picks format"]
        E["pg_backup_custom.c<br/>pg_backup_directory.c<br/>pg_backup_tar.c<br/>pg_backup_null.c (plain)"]
    end
    subgraph restore["pg_restore"]
        F["OpenArchive + ReadToc<br/>RestoreArchive()<br/>serial or parallel"]
    end
    A --> B --> C --> D --> E
    E -.archive file.-> F

Figure 1 — The four layers. Solid arrows are the dump path; the dashed arrow is the file handed to pg_restore. Plain-text dump is the same path with the archNull back end, and pg_dump itself drives RestoreArchive in that case rather than writing a container.

Catalog-driven extraction under a pinned snapshot

Section titled “Catalog-driven extraction under a pinned snapshot”

main() in pg_dump.c connects with a regular libpq connection, then opens a transaction and (for a consistent dump) sets it to the appropriate isolation level and exports a snapshot — the same machinery a parallel dump uses to share one snapshot across worker connections. Inside that transaction, getSchemaData (in common.c) runs a long, deliberately ordered sequence of catalog reads. The ordering is not cosmetic; comments in the function spell out the data dependencies:

// getSchemaData — common.c (condensed)
/* extensions first: membership gates whether other objects dump */
extinfo = getExtensions(fout, &numExtensions);
getExtensionMembership(fout, extinfo, numExtensions);
getNamespaces(fout);
/* getTables ASAP to minimize the lock-acquisition window */
tblinfo = getTables(fout, &numTables);
getOwnedSeqs(fout, tblinfo, numTables);
getFuncs(fout);
getTypes(fout); /* must be after getTables and getFuncs */
/* ... aggregates, operators, opclasses, collations, casts, ... */
flagInhTables(fout, tblinfo, numTables, inhinfo, numInherits);
getTableAttrs(fout, tblinfo, numTables); /* columns of interesting tables */
getIndexes(fout, tblinfo, numTables);
getConstraints(fout, tblinfo, numTables);
getTriggers(fout, tblinfo, numTables);
/* ... rules, policies, publications, subscriptions ... */

Each getXxx runs a SELECT against the relevant catalog (pg_class, pg_type, pg_proc, pg_constraint, …) and turns each row into a heap-allocated *Info struct (TableInfo, TypeInfo, FuncInfo, …) whose first member is a DumpableObject. The comment on getTables captures the locking concern: every table that may be dumped is LOCK TABLE … IN ACCESS SHARE MODE-d to prevent it being dropped or rewritten mid-dump, and the function runs early to keep the window between transaction start and lock acquisition small.

The DumpableObject graph: CatalogId and DumpId

Section titled “The DumpableObject graph: CatalogId and DumpId”

Every dumpable object carries two identities. The CatalogId is its catalog coordinate — (tableoid, oid), i.e. which catalog and which row — and is how the dumper cross-references objects it learned about from different queries. The DumpId is a dense small integer assigned sequentially as objects are created; it is the key used by the dependency arrays and the sort. AssignDumpId wires both into the lookup structures:

// AssignDumpId — common.c (condensed)
dobj->dumpId = ++lastDumpId;
dobj->dump = DUMP_COMPONENT_ALL; /* default: dump everything */
dobj->components = DUMP_COMPONENT_DEFINITION;
dobj->dependencies = NULL; dobj->nDeps = 0; dobj->allocDeps = 0;
/* index by DumpId (dense array, grown by doubling) */
dumpIdMap[dobj->dumpId] = dobj;
/* index by CatalogId (open-addressing hash) if it has one */
if (OidIsValid(dobj->catId.tableoid))
{
entry = catalogid_insert(catalogIdHash, dobj->catId, &found);
entry->dobj = dobj;
}

Two indexes, two access patterns: findObjectByDumpId (array indexing, used by the sort) and findObjectByCatalogId (hash lookup, used when one catalog read references an OID discovered by another). Dependency edges are plain DumpId lists attached to each object; addObjectDependency appends to a doubling array, removeObjectDependency filters one out — these are the primitives the loop-repair logic uses to rewrite the graph.

flowchart LR
    subgraph idx["common.c lookup tables"]
        DM["dumpIdMap[]<br/>DumpId to DumpableObject*"]
        CH["catalogIdHash<br/>CatalogId to DumpableObject*"]
    end
    O1["TableInfo customers<br/>dumpId=12 catId=(1259,16400)"]
    O2["TableInfo orders<br/>dumpId=15 catId=(1259,16410)<br/>deps=[12]"]
    O3["ConstraintInfo orders_fk<br/>dumpId=22 deps=[15,12]"]
    O1 --> DM
    O2 --> DM
    O3 --> DM
    O1 --> CH
    O2 --> CH
    O2 -. depends on .-> O1
    O3 -. depends on .-> O2
    O3 -. depends on .-> O1

Figure 2 — The object graph. orders depends on customers (foreign key target); the FK constraint depends on both endpoints. The sort must emit customers before orders before orders_fk — and if a mutual FK created a cycle, loop repair would split the constraint out so it lands in post-data.

Explicit dependencies from pg_depend, plus synthetic edges

Section titled “Explicit dependencies from pg_depend, plus synthetic edges”

Most edges come straight from the catalog: getDependencies (in pg_dump.c) reads pg_depend, maps each (classid, objid)(refclassid, refobjid) row to a pair of DumpableObjects via findObjectByCatalogId, and calls addObjectDependency. On top of these, the extraction code adds synthetic edges that pg_depend does not model — for example, a column default that must be emitted after its table (addObjectDependency(&attrDef->dobj, tbinfo->dobj.dumpId) in common.c), or the matview-refresh edges built by buildMatViewRefreshDependencies.

Ordering: type-name pre-sort, then topological sort

Section titled “Ordering: type-name pre-sort, then topological sort”

pg_dump.c sorts in two stages. First sortDumpableObjectsByTypeName imposes a stable, deterministic baseline order (by object-type priority, then by schema and name) so that two logically identical databases dump byte-identically — important for diffing dumps. Then sortDumpableObjects runs the real topological sort, using the type-name order only as a tie-breaker to “minimize unnecessary rearrangement”:

// sortDumpableObjects — pg_dump_sort.c (condensed)
preDataBoundId = preBoundaryId; /* the pre-data/data section boundary */
postDataBoundId = postBoundaryId; /* the data/post-data boundary */
ordering = pg_malloc(numObjs * sizeof(DumpableObject *));
while (!TopoSort(objs, numObjs, ordering, &nOrdering))
findDependencyLoops(ordering, nOrdering, numObjs);
memcpy(objs, ordering, numObjs * sizeof(DumpableObject *));

The while loop is the cycle-handling structure: TopoSort returns true on success; if it returns false it has filled ordering with the objects it could not place (those in or downstream of a cycle), and findDependencyLoops finds an actual cycle among them and calls repairDependencyLoop to break it, after which the sort is retried.

The two boundary IDs name dummy objects (DO_PRE_DATA_BOUNDARY, DO_POST_DATA_BOUNDARY) that the sort treats as ordinary nodes. By making every pre-data object a predecessor of the pre-data boundary, every data object a successor of it and a predecessor of the post-data boundary, the section split falls out of the same topological sort rather than needing a separate pass.

Loop repair: splitting objects into multiple TOC entries

Section titled “Loop repair: splitting objects into multiple TOC entries”

repairDependencyLoop is a catalog of known cycle shapes. A type and its I/O function reference each other; a view and its ON SELECT rule reference each other; two tables can have mutual foreign keys. Each case has a hand-coded repair that removes one edge — usually by deferring part of the object to post-data:

// repairDependencyLoop — pg_dump_sort.c (condensed)
/* Datatype and one of its I/O or canonicalize functions */
if (nLoop == 2 &&
loop[0]->objType == DO_TYPE && loop[1]->objType == DO_FUNC)
{
repairTypeFuncLoop(loop[0], loop[1]);
return;
}
/* View (including matview) and its ON SELECT rule */
if (nLoop == 2 &&
loop[0]->objType == DO_TABLE && loop[1]->objType == DO_RULE &&
(((TableInfo *) loop[0])->relkind == RELKIND_VIEW || … RELKIND_MATVIEW) &&
((RuleInfo *) loop[1])->ev_type == '1' && … )
{
repairViewRuleLoop(loop[0], loop[1]); /* dump view as CREATE+rule */
return;
}
/* … table-constraint loops, larger indirect loops, etc. … */

This is why a self-referential view dumps as a bare CREATE VIEW … AS SELECT NULL (or a dummy table) in pre-data plus a CREATE RULE "_RETURN" in post-data: the loop repair split it.

Once sorted, pg_dump.c walks the ordered array and calls dumpDumpableObject on each, which dispatches by type (dumpTableSchema, dumpType, dumpFunc, …). Each dumpXxx builds the object’s SQL text and registers it as a TOC entry via ArchiveEntry, supplying its tag, owner, section, the CREATE/DROP SQL, an optional COPY statement, an optional data dumper callback, and its dependency list:

// ArchiveEntry — pg_backup_archiver.c (condensed)
newToc = pg_malloc0(sizeof(TocEntry));
/* link into the circular TOC list */
newToc->prev = AH->toc->prev; newToc->next = AH->toc;
AH->toc->prev->next = newToc; AH->toc->prev = newToc;
newToc->dumpId = dumpId;
newToc->section = opts->section; /* PRE_DATA / DATA / POST_DATA */
newToc->tag = pg_strdup(opts->tag);
newToc->desc = pg_strdup(opts->description); /* "TABLE", "INDEX", … */
newToc->defn = opts->createStmt ? pg_strdup(opts->createStmt) : NULL;
newToc->dropStmt = opts->dropStmt ? pg_strdup(opts->dropStmt) : NULL;
newToc->copyStmt = opts->copyStmt ? pg_strdup(opts->copyStmt) : NULL;
newToc->dataDumper = opts->dumpFn; /* table-data callback */
memcpy(newToc->dependencies, opts->deps, opts->nDeps * sizeof(DumpId));

The desc string (“TABLE”, “INDEX”, “FK CONSTRAINT”, “TABLE DATA”, “DATABASE”, …) is load-bearing: the restore-time _tocEntryRequired filter keys entirely off desc string comparisons, so the descriptor doubles as the object’s restore-policy class.

A table’s definition is a SQL string in the TOC; its data is not — it is a callback (dataDumper) that streams rows on demand. For most tables that callback is dumpTableData_copy, which issues a server-side COPY … TO stdout and relays the protocol’s CopyData rows straight into the archive:

// dumpTableData_copy — pg_dump.c (condensed)
column_list = fmtCopyColumnList(tbinfo, clistBuf); /* explicit col order */
if (tdinfo->filtercond || tbinfo->relkind == RELKIND_FOREIGN_TABLE)
appendPQExpBuffer(q, "COPY (SELECT %s FROM %s %s) TO stdout;",
…, fmtQualifiedDumpable(tbinfo),
tdinfo->filtercond ? tdinfo->filtercond : "");
else
appendPQExpBuffer(q, "COPY %s %s TO stdout;",
fmtQualifiedDumpable(tbinfo), column_list);

Deferring data to a callback is what makes the archive formats pluggable: a plain dump runs the callback and writes SQL inline; the custom/directory formats run it during WriteDataChunks and store the compressed CopyData block at a TOC-recorded offset, so restore can seek to it. The --inserts mode swaps the callback for dumpTableData_insert, which emits INSERT statements instead (portable to other engines, far slower).

CreateArchive selects a format and _allocAH dispatches to that format’s init function, which fills a vtable of function pointers (WriteDataPtr, StartTocPtr, ReadTocPtr, ClosePtr, …). The rest of the archiver calls through the vtable and never names a concrete format:

// _allocAH — pg_backup_archiver.c (condensed)
switch (AH->format)
{
case archCustom: InitArchiveFmt_Custom(AH); break;
case archNull: InitArchiveFmt_Null(AH); break; /* plain text */
case archDirectory: InitArchiveFmt_Directory(AH); break;
case archTar: InitArchiveFmt_Tar(AH); break;
default: pg_fatal("unrecognized file format \"%d\"", …);
}

The format is chosen from the -F letter in pg_dump’s main:

-Fenumback endseekable / parallelTOC
p / plainarchNullpg_backup_null.cno / nonone — pure SQL stream
c / customarchCustompg_backup_custom.cyes (if fseeko) / restore-onlyone file, compressed blocks
d / directoryarchDirectorypg_backup_directory.cyes / dump and restoredir of per-table files + toc.dat
t / tararchTarpg_backup_tar.climited / noone tar, toc.dat member

Only the directory format supports parallel dump (each worker writes its own file; a tar or single custom file would need coordinated seeks). Custom, directory, and tar all carry a real TOC and so support selective and reordered restore via pg_restore; plain text is a one-shot SQL script that psql replays with no listing or filtering. The plain path is special in main: RestoreArchive is invoked immediately to emit SQL, whereas the container formats defer all output to CloseArchive.

Restore: ReadToc, requirement filtering, and the section passes

Section titled “Restore: ReadToc, requirement filtering, and the section passes”

pg_restore opens the archive (OpenArchive_allocAH in read mode), ReadToc rebuilds the TOC list from the serialized WriteToc output, and ProcessArchiveRestoreOptions walks the list once to mark each entry with a reqs bitmask via _tocEntryRequired. That function decides, per entry and purely from desc/tag string comparisons plus the RestoreOptions, whether the entry’s schema and/or data should be emitted:

// _tocEntryRequired — pg_backup_archiver.c (condensed)
int res = REQ_SCHEMA | REQ_DATA;
if (strcmp(te->desc, "ENCODING") == 0 ||"STDSTRINGS"||"SEARCHPATH" …)
return REQ_SPECIAL;
if (strcmp(te->desc, "STATISTICS DATA") == 0)
{ if (!ropt->dumpStatistics) return 0; res = REQ_STATS; }
/* DATABASE / DATABASE PROPERTIES only when --create */
if (strcmp(te->desc, "DATABASE") == 0 ||"DATABASE PROPERTIES" …)
return ropt->createDB ? REQ_SCHEMA : 0;
if (ropt->aclsSkip && _tocEntryIsACL(te)) return 0;
if (ropt->no_comments && strcmp(te->desc, "COMMENT") == 0) return 0;
/* … no_policies / no_publications / no_security_labels exclusions … */

RestoreArchive then replays entries in TOC order, but restore_toc_entry splits each into its schema part (run when REQ_SCHEMA is set — _printTocEntry emits the defn) and its data part (run when REQ_DATA is set — the data block is read back and applied). Because the dump already ordered everything topologically and the section boundaries are real TOC entries, serial restore is just an in-order walk.

Parallel restore: prefork, ready heap, multi-pass

Section titled “Parallel restore: prefork, ready heap, multi-pass”

Parallel restore (-j N, container formats with --use-db) is a three-phase engine. restore_toc_entries_prefork runs all pre-data items in the single parent connection (they are cheap and order-sensitive), pushing data and post-data items onto a pending_list. Then restore_toc_entries_parallel drives a worker pool against a ready heap ordered by data size (largest first, so the pool does not end starved on one huge table):

// restore_toc_entries_parallel — pg_backup_archiver.c (condensed)
ready_heap = binaryheap_allocate(AH->tocCount, TocEntrySizeCompareBinaryheap, NULL);
AH->restorePass = RESTORE_PASS_MAIN;
move_to_ready_heap(pending_list, ready_heap, AH->restorePass);
for (;;)
{
next_work_item = pop_next_work_item(ready_heap, pstate);
if (next_work_item != NULL)
DispatchJobForTocEntry(AH, pstate, next_work_item, ACT_RESTORE,
mark_restore_job_done, ready_heap);
else if (IsEveryWorkerIdle(pstate))
{
if (AH->restorePass == RESTORE_PASS_LAST) break;
AH->restorePass++; /* advance MAIN→ACL→POST_ACL */
move_to_ready_heap(pending_list, ready_heap, AH->restorePass);
continue;
}
WaitForWorkers(AH, pstate, next_work_item ? WFW_ONE_IDLE : WFW_GOT_STATUS);
}

Two correctness mechanisms ride along. pop_next_work_item refuses to dispatch an item that would need a lock conflicting with an already-running worker (has_lock_conflicts) — important because, e.g., an ALTER TABLE … ADD CONSTRAINT takes a stronger lock than a concurrent index build on the same table. And mark_restore_job_done calls reduce_dependencies, which decrements the blocked-count of every item that depended on the finished one and moves any now-unblocked item into the ready heap. The restorePass counter (RESTORE_PASS_MAINRESTORE_PASS_ACLRESTORE_PASS_POST_ACL) forces a few classes (ACLs, then comments/security labels on them) into strictly later passes regardless of the graph, because their dependencies are not fully modeled.

pg_dumpall: cluster globals plus per-database delegation

Section titled “pg_dumpall: cluster globals plus per-database delegation”

pg_dump connects to one database and can only see objects in that database’s catalog — it cannot dump cluster-global objects, which live in shared catalogs (pg_authid, pg_db_role_setting, pg_tablespace). pg_dumpall fills that gap. Its main connects to one database (default postgres), emits the globals directly as SQL, then shells out to pg_dump once per database:

// dumpDatabases / runPgDump — pg_dumpall.c (condensed)
/* globals first, written straight to the output script */
dumpRoles(conn); /* CREATE ROLE … from pg_authid */
dumpRoleMembership(conn); /* GRANT role TO role from pg_auth_members */
dumpTablespaces(conn); /* CREATE TABLESPACE … from pg_tablespace */
/* then each database via a child pg_dump process */
for each datname:
create_opts = "--create"; /* or "--clean --create" */
ret = runPgDump(dbname, create_opts); /* exec pg_dump -Fp / -Fa */

runPgDump builds a command line invoking the pg_dump binary with -Fp (plain) or, when writing to a file, the undocumented -Fa (plain-append) so multiple databases concatenate into one script. The output is therefore always plain SQL — pg_dumpall has no archive format of its own. Roles are dumped before databases because CREATE DATABASE … OWNER and per-database ALTER ROLE … SET references must resolve; template0 is skipped and template1/postgres are assumed to pre-exist (the script \connects rather than CREATEs them).

Anchor on symbol names, not line numbers. Use git grep -n '<symbol>' src/bin/pg_dump/ to relocate; line numbers in the table below are hints scoped to commit 273fe94.

  • main — parse options, parseArchiveFormat the -F letter, connect, CreateArchive, drive extraction → sort → dumpDumpableObject loop → SetArchiveOptions / ProcessArchiveRestoreOptionsRestoreArchive (plain) or CloseArchive (container).
  • parseArchiveFormat — map p/c/d/t (and long names) to archNull / archCustom / archDirectory / archTar.
  • getTableData / makeTableDataInfo — create DO_TABLE_DATA objects and attach the dumpTableData data-dumper to each dumpable table.
  • dumpTableData_copy / dumpTableData_insert — the two data-dumper callbacks (server-side COPY … TO stdout vs. emitted INSERTs).
  • dumpTableSchema — build a table’s CREATE TABLE text and ArchiveEntry it; dumpDumpableObject is the type-dispatch hub.
  • dumpDatabase — emit the CREATE DATABASE / DATABASE PROPERTIES entries.
  • getDependencies — read pg_depend, translate to addObjectDependency edges.
  • selectDumpableTable / selectDumpableNamespace — policy subroutines that set each object’s dump component mask from include/exclude patterns.

Object model and catalog extraction (common.c)

Section titled “Object model and catalog extraction (common.c)”
  • getSchemaData — the ordered sequence of getXxx catalog reads; returns the TableInfo array.
  • AssignDumpId / createDumpId — assign a DumpId, register in dumpIdMap and catalogIdHash.
  • recordAdditionalCatalogID — map a second CatalogId onto an existing object (e.g. array type ↔ base type).
  • findObjectByDumpId / findObjectByCatalogId / findTableByOid — the three lookups.
  • addObjectDependency / removeObjectDependency — mutate an object’s dependencies[] array.
  • getDumpableObjects — flatten dumpIdMap into a modifiable array for the sort.
  • flagInhTables / flagInhAttrs — wire child→parent links and suppress inherited columns/defaults.
  • sortDumpableObjectsByTypeName / DOTypeNameCompare — stable pre-sort by dbObjectTypePriority[objType] then schema/name.
  • dbObjectTypePriority[] — the per-DumpableObjectType priority array (designated initializers; StaticAssertDecl guards its length).
  • sortDumpableObjects — set section-boundary IDs, loop TopoSort / findDependencyLoops until acyclic.
  • TopoSort — Knuth Vol. 1 topological sort over DumpIds using a binaryheap priority queue keyed to preserve input order.
  • findDependencyLoops / repairDependencyLoop — detect a concrete cycle and dispatch to a shape-specific repair (repairTypeFuncLoop, repairViewRuleLoop, table-constraint splits, …).
  • CreateArchive / OpenArchive / _allocAH — allocate the ArchiveHandle, pick the format, call InitArchiveFmt_*.
  • _discoverArchiveFormat — sniff the magic of an input archive when -F is unspecified for restore.
  • ArchiveEntry — build a TocEntry and splice it into the circular TOC.
  • WriteToc / ReadToc — serialize / deserialize the TOC (only entries whose reqs include REQ_SCHEMA|REQ_DATA|REQ_STATS|REQ_SPECIAL).
  • WriteDataChunks / WriteDataChunksForTocEntry — invoke each entry’s dataDumper; parallel-dump path sorts by size and DispatchJobForTocEntry.
  • ProcessArchiveRestoreOptions / _tocEntryRequired — compute per-entry reqs bitmask from RestoreOptions.
  • RestoreArchive / restore_toc_entry / _printTocEntry — serial restore: emit schema part then data part of each TOC entry in order.
  • restore_toc_entries_prefork / restore_toc_entries_parallel — three-phase parallel restore driver.
  • pop_next_work_item / move_to_ready_heap / reduce_dependencies / mark_restore_job_done — ready-heap scheduling and dependency decrement.
  • parallel_restore — the worker-side entry that calls restore_toc_entry.
  • SortTocFromFile — reorder/filter the TOC from a pg_restore -L list file.
  • main — connect, optionally dropDBs, dumpRoles, dumpRoleMembership, dumpTablespaces, then dumpDatabases.
  • dumpRoles / dumpRoleMembershipCREATE ROLE / GRANT from pg_authid / pg_auth_members.
  • dumpTablespacesCREATE TABLESPACE from pg_tablespace.
  • dumpDatabases / runPgDump — per-database delegation to the pg_dump binary with -Fp / -Fa.
  • expand_dbname_patterns — resolve --database patterns against pg_database.

Position hints (as of 2026-06-06, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-06, REL_18 273fe94)”
SymbolFileLine
mainpg_dump.c414
selectDumpableTablepg_dump.c2030
dumpTableData_copypg_dump.c2326
dumpTableDatapg_dump.c2816
getTableDatapg_dump.c2966
dumpDatabasepg_dump.c3228
getTablespg_dump.c7026
dumpDumpableObjectpg_dump.c11487
dumpTableSchemapg_dump.c16919
getDependenciespg_dump.c19803
getSchemaDatacommon.c98
flagInhTablescommon.c269
AssignDumpIdcommon.c657
findObjectByCatalogIdcommon.c778
getDumpableObjectscommon.c797
addObjectDependencycommon.c818
dbObjectTypePriority[]pg_dump_sort.c105
sortDumpableObjectsByTypeNamepg_dump_sort.c192
sortDumpableObjectspg_dump_sort.c558
TopoSortpg_dump_sort.c610
findDependencyLoopspg_dump_sort.c759
repairDependencyLooppg_dump_sort.c1173
CreateArchivepg_backup_archiver.c230
RestoreArchivepg_backup_archiver.c343
restore_toc_entrypg_backup_archiver.c843
ArchiveEntrypg_backup_archiver.c1240
_allocAHpg_backup_archiver.c2376
WriteDataChunkspg_backup_archiver.c2496
WriteTocpg_backup_archiver.c2610
ReadTocpg_backup_archiver.c2708
_tocEntryRequiredpg_backup_archiver.c2986
restore_toc_entries_preforkpg_backup_archiver.c4311
restore_toc_entries_parallelpg_backup_archiver.c4434
pop_next_work_itempg_backup_archiver.c4711
reduce_dependenciespg_backup_archiver.c5060
dumpRolespg_dumpall.c823
dumpTablespacespg_dumpall.c1382
dumpDatabasespg_dumpall.c1628
runPgDumppg_dumpall.c1724

Facts about the source at commit 273fe94, readable without external materials. Open questions follow.

  • The four archive formats are exactly archCustom, archTar, archNull, archDirectory. Verified in pg_backup.h: the ArchiveFormat enum is archUnknown = 0, archCustom = 1, archTar = 3, archNull = 4, archDirectory = 5 (value 2 is intentionally absent). The -F letters p/plain map to archNull, so the plain-text path reuses the null back end rather than a distinct format. Verified in parseArchiveFormat (pg_dump.c).

  • _allocAH dispatches format selection through InitArchiveFmt_*, and every later archiver operation goes through function-pointer members of ArchiveHandle. Verified in _allocAH: the switch (AH->format) calls one of four init functions; the body of the archiver (e.g. WriteToc, RestoreArchive) calls AH->WriteExtraTocPtr, AH->ReadExtraTocPtr, AH->ClosePtr, etc., never a concrete format function. This is the strategy-pattern factoring.

  • The dependency sort is wrapped in a retry loop that handles cycles by graph rewriting, not by erroring. Verified in sortDumpableObjects: while (!TopoSort(...)) findDependencyLoops(...). TopoSort returns false and lists the unplaceable objects; findDependencyLoopsrepairDependencyLoop mutates the graph (via removeObjectDependency and by splitting objects) and the sort is retried until acyclic.

  • The pre-data / data / post-data section split is implemented as ordinary nodes in the same topological sort. Verified: sortDumpableObjects stashes preDataBoundId / postDataBoundId (the DumpIds of the DO_PRE_DATA_BOUNDARY / DO_POST_DATA_BOUNDARY dummy objects) in statics, and the boundary objects participate in TopoSort like any other. There is no separate section-ordering pass.

  • WriteToc writes only entries whose reqs bitmask is non-empty. Verified in WriteToc: it counts and then writes only entries with (te->reqs & (REQ_SCHEMA | REQ_DATA | REQ_STATS | REQ_SPECIAL)) != 0. The reqs mask is computed earlier by ProcessArchiveRestoreOptions / _tocEntryRequired, so the dumped TOC already reflects include/exclude decisions.

  • _tocEntryRequired keys restore policy off the desc string. Verified: the function is a cascade of strcmp(te->desc, "ENCODING"), "STATISTICS DATA", "DATABASE", _tocEntryIsACL, "COMMENT", "POLICY", "PUBLICATION", "SECURITY LABEL", … comparisons. The TOC entry’s human-readable descriptor is simultaneously its restore-class key.

  • Parallel dump is only meaningful for the directory format; parallel restore additionally requires a real connection. Verified in pg_dump’s main: a numWorkers > 1 request errors unless the format is archDirectory. Verified in RestoreArchive: parallel_mode = (AH->public.numWorkers > 1 && ropt->useDB). Single-file custom/tar dumps cannot be written by independent workers.

  • Parallel restore respects both dependencies and lock conflicts. Verified: reduce_dependencies (called from mark_restore_job_done) decrements blocked-counts and promotes newly-ready items; pop_next_work_item skips any ready item whose has_lock_conflicts against a currently-running worker would collide. The driver also advances through restorePass values, forcing ACL/post-ACL classes into later passes.

  • pg_dumpall produces plain SQL only and delegates table dumps to the pg_dump binary. Verified in runPgDump (pg_dumpall.c): it builds a command line for the external pg_dump with -Fp (or -Fa when writing to a file). pg_dumpall itself only emits globals (dumpRoles, dumpRoleMembership, dumpTablespaces) inline; it has no archive format.

  1. Snapshot synchronization across parallel-dump workers. The consistent-snapshot guarantee for -j N dumps relies on the leader exporting a snapshot and each worker SET TRANSACTION SNAPSHOT-ing it. The exact handshake lives in parallel.c / pg_backup_db.c and is only summarized here; tracing set_archive_cancel_info and the worker connection setup is left to a parallel.c-focused follow-up.

  2. Statistics dumping (STATISTICS DATA). PG18 dumps optimizer statistics as a first-class TOC class (REQ_STATS, the DO_REL_STATS object type, --statistics-only / --no-statistics). How the stats are serialized and re-applied (pg_restore_relation_stats) is referenced via _tocEntryRequired here but not walked end-to-end.

  3. Custom-format second TOC write. WriteToc has special handling for te->defnLen set on a second call, which only the custom format makes (to rewrite data offsets in place once fseeko is confirmed usable). The exact close-time rewrite sequence lives in pg_backup_custom.c’s _CloseArchive and is noted but not traced.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

Pointers, not analysis. Each bullet is a starting handle for a follow-up document, framed as “how does the same problem look elsewhere?”

  • mysqldump — streaming SQL, no intermediate graph. MySQL’s classic logical dumper reads INFORMATION_SCHEMA plus SHOW CREATE TABLE and streams CREATE/INSERT SQL directly to stdout in catalog order, with no in-memory DumpableObject graph and no topological sort. It leans on SET FOREIGN_KEY_CHECKS=0 at restore time to side-step the ordering problem that pg_dump_sort.c solves structurally. The trade is that mysqldump cannot offer selective restore, dependency-honoring parallelism, or a reorderable TOC — there is no container, only a script. The closest PostgreSQL analogue is the -Fp (plain) path, which is also a one-shot script; PostgreSQL’s extra machinery only pays off in the container formats.

  • MySQL Shell util.dumpInstance / dumpSchemas. The modern MySQL dumper (MySQL Shell 8.0+) is a deliberate move toward pg_dump’s container model: it writes a directory of per-table chunked files plus metadata, supports parallel dump and parallel load (util.loadDump), and uses LOCK INSTANCE FOR BACKUP + a consistent snapshot for isolation. A side-by-side with pg_backup_directory.c’s per-worker file layout and the ready-heap scheduler would show two independent rediscoveries of the same “directory of chunks + dependency-aware loader” design.

  • Oracle Data Pump (expdp / impdp) — the master table. Data Pump materializes its TOC as an actual database table (the master table) in the dumping schema, recording every object, its metadata, and worker progress; this is what lets a Data Pump job be paused, resumed, and attached-to from another session. PostgreSQL’s TOC is in-archive and in-memory, not a live table, so a pg_restore job is not resumable mid-flight. A comparison of restart/resume semantics (Data Pump’s master table vs. pg_restore’s all-or-nothing --single-transaction) is a clean follow-up.

  • SQL Server SMO + BACPAC. SQL Server’s logical export (.bacpac via SqlPackage / SMO) separates a schema model (DACPAC) from packaged data, and the SMO object model walks sys.* to script DDL — the same “catalog-to-DDL decompiler” role dumpTableSchema plays. The schema model is validated as a unit before deployment, a design point PostgreSQL has no equivalent of (a pg_restore failure is detected statement-by-statement).

  • Logical replication as a continuous dump. pg_dump is a point-in-time logical snapshot; logical decoding / pgoutput (see postgres-logical-decoding.md, postgres-pgoutput.md) is the streaming form of the same “ship intent, not pages” idea. The initial table sync of a logical replication subscription is conceptually a COPY-based dump of each table under an exported snapshot — the same snapshot-export mechanism pg_dump -j uses. Tracing how tablesync.c reuses snapshot export would unify the two stories.

  • --filter and object-selection grammars. PG17 added a --filter=FILE option (include/exclude rules read from a file) on top of the long-standing -t / -T / -n / -N pattern flags; the rules feed the same selectDumpableTable / selectDumpableNamespace component-mask machinery. A comparison with Data Pump’s INCLUDE/EXCLUDE metadata filters and mysqldump’s coarse --ignore-table would map the expressiveness spectrum of object selection.

  • Research frontier: parallel-safe consistent logical snapshots. The open question of how -j N workers share one snapshot (SET TRANSACTION SNAPSHOT against a leader-exported snapshot id) is the practical face of a deeper topic — exporting and importing MVCC snapshots across sessions while holding back vacuum. The mechanics live in postgres-mvcc-snapshots.md and postgres-procarray.md; the cost model (how long a multi-hour parallel dump pins old row versions cluster-wide) is a backup-planning concern worth its own note.

PostgreSQL source (under /data/hgryoo/references/postgres, REL_18 273fe94)

Section titled “PostgreSQL source (under /data/hgryoo/references/postgres, REL_18 273fe94)”
  • src/bin/pg_dump/pg_dump.c — driver: main, parseArchiveFormat, getTableData / makeTableDataInfo, dumpTableData_copy / dumpTableData_insert, dumpTableSchema, dumpDumpableObject, dumpDatabase, getDependencies, selectDumpableTable / selectDumpableNamespace.
  • src/bin/pg_dump/common.c — object model: getSchemaData, AssignDumpId / createDumpId, recordAdditionalCatalogID, findObjectByDumpId / findObjectByCatalogId / findTableByOid, addObjectDependency / removeObjectDependency, getDumpableObjects, flagInhTables / flagInhAttrs.
  • src/bin/pg_dump/pg_dump_sort.c — ordering: sortDumpableObjectsByTypeName / DOTypeNameCompare, dbObjectTypePriority[], sortDumpableObjects, TopoSort, findDependencyLoops / repairDependencyLoop (and the per-shape repairTypeFuncLoop / repairViewRuleLoop / table-constraint repairs).
  • src/bin/pg_dump/pg_backup_archiver.c — archive engine: CreateArchive / OpenArchive / _allocAH, _discoverArchiveFormat, ArchiveEntry, WriteToc / ReadToc, WriteDataChunks, ProcessArchiveRestoreOptions / _tocEntryRequired, RestoreArchive / restore_toc_entry / _printTocEntry, restore_toc_entries_prefork / restore_toc_entries_parallel, pop_next_work_item / move_to_ready_heap / reduce_dependencies / mark_restore_job_done.
  • src/bin/pg_dump/pg_dumpall.c — cluster globals: main, dumpRoles, dumpRoleMembership, dumpTablespaces, dumpDatabases / runPgDump, expand_dbname_patterns.
  • src/bin/pg_dump/pg_backup.h — the ArchiveFormat enum (archCustom / archTar / archNull / archDirectory), RestoreOptions, the public Archive struct.
  • src/bin/pg_dump/pg_backup_archiver.hArchiveHandle, TocEntry, the function-pointer vtable members, the REQ_* and RESTORE_PASS_* constants.
  • src/bin/pg_dump/parallel.c — (referenced) ParallelState, DispatchJobForTocEntry, WaitForWorkers, worker fork/connect and snapshot sharing.
  • Format back ends (referenced, not walked): pg_backup_custom.c, pg_backup_directory.c, pg_backup_tar.c, pg_backup_null.c.

Textbook chapters (under knowledge/research/dbms-general/)

Section titled “Textbook chapters (under knowledge/research/dbms-general/)”
  • Database System Concepts (Silberschatz, Korth, Sudarshan), “Recovery System” — the dump-vs-log duality; a logical dump as a re-derivation recipe rather than a byte copy.
  • Database Internals (Petrov), Part II — logical vs. physical replication, the portability/fidelity trade that separates pg_dump from pg_basebackup.
  • The Art of Computer Programming, Vol. 1 (Knuth), §2.2.3 — the topological-sort algorithm TopoSort implements.
  • postgres-system-catalogs.md — the catalogs (pg_class, pg_type, pg_proc, pg_constraint, …) that the getXxx extractors read; defer catalog-layout detail there.
  • postgres-dependency-tracking.mdpg_depend / pg_shdepend semantics that getDependencies translates into the dump graph; defer dependency-row detail there.
  • postgres-mvcc-snapshots.md / postgres-procarray.md — the snapshot-export machinery that gives a dump (and a parallel dump) its consistent read.
  • postgres-copy.md — the COPY … TO stdout protocol that dumpTableData_copy rides; the restore-side COPY … FROM stdin.
  • postgres-backup-basebackup.md / postgres-incremental-backup.md — the physical backup path that pg_dump is the logical counterpart to.
  • postgres-pg-upgrade.mdpg_upgrade uses pg_dump --binary-upgrade / pg_dumpall --globals-only to carry schema across a major-version jump.
  • postgres-ddl-execution.md — the server side that replays the DDL a dump emits.