PostgreSQL ALTER TABLE — The Multi-Pass Machinery and Table Rewrites
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”ALTER TABLE is the surface syntax for schema evolution: changing the
logical shape of a relation (its columns, types, constraints, storage
properties) after rows already exist. Every relational system must answer
one foundational question — when the declared schema changes, what
happens to the bytes already on disk? — and the answer divides the entire
design space.
Database System Concepts (Silberschatz, Korth, Sudarshan) introduces the
relation as a set of tuples conforming to a schema R(A1, A2, …, An). The
schema is metadata; the tuples are data. A schema change can therefore be
realized in two fundamentally different ways:
-
Metadata-only (logical) change. The catalog description is updated and existing tuples are reinterpreted under the new description. No stored byte moves. This is cheap (O(1) in the number of rows) but is only possible when the on-disk representation is already compatible with the new schema — for example, dropping a column (the bytes stay but are no longer projected) or widening
varchar(10)tovarchar(20)(same physical layout, looser constraint). -
Physical (data) change — a rewrite. Every stored tuple must be read, transformed, and written back in the new layout. This is O(rows) work, produces WAL proportional to the table size, and — in the simplest implementation — holds an exclusive lock for the whole duration. It is unavoidable when the on-disk bytes themselves must change: converting
inttotext, changing a row’s physical width, or compacting after a storage-parameter change.
The art of an ALTER TABLE implementation is maximizing case 1 and
minimizing case 2. A naive engine rewrites the table on every change; a
mature one recognizes the large class of changes that are metadata-only
and reserves the rewrite for the irreducible cases. PostgreSQL has spent
two decades pushing operations out of case 2 into case 1: the attisdropped
logical-delete for DROP COLUMN (since the very beginning), the
binary-coercion shortcut for ALTER TYPE, and — the marquee feature added
in PostgreSQL 11 — the fast default, which lets ADD COLUMN ... DEFAULT <const> be a pure catalog change by storing the default value in
pg_attribute.attmissingval and synthesizing it at read time.
A second theoretical axis is atomicity of DDL. In many engines DDL is
auto-committing or only partially transactional. PostgreSQL treats DDL as
ordinary transactional work: the catalog rows are MVCC tuples, the heap
rewrite is a normal heap insert into a new relfilenode, and an error
anywhere — or an explicit ROLLBACK — undoes the whole ALTER TABLE with
no special cleanup. The header comment on AlterTable states it directly:
“Thanks to the magic of MVCC, an error anywhere along the way rolls back
the whole operation; we don’t have to do anything special to clean up.”
The third axis is concurrency. A schema change must coordinate with
concurrent readers and writers. The classic mechanism is a table-level
lock whose strength is chosen to be exactly as strong as the change’s
visibility requires: a change invisible to running SELECTs can take a
weak lock that allows reads to continue, whereas a change that rewrites the
heap or alters query plans needs AccessExclusiveLock. PostgreSQL encodes
this entire policy in one function, AlterTableGetLockLevel, and — crucially
— computes it before acquiring any lock, so the lock taken is correct on
the first try.
The design space an ALTER TABLE implementer chooses within:
-
One pass or many. A user can submit several subcommands in a single statement (
ALTER TABLE t DROP a, ADD b int, ALTER c TYPE bigint). Some subcommands must be ordered relative to others (drop before add, type-change before index re-add). The implementation can either reject combinations or impose a fixed ordering. PostgreSQL imposes a fixed pass ordering. -
Per-statement scan or per-subcommand scan. If three subcommands each need to examine every row, does the engine scan three times or once? PostgreSQL’s explicit goal is one pass over the data per table, no matter how many subcommands need it.
-
Rewrite in place or rewrite to a new file. PostgreSQL never mutates heap pages in place for a rewrite; it builds an entirely new relfilenode and atomically swaps the file pointers, reusing the
CLUSTER/VACUUM FULLmachinery (make_new_heap,finish_heap_swap).
Common DBMS Design
Section titled “Common DBMS Design”Schema-change engines across systems converge on a recognizable set of conventions. Naming them makes PostgreSQL’s specific symbols read as one set of choices within a shared playbook.
Separate “prepare” and “execute” phases
Section titled “Separate “prepare” and “execute” phases”A schema change cannot validate every precondition in a single pass,
because earlier subcommands change the catalog state that later ones
depend on. The universal pattern is to split into a preparation phase
(check permissions, resolve names, decide recursion, reject illegal
combinations) and an execution phase (mutate catalogs, then mutate data).
PostgreSQL’s ATPrepCmd / ATExecCmd split is exactly this; the prepare
phase is deliberately conservative about reading “table details that
another subcommand could change.”
A work queue keyed by target relation
Section titled “A work queue keyed by target relation”Inheritance and partitioning mean one statement can affect many tables.
The standard structure is a work queue — one entry per affected relation,
each carrying the list of operations to perform and the scratch state
gathered during preparation. PostgreSQL’s queue is a List of
AlteredTableInfo structs, found or created by ATGetQueueEntry.
Ordered phases so dependent operations interleave correctly
Section titled “Ordered phases so dependent operations interleave correctly”Because subcommands have ordering constraints, engines bucket them into
fixed phases and execute phase-by-phase across all tables in parallel
before advancing. This lets one table’s operation enqueue work for another
table’s later phase (PostgreSQL’s example: ALTER TYPE on a foreign key’s
primary-key column dispatching the re-add of the FK constraint to the
referencing table). PostgreSQL has 13 passes, AT_PASS_DROP first through
AT_PASS_MISC last.
Lock strength derived from change visibility
Section titled “Lock strength derived from change visibility”The lock a DDL takes should be the weakest that still serializes
correctly against concurrent activity that could observe the change.
Systems classify each operation: invisible-to-reads → weak lock;
plan-invalidating or heap-rewriting → exclusive lock. PostgreSQL’s
AlterTableGetLockLevel is a giant switch returning one of
ShareUpdateExclusiveLock, ShareRowExclusiveLock, or
AccessExclusiveLock per subtype, taking the maximum across subcommands.
Metadata-only changes via catalog flags and value synthesis
Section titled “Metadata-only changes via catalog flags and value synthesis”The cheap path for schema change is a catalog flag plus read-time
synthesis. Dropping a column flips a “deleted” flag and leaves the bytes;
adding a column with a default stores the default in the catalog and
materializes it only for rows that predate the column. PostgreSQL uses
pg_attribute.attisdropped for the former and attmissingval (the fast
default, PG 11+) for the latter.
Rewrite by build-new-file-and-swap, not in-place mutation
Section titled “Rewrite by build-new-file-and-swap, not in-place mutation”When a rewrite is truly required, the robust pattern is to build a fresh
physical file, copy transformed tuples into it, and atomically swap the
file identity — so a crash or rollback leaves the original intact. This is
the same mechanism CLUSTER and VACUUM FULL use. PostgreSQL shares the
code: make_new_heap builds the transient relation, ATRewriteTable
copies, finish_heap_swap swaps relfilenodes and rebuilds indexes.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Concept | PostgreSQL name |
|---|---|
| Top-level driver | AlterTable → ATController |
| Lock-level policy (pre-lock) | AlterTableGetLockLevel |
| Phase 1 — prepare | ATPrepCmd |
| Phase 2 — catalog update | ATRewriteCatalogs → ATExecCmd |
| Phase 3 — scan/rewrite | ATRewriteTables → ATRewriteTable |
| Work-queue entry (per table) | AlteredTableInfo |
| Work-queue lookup/create | ATGetQueueEntry |
| Fixed pass ordering | AlterTablePass enum (AT_PASS_DROP … AT_PASS_MISC) |
| Rewrite reason bitmask | tab->rewrite (AT_REWRITE_* flags) |
| New-column value to compute in Phase 3 | NewColumnValue |
| New constraint to verify in Phase 3 | NewConstraint |
| Rewrite-vs-metadata decision (type) | ATColumnChangeRequiresRewrite |
| Fast default (metadata-only ADD) | StoreAttrMissingVal / attmissingval |
| Logical column delete | RemoveAttributeById (attisdropped) |
| Build transient heap | make_new_heap (cluster.c) |
| Atomic file swap | finish_heap_swap / swap_relation_files |
| Inheritance recursion | ATSimpleRecursion, find_all_inheritors |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”One statement, one lock, three phases
Section titled “One statement, one lock, three phases”ALTER TABLE enters through AlterTable, but the lock is acquired by the
caller (utility.c) before AlterTable runs, using a lock level computed
by AlterTableGetLockLevel. The key invariant — stated in the function’s
own comment — is that the lock level must be decidable without looking at
the table, because the table is not yet locked:
// AlterTableGetLockLevel — src/backend/commands/tablecmds.cLOCKMODEAlterTableGetLockLevel(List *cmds){ ListCell *lcmd; LOCKMODE lockmode = ShareUpdateExclusiveLock;
foreach(lcmd, cmds) { AlterTableCmd *cmd = (AlterTableCmd *) lfirst(lcmd); LOCKMODE cmd_lockmode = AccessExclusiveLock; /* default */
switch (cmd->subtype) { case AT_AddColumn: /* may rewrite heap */ case AT_SetAccessMethod: /* must rewrite heap */ case AT_SetTableSpace: /* must rewrite heap */ case AT_AlterColumnType: /* must rewrite heap */ cmd_lockmode = AccessExclusiveLock; break; /* ... dozens of cases ... */ case AT_SetStatistics: case AT_ClusterOn: case AT_DropCluster: cmd_lockmode = ShareUpdateExclusiveLock; break; /* ... */ } /* Take the greatest lockmode from any subcommand */ if (cmd_lockmode > lockmode) lockmode = cmd_lockmode; } return lockmode;}The default floor is ShareUpdateExclusiveLock (the weakest level any
ALTER TABLE uses — it still excludes VACUUM, ANALYZE, and other DDL
but allows reads and writes). Most subcommands escalate to
AccessExclusiveLock; trigger enable/disable and FK addition land on the
intermediate ShareRowExclusiveLock. The maximum across all subcommands
wins, and that single lock is held until commit. A comment in the function
adds the Hot Standby caveat: because a standby only learns about
AccessExclusiveLocks on the primary, any change visible to standby
SELECTs must use AccessExclusiveLock even if a weaker lock would
otherwise suffice.
After the caller takes the lock, AlterTable opens the relation NoLock
(the lock is already held), runs a safety check, and delegates to
ATController, which is the clearest statement of the three-phase design:
// ATController — src/backend/commands/tablecmds.cstatic voidATController(AlterTableStmt *parsetree, Relation rel, List *cmds, bool recurse, LOCKMODE lockmode, AlterTableUtilityContext *context){ List *wqueue = NIL; ListCell *lcmd;
/* Phase 1: preliminary examination of commands, create work queue */ foreach(lcmd, cmds) { AlterTableCmd *cmd = (AlterTableCmd *) lfirst(lcmd); ATPrepCmd(&wqueue, rel, cmd, recurse, false, lockmode, context); }
/* Close the relation, but keep lock until commit */ relation_close(rel, NoLock);
/* Phase 2: update system catalogs */ ATRewriteCatalogs(&wqueue, lockmode, context);
/* Phase 3: scan/rewrite tables as needed, and run afterStmts */ ATRewriteTables(parsetree, &wqueue, lockmode, context);}The design intent (from the AlterTable header) is one pass over the
data: Phase 3 is skipped entirely unless some subcommand requires it, and
when it does run, it scans each table exactly once regardless of how many
subcommands contributed work.
flowchart TD
A["utility.c: AlterTableGetLockLevel(cmds)<br/>computes lock WITHOUT opening table"] --> B["acquire table lock<br/>(held until commit)"]
B --> C["AlterTable -> CheckAlterTableIsSafe -> ATController"]
C --> D["Phase 1: ATPrepCmd per subcommand<br/>permission/relkind checks, recurse to children<br/>build work queue of AlteredTableInfo<br/>bucket subcmds into passes"]
D --> E["relation_close(rel, NoLock)<br/>lock stays"]
E --> F["Phase 2: ATRewriteCatalogs<br/>for pass = DROP .. MISC:<br/>for each table: ATExecCmd per subcmd<br/>mutate catalogs, maybe set tab->rewrite"]
F --> G{"any table:<br/>tab->rewrite > 0<br/>or constraints/notnull<br/>to verify?"}
G -->|no| H["Phase 3 mostly no-op<br/>run afterStmts"]
G -->|"rewrite > 0"| I["make_new_heap -> ATRewriteTable<br/>copy+transform every tuple<br/>finish_heap_swap (swap relfilenodes)"]
G -->|"verify only"| J["ATRewriteTable(tab, InvalidOid)<br/>scan to check constraints, no rewrite"]
I --> K["validate FK constraints (final loop)<br/>run afterStmts"]
J --> K
H --> K
The work queue and the AlteredTableInfo struct
Section titled “The work queue and the AlteredTableInfo struct”Phase 1 builds a List of AlteredTableInfo, one entry per affected
relation. The struct carries everything Phases 2 and 3 need — the
per-pass subcommand lists, the pre-modification tuple descriptor (captured
the moment the entry is created, before anything changes it), the columns
to recompute (newvals), the constraints to verify (constraints), and
the rewrite reason:
// AlteredTableInfo (condensed) — src/backend/commands/tablecmds.ctypedef struct AlteredTableInfo{ Oid relid; /* Relation to work on */ char relkind; /* Its relkind */ TupleDesc oldDesc; /* Pre-modification tuple descriptor */ Relation rel; /* transiently set during Phase 2 */ List *subcmds[AT_NUM_PASSES]; /* Lists of AlterTableCmd */ List *constraints; /* List of NewConstraint */ List *newvals; /* List of NewColumnValue */ List *afterStmts; /* utility command parsetrees */ bool verify_new_notnull; /* T if we should recheck NOT NULL */ int rewrite; /* Reason for forced rewrite, if any */ bool chgAccessMethod; Oid newAccessMethod; Oid newTableSpace; bool chgPersistence; char newrelpersistence; Expr *partition_constraint; /* ... changedConstraintOids/Defs, changedIndexOids/Defs, etc. ... */} AlteredTableInfo;ATGetQueueEntry is the find-or-create accessor. Note how it snapshots the
tuple descriptor on creation — oldDesc is the old shape, which Phase 3
needs to read existing tuples even after Phase 2 has rewritten
pg_attribute:
// ATGetQueueEntry — src/backend/commands/tablecmds.cstatic AlteredTableInfo *ATGetQueueEntry(List **wqueue, Relation rel){ Oid relid = RelationGetRelid(rel); AlteredTableInfo *tab; ListCell *ltab;
foreach(ltab, *wqueue) { tab = (AlteredTableInfo *) lfirst(ltab); if (tab->relid == relid) return tab; /* already queued */ }
/* Not there, so add it (snapshot the OLD descriptor now) */ tab = (AlteredTableInfo *) palloc0(sizeof(AlteredTableInfo)); tab->relid = relid; tab->relkind = rel->rd_rel->relkind; tab->oldDesc = CreateTupleDescCopyConstr(RelationGetDescr(rel)); tab->newAccessMethod = InvalidOid; tab->newTableSpace = InvalidOid; tab->newrelpersistence = RELPERSISTENCE_PERMANENT; *wqueue = lappend(*wqueue, tab); return tab;}The pass ordering
Section titled “The pass ordering”The reason Phase 2 cannot just execute subcommands in user order is that
operations have dependencies: a DROP COLUMN must happen before an ADD COLUMN of the same name; an existing index must be re-added only after the
ALTER TYPE that forced its rebuild; an FK index-based constraint must
precede the FK itself. PostgreSQL resolves this with a fixed enum of
passes; Phase 1 files each subcommand into the right subcmds[pass] bucket:
// AlterTablePass — src/backend/commands/tablecmds.ctypedef enum AlterTablePass{ AT_PASS_UNSET = -1, /* UNSET will cause ERROR */ AT_PASS_DROP, /* DROP (all flavors) */ AT_PASS_ALTER_TYPE, /* ALTER COLUMN TYPE */ AT_PASS_ADD_COL, /* ADD COLUMN */ AT_PASS_SET_EXPRESSION, /* ALTER SET EXPRESSION */ AT_PASS_OLD_INDEX, /* re-add existing indexes */ AT_PASS_OLD_CONSTR, /* re-add existing constraints */ AT_PASS_ADD_CONSTR, /* ADD constraints (initial examination) */ AT_PASS_COL_ATTRS, /* set column attributes, eg NOT NULL */ AT_PASS_ADD_INDEXCONSTR, /* ADD index-based constraints */ AT_PASS_ADD_INDEX, /* ADD indexes */ AT_PASS_ADD_OTHERCONSTR, /* ADD other constraints, defaults */ AT_PASS_MISC, /* other stuff */} AlterTablePass;
#define AT_NUM_PASSES (AT_PASS_MISC + 1)DROP is pass 0, ALTER TYPE is pass 1 (it precedes the OLD_INDEX and
OLD_CONSTR passes that re-add the objects it dropped), and the ADD
family fans out across passes 7–11 so index-based constraints, plain
indexes, and other constraints land in dependency order. Anything
unclassified falls to AT_PASS_MISC.
In-place vs. full rewrite: the three cheap paths
Section titled “In-place vs. full rewrite: the three cheap paths”The heart of ALTER TABLE performance is the tab->rewrite bitmask. It
starts at 0; each subcommand that genuinely needs a heap rewrite ORs in a
reason flag:
// AT_REWRITE_* flags — src/include/commands/event_trigger.h#define AT_REWRITE_ALTER_PERSISTENCE 0x01#define AT_REWRITE_DEFAULT_VAL 0x02#define AT_REWRITE_COLUMN_REWRITE 0x04#define AT_REWRITE_ACCESS_METHOD 0x08If, after Phase 2, tab->rewrite is still 0 and there are no new
constraints to verify, Phase 3 does essentially nothing for that table.
Three subcommand families illustrate the cheap-vs-expensive decision:
DROP COLUMN — always metadata-only. Dropping a column never touches a
single heap page. ATExecDropColumn ultimately drives RemoveAttributeById
(in heap.c), which flips attisdropped, blanks the type, and renames the
column to a guaranteed-unique placeholder. The bytes stay in every tuple;
the executor simply stops projecting that attribute:
// RemoveAttributeById (condensed) — src/backend/catalog/heap.cvoidRemoveAttributeById(Oid relid, AttrNumber attnum){ /* ... open rel AccessExclusiveLock, fetch pg_attribute tuple ... */ attStruct->attisdropped = true; attStruct->atttypid = InvalidOid; /* type link no longer reliable */ attStruct->attnotnull = false; attStruct->attgenerated = '\0'; /* rename so the slot can't collide with a future ADD COLUMN */ snprintf(newattname, sizeof(newattname), "........pg.dropped.%d........", attnum); namestrcpy(&(attStruct->attname), newattname); /* ... clear attmissingval, attstattarget, attacl, attoptions ... */ CatalogTupleUpdate(attr_rel, &tuple->t_self, tuple);}ADD COLUMN — metadata-only when the default is a constant (fast
default). ATExecAddColumn inserts the pg_attribute row, then tries to
avoid a rewrite. If the column has a non-volatile, non-generated,
non-domain-constrained default, it evaluates the default once and stores
the resulting datum in pg_attribute.attmissingval via
StoreAttrMissingVal. Every pre-existing row then reads back that value
without the byte ever being written to the heap. Only if that fast path is
unavailable does it set AT_REWRITE_DEFAULT_VAL:
// ATExecAddColumn (rewrite decision, condensed) — tablecmds.cif (rel->rd_rel->relkind == RELKIND_RELATION && !colDef->generated && !has_domain_constraints && !contain_volatile_functions((Node *) defval)){ /* Evaluate the default once and store it outside the heap */ missingval = ExecEvalExpr(exprState, ..., &missingIsNull); if (!missingIsNull) { StoreAttrMissingVal(rel, attribute->attnum, missingval); has_missing = true; }}else{ /* Failed to use missing mode -> must rewrite to install the value */ if (colDef->generated != ATTRIBUTE_GENERATED_VIRTUAL) tab->rewrite |= AT_REWRITE_DEFAULT_VAL;}ADD COLUMN with no default at all is even cheaper: the comment notes that
heap access routines return NULL for any attnum beyond the stored tuple’s
attribute count, so a nullable column needs neither a rewrite nor a missing
value.
ALTER COLUMN TYPE — rewrite only if the cast isn’t a no-op.
ATPrepAlterColumnType plans the transformation expression and asks
ATColumnChangeRequiresRewrite whether it can be elided. If the expression
is just the column itself (varchar(10) → varchar(20), or any
binary-coercible cast), no rewrite is needed; the catalog type changes and
the bytes stay. The function walks the expression tree, stripping
relabel/coerce nodes, and returns false only if it bottoms out at the
original Var:
// ATColumnChangeRequiresRewrite (condensed) — tablecmds.cstatic boolATColumnChangeRequiresRewrite(Node *expr, AttrNumber varattno){ for (;;) { if (IsA(expr, Var) && ((Var *) expr)->varattno == varattno) return false; /* identity -> no rewrite */ else if (IsA(expr, RelabelType)) expr = (Node *) ((RelabelType *) expr)->arg; else if (IsA(expr, CoerceToDomain)) { CoerceToDomain *d = (CoerceToDomain *) expr; if (DomainHasConstraints(d->resulttype)) return true; /* must scan to enforce domain */ expr = (Node *) d->arg; } else if (IsA(expr, FuncExpr)) { /* a few timestamp casts are no-ops on some configs */ /* ... F_TIMESTAMPTZ_TIMESTAMP etc. ... */ return true; } else return true; /* any real computation -> rewrite */ }}When a rewrite is required, the plumbing in ATPrepAlterColumnType
queues a NewColumnValue (the transform expression) and sets the bit:
// ATPrepAlterColumnType (tail, condensed) — tablecmds.cnewval = (NewColumnValue *) palloc0(sizeof(NewColumnValue));newval->attnum = attnum;newval->expr = (Expr *) transform;tab->newvals = lappend(tab->newvals, newval);if (ATColumnChangeRequiresRewrite(transform, attnum)) tab->rewrite |= AT_REWRITE_COLUMN_REWRITE;flowchart TD
S["subcommand in Phase 2 (ATExecCmd)"] --> T{"which subtype?"}
T -->|DROP COLUMN| U["RemoveAttributeById<br/>attisdropped = true<br/>NEVER sets tab->rewrite"]
T -->|"ADD COLUMN DEFAULT c"| V{"const, non-volatile,<br/>no domain constraint,<br/>plain relation?"}
V -->|yes| W["StoreAttrMissingVal<br/>attmissingval = eval(default)<br/>metadata-only"]
V -->|no| X["tab->rewrite |= AT_REWRITE_DEFAULT_VAL"]
T -->|"ALTER TYPE"| Y{"ATColumnChangeRequiresRewrite?"}
Y -->|no, binary-coercible| Z["catalog type change only<br/>metadata-only"]
Y -->|yes| AA["tab->rewrite |= AT_REWRITE_COLUMN_REWRITE<br/>queue NewColumnValue"]
T -->|"SET LOGGED / UNLOGGED"| AB["tab->rewrite |= AT_REWRITE_ALTER_PERSISTENCE"]
T -->|"SET ACCESS METHOD"| AC["tab->rewrite |= AT_REWRITE_ACCESS_METHOD"]
U --> END["Phase 3 reads tab->rewrite"]
W --> END
X --> END
Z --> END
AA --> END
AB --> END
AC --> END
Phase 3: one scan, verify-or-rewrite
Section titled “Phase 3: one scan, verify-or-rewrite”ATRewriteTables is the third phase. It walks the same work queue and, for
each table with storage, decides among three outcomes from the tab->rewrite
bitmask and the pending-constraint lists. The branch is the clearest place
to see in-place vs. full-rewrite expressed as control flow:
// ATRewriteTables (branch, condensed) — src/backend/commands/tablecmds.cif (tab->rewrite > 0 && tab->relkind != RELKIND_SEQUENCE){ /* Build a temporary relation and copy data */ OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod, persistence, lockmode); /* Copy the heap data with modifications, test new constraints */ ATRewriteTable(tab, OIDNewHeap); /* Swap the physical files, rebuild indexes, discard old heap */ finish_heap_swap(tab->relid, OIDNewHeap, false, false, true, !OidIsValid(tab->newTableSpace), RecentXmin, ReadNextMultiXactId(), persistence);}else{ /* No rewrite: scan only to verify new constraints, no new file */ if (tab->constraints != NIL || tab->verify_new_notnull || tab->partition_constraint != NULL) ATRewriteTable(tab, InvalidOid); /* SET TABLESPACE with no reconstruction is a block-by-block copy */ if (tab->newTableSpace) ATExecSetTableSpace(tab->relid, tab->newTableSpace, lockmode);}The single ATRewriteTable routine serves both roles, switched on whether
OIDNewHeap is a real OID or InvalidOid. The rewrite path builds a fresh
relfilenode with make_new_heap (the same function CLUSTER and VACUUM FULL call), copies every transformed tuple into it, and then
finish_heap_swap atomically swaps the two relations’ relfilenodes,
rebuilds the indexes against the new file, and drops the old one. Because the
new file is a separate physical relation until the swap, a crash or
ROLLBACK anywhere leaves the original file untouched — the MVCC-atomicity
property the AlterTable header promises. The verify-only path passes
InvalidOid, allocates no new file, and merely scans to throw an error if a
freshly added CHECK, NOT NULL, or partition constraint is violated by an
existing row.
Inside ATRewriteTable the scan reads each old tuple through tab->oldDesc
(the descriptor snapshotted in Phase 1) and, when rewriting, projects it into
a new slot built from the current descriptor, evaluating each queued
NewColumnValue expression:
// ATRewriteTable (per-tuple rewrite, condensed) — tablecmds.cwhile (table_scan_getnextslot(scan, ForwardScanDirection, oldslot)){ if (tab->rewrite > 0) { slot_getallattrs(oldslot); ExecClearTuple(newslot); /* copy unchanged attributes straight across */ memcpy(newslot->tts_values, oldslot->tts_values, sizeof(Datum) * oldslot->tts_nvalid); memcpy(newslot->tts_isnull, oldslot->tts_isnull, sizeof(bool) * oldslot->tts_nvalid); /* dropped columns become NULL in the new layout */ foreach(lc, dropped_attrs) newslot->tts_isnull[lfirst_int(lc)] = true; /* evaluate ALTER TYPE / new-default transform expressions */ econtext->ecxt_scantuple = oldslot; foreach(l, tab->newvals) { NewColumnValue *ex = lfirst(l); if (ex->is_generated) continue; newslot->tts_values[ex->attnum - 1] = ExecEvalExpr(ex->exprstate, econtext, &newslot->tts_isnull[ex->attnum - 1]); } ExecStoreVirtualTuple(newslot); insertslot = newslot; } else insertslot = oldslot; /* verify in place, no projection */
/* NOT NULL / CHECK / partition checks run for BOTH paths */ econtext->ecxt_scantuple = insertslot; foreach_int(attn, notnull_attrs) if (slot_attisnull(insertslot, attn)) ereport(ERROR, (errcode(ERRCODE_NOT_NULL_VIOLATION), ...)); foreach(l, tab->constraints) { /* CONSTR_CHECK -> ExecCheck */ } /* if (newrel) table_tuple_insert(newrel, insertslot, ...) */}Two details make this the linchpin of the whole design. First, the same
loop enforces new constraints whether or not a rewrite happens — the
verify-only path simply reuses oldslot as insertslot and never inserts.
Second, the loop runs once per table no matter how many subcommands
contributed newvals or constraints: a statement that changes three column
types and adds two CHECKs still scans the heap a single time. When the
relation is being rewritten, ATRewriteTable also calls
TransferPredicateLocksToHeapRelation(oldrel) up front, because moving tuples
to a new file invalidates any tuple- or page-level SSI predicate locks.
flowchart TD
A["ATRewriteTables: foreach table in wqueue"] --> B{"RELKIND_HAS_STORAGE?"}
B -->|no| A
B -->|yes| C{"tab->rewrite > 0?"}
C -->|"yes (not sequence)"| D["make_new_heap -> new relfilenode"]
D --> E["ATRewriteTable(tab, OIDNewHeap)<br/>scan oldDesc, project newDesc<br/>eval newvals, check constraints<br/>table_tuple_insert into new file"]
E --> F["finish_heap_swap<br/>swap_relation_files (relfilenode)<br/>rebuild indexes, drop old heap"]
C -->|"no, but constraints/notnull pending"| G["ATRewriteTable(tab, InvalidOid)<br/>scan only, no new file<br/>error if a row violates"]
C -->|"no, only SET TABLESPACE"| H["ATExecSetTableSpace<br/>block-by-block file copy"]
C -->|"nothing pending"| I["skip table entirely"]
F --> J["final loop: validate FK constraints<br/>execute tab->afterStmts"]
G --> J
H --> J
I --> J
Inheritance recursion and the work queue
Section titled “Inheritance recursion and the work queue”Phase 1 (ATPrepCmd) is also where a single statement fans out to child
tables. ATPrepCmd finds or creates the AlteredTableInfo via
ATGetQueueEntry, then — for recursable subtypes — drives ATSimpleRecursion,
which uses find_all_inheritors to enqueue one work-queue entry per
descendant and re-invokes the prep routine on each. This is why a DROP COLUMN on a partitioned parent silently drops the column on every partition:
each partition gets its own AlteredTableInfo, its own per-pass buckets, and
(in Phase 3) its own single heap scan. The recursion happens entirely in
Phase 1, so by the time Phase 2 runs, the queue already contains every
affected relation and ATRewriteCatalogs can iterate them “in parallel,
one pass at a time.”
Source Walkthrough
Section titled “Source Walkthrough”This section walks the symbols in call order, grouped by phase, so a reader
can follow ALTER TABLE from the utility hook to the final FK validation.
All symbols are anchored by name; line numbers appear only in the
position-hint table at the end.
Entry and lock-level policy
Section titled “Entry and lock-level policy”AlterTable— the top-level driver. Re-opens the relationNoLock(the caller already holds the lock), runsCheckAlterTableIsSafe, then callsATController. Its header comment is the canonical statement of the three-phase plan and the MVCC-rollback guarantee.AlterTableGetLockLevel— computes the lock before the table is opened, by inspecting only the parsed subcommand list. Returns the maximum of the per-subtype lock levels, floored atShareUpdateExclusiveLock. The giantswitchis the authoritative map of “whichALTERneeds which lock.” Hot Standby forcesAccessExclusiveLockfor any change a standbySELECTcould observe.ATController— sequences the three phases: aforeachover subcommands callingATPrepCmd(Phase 1), thenrelation_close(rel, NoLock), thenATRewriteCatalogs(Phase 2), thenATRewriteTables(Phase 3).
Phase 1 — prepare and recurse
Section titled “Phase 1 — prepare and recurse”ATPrepCmd— per-subcommand preparation. Finds/creates theAlteredTableInfo, validates the subtype against the relkind, decides the target pass, and recurses to children. Deliberately conservative: it does not read table details a later subcommand could change.ATGetQueueEntry— find-or-create accessor for the per-relation work queue entry. SnapshotsoldDesc(a constrained copy of the current tuple descriptor) at creation time — the descriptor Phase 3 reads old tuples through.ATSimpleRecursion/find_all_inheritors— fan the subcommand out to inheritance children / partitions, one queue entry each.AlteredTableInfo/AlterTablePass/AT_NUM_PASSES— the work-queue struct, the fixed pass enum, and the bucket-array size.subcmdsisList *[AT_NUM_PASSES].
Phase 2 — catalog mutation, pass by pass
Section titled “Phase 2 — catalog mutation, pass by pass”ATRewriteCatalogs— the outerfor (pass = 0; pass < AT_NUM_PASSES; pass++)loop wrapping aforeachover tables. Processing all tables “in parallel, one pass at a time” is what lets one table’s subcommand enqueue work into a later pass of another table (the FK re-add example). After theALTER TYPE/SET EXPRESSIONpasses it callsATPostAlterTypeCleanup; at the end it adds TOAST tables where newly needed.ATExecCmd— the dispatchswitchfromcmd->subtypeto the concrete executor (ATExecAddColumn,ATExecDropColumn,ATExecAlterColumnType,ATExecColumnDefault, …). This is where catalogs are actually mutated andtab->rewritebits are set.ATExecAddColumn— inserts thepg_attributerow; takes the fast-default path viaStoreAttrMissingValwhen the default is a non-volatile constant on a plain relation, else setsAT_REWRITE_DEFAULT_VAL.ATExecDropColumn→RemoveAttributeById(heap.c) — pure catalog logical delete:attisdropped = true, type cleared, name rewritten to the........pg.dropped.N........placeholder. Never setstab->rewrite.ATPrepAlterColumnType/ATColumnChangeRequiresRewrite— plan the transform expression, queue aNewColumnValue, and setAT_REWRITE_COLUMN_REWRITEonly when the cast is more than a relabel (binary-coercible casts andvarcharlength widening skip the rewrite).StoreAttrMissingVal(heap.c) — writes the evaluated default intopg_attribute.attmissingval/ setsatthasmissing, the fast-default storage.
Phase 3 — scan, rewrite, swap, validate
Section titled “Phase 3 — scan, rewrite, swap, validate”ATRewriteTables— the per-table branch ontab->rewriteand the pending-constraint lists. FiresEventTriggerTableRewritebefore a rewrite; rejects rewrites of system / catalog-used / other-backend-temp relations.make_new_heap(cluster.c) — builds the transient destination relation with the chosen tablespace, access method, and persistence. Shared withCLUSTER/VACUUM FULL.ATRewriteTable— the single heap scan. Reads throughtab->oldDesc, projects into the new descriptor, evaluatestab->newvals(ExecEvalExpr), nulls dropped attributes, enforcesnotnull_attrsandtab->constraints, and (whennewrel)table_tuple_inserts. CallsTransferPredicateLocksToHeapRelationwhen rewriting.finish_heap_swap→swap_relation_files(cluster.c) — atomically swaps the old and newrelfilenodes (and persistence), rebuilds indexes against the new file, and discards the old heap. UsesRecentXminas the newrelfrozenxidbecause every tuple was just rewritten.ATExecSetTableSpace— theSET TABLESPACE-without-reconstruction shortcut: a raw block-by-block file copy rather than a tuple rewrite.AT_REWRITE_ALTER_PERSISTENCE/AT_REWRITE_DEFAULT_VAL/AT_REWRITE_COLUMN_REWRITE/AT_REWRITE_ACCESS_METHOD— the four rewrite-reason bits (inevent_trigger.h), OR-ed intotab->rewrite.
After the per-table loop, ATRewriteTables runs a final pass that validates
new foreign-key constraints (deferred so both sides of an FK are fully
rewritten first) and executes each table’s tab->afterStmts.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
AlterTablePass (enum) | src/backend/commands/tablecmds.c | 148 |
AT_NUM_PASSES | src/backend/commands/tablecmds.c | 166 |
AlteredTableInfo (struct) | src/backend/commands/tablecmds.c | 178 |
AlterTable | src/backend/commands/tablecmds.c | 4534 |
AlterTableGetLockLevel | src/backend/commands/tablecmds.c | 4608 |
ATController | src/backend/commands/tablecmds.c | 4870 |
ATPrepCmd | src/backend/commands/tablecmds.c | 4905 |
ATRewriteCatalogs | src/backend/commands/tablecmds.c | 5302 |
ATExecCmd | src/backend/commands/tablecmds.c | 5376 |
ATRewriteTables | src/backend/commands/tablecmds.c | 5838 |
ATRewriteTable | src/backend/commands/tablecmds.c | 6126 |
ATGetQueueEntry | src/backend/commands/tablecmds.c | 6562 |
ATSimpleRecursion | src/backend/commands/tablecmds.c | 6816 |
ATExecAddColumn | src/backend/commands/tablecmds.c | 7217 |
ATExecDropColumn | src/backend/commands/tablecmds.c | 9284 |
ATPrepAlterColumnType | src/backend/commands/tablecmds.c | 14384 |
ATColumnChangeRequiresRewrite | src/backend/commands/tablecmds.c | 14690 |
RemoveAttributeById | src/backend/catalog/heap.c | 1700 |
StoreAttrMissingVal | src/backend/catalog/heap.c | 2047 |
make_new_heap | src/backend/commands/cluster.c | 705 |
swap_relation_files | src/backend/commands/cluster.c | 1063 |
finish_heap_swap | src/backend/commands/cluster.c | 1445 |
AT_REWRITE_* flags | src/include/commands/event_trigger.h | 40-43 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified facts
Section titled “Verified facts”-
The lock level is computed before the table is opened, from the parsed subcommand list only. Verified in
AlterTableGetLockLevel: the function takesList *cmdsand never touches the relation; the floor isShareUpdateExclusiveLockand each subtype’scmd_lockmodeis taken as a running maximum. The caller in utility.c acquires that lock, thenAlterTableopensNoLock. -
ALTER TABLEis three phases driven byATController. Verified: Phase 1 is theforeach/ATPrepCmdloop; the relation is thenrelation_closedNoLock; Phase 2 isATRewriteCatalogs; Phase 3 isATRewriteTables. The comment blocks inATControllerlabel all three. -
Phase 2 runs all tables “in parallel, one pass at a time.” Verified in
ATRewriteCatalogs: the outer loop isfor (AlterTablePass pass = 0; pass < AT_NUM_PASSES; pass++)and the innerforeachwalks the work queue. The header comment states work can only be propagated into later passes. -
There are 13 passes,
AT_PASS_DROPfirst throughAT_PASS_MISClast. Verified by reading theAlterTablePassenum:AT_PASS_UNSET = -1thenAT_PASS_DROP(0) …AT_PASS_MISC, withAT_NUM_PASSES = AT_PASS_MISC + 1.ALTER TYPE(pass 1) precedes theOLD_INDEX/OLD_CONSTRre-add passes. -
DROP COLUMNis a catalog-only logical delete. Verified inRemoveAttributeById(heap.c): it setsattisdropped = true, clearsatttypid,attnotnull,attgenerated, renames to........pg.dropped.%d........, and updates thepg_attributetuple. No heap page is touched, andATExecDropColumnnever ORs a bit intotab->rewrite. -
ADD COLUMNwith a non-volatile constant default is metadata-only via the fast default. Verified:ATExecAddColumnstores the evaluated datum withStoreAttrMissingVal(writingattmissingval/atthasmissing) when the default is non-volatile, non-generated, and the relation is a plainRELKIND_RELATION; otherwise it setsAT_REWRITE_DEFAULT_VAL. A nullableADD COLUMNwith no default needs neither. -
ALTER COLUMN TYPEskips the rewrite for binary-coercible casts. Verified inATColumnChangeRequiresRewrite: the loop returnsfalse(no rewrite) when the transform bottoms out at the originalVar, stripsRelabelType, and treats constraint-freeCoerceToDomainas transparent; it returnstruefor anyFuncExpr/real computation. The decision setsAT_REWRITE_COLUMN_REWRITEand queues aNewColumnValue. -
A rewrite builds a new relfilenode and swaps, never mutating in place. Verified in
ATRewriteTables:make_new_heap→ATRewriteTable(tab, OIDNewHeap)→finish_heap_swap(...).finish_heap_swapcallsswap_relation_filesand usesRecentXminas the newrelfrozenxid. The verify-only branch passesInvalidOidand allocates no file. -
Phase 3 scans each table at most once. Verified in
ATRewriteTable: a singletable_beginscan/table_scan_getnextslotloop both projects new tuples (whentab->rewrite) and checks allnotnull_attrs,tab->constraints, and the partition constraint; multiple subcommands share the one scan. -
The rewrite reason is a four-bit mask in
event_trigger.h. Verified:AT_REWRITE_ALTER_PERSISTENCE 0x01,AT_REWRITE_DEFAULT_VAL 0x02,AT_REWRITE_COLUMN_REWRITE 0x04,AT_REWRITE_ACCESS_METHOD 0x08. The mask is consumed inATRewriteTablesand broadcast byEventTriggerTableRewrite.
Open questions
Section titled “Open questions”-
The precise set of subcommand subtypes that land on
ShareRowExclusiveLockvs.AccessExclusiveLock. The doc names the three lock tiers and the floor, but the full per-subtype table is large and evolves; the authoritative list is theAlterTableGetLockLevelswitch as of this revision. Investigation path: enumerate everycaseand itscmd_lockmode, cross-checked against the user-facing docs’ “ALTER TABLE lock levels” note. -
How
ALTER TYPEreconstructs dependent objects (indexes, views, constraints) across the pass boundary. This doc treatsATPostAlterTypeCleanupand theOLD_INDEX/OLD_CONSTRre-add passes as a black box; the dependency-tracking that decides which objects must be dropped-and-recreated is owned bypostgres-ddl-execution.md. -
The exact interaction between a fast-default column and a later table-rewriting
ALTER. Onceattmissingvalis set, a subsequent rewrite must materialize the missing value into every tuple and clear the flag; the precise point whereatthasmissingis reset during a rewrite was not traced here.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”-
MySQL/InnoDB
ALGORITHM=INSTANT|INPLACE|COPY. InnoDB exposes the rewrite-vs-metadata choice as explicit DDL syntax.INSTANT ADD COLUMN(MySQL 8.0) is the direct analog of PostgreSQL’s fast default — the column metadata is recorded and old rows read a stored default — whileCOPYrebuilds the table under a metadata lock. A side-by-side of InnoDB’s instant column metadata againstattmissingvalwould sharpen what each engine can and cannot do without a rewrite (e.g., column position,DROP COLUMNinstant support). -
Online schema change tools (gh-ost, pt-online-schema-change). Because many engines historically held an exclusive lock for the whole rewrite, external tools build a shadow table, backfill it, capture concurrent writes via triggers or the binlog, then cut over with a brief lock. PostgreSQL’s in-process
make_new_heap+finish_heap_swapdoes the shadow-table-and-swap inside one transaction but still underAccessExclusiveLockfor the duration — so the same online-DDL pressure exists. Mapping gh-ost’s triggered cut-over against PostgreSQL’s single-lock rewrite frames why PostgreSQL users still reach forCREATE INDEX CONCURRENTLY-style patterns and logical-replication-based migrations. -
Google F1 / Spanner asynchronous schema change. Rae et al., Online, Asynchronous Schema Change in F1 (VLDB 2013), formalize schema evolution as a sequence of intermediate states (delete-only, write-only) so that nodes on different schema versions never corrupt each other’s data — a fundamentally distributed answer that PostgreSQL’s single-node, single-lock model sidesteps entirely. The contrast is instructive: PostgreSQL gets atomicity from MVCC and one lock; F1 gets availability from staged, version-tolerant states.
-
Aurora / Neon and storage-disaggregated rewrites. When storage is a separate service, a full heap rewrite is bandwidth across a network rather than local I/O, which changes the cost calculus of “just rewrite it.” How a disaggregated engine would implement
finish_heap_swap’s atomic relfilenode swap (a metadata flip vs. a data move) is an open comparative question. -
CUBRID
ALTER TABLE. CUBRID likewise distinguishes metadata-only alters from those forcing a reload, but its catalog and locking model differ; a comparison against the CUBRID schema-change path in the cubrid tree would highlight how much of PostgreSQL’s cheap-path machinery (fast default,attisdropped, binary-coercion shortcut) has a CUBRID counterpart.
Sources
Section titled “Sources”In-tree source files (REL_18_STABLE, commit 273fe94)
Section titled “In-tree source files (REL_18_STABLE, commit 273fe94)”src/backend/commands/tablecmds.c— the wholeALTER TABLEmachine:AlterTable,AlterTableGetLockLevel,ATController,ATPrepCmd,ATRewriteCatalogs,ATExecCmd,ATRewriteTables,ATRewriteTable,ATGetQueueEntry,ATSimpleRecursion,ATExecAddColumn,ATExecDropColumn,ATPrepAlterColumnType,ATColumnChangeRequiresRewrite, and theAlterTablePassenum /AlteredTableInfostruct.src/backend/commands/cluster.c—make_new_heap,swap_relation_files,finish_heap_swap, shared withCLUSTER/VACUUM FULL.src/backend/catalog/heap.c—RemoveAttributeById(theattisdroppedlogical delete) andStoreAttrMissingVal(the fast-default storage).src/include/commands/event_trigger.h— theAT_REWRITE_*reason-bit macros consumed byATRewriteTablesandEventTriggerTableRewrite.src/include/nodes/parsenodes.h—AlterTableStmt,AlterTableCmd, and theAlterTableType(AT_*subtype) enum the dispatch switches on.
Papers and textbook chapters
Section titled “Papers and textbook chapters”- Database System Concepts (Silberschatz, Korth, Sudarshan, 7e), ch. 4
“Intermediate SQL” / DDL, and the relation/schema foundation in ch. 2 — the
metadata-vs-data distinction that splits the
ALTER TABLEdesign space (knowledge/research/dbms-general/). - Database Internals (Petrov 2019) — the relfilenode / file-swap and WAL
framing behind a rewrite (
knowledge/research/dbms-general/). - Rae, I. et al. (2013). “Online, Asynchronous Schema Change in F1.” PVLDB 6(11):1045-1056. The distributed staged-state alternative to single-lock DDL.
Sibling docs (cross-references — mechanism owned there, not duplicated here)
Section titled “Sibling docs (cross-references — mechanism owned there, not duplicated here)”postgres-ddl-execution.md— the utility-command dispatch, dependency tracking, and object reconstruction thatALTER TYPE’sOLD_INDEX/OLD_CONSTRpasses rely on.postgres-constraints.md—CHECK/NOT NULL/foreign-key constraint representation and the FK validation thatALTER TABLE’s Phase-3 final loop invokes.postgres-index-creation.md— index build / rebuild, whichALTER TYPE’s index re-add passes andfinish_heap_swap’s index rebuild call into.postgres-heap-am.md—table_tuple_insert/ heap storage thatATRewriteTablewrites the rewritten tuples through.postgres-toast.md— TOAST table creation (AlterTableCreateToastTable) triggered at the tail ofATRewriteCatalogs.postgres-lock-manager.md— the heavyweight lock table behind theAlterTableGetLockLevellock levels.