Skip to content

PostgreSQL ALTER TABLE — The Multi-Pass Machinery and Table Rewrites

Contents:

ALTER TABLE is the surface syntax for schema evolution: changing the logical shape of a relation (its columns, types, constraints, storage properties) after rows already exist. Every relational system must answer one foundational question — when the declared schema changes, what happens to the bytes already on disk? — and the answer divides the entire design space.

Database System Concepts (Silberschatz, Korth, Sudarshan) introduces the relation as a set of tuples conforming to a schema R(A1, A2, …, An). The schema is metadata; the tuples are data. A schema change can therefore be realized in two fundamentally different ways:

  1. Metadata-only (logical) change. The catalog description is updated and existing tuples are reinterpreted under the new description. No stored byte moves. This is cheap (O(1) in the number of rows) but is only possible when the on-disk representation is already compatible with the new schema — for example, dropping a column (the bytes stay but are no longer projected) or widening varchar(10) to varchar(20) (same physical layout, looser constraint).

  2. Physical (data) change — a rewrite. Every stored tuple must be read, transformed, and written back in the new layout. This is O(rows) work, produces WAL proportional to the table size, and — in the simplest implementation — holds an exclusive lock for the whole duration. It is unavoidable when the on-disk bytes themselves must change: converting int to text, changing a row’s physical width, or compacting after a storage-parameter change.

The art of an ALTER TABLE implementation is maximizing case 1 and minimizing case 2. A naive engine rewrites the table on every change; a mature one recognizes the large class of changes that are metadata-only and reserves the rewrite for the irreducible cases. PostgreSQL has spent two decades pushing operations out of case 2 into case 1: the attisdropped logical-delete for DROP COLUMN (since the very beginning), the binary-coercion shortcut for ALTER TYPE, and — the marquee feature added in PostgreSQL 11 — the fast default, which lets ADD COLUMN ... DEFAULT <const> be a pure catalog change by storing the default value in pg_attribute.attmissingval and synthesizing it at read time.

A second theoretical axis is atomicity of DDL. In many engines DDL is auto-committing or only partially transactional. PostgreSQL treats DDL as ordinary transactional work: the catalog rows are MVCC tuples, the heap rewrite is a normal heap insert into a new relfilenode, and an error anywhere — or an explicit ROLLBACK — undoes the whole ALTER TABLE with no special cleanup. The header comment on AlterTable states it directly: “Thanks to the magic of MVCC, an error anywhere along the way rolls back the whole operation; we don’t have to do anything special to clean up.”

The third axis is concurrency. A schema change must coordinate with concurrent readers and writers. The classic mechanism is a table-level lock whose strength is chosen to be exactly as strong as the change’s visibility requires: a change invisible to running SELECTs can take a weak lock that allows reads to continue, whereas a change that rewrites the heap or alters query plans needs AccessExclusiveLock. PostgreSQL encodes this entire policy in one function, AlterTableGetLockLevel, and — crucially — computes it before acquiring any lock, so the lock taken is correct on the first try.

The design space an ALTER TABLE implementer chooses within:

  1. One pass or many. A user can submit several subcommands in a single statement (ALTER TABLE t DROP a, ADD b int, ALTER c TYPE bigint). Some subcommands must be ordered relative to others (drop before add, type-change before index re-add). The implementation can either reject combinations or impose a fixed ordering. PostgreSQL imposes a fixed pass ordering.

  2. Per-statement scan or per-subcommand scan. If three subcommands each need to examine every row, does the engine scan three times or once? PostgreSQL’s explicit goal is one pass over the data per table, no matter how many subcommands need it.

  3. Rewrite in place or rewrite to a new file. PostgreSQL never mutates heap pages in place for a rewrite; it builds an entirely new relfilenode and atomically swaps the file pointers, reusing the CLUSTER/VACUUM FULL machinery (make_new_heap, finish_heap_swap).

Schema-change engines across systems converge on a recognizable set of conventions. Naming them makes PostgreSQL’s specific symbols read as one set of choices within a shared playbook.

Separate “prepare” and “execute” phases

Section titled “Separate “prepare” and “execute” phases”

A schema change cannot validate every precondition in a single pass, because earlier subcommands change the catalog state that later ones depend on. The universal pattern is to split into a preparation phase (check permissions, resolve names, decide recursion, reject illegal combinations) and an execution phase (mutate catalogs, then mutate data). PostgreSQL’s ATPrepCmd / ATExecCmd split is exactly this; the prepare phase is deliberately conservative about reading “table details that another subcommand could change.”

Inheritance and partitioning mean one statement can affect many tables. The standard structure is a work queue — one entry per affected relation, each carrying the list of operations to perform and the scratch state gathered during preparation. PostgreSQL’s queue is a List of AlteredTableInfo structs, found or created by ATGetQueueEntry.

Ordered phases so dependent operations interleave correctly

Section titled “Ordered phases so dependent operations interleave correctly”

Because subcommands have ordering constraints, engines bucket them into fixed phases and execute phase-by-phase across all tables in parallel before advancing. This lets one table’s operation enqueue work for another table’s later phase (PostgreSQL’s example: ALTER TYPE on a foreign key’s primary-key column dispatching the re-add of the FK constraint to the referencing table). PostgreSQL has 13 passes, AT_PASS_DROP first through AT_PASS_MISC last.

Lock strength derived from change visibility

Section titled “Lock strength derived from change visibility”

The lock a DDL takes should be the weakest that still serializes correctly against concurrent activity that could observe the change. Systems classify each operation: invisible-to-reads → weak lock; plan-invalidating or heap-rewriting → exclusive lock. PostgreSQL’s AlterTableGetLockLevel is a giant switch returning one of ShareUpdateExclusiveLock, ShareRowExclusiveLock, or AccessExclusiveLock per subtype, taking the maximum across subcommands.

Metadata-only changes via catalog flags and value synthesis

Section titled “Metadata-only changes via catalog flags and value synthesis”

The cheap path for schema change is a catalog flag plus read-time synthesis. Dropping a column flips a “deleted” flag and leaves the bytes; adding a column with a default stores the default in the catalog and materializes it only for rows that predate the column. PostgreSQL uses pg_attribute.attisdropped for the former and attmissingval (the fast default, PG 11+) for the latter.

Rewrite by build-new-file-and-swap, not in-place mutation

Section titled “Rewrite by build-new-file-and-swap, not in-place mutation”

When a rewrite is truly required, the robust pattern is to build a fresh physical file, copy transformed tuples into it, and atomically swap the file identity — so a crash or rollback leaves the original intact. This is the same mechanism CLUSTER and VACUUM FULL use. PostgreSQL shares the code: make_new_heap builds the transient relation, ATRewriteTable copies, finish_heap_swap swaps relfilenodes and rebuilds indexes.

ConceptPostgreSQL name
Top-level driverAlterTableATController
Lock-level policy (pre-lock)AlterTableGetLockLevel
Phase 1 — prepareATPrepCmd
Phase 2 — catalog updateATRewriteCatalogsATExecCmd
Phase 3 — scan/rewriteATRewriteTablesATRewriteTable
Work-queue entry (per table)AlteredTableInfo
Work-queue lookup/createATGetQueueEntry
Fixed pass orderingAlterTablePass enum (AT_PASS_DROPAT_PASS_MISC)
Rewrite reason bitmasktab->rewrite (AT_REWRITE_* flags)
New-column value to compute in Phase 3NewColumnValue
New constraint to verify in Phase 3NewConstraint
Rewrite-vs-metadata decision (type)ATColumnChangeRequiresRewrite
Fast default (metadata-only ADD)StoreAttrMissingVal / attmissingval
Logical column deleteRemoveAttributeById (attisdropped)
Build transient heapmake_new_heap (cluster.c)
Atomic file swapfinish_heap_swap / swap_relation_files
Inheritance recursionATSimpleRecursion, find_all_inheritors

ALTER TABLE enters through AlterTable, but the lock is acquired by the caller (utility.c) before AlterTable runs, using a lock level computed by AlterTableGetLockLevel. The key invariant — stated in the function’s own comment — is that the lock level must be decidable without looking at the table, because the table is not yet locked:

// AlterTableGetLockLevel — src/backend/commands/tablecmds.c
LOCKMODE
AlterTableGetLockLevel(List *cmds)
{
ListCell *lcmd;
LOCKMODE lockmode = ShareUpdateExclusiveLock;
foreach(lcmd, cmds)
{
AlterTableCmd *cmd = (AlterTableCmd *) lfirst(lcmd);
LOCKMODE cmd_lockmode = AccessExclusiveLock; /* default */
switch (cmd->subtype)
{
case AT_AddColumn: /* may rewrite heap */
case AT_SetAccessMethod: /* must rewrite heap */
case AT_SetTableSpace: /* must rewrite heap */
case AT_AlterColumnType: /* must rewrite heap */
cmd_lockmode = AccessExclusiveLock;
break;
/* ... dozens of cases ... */
case AT_SetStatistics:
case AT_ClusterOn:
case AT_DropCluster:
cmd_lockmode = ShareUpdateExclusiveLock;
break;
/* ... */
}
/* Take the greatest lockmode from any subcommand */
if (cmd_lockmode > lockmode)
lockmode = cmd_lockmode;
}
return lockmode;
}

The default floor is ShareUpdateExclusiveLock (the weakest level any ALTER TABLE uses — it still excludes VACUUM, ANALYZE, and other DDL but allows reads and writes). Most subcommands escalate to AccessExclusiveLock; trigger enable/disable and FK addition land on the intermediate ShareRowExclusiveLock. The maximum across all subcommands wins, and that single lock is held until commit. A comment in the function adds the Hot Standby caveat: because a standby only learns about AccessExclusiveLocks on the primary, any change visible to standby SELECTs must use AccessExclusiveLock even if a weaker lock would otherwise suffice.

After the caller takes the lock, AlterTable opens the relation NoLock (the lock is already held), runs a safety check, and delegates to ATController, which is the clearest statement of the three-phase design:

// ATController — src/backend/commands/tablecmds.c
static void
ATController(AlterTableStmt *parsetree,
Relation rel, List *cmds, bool recurse, LOCKMODE lockmode,
AlterTableUtilityContext *context)
{
List *wqueue = NIL;
ListCell *lcmd;
/* Phase 1: preliminary examination of commands, create work queue */
foreach(lcmd, cmds)
{
AlterTableCmd *cmd = (AlterTableCmd *) lfirst(lcmd);
ATPrepCmd(&wqueue, rel, cmd, recurse, false, lockmode, context);
}
/* Close the relation, but keep lock until commit */
relation_close(rel, NoLock);
/* Phase 2: update system catalogs */
ATRewriteCatalogs(&wqueue, lockmode, context);
/* Phase 3: scan/rewrite tables as needed, and run afterStmts */
ATRewriteTables(parsetree, &wqueue, lockmode, context);
}

The design intent (from the AlterTable header) is one pass over the data: Phase 3 is skipped entirely unless some subcommand requires it, and when it does run, it scans each table exactly once regardless of how many subcommands contributed work.

flowchart TD
    A["utility.c: AlterTableGetLockLevel(cmds)<br/>computes lock WITHOUT opening table"] --> B["acquire table lock<br/>(held until commit)"]
    B --> C["AlterTable -> CheckAlterTableIsSafe -> ATController"]
    C --> D["Phase 1: ATPrepCmd per subcommand<br/>permission/relkind checks, recurse to children<br/>build work queue of AlteredTableInfo<br/>bucket subcmds into passes"]
    D --> E["relation_close(rel, NoLock)<br/>lock stays"]
    E --> F["Phase 2: ATRewriteCatalogs<br/>for pass = DROP .. MISC:<br/>for each table: ATExecCmd per subcmd<br/>mutate catalogs, maybe set tab->rewrite"]
    F --> G{"any table:<br/>tab->rewrite > 0<br/>or constraints/notnull<br/>to verify?"}
    G -->|no| H["Phase 3 mostly no-op<br/>run afterStmts"]
    G -->|"rewrite > 0"| I["make_new_heap -> ATRewriteTable<br/>copy+transform every tuple<br/>finish_heap_swap (swap relfilenodes)"]
    G -->|"verify only"| J["ATRewriteTable(tab, InvalidOid)<br/>scan to check constraints, no rewrite"]
    I --> K["validate FK constraints (final loop)<br/>run afterStmts"]
    J --> K
    H --> K

The work queue and the AlteredTableInfo struct

Section titled “The work queue and the AlteredTableInfo struct”

Phase 1 builds a List of AlteredTableInfo, one entry per affected relation. The struct carries everything Phases 2 and 3 need — the per-pass subcommand lists, the pre-modification tuple descriptor (captured the moment the entry is created, before anything changes it), the columns to recompute (newvals), the constraints to verify (constraints), and the rewrite reason:

// AlteredTableInfo (condensed) — src/backend/commands/tablecmds.c
typedef struct AlteredTableInfo
{
Oid relid; /* Relation to work on */
char relkind; /* Its relkind */
TupleDesc oldDesc; /* Pre-modification tuple descriptor */
Relation rel; /* transiently set during Phase 2 */
List *subcmds[AT_NUM_PASSES]; /* Lists of AlterTableCmd */
List *constraints; /* List of NewConstraint */
List *newvals; /* List of NewColumnValue */
List *afterStmts; /* utility command parsetrees */
bool verify_new_notnull; /* T if we should recheck NOT NULL */
int rewrite; /* Reason for forced rewrite, if any */
bool chgAccessMethod;
Oid newAccessMethod;
Oid newTableSpace;
bool chgPersistence;
char newrelpersistence;
Expr *partition_constraint;
/* ... changedConstraintOids/Defs, changedIndexOids/Defs, etc. ... */
} AlteredTableInfo;

ATGetQueueEntry is the find-or-create accessor. Note how it snapshots the tuple descriptor on creation — oldDesc is the old shape, which Phase 3 needs to read existing tuples even after Phase 2 has rewritten pg_attribute:

// ATGetQueueEntry — src/backend/commands/tablecmds.c
static AlteredTableInfo *
ATGetQueueEntry(List **wqueue, Relation rel)
{
Oid relid = RelationGetRelid(rel);
AlteredTableInfo *tab;
ListCell *ltab;
foreach(ltab, *wqueue)
{
tab = (AlteredTableInfo *) lfirst(ltab);
if (tab->relid == relid)
return tab; /* already queued */
}
/* Not there, so add it (snapshot the OLD descriptor now) */
tab = (AlteredTableInfo *) palloc0(sizeof(AlteredTableInfo));
tab->relid = relid;
tab->relkind = rel->rd_rel->relkind;
tab->oldDesc = CreateTupleDescCopyConstr(RelationGetDescr(rel));
tab->newAccessMethod = InvalidOid;
tab->newTableSpace = InvalidOid;
tab->newrelpersistence = RELPERSISTENCE_PERMANENT;
*wqueue = lappend(*wqueue, tab);
return tab;
}

The reason Phase 2 cannot just execute subcommands in user order is that operations have dependencies: a DROP COLUMN must happen before an ADD COLUMN of the same name; an existing index must be re-added only after the ALTER TYPE that forced its rebuild; an FK index-based constraint must precede the FK itself. PostgreSQL resolves this with a fixed enum of passes; Phase 1 files each subcommand into the right subcmds[pass] bucket:

// AlterTablePass — src/backend/commands/tablecmds.c
typedef enum AlterTablePass
{
AT_PASS_UNSET = -1, /* UNSET will cause ERROR */
AT_PASS_DROP, /* DROP (all flavors) */
AT_PASS_ALTER_TYPE, /* ALTER COLUMN TYPE */
AT_PASS_ADD_COL, /* ADD COLUMN */
AT_PASS_SET_EXPRESSION, /* ALTER SET EXPRESSION */
AT_PASS_OLD_INDEX, /* re-add existing indexes */
AT_PASS_OLD_CONSTR, /* re-add existing constraints */
AT_PASS_ADD_CONSTR, /* ADD constraints (initial examination) */
AT_PASS_COL_ATTRS, /* set column attributes, eg NOT NULL */
AT_PASS_ADD_INDEXCONSTR, /* ADD index-based constraints */
AT_PASS_ADD_INDEX, /* ADD indexes */
AT_PASS_ADD_OTHERCONSTR, /* ADD other constraints, defaults */
AT_PASS_MISC, /* other stuff */
} AlterTablePass;
#define AT_NUM_PASSES (AT_PASS_MISC + 1)

DROP is pass 0, ALTER TYPE is pass 1 (it precedes the OLD_INDEX and OLD_CONSTR passes that re-add the objects it dropped), and the ADD family fans out across passes 7–11 so index-based constraints, plain indexes, and other constraints land in dependency order. Anything unclassified falls to AT_PASS_MISC.

In-place vs. full rewrite: the three cheap paths

Section titled “In-place vs. full rewrite: the three cheap paths”

The heart of ALTER TABLE performance is the tab->rewrite bitmask. It starts at 0; each subcommand that genuinely needs a heap rewrite ORs in a reason flag:

// AT_REWRITE_* flags — src/include/commands/event_trigger.h
#define AT_REWRITE_ALTER_PERSISTENCE 0x01
#define AT_REWRITE_DEFAULT_VAL 0x02
#define AT_REWRITE_COLUMN_REWRITE 0x04
#define AT_REWRITE_ACCESS_METHOD 0x08

If, after Phase 2, tab->rewrite is still 0 and there are no new constraints to verify, Phase 3 does essentially nothing for that table. Three subcommand families illustrate the cheap-vs-expensive decision:

DROP COLUMN — always metadata-only. Dropping a column never touches a single heap page. ATExecDropColumn ultimately drives RemoveAttributeById (in heap.c), which flips attisdropped, blanks the type, and renames the column to a guaranteed-unique placeholder. The bytes stay in every tuple; the executor simply stops projecting that attribute:

// RemoveAttributeById (condensed) — src/backend/catalog/heap.c
void
RemoveAttributeById(Oid relid, AttrNumber attnum)
{
/* ... open rel AccessExclusiveLock, fetch pg_attribute tuple ... */
attStruct->attisdropped = true;
attStruct->atttypid = InvalidOid; /* type link no longer reliable */
attStruct->attnotnull = false;
attStruct->attgenerated = '\0';
/* rename so the slot can't collide with a future ADD COLUMN */
snprintf(newattname, sizeof(newattname),
"........pg.dropped.%d........", attnum);
namestrcpy(&(attStruct->attname), newattname);
/* ... clear attmissingval, attstattarget, attacl, attoptions ... */
CatalogTupleUpdate(attr_rel, &tuple->t_self, tuple);
}

ADD COLUMN — metadata-only when the default is a constant (fast default). ATExecAddColumn inserts the pg_attribute row, then tries to avoid a rewrite. If the column has a non-volatile, non-generated, non-domain-constrained default, it evaluates the default once and stores the resulting datum in pg_attribute.attmissingval via StoreAttrMissingVal. Every pre-existing row then reads back that value without the byte ever being written to the heap. Only if that fast path is unavailable does it set AT_REWRITE_DEFAULT_VAL:

// ATExecAddColumn (rewrite decision, condensed) — tablecmds.c
if (rel->rd_rel->relkind == RELKIND_RELATION &&
!colDef->generated &&
!has_domain_constraints &&
!contain_volatile_functions((Node *) defval))
{
/* Evaluate the default once and store it outside the heap */
missingval = ExecEvalExpr(exprState, ..., &missingIsNull);
if (!missingIsNull)
{
StoreAttrMissingVal(rel, attribute->attnum, missingval);
has_missing = true;
}
}
else
{
/* Failed to use missing mode -> must rewrite to install the value */
if (colDef->generated != ATTRIBUTE_GENERATED_VIRTUAL)
tab->rewrite |= AT_REWRITE_DEFAULT_VAL;
}

ADD COLUMN with no default at all is even cheaper: the comment notes that heap access routines return NULL for any attnum beyond the stored tuple’s attribute count, so a nullable column needs neither a rewrite nor a missing value.

ALTER COLUMN TYPE — rewrite only if the cast isn’t a no-op. ATPrepAlterColumnType plans the transformation expression and asks ATColumnChangeRequiresRewrite whether it can be elided. If the expression is just the column itself (varchar(10)varchar(20), or any binary-coercible cast), no rewrite is needed; the catalog type changes and the bytes stay. The function walks the expression tree, stripping relabel/coerce nodes, and returns false only if it bottoms out at the original Var:

// ATColumnChangeRequiresRewrite (condensed) — tablecmds.c
static bool
ATColumnChangeRequiresRewrite(Node *expr, AttrNumber varattno)
{
for (;;)
{
if (IsA(expr, Var) && ((Var *) expr)->varattno == varattno)
return false; /* identity -> no rewrite */
else if (IsA(expr, RelabelType))
expr = (Node *) ((RelabelType *) expr)->arg;
else if (IsA(expr, CoerceToDomain))
{
CoerceToDomain *d = (CoerceToDomain *) expr;
if (DomainHasConstraints(d->resulttype))
return true; /* must scan to enforce domain */
expr = (Node *) d->arg;
}
else if (IsA(expr, FuncExpr))
{
/* a few timestamp casts are no-ops on some configs */
/* ... F_TIMESTAMPTZ_TIMESTAMP etc. ... */
return true;
}
else
return true; /* any real computation -> rewrite */
}
}

When a rewrite is required, the plumbing in ATPrepAlterColumnType queues a NewColumnValue (the transform expression) and sets the bit:

// ATPrepAlterColumnType (tail, condensed) — tablecmds.c
newval = (NewColumnValue *) palloc0(sizeof(NewColumnValue));
newval->attnum = attnum;
newval->expr = (Expr *) transform;
tab->newvals = lappend(tab->newvals, newval);
if (ATColumnChangeRequiresRewrite(transform, attnum))
tab->rewrite |= AT_REWRITE_COLUMN_REWRITE;
flowchart TD
    S["subcommand in Phase 2 (ATExecCmd)"] --> T{"which subtype?"}
    T -->|DROP COLUMN| U["RemoveAttributeById<br/>attisdropped = true<br/>NEVER sets tab->rewrite"]
    T -->|"ADD COLUMN DEFAULT c"| V{"const, non-volatile,<br/>no domain constraint,<br/>plain relation?"}
    V -->|yes| W["StoreAttrMissingVal<br/>attmissingval = eval(default)<br/>metadata-only"]
    V -->|no| X["tab->rewrite |= AT_REWRITE_DEFAULT_VAL"]
    T -->|"ALTER TYPE"| Y{"ATColumnChangeRequiresRewrite?"}
    Y -->|no, binary-coercible| Z["catalog type change only<br/>metadata-only"]
    Y -->|yes| AA["tab->rewrite |= AT_REWRITE_COLUMN_REWRITE<br/>queue NewColumnValue"]
    T -->|"SET LOGGED / UNLOGGED"| AB["tab->rewrite |= AT_REWRITE_ALTER_PERSISTENCE"]
    T -->|"SET ACCESS METHOD"| AC["tab->rewrite |= AT_REWRITE_ACCESS_METHOD"]
    U --> END["Phase 3 reads tab->rewrite"]
    W --> END
    X --> END
    Z --> END
    AA --> END
    AB --> END
    AC --> END

ATRewriteTables is the third phase. It walks the same work queue and, for each table with storage, decides among three outcomes from the tab->rewrite bitmask and the pending-constraint lists. The branch is the clearest place to see in-place vs. full-rewrite expressed as control flow:

// ATRewriteTables (branch, condensed) — src/backend/commands/tablecmds.c
if (tab->rewrite > 0 && tab->relkind != RELKIND_SEQUENCE)
{
/* Build a temporary relation and copy data */
OIDNewHeap = make_new_heap(tab->relid, NewTableSpace, NewAccessMethod,
persistence, lockmode);
/* Copy the heap data with modifications, test new constraints */
ATRewriteTable(tab, OIDNewHeap);
/* Swap the physical files, rebuild indexes, discard old heap */
finish_heap_swap(tab->relid, OIDNewHeap,
false, false, true,
!OidIsValid(tab->newTableSpace),
RecentXmin, ReadNextMultiXactId(), persistence);
}
else
{
/* No rewrite: scan only to verify new constraints, no new file */
if (tab->constraints != NIL || tab->verify_new_notnull ||
tab->partition_constraint != NULL)
ATRewriteTable(tab, InvalidOid);
/* SET TABLESPACE with no reconstruction is a block-by-block copy */
if (tab->newTableSpace)
ATExecSetTableSpace(tab->relid, tab->newTableSpace, lockmode);
}

The single ATRewriteTable routine serves both roles, switched on whether OIDNewHeap is a real OID or InvalidOid. The rewrite path builds a fresh relfilenode with make_new_heap (the same function CLUSTER and VACUUM FULL call), copies every transformed tuple into it, and then finish_heap_swap atomically swaps the two relations’ relfilenodes, rebuilds the indexes against the new file, and drops the old one. Because the new file is a separate physical relation until the swap, a crash or ROLLBACK anywhere leaves the original file untouched — the MVCC-atomicity property the AlterTable header promises. The verify-only path passes InvalidOid, allocates no new file, and merely scans to throw an error if a freshly added CHECK, NOT NULL, or partition constraint is violated by an existing row.

Inside ATRewriteTable the scan reads each old tuple through tab->oldDesc (the descriptor snapshotted in Phase 1) and, when rewriting, projects it into a new slot built from the current descriptor, evaluating each queued NewColumnValue expression:

// ATRewriteTable (per-tuple rewrite, condensed) — tablecmds.c
while (table_scan_getnextslot(scan, ForwardScanDirection, oldslot))
{
if (tab->rewrite > 0)
{
slot_getallattrs(oldslot);
ExecClearTuple(newslot);
/* copy unchanged attributes straight across */
memcpy(newslot->tts_values, oldslot->tts_values,
sizeof(Datum) * oldslot->tts_nvalid);
memcpy(newslot->tts_isnull, oldslot->tts_isnull,
sizeof(bool) * oldslot->tts_nvalid);
/* dropped columns become NULL in the new layout */
foreach(lc, dropped_attrs)
newslot->tts_isnull[lfirst_int(lc)] = true;
/* evaluate ALTER TYPE / new-default transform expressions */
econtext->ecxt_scantuple = oldslot;
foreach(l, tab->newvals)
{
NewColumnValue *ex = lfirst(l);
if (ex->is_generated) continue;
newslot->tts_values[ex->attnum - 1] =
ExecEvalExpr(ex->exprstate, econtext,
&newslot->tts_isnull[ex->attnum - 1]);
}
ExecStoreVirtualTuple(newslot);
insertslot = newslot;
}
else
insertslot = oldslot; /* verify in place, no projection */
/* NOT NULL / CHECK / partition checks run for BOTH paths */
econtext->ecxt_scantuple = insertslot;
foreach_int(attn, notnull_attrs)
if (slot_attisnull(insertslot, attn))
ereport(ERROR, (errcode(ERRCODE_NOT_NULL_VIOLATION), ...));
foreach(l, tab->constraints) { /* CONSTR_CHECK -> ExecCheck */ }
/* if (newrel) table_tuple_insert(newrel, insertslot, ...) */
}

Two details make this the linchpin of the whole design. First, the same loop enforces new constraints whether or not a rewrite happens — the verify-only path simply reuses oldslot as insertslot and never inserts. Second, the loop runs once per table no matter how many subcommands contributed newvals or constraints: a statement that changes three column types and adds two CHECKs still scans the heap a single time. When the relation is being rewritten, ATRewriteTable also calls TransferPredicateLocksToHeapRelation(oldrel) up front, because moving tuples to a new file invalidates any tuple- or page-level SSI predicate locks.

flowchart TD
    A["ATRewriteTables: foreach table in wqueue"] --> B{"RELKIND_HAS_STORAGE?"}
    B -->|no| A
    B -->|yes| C{"tab->rewrite > 0?"}
    C -->|"yes (not sequence)"| D["make_new_heap -> new relfilenode"]
    D --> E["ATRewriteTable(tab, OIDNewHeap)<br/>scan oldDesc, project newDesc<br/>eval newvals, check constraints<br/>table_tuple_insert into new file"]
    E --> F["finish_heap_swap<br/>swap_relation_files (relfilenode)<br/>rebuild indexes, drop old heap"]
    C -->|"no, but constraints/notnull pending"| G["ATRewriteTable(tab, InvalidOid)<br/>scan only, no new file<br/>error if a row violates"]
    C -->|"no, only SET TABLESPACE"| H["ATExecSetTableSpace<br/>block-by-block file copy"]
    C -->|"nothing pending"| I["skip table entirely"]
    F --> J["final loop: validate FK constraints<br/>execute tab->afterStmts"]
    G --> J
    H --> J
    I --> J

Phase 1 (ATPrepCmd) is also where a single statement fans out to child tables. ATPrepCmd finds or creates the AlteredTableInfo via ATGetQueueEntry, then — for recursable subtypes — drives ATSimpleRecursion, which uses find_all_inheritors to enqueue one work-queue entry per descendant and re-invokes the prep routine on each. This is why a DROP COLUMN on a partitioned parent silently drops the column on every partition: each partition gets its own AlteredTableInfo, its own per-pass buckets, and (in Phase 3) its own single heap scan. The recursion happens entirely in Phase 1, so by the time Phase 2 runs, the queue already contains every affected relation and ATRewriteCatalogs can iterate them “in parallel, one pass at a time.”

This section walks the symbols in call order, grouped by phase, so a reader can follow ALTER TABLE from the utility hook to the final FK validation. All symbols are anchored by name; line numbers appear only in the position-hint table at the end.

  • AlterTable — the top-level driver. Re-opens the relation NoLock (the caller already holds the lock), runs CheckAlterTableIsSafe, then calls ATController. Its header comment is the canonical statement of the three-phase plan and the MVCC-rollback guarantee.
  • AlterTableGetLockLevel — computes the lock before the table is opened, by inspecting only the parsed subcommand list. Returns the maximum of the per-subtype lock levels, floored at ShareUpdateExclusiveLock. The giant switch is the authoritative map of “which ALTER needs which lock.” Hot Standby forces AccessExclusiveLock for any change a standby SELECT could observe.
  • ATController — sequences the three phases: a foreach over subcommands calling ATPrepCmd (Phase 1), then relation_close(rel, NoLock), then ATRewriteCatalogs (Phase 2), then ATRewriteTables (Phase 3).
  • ATPrepCmd — per-subcommand preparation. Finds/creates the AlteredTableInfo, validates the subtype against the relkind, decides the target pass, and recurses to children. Deliberately conservative: it does not read table details a later subcommand could change.
  • ATGetQueueEntry — find-or-create accessor for the per-relation work queue entry. Snapshots oldDesc (a constrained copy of the current tuple descriptor) at creation time — the descriptor Phase 3 reads old tuples through.
  • ATSimpleRecursion / find_all_inheritors — fan the subcommand out to inheritance children / partitions, one queue entry each.
  • AlteredTableInfo / AlterTablePass / AT_NUM_PASSES — the work-queue struct, the fixed pass enum, and the bucket-array size. subcmds is List *[AT_NUM_PASSES].

Phase 2 — catalog mutation, pass by pass

Section titled “Phase 2 — catalog mutation, pass by pass”
  • ATRewriteCatalogs — the outer for (pass = 0; pass < AT_NUM_PASSES; pass++) loop wrapping a foreach over tables. Processing all tables “in parallel, one pass at a time” is what lets one table’s subcommand enqueue work into a later pass of another table (the FK re-add example). After the ALTER TYPE/SET EXPRESSION passes it calls ATPostAlterTypeCleanup; at the end it adds TOAST tables where newly needed.
  • ATExecCmd — the dispatch switch from cmd->subtype to the concrete executor (ATExecAddColumn, ATExecDropColumn, ATExecAlterColumnType, ATExecColumnDefault, …). This is where catalogs are actually mutated and tab->rewrite bits are set.
  • ATExecAddColumn — inserts the pg_attribute row; takes the fast-default path via StoreAttrMissingVal when the default is a non-volatile constant on a plain relation, else sets AT_REWRITE_DEFAULT_VAL.
  • ATExecDropColumnRemoveAttributeById (heap.c) — pure catalog logical delete: attisdropped = true, type cleared, name rewritten to the ........pg.dropped.N........ placeholder. Never sets tab->rewrite.
  • ATPrepAlterColumnType / ATColumnChangeRequiresRewrite — plan the transform expression, queue a NewColumnValue, and set AT_REWRITE_COLUMN_REWRITE only when the cast is more than a relabel (binary-coercible casts and varchar length widening skip the rewrite).
  • StoreAttrMissingVal (heap.c) — writes the evaluated default into pg_attribute.attmissingval / sets atthasmissing, the fast-default storage.
  • ATRewriteTables — the per-table branch on tab->rewrite and the pending-constraint lists. Fires EventTriggerTableRewrite before a rewrite; rejects rewrites of system / catalog-used / other-backend-temp relations.
  • make_new_heap (cluster.c) — builds the transient destination relation with the chosen tablespace, access method, and persistence. Shared with CLUSTER / VACUUM FULL.
  • ATRewriteTable — the single heap scan. Reads through tab->oldDesc, projects into the new descriptor, evaluates tab->newvals (ExecEvalExpr), nulls dropped attributes, enforces notnull_attrs and tab->constraints, and (when newrel) table_tuple_inserts. Calls TransferPredicateLocksToHeapRelation when rewriting.
  • finish_heap_swapswap_relation_files (cluster.c) — atomically swaps the old and new relfilenodes (and persistence), rebuilds indexes against the new file, and discards the old heap. Uses RecentXmin as the new relfrozenxid because every tuple was just rewritten.
  • ATExecSetTableSpace — the SET TABLESPACE-without-reconstruction shortcut: a raw block-by-block file copy rather than a tuple rewrite.
  • AT_REWRITE_ALTER_PERSISTENCE / AT_REWRITE_DEFAULT_VAL / AT_REWRITE_COLUMN_REWRITE / AT_REWRITE_ACCESS_METHOD — the four rewrite-reason bits (in event_trigger.h), OR-ed into tab->rewrite.

After the per-table loop, ATRewriteTables runs a final pass that validates new foreign-key constraints (deferred so both sides of an FK are fully rewritten first) and executes each table’s tab->afterStmts.

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
AlterTablePass (enum)src/backend/commands/tablecmds.c148
AT_NUM_PASSESsrc/backend/commands/tablecmds.c166
AlteredTableInfo (struct)src/backend/commands/tablecmds.c178
AlterTablesrc/backend/commands/tablecmds.c4534
AlterTableGetLockLevelsrc/backend/commands/tablecmds.c4608
ATControllersrc/backend/commands/tablecmds.c4870
ATPrepCmdsrc/backend/commands/tablecmds.c4905
ATRewriteCatalogssrc/backend/commands/tablecmds.c5302
ATExecCmdsrc/backend/commands/tablecmds.c5376
ATRewriteTablessrc/backend/commands/tablecmds.c5838
ATRewriteTablesrc/backend/commands/tablecmds.c6126
ATGetQueueEntrysrc/backend/commands/tablecmds.c6562
ATSimpleRecursionsrc/backend/commands/tablecmds.c6816
ATExecAddColumnsrc/backend/commands/tablecmds.c7217
ATExecDropColumnsrc/backend/commands/tablecmds.c9284
ATPrepAlterColumnTypesrc/backend/commands/tablecmds.c14384
ATColumnChangeRequiresRewritesrc/backend/commands/tablecmds.c14690
RemoveAttributeByIdsrc/backend/catalog/heap.c1700
StoreAttrMissingValsrc/backend/catalog/heap.c2047
make_new_heapsrc/backend/commands/cluster.c705
swap_relation_filessrc/backend/commands/cluster.c1063
finish_heap_swapsrc/backend/commands/cluster.c1445
AT_REWRITE_* flagssrc/include/commands/event_trigger.h40-43
  • The lock level is computed before the table is opened, from the parsed subcommand list only. Verified in AlterTableGetLockLevel: the function takes List *cmds and never touches the relation; the floor is ShareUpdateExclusiveLock and each subtype’s cmd_lockmode is taken as a running maximum. The caller in utility.c acquires that lock, then AlterTable opens NoLock.

  • ALTER TABLE is three phases driven by ATController. Verified: Phase 1 is the foreach/ATPrepCmd loop; the relation is then relation_closed NoLock; Phase 2 is ATRewriteCatalogs; Phase 3 is ATRewriteTables. The comment blocks in ATController label all three.

  • Phase 2 runs all tables “in parallel, one pass at a time.” Verified in ATRewriteCatalogs: the outer loop is for (AlterTablePass pass = 0; pass < AT_NUM_PASSES; pass++) and the inner foreach walks the work queue. The header comment states work can only be propagated into later passes.

  • There are 13 passes, AT_PASS_DROP first through AT_PASS_MISC last. Verified by reading the AlterTablePass enum: AT_PASS_UNSET = -1 then AT_PASS_DROP (0) … AT_PASS_MISC, with AT_NUM_PASSES = AT_PASS_MISC + 1. ALTER TYPE (pass 1) precedes the OLD_INDEX/OLD_CONSTR re-add passes.

  • DROP COLUMN is a catalog-only logical delete. Verified in RemoveAttributeById (heap.c): it sets attisdropped = true, clears atttypid, attnotnull, attgenerated, renames to ........pg.dropped.%d........, and updates the pg_attribute tuple. No heap page is touched, and ATExecDropColumn never ORs a bit into tab->rewrite.

  • ADD COLUMN with a non-volatile constant default is metadata-only via the fast default. Verified: ATExecAddColumn stores the evaluated datum with StoreAttrMissingVal (writing attmissingval/atthasmissing) when the default is non-volatile, non-generated, and the relation is a plain RELKIND_RELATION; otherwise it sets AT_REWRITE_DEFAULT_VAL. A nullable ADD COLUMN with no default needs neither.

  • ALTER COLUMN TYPE skips the rewrite for binary-coercible casts. Verified in ATColumnChangeRequiresRewrite: the loop returns false (no rewrite) when the transform bottoms out at the original Var, strips RelabelType, and treats constraint-free CoerceToDomain as transparent; it returns true for any FuncExpr/real computation. The decision sets AT_REWRITE_COLUMN_REWRITE and queues a NewColumnValue.

  • A rewrite builds a new relfilenode and swaps, never mutating in place. Verified in ATRewriteTables: make_new_heapATRewriteTable(tab, OIDNewHeap)finish_heap_swap(...). finish_heap_swap calls swap_relation_files and uses RecentXmin as the new relfrozenxid. The verify-only branch passes InvalidOid and allocates no file.

  • Phase 3 scans each table at most once. Verified in ATRewriteTable: a single table_beginscan / table_scan_getnextslot loop both projects new tuples (when tab->rewrite) and checks all notnull_attrs, tab->constraints, and the partition constraint; multiple subcommands share the one scan.

  • The rewrite reason is a four-bit mask in event_trigger.h. Verified: AT_REWRITE_ALTER_PERSISTENCE 0x01, AT_REWRITE_DEFAULT_VAL 0x02, AT_REWRITE_COLUMN_REWRITE 0x04, AT_REWRITE_ACCESS_METHOD 0x08. The mask is consumed in ATRewriteTables and broadcast by EventTriggerTableRewrite.

  1. The precise set of subcommand subtypes that land on ShareRowExclusiveLock vs. AccessExclusiveLock. The doc names the three lock tiers and the floor, but the full per-subtype table is large and evolves; the authoritative list is the AlterTableGetLockLevel switch as of this revision. Investigation path: enumerate every case and its cmd_lockmode, cross-checked against the user-facing docs’ “ALTER TABLE lock levels” note.

  2. How ALTER TYPE reconstructs dependent objects (indexes, views, constraints) across the pass boundary. This doc treats ATPostAlterTypeCleanup and the OLD_INDEX/OLD_CONSTR re-add passes as a black box; the dependency-tracking that decides which objects must be dropped-and-recreated is owned by postgres-ddl-execution.md.

  3. The exact interaction between a fast-default column and a later table-rewriting ALTER. Once attmissingval is set, a subsequent rewrite must materialize the missing value into every tuple and clear the flag; the precise point where atthasmissing is reset during a rewrite was not traced here.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”
  • MySQL/InnoDB ALGORITHM=INSTANT|INPLACE|COPY. InnoDB exposes the rewrite-vs-metadata choice as explicit DDL syntax. INSTANT ADD COLUMN (MySQL 8.0) is the direct analog of PostgreSQL’s fast default — the column metadata is recorded and old rows read a stored default — while COPY rebuilds the table under a metadata lock. A side-by-side of InnoDB’s instant column metadata against attmissingval would sharpen what each engine can and cannot do without a rewrite (e.g., column position, DROP COLUMN instant support).

  • Online schema change tools (gh-ost, pt-online-schema-change). Because many engines historically held an exclusive lock for the whole rewrite, external tools build a shadow table, backfill it, capture concurrent writes via triggers or the binlog, then cut over with a brief lock. PostgreSQL’s in-process make_new_heap+finish_heap_swap does the shadow-table-and-swap inside one transaction but still under AccessExclusiveLock for the duration — so the same online-DDL pressure exists. Mapping gh-ost’s triggered cut-over against PostgreSQL’s single-lock rewrite frames why PostgreSQL users still reach for CREATE INDEX CONCURRENTLY-style patterns and logical-replication-based migrations.

  • Google F1 / Spanner asynchronous schema change. Rae et al., Online, Asynchronous Schema Change in F1 (VLDB 2013), formalize schema evolution as a sequence of intermediate states (delete-only, write-only) so that nodes on different schema versions never corrupt each other’s data — a fundamentally distributed answer that PostgreSQL’s single-node, single-lock model sidesteps entirely. The contrast is instructive: PostgreSQL gets atomicity from MVCC and one lock; F1 gets availability from staged, version-tolerant states.

  • Aurora / Neon and storage-disaggregated rewrites. When storage is a separate service, a full heap rewrite is bandwidth across a network rather than local I/O, which changes the cost calculus of “just rewrite it.” How a disaggregated engine would implement finish_heap_swap’s atomic relfilenode swap (a metadata flip vs. a data move) is an open comparative question.

  • CUBRID ALTER TABLE. CUBRID likewise distinguishes metadata-only alters from those forcing a reload, but its catalog and locking model differ; a comparison against the CUBRID schema-change path in the cubrid tree would highlight how much of PostgreSQL’s cheap-path machinery (fast default, attisdropped, binary-coercion shortcut) has a CUBRID counterpart.

In-tree source files (REL_18_STABLE, commit 273fe94)

Section titled “In-tree source files (REL_18_STABLE, commit 273fe94)”
  • src/backend/commands/tablecmds.c — the whole ALTER TABLE machine: AlterTable, AlterTableGetLockLevel, ATController, ATPrepCmd, ATRewriteCatalogs, ATExecCmd, ATRewriteTables, ATRewriteTable, ATGetQueueEntry, ATSimpleRecursion, ATExecAddColumn, ATExecDropColumn, ATPrepAlterColumnType, ATColumnChangeRequiresRewrite, and the AlterTablePass enum / AlteredTableInfo struct.
  • src/backend/commands/cluster.cmake_new_heap, swap_relation_files, finish_heap_swap, shared with CLUSTER / VACUUM FULL.
  • src/backend/catalog/heap.cRemoveAttributeById (the attisdropped logical delete) and StoreAttrMissingVal (the fast-default storage).
  • src/include/commands/event_trigger.h — the AT_REWRITE_* reason-bit macros consumed by ATRewriteTables and EventTriggerTableRewrite.
  • src/include/nodes/parsenodes.hAlterTableStmt, AlterTableCmd, and the AlterTableType (AT_* subtype) enum the dispatch switches on.
  • Database System Concepts (Silberschatz, Korth, Sudarshan, 7e), ch. 4 “Intermediate SQL” / DDL, and the relation/schema foundation in ch. 2 — the metadata-vs-data distinction that splits the ALTER TABLE design space (knowledge/research/dbms-general/).
  • Database Internals (Petrov 2019) — the relfilenode / file-swap and WAL framing behind a rewrite (knowledge/research/dbms-general/).
  • Rae, I. et al. (2013). “Online, Asynchronous Schema Change in F1.” PVLDB 6(11):1045-1056. The distributed staged-state alternative to single-lock DDL.

Sibling docs (cross-references — mechanism owned there, not duplicated here)

Section titled “Sibling docs (cross-references — mechanism owned there, not duplicated here)”
  • postgres-ddl-execution.md — the utility-command dispatch, dependency tracking, and object reconstruction that ALTER TYPE’s OLD_INDEX/ OLD_CONSTR passes rely on.
  • postgres-constraints.mdCHECK/NOT NULL/foreign-key constraint representation and the FK validation that ALTER TABLE’s Phase-3 final loop invokes.
  • postgres-index-creation.md — index build / rebuild, which ALTER TYPE’s index re-add passes and finish_heap_swap’s index rebuild call into.
  • postgres-heap-am.mdtable_tuple_insert / heap storage that ATRewriteTable writes the rewritten tuples through.
  • postgres-toast.md — TOAST table creation (AlterTableCreateToastTable) triggered at the tail of ATRewriteCatalogs.
  • postgres-lock-manager.md — the heavyweight lock table behind the AlterTableGetLockLevel lock levels.