Skip to content

PostgreSQL Transactions & Recovery — Section Overview

Contents:

This subcategory is the transactional core: the two interlocking machines that make PostgreSQL both concurrent and durable, plus the background machinery that pays the bill for the design choice underneath them — a no-overwrite, multi-version heap.

The scope, in one frame:

  • Concurrency (MVCC). A reader sees a snapshot: the set of transactions whose effects are visible to it. A snapshot is computed from the procarray — the in-shared-memory census of every live backend’s XID (storage/ipc/procarray.c) — by GetSnapshotData. Tuple visibility against that snapshot is decided in access/heap/heapam_visibility.c, and the snapshot’s lifetime is managed by the snapshot manager (utils/time/snapmgr.c). The procarray is the seam: it is owned by the server-architecture substrate but read by every snapshot, so this section owns both postgres-mvcc-snapshots.md and postgres-procarray.md, with procarray owning the structure internals and mvcc-snapshots owning visibility.
  • Commit state. Whether a given XID committed, aborted, or is still in progress lives in the commit log (clog), plus subtransaction parentage (subtrans), optional commit timestamps (commit_ts), and shared-lock group membership (multixact). All four ride on the SLRU — a simple least-recently-used page cache for fixed-size, append-mostly logs — treated here as its own substrate doc because four clients share it.
  • The WAL durability spine. The write-ahead log is PostgreSQL’s single redo stream. access/transam/xact.c drives the transaction state machine and writes the commit record; xlog.c / xloginsert.c insert records; every record is tagged by a resource manager (rmgr) that supplies its redo callback. On startup the startup process replays the stream (xlogrecovery.c); the checkpointer periodically bounds how far back replay must begin. Two-phase commit persists a prepared transaction’s state so a second backend (or recovery) can commit it later.
  • Reclamation. Because the heap never overwrites, dead versions accumulate. Vacuum (commands/vacuum.c, access/heap/vacuumlazy.c) reclaims them; autovacuum schedules that work; and freeze / wraparound machinery (access/heap/heapam.c, access/transam/varsup.c) keeps the 32-bit XID space from overflowing.

Sharp boundaries — what this section is NOT:

  • The shared-memory segment, PGPROC/ProcGlobal, LWLocks, the heavyweight lock table, and SSI/predicate locking belong to server-architecture. This section consumes the procarray and takes locks; it does not describe the lock manager or the IPC substrate. (The procarray structure is the one piece pulled into this section, because the snapshot seam is inseparable from MVCC.)
  • Pages, buffers, smgr, the heap tuple format, HOT pruning, the table AM, the visibility map, and TOAST belong to storage-engine. This section decides which versions are visible and when they may be reclaimed; the storage section owns how a version is laid out on a page and flushed. The WAL-before-flush rule is the shared seam.
  • Streaming/logical replication, replication slots, archiving, base backup, and PITR belong to replication-ha. They are all consumers of the same WAL stream this section produces; the production side and the redo/recovery side live here, the distribution side lives there.

In short: this section owns snapshot computation, commit-state recording, WAL production + redo, and dead-version reclamation. It hands the substrate down to server-architecture and storage-engine, and hands the WAL stream sideways to replication-ha.

The subcategory is best read as two machines over one substrate, joined at the commit. The left column is the MVCC / visibility machine; the right column is the WAL / durability machine; reclamation sits underneath, undoing what MVCC leaves behind.

flowchart TB
  subgraph SUBSTRATE["shared-memory substrate (owned by server-architecture)"]
    PARR["postgres-procarray<br/>live-XID census, GetSnapshotData"]
  end

  subgraph MVCC["concurrency machine — what is visible"]
    SNAP["postgres-mvcc-snapshots<br/>snapshots + tuple visibility"]
    CLOG["postgres-clog-commit-ts<br/>commit/abort state, commit timestamps"]
    MXACT["postgres-multixact<br/>shared row locks, lock groups"]
    SLRU["postgres-slru<br/>page-cache substrate for clog/subtrans/commit_ts/multixact"]
  end

  subgraph WAL["durability machine — the redo spine"]
    XACT["postgres-xact<br/>transaction state machine, commit record"]
    XLOG["postgres-xlog-wal<br/>WAL insert, segments, control state"]
    RMGR["postgres-wal-records-rmgr<br/>per-rmgr record formats + redo callbacks"]
    TWOPC["postgres-two-phase-commit<br/>PREPARE / COMMIT PREPARED state"]
    REDO["postgres-recovery-redo<br/>startup process: crash + archive replay"]
    CKPT["postgres-checkpoint<br/>checkpointer: bounds redo start"]
  end

  subgraph RECLAIM["reclamation — undo the no-overwrite cost"]
    VAC["postgres-vacuum<br/>lazy vacuum: prune, freeze, reap"]
    AV["postgres-autovacuum<br/>launcher + workers: scheduling"]
    FREEZE["postgres-xid-wraparound-freeze<br/>freeze + 32-bit XID wraparound defense"]
  end

  PARR --> SNAP
  SNAP --> CLOG
  CLOG --> SLRU
  MXACT --> SLRU
  SNAP -. "reads commit state" .-> CLOG

  XACT --> XLOG
  XACT --> CLOG
  XLOG --> RMGR
  XACT --> TWOPC
  XLOG --> REDO
  RMGR -. "redo callbacks" .-> REDO
  CKPT --> XLOG
  CKPT -. "bounds replay window" .-> REDO

  SNAP -. "computes oldest visible XID" .-> VAC
  AV --> VAC
  VAC --> FREEZE
  FREEZE -. "advances frozen horizon" .-> CLOG

The two machines meet at commit: xact writes the WAL commit record and stamps the clog. The reclamation machine closes the cycle: vacuum uses the oldest snapshot’s visible-XID horizon (from the procarray/snapshot side) to decide which dead versions are safe to remove, and freeze advances the frozen-XID horizon so old clog pages can be truncated. SLRU is the quiet substrate under the whole left side.

Cross-referenced-first: read the snapshot seam and WAL spine before the docs that depend on them.

  1. The snapshot seampostgres-procarray.md, then postgres-mvcc-snapshots.md. The procarray is the structure; MVCC snapshots are what you compute from it. Read these as a pair.
  2. Commit state and its substratepostgres-slru.md first (the page cache), then postgres-clog-commit-ts.md and postgres-multixact.md (its clients). Visibility decisions read clog, so this follows MVCC.
  3. The transaction + WAL spinepostgres-xact.md (the state machine and commit protocol), then postgres-xlog-wal.md (the log itself), then postgres-wal-records-rmgr.md (record formats and redo callbacks).
  4. Recovery and its boundspostgres-recovery-redo.md (the startup process replaying the stream), then postgres-checkpoint.md (what bounds how far back replay starts).
  5. Distributed commitpostgres-two-phase-commit.md (builds on xact + WAL + recovery).
  6. Paying the no-overwrite billpostgres-vacuum.md, then postgres-autovacuum.md (scheduling), then postgres-xid-wraparound-freeze.md (the wraparound defense that vacuum ultimately exists to serve).

If you only read three: procarray → mvcc-snapshots → xact. Those name the snapshot seam and the commit, which everything else hangs from. The in-tree access/transam/README (“The Transaction System”) is the canonical companion to step 3 and worth reading alongside postgres-xact.md.

Forward references — these module docs are planned; summaries are predictive, describing what each will own.

Module docWhat it covers (one line)
postgres-mvcc-snapshots.mdWhat a snapshot is, how GetSnapshotData builds one from the procarray, and how HeapTupleSatisfiesMVCC decides tuple visibility against it.
postgres-procarray.mdThe shared-memory live-XID census: ProcArray structure, XID assignment/retirement, snapshot computation, and the xmin horizon vacuum relies on.
postgres-xact.mdThe three-layer transaction state machine (StartTransactionCommand/CommitTransactionCommand), subtransactions, savepoints, and the WAL commit record + clog stamp.
postgres-xlog-wal.mdWAL record insertion (XLogInsert), LSNs, WAL segments in pg_wal/, the control file, and the WAL-before-flush durability rule.
postgres-wal-records-rmgr.mdThe resource-manager table: how each record is tagged by rmgr and dispatched to its redo/desc/decode callbacks; generic WAL.
postgres-slru.mdThe Simple LRU page-cache substrate (SimpleLruInit) shared by clog, subtrans, commit_ts, and multixact — buffering, I/O, and truncation.
postgres-clog-commit-ts.mdThe commit log (committed/aborted/in-progress per XID), subtransaction parentage, and optional per-transaction commit timestamps.
postgres-multixact.mdMultiXact IDs: how PostgreSQL represents a row locked/shared by multiple transactions, and the offset/member SLRUs behind it.
postgres-two-phase-commit.mdPREPARE TRANSACTION / COMMIT PREPARED: the on-disk TwoPhaseFileHeader state, 2PC rmgr records, and recovery of prepared transactions.
postgres-recovery-redo.mdThe startup process: crash vs archive recovery, the redo loop over the WAL stream, timelines, recovery targets (PITR), and WAL prefetch.
postgres-checkpoint.mdWhat a checkpoint is, the checkpointer process (CheckpointerMain), how CreateCheckPoint bounds the redo start point, and checkpoint pacing.
postgres-vacuum.mdLazy vacuum’s three phases (prune+freeze, index vacuum, reap) in vacuumlazy.c, dead-TID tracking, and the parallel-vacuum dispatch.
postgres-autovacuum.mdThe autovacuum launcher + workers: thresholds, per-table scheduling, and the wraparound-emergency path that forces vacuums.
postgres-xid-wraparound-freeze.mdThe 32-bit XID space, freezing tuples (heap_prepare_freeze_tuple), varsup.c XID assignment + limits, and wraparound defense.
Section overviewWhy it borders this one
postgres-overview-server-architecture.mdOwns the shared-memory substrate this section sits on: PGPROC/ProcGlobal, LWLocks, the heavyweight lock table, SSI/predicate locking, and the checkpointer/bgwriter/startup aux processes. The procarray structure is the one piece this section pulls inward; everything else under it stays there.
postgres-overview-storage-engine.mdOwns the page, buffer pool, smgr, heap tuple format, HOT pruning, the table AM, the visibility map, and TOAST. This section decides which tuple versions are visible and when they may be reclaimed; the storage section owns how a version is laid out and flushed. The WAL-before-flush rule and the visibility map are the shared seams.
postgres-overview-replication-ha.mdThe downstream consumer of the WAL stream this section produces and redoes: physical streaming, logical decoding, archiving, base backup, and PITR all read the same log. Recovery/redo lives here; distribution lives there.
postgres-overview-query-processing.mdA lighter border: the executor opens snapshots and the transaction blocks delimited here wrap query execution. Snapshot lifetime (snapmgr) is the touch point; the executor itself stays in query-processing.