PostgreSQL Transactions & Recovery — Section Overview
Contents:
What this section covers
Section titled “What this section covers”This subcategory is the transactional core: the two interlocking machines that make PostgreSQL both concurrent and durable, plus the background machinery that pays the bill for the design choice underneath them — a no-overwrite, multi-version heap.
The scope, in one frame:
- Concurrency (MVCC). A reader sees a snapshot: the set of
transactions whose effects are visible to it. A snapshot is computed
from the procarray — the in-shared-memory census of every live
backend’s XID (
storage/ipc/procarray.c) — byGetSnapshotData. Tuple visibility against that snapshot is decided inaccess/heap/heapam_visibility.c, and the snapshot’s lifetime is managed by the snapshot manager (utils/time/snapmgr.c). The procarray is the seam: it is owned by the server-architecture substrate but read by every snapshot, so this section owns bothpostgres-mvcc-snapshots.mdandpostgres-procarray.md, with procarray owning the structure internals and mvcc-snapshots owning visibility. - Commit state. Whether a given XID committed, aborted, or is still in
progress lives in the commit log (clog), plus subtransaction parentage
(
subtrans), optional commit timestamps (commit_ts), and shared-lock group membership (multixact). All four ride on the SLRU — a simple least-recently-used page cache for fixed-size, append-mostly logs — treated here as its own substrate doc because four clients share it. - The WAL durability spine. The write-ahead log is PostgreSQL’s single
redo stream.
access/transam/xact.cdrives the transaction state machine and writes the commit record;xlog.c/xloginsert.cinsert records; every record is tagged by a resource manager (rmgr) that supplies itsredocallback. On startup the startup process replays the stream (xlogrecovery.c); the checkpointer periodically bounds how far back replay must begin. Two-phase commit persists a prepared transaction’s state so a second backend (or recovery) can commit it later. - Reclamation. Because the heap never overwrites, dead versions
accumulate. Vacuum (
commands/vacuum.c,access/heap/vacuumlazy.c) reclaims them; autovacuum schedules that work; and freeze / wraparound machinery (access/heap/heapam.c,access/transam/varsup.c) keeps the 32-bit XID space from overflowing.
Sharp boundaries — what this section is NOT:
- The shared-memory segment,
PGPROC/ProcGlobal, LWLocks, the heavyweight lock table, and SSI/predicate locking belong to server-architecture. This section consumes the procarray and takes locks; it does not describe the lock manager or the IPC substrate. (The procarray structure is the one piece pulled into this section, because the snapshot seam is inseparable from MVCC.) - Pages, buffers, smgr, the heap tuple format, HOT pruning, the table AM, the visibility map, and TOAST belong to storage-engine. This section decides which versions are visible and when they may be reclaimed; the storage section owns how a version is laid out on a page and flushed. The WAL-before-flush rule is the shared seam.
- Streaming/logical replication, replication slots, archiving, base backup, and PITR belong to replication-ha. They are all consumers of the same WAL stream this section produces; the production side and the redo/recovery side live here, the distribution side lives there.
In short: this section owns snapshot computation, commit-state recording, WAL production + redo, and dead-version reclamation. It hands the substrate down to server-architecture and storage-engine, and hands the WAL stream sideways to replication-ha.
The layering
Section titled “The layering”The subcategory is best read as two machines over one substrate, joined at the commit. The left column is the MVCC / visibility machine; the right column is the WAL / durability machine; reclamation sits underneath, undoing what MVCC leaves behind.
flowchart TB
subgraph SUBSTRATE["shared-memory substrate (owned by server-architecture)"]
PARR["postgres-procarray<br/>live-XID census, GetSnapshotData"]
end
subgraph MVCC["concurrency machine — what is visible"]
SNAP["postgres-mvcc-snapshots<br/>snapshots + tuple visibility"]
CLOG["postgres-clog-commit-ts<br/>commit/abort state, commit timestamps"]
MXACT["postgres-multixact<br/>shared row locks, lock groups"]
SLRU["postgres-slru<br/>page-cache substrate for clog/subtrans/commit_ts/multixact"]
end
subgraph WAL["durability machine — the redo spine"]
XACT["postgres-xact<br/>transaction state machine, commit record"]
XLOG["postgres-xlog-wal<br/>WAL insert, segments, control state"]
RMGR["postgres-wal-records-rmgr<br/>per-rmgr record formats + redo callbacks"]
TWOPC["postgres-two-phase-commit<br/>PREPARE / COMMIT PREPARED state"]
REDO["postgres-recovery-redo<br/>startup process: crash + archive replay"]
CKPT["postgres-checkpoint<br/>checkpointer: bounds redo start"]
end
subgraph RECLAIM["reclamation — undo the no-overwrite cost"]
VAC["postgres-vacuum<br/>lazy vacuum: prune, freeze, reap"]
AV["postgres-autovacuum<br/>launcher + workers: scheduling"]
FREEZE["postgres-xid-wraparound-freeze<br/>freeze + 32-bit XID wraparound defense"]
end
PARR --> SNAP
SNAP --> CLOG
CLOG --> SLRU
MXACT --> SLRU
SNAP -. "reads commit state" .-> CLOG
XACT --> XLOG
XACT --> CLOG
XLOG --> RMGR
XACT --> TWOPC
XLOG --> REDO
RMGR -. "redo callbacks" .-> REDO
CKPT --> XLOG
CKPT -. "bounds replay window" .-> REDO
SNAP -. "computes oldest visible XID" .-> VAC
AV --> VAC
VAC --> FREEZE
FREEZE -. "advances frozen horizon" .-> CLOG
The two machines meet at commit: xact writes the WAL commit record
and stamps the clog. The reclamation machine closes the cycle: vacuum uses
the oldest snapshot’s visible-XID horizon (from the procarray/snapshot side)
to decide which dead versions are safe to remove, and freeze advances the
frozen-XID horizon so old clog pages can be truncated. SLRU is the quiet
substrate under the whole left side.
Reading order
Section titled “Reading order”Cross-referenced-first: read the snapshot seam and WAL spine before the docs that depend on them.
- The snapshot seam —
postgres-procarray.md, thenpostgres-mvcc-snapshots.md. The procarray is the structure; MVCC snapshots are what you compute from it. Read these as a pair. - Commit state and its substrate —
postgres-slru.mdfirst (the page cache), thenpostgres-clog-commit-ts.mdandpostgres-multixact.md(its clients). Visibility decisions read clog, so this follows MVCC. - The transaction + WAL spine —
postgres-xact.md(the state machine and commit protocol), thenpostgres-xlog-wal.md(the log itself), thenpostgres-wal-records-rmgr.md(record formats and redo callbacks). - Recovery and its bounds —
postgres-recovery-redo.md(the startup process replaying the stream), thenpostgres-checkpoint.md(what bounds how far back replay starts). - Distributed commit —
postgres-two-phase-commit.md(builds on xact + WAL + recovery). - Paying the no-overwrite bill —
postgres-vacuum.md, thenpostgres-autovacuum.md(scheduling), thenpostgres-xid-wraparound-freeze.md(the wraparound defense that vacuum ultimately exists to serve).
If you only read three: procarray → mvcc-snapshots → xact. Those name
the snapshot seam and the commit, which everything else hangs from. The
in-tree access/transam/README (“The Transaction System”) is the canonical
companion to step 3 and worth reading alongside postgres-xact.md.
Detail-doc summaries
Section titled “Detail-doc summaries”Forward references — these module docs are planned; summaries are predictive, describing what each will own.
| Module doc | What it covers (one line) |
|---|---|
postgres-mvcc-snapshots.md | What a snapshot is, how GetSnapshotData builds one from the procarray, and how HeapTupleSatisfiesMVCC decides tuple visibility against it. |
postgres-procarray.md | The shared-memory live-XID census: ProcArray structure, XID assignment/retirement, snapshot computation, and the xmin horizon vacuum relies on. |
postgres-xact.md | The three-layer transaction state machine (StartTransactionCommand/CommitTransactionCommand), subtransactions, savepoints, and the WAL commit record + clog stamp. |
postgres-xlog-wal.md | WAL record insertion (XLogInsert), LSNs, WAL segments in pg_wal/, the control file, and the WAL-before-flush durability rule. |
postgres-wal-records-rmgr.md | The resource-manager table: how each record is tagged by rmgr and dispatched to its redo/desc/decode callbacks; generic WAL. |
postgres-slru.md | The Simple LRU page-cache substrate (SimpleLruInit) shared by clog, subtrans, commit_ts, and multixact — buffering, I/O, and truncation. |
postgres-clog-commit-ts.md | The commit log (committed/aborted/in-progress per XID), subtransaction parentage, and optional per-transaction commit timestamps. |
postgres-multixact.md | MultiXact IDs: how PostgreSQL represents a row locked/shared by multiple transactions, and the offset/member SLRUs behind it. |
postgres-two-phase-commit.md | PREPARE TRANSACTION / COMMIT PREPARED: the on-disk TwoPhaseFileHeader state, 2PC rmgr records, and recovery of prepared transactions. |
postgres-recovery-redo.md | The startup process: crash vs archive recovery, the redo loop over the WAL stream, timelines, recovery targets (PITR), and WAL prefetch. |
postgres-checkpoint.md | What a checkpoint is, the checkpointer process (CheckpointerMain), how CreateCheckPoint bounds the redo start point, and checkpoint pacing. |
postgres-vacuum.md | Lazy vacuum’s three phases (prune+freeze, index vacuum, reap) in vacuumlazy.c, dead-TID tracking, and the parallel-vacuum dispatch. |
postgres-autovacuum.md | The autovacuum launcher + workers: thresholds, per-table scheduling, and the wraparound-emergency path that forces vacuums. |
postgres-xid-wraparound-freeze.md | The 32-bit XID space, freezing tuples (heap_prepare_freeze_tuple), varsup.c XID assignment + limits, and wraparound defense. |
Adjacent sections
Section titled “Adjacent sections”| Section overview | Why it borders this one |
|---|---|
postgres-overview-server-architecture.md | Owns the shared-memory substrate this section sits on: PGPROC/ProcGlobal, LWLocks, the heavyweight lock table, SSI/predicate locking, and the checkpointer/bgwriter/startup aux processes. The procarray structure is the one piece this section pulls inward; everything else under it stays there. |
postgres-overview-storage-engine.md | Owns the page, buffer pool, smgr, heap tuple format, HOT pruning, the table AM, the visibility map, and TOAST. This section decides which tuple versions are visible and when they may be reclaimed; the storage section owns how a version is laid out and flushed. The WAL-before-flush rule and the visibility map are the shared seams. |
postgres-overview-replication-ha.md | The downstream consumer of the WAL stream this section produces and redoes: physical streaming, logical decoding, archiving, base backup, and PITR all read the same log. Recovery/redo lives here; distribution lives there. |
postgres-overview-query-processing.md | A lighter border: the executor opens snapshots and the transaction blocks delimited here wrap query execution. Snapshot lifetime (snapmgr) is the touch point; the executor itself stays in query-processing. |