PostgreSQL Two-Phase Commit — Global Transactions, Dummy PGPROCs, and WAL-First State Persistence
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Two-phase commit (2PC) is the classical protocol for achieving atomic commit across multiple, independently-failing resource managers. Jim Gray’s Notes on Database Operating Systems (1978) and the subsequent X/Open XA specification (1991) formalize the model: a transaction coordinator drives all participants through a two-round handshake — prepare (can you commit?) then commit (do commit) — so that the outcome is either universally committed or universally aborted even if any single participant crashes between the two rounds. The correctness guarantee rests on a single invariant: once a participant votes “prepared,” it may not unilaterally abort; it must wait for the coordinator’s second-phase decision and honor it.
Database System Concepts (Silberschatz, Korth, Sudarshan, 7e, §19.4 “Atomic Commit Protocols”) states the canonical safety property: “if any site crashes after the coordinator sends commit, the transaction must still commit on recovery.” This lifts the durable instant to the coordinator’s decision log write, not the participant’s. The engineering consequence is that every participant must persist enough state at prepare time to re-execute the second phase — commit or abort — after any crash. That persistent state is the 2PC state record, and its durability requirements are strictly stronger than those of a single-backend commit: it must survive restarts that may come arbitrarily long after the prepare.
Three design choices follow from this model and shape every 2PC implementation:
-
What must be durably recorded at prepare time? At minimum: the XID, which resource managers participated, and the resources they hold (locks, pending file drops, catalog invalidation messages). Anything not recorded is lost across a crash, making the second-phase completion incorrect.
-
Where does the prepared state live? A WAL-primary approach writes the state record to WAL and reads it back for commit/rollback. A file-primary approach writes a separate state file immediately. A hybrid defers the file write to checkpoint time, using WAL as the primary backing store for the common fast path.
-
How does the rest of the system see a prepared transaction? An XID in the “prepared” state must be treated as still running — it holds locks, its writes are not yet visible to others, and it must block XID wraparound cleanup — but its original backend has detached. The engine needs a proxy entry in the shared process array so that lock conflicts, snapshot decisions, and XID-horizon tracking all work correctly without special-casing prepared XIDs everywhere.
PostgreSQL’s twophase.c answers all three: serialize into WAL at prepare,
promote to file at checkpoint, and maintain a dummy PGPROC for each
prepared XID so the rest of the engine sees it as an ordinary running
transaction.
The recovery substrate underneath 2PC is ARIES (Mohan et al., ARIES: A
Transaction Recovery Method Supporting Fine-Granularity Locking and Partial
Rollbacks Using Write-Ahead Logging, ACM TODS 1992; captured in
knowledge/research/dbms-papers/aries.md). ARIES’s repeat-history principle
is why replay of a PREPARE WAL record simply re-creates the TwoPhaseState
entry in shared memory — the prepared transaction is considered still running,
and the second-phase WAL record (commit or abort) will complete it during redo
if the second phase happened before the crash.
Common DBMS Design
Section titled “Common DBMS Design”Across Oracle (two-phase with in-doubt transaction table), IBM DB2 (prepared
transaction coordinator log), MySQL/InnoDB (XA with XA PREPARE), SQL Server
(DTC integration), and PostgreSQL, four engineering conventions recur.
A dummy process entry per prepared XID. The locking subsystem, the
snapshot engine (procarray), and the XID-horizon tracker all operate on the
shared process array. Rather than plumbing a special “prepared XID” path through
each of these subsystems, every major engine inserts a lightweight proxy entry
into the same array that normal backends use. PostgreSQL’s proxy is a full
PGPROC struct with pid = 0; Oracle uses an in-doubt transaction table entry
that also participates in lock conflict detection; InnoDB maintains a hash of
prepared XA transactions visible to the lock manager. The common insight is that
the locking and visibility machinery should not need to know whether a “running”
transaction has a live backend.
State blob serialized at prepare time, read back at second-phase time. No engine reconstructs the necessary cleanup state from the heap or WAL at commit time — the information that must be acted on (which relations to drop, which catalog caches to invalidate, which locks to release) is gathered once at prepare time and stored in a compact, versioned record. The blob’s schema is driven by the set of subsystems that register cleanup actions; those subsystems become “2PC resource managers” in XA terminology.
Extensible resource manager callback tables. XA formalizes this: any
resource manager that participates in a distributed transaction implements
prepare, commit, and rollback callbacks. PostgreSQL expresses this as
four static arrays of TwoPhaseCallback function pointers indexed by a
small TwoPhaseRmgrId (0–4): one array each for recovery, post-commit,
post-abort, and standby-recovery. The set of built-in registrants is
lock manager, pgstat, multixact, and predicate lock; the arrays are compiled
into twophase_rmgr.c.
Lazy promotion from write-ahead log to stable file. A prepared transaction
that lives only seconds does not need a dedicated file — the WAL record it wrote
at prepare time is still in the WAL buffer and can be re-read instantly.
Writing a separate file for every prepare, even short-lived ones, would generate
small-file I/O on every PREPARE TRANSACTION call. Instead, engines write the
state to WAL (or an equivalent redo log) and promote to a dedicated file only at
checkpoint time, after the WAL that carries the state might otherwise be
recycled. This is the pattern PostgreSQL follows precisely: prepare_start_lsn
in the GlobalTransactionData is the WAL read-pointer; ondisk is the flag
that switches from WAL to file.
Theory ↔ PostgreSQL mapping
| Theory / convention | PostgreSQL name |
|---|---|
| Global transaction ID (GID) | gid[GIDSIZE] in GlobalTransactionData |
| Prepared XID in the running-transaction set | dummy PGPROC at gxact->pgprocno, inserted via ProcArrayAdd |
| Shared table of all prepared transactions | TwoPhaseStateData.prepXacts[], guarded by TwoPhaseStateLock |
| Prepare state blob (WAL primary) | XLOG_XACT_PREPARE record assembled by StartPrepare / EndPrepare |
| Prepare state blob (file secondary) | pg_twophase/<XID> file, promoted at checkpoint by CheckPointTwoPhase |
| 2PC resource manager callbacks | TwoPhaseCallback arrays in twophase_rmgr.c |
| Coordinator decision (commit) | FinishPreparedTransaction(gid, true) → RecordTransactionCommitPrepared |
| Coordinator decision (abort) | FinishPreparedTransaction(gid, false) → RecordTransactionAbortPrepared |
| Recovery of prepared transactions | restoreTwoPhaseData + RecoverPreparedTransactions |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”Shared memory layout: TwoPhaseStateData and GlobalTransactionData
Section titled “Shared memory layout: TwoPhaseStateData and GlobalTransactionData”The entire 2PC runtime state lives in a single fixed-size shared-memory region
allocated by TwoPhaseShmemInit. The root is TwoPhaseStateData:
// TwoPhaseStateData — src/backend/access/transam/twophase.ctypedef struct TwoPhaseStateData{ GlobalTransaction freeGXacts; /* linked list of free entries */ int numPrepXacts; /* number of valid entries */ GlobalTransaction prepXacts[FLEXIBLE_ARRAY_MEMBER]; /* max_prepared_xacts */} TwoPhaseStateData;The trailing flexible array holds exactly max_prepared_xacts pointers.
Every pointer points into a second allocation region — a flat array of
GlobalTransactionData structs — that is laid out immediately after the
pointer array in the same ShmemInitStruct call. On startup the structs are
linked into freeGXacts; allocation pops from the front.
Each GlobalTransactionData carries both the 2PC metadata and a reference to
a dummy PGPROC from the PreparedXactProcs[] array (a sibling allocation
in InitProcGlobal):
// GlobalTransactionData — src/backend/access/transam/twophase.ctypedef struct GlobalTransactionData{ GlobalTransaction next; /* free-list link */ int pgprocno; /* ID of associated dummy PGPROC */ TimestampTz prepared_at; XLogRecPtr prepare_start_lsn; /* read state from WAL at this LSN */ XLogRecPtr prepare_end_lsn; /* wait for sync replication up to here */ TransactionId xid; Oid owner; ProcNumber locking_backend; /* backend currently finishing it */ bool valid; /* true once in ProcArray */ bool ondisk; /* true once promoted to pg_twophase/ */ bool inredo; /* true if added via WAL replay */ char gid[GIDSIZE];} GlobalTransactionData;The valid / ondisk / inredo trio drives almost all branching in the
subsystem. valid means the GXACT is ready for a second-phase command;
ondisk switches the data source from XlogReadTwoPhaseData to
ReadTwoPhaseFile; inredo marks entries that arrived through WAL replay
rather than normal PREPARE TRANSACTION.
Figure 1 — TwoPhaseState shared-memory layout
flowchart LR
subgraph shmem["Shared memory (TwoPhaseStateLock)"]
TS["TwoPhaseStateData<br/>freeGXacts → freelist<br/>numPrepXacts<br/>prepXacts[0..N-1]"]
G0["GlobalTransactionData[0]<br/>gid / xid / valid / ondisk / inredo<br/>prepare_start_lsn<br/>pgprocno"]
G1["GlobalTransactionData[1]<br/>..."]
P0["dummy PGPROC[0]<br/>xid / databaseId / myProcLocks<br/>pid = 0"]
end
TS -- "prepXacts[0]" --> G0
TS -- "prepXacts[1]" --> G1
G0 -- "pgprocno" --> P0
P0 -- "ProcArrayAdd" --> PA["ProcArray<br/>(global)"]
Figure 1 — TwoPhaseState maps GID-keyed GlobalTransactionData entries to
dummy PGPROC slots. The dummy proc’s presence in the ProcArray makes the
prepared XID visible to snapshot acquisition and lock conflict detection.
Phase 1: PREPARE TRANSACTION — the prepare pipeline
Section titled “Phase 1: PREPARE TRANSACTION — the prepare pipeline”The SQL-level PrepareTransaction() in xact.c orchestrates phase one. It
calls into twophase.c in three steps:
Step 1 — MarkAsPreparing: reserve the slot and check for GID collisions.
// MarkAsPreparing — src/backend/access/transam/twophase.cGlobalTransactionMarkAsPreparing(TransactionId xid, const char *gid, TimestampTz prepared_at, Oid owner, Oid databaseid){ // ... length check, max_prepared_xacts guard ... LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
/* Check for duplicate GID */ for (i = 0; i < TwoPhaseState->numPrepXacts; i++) if (strcmp(TwoPhaseState->prepXacts[i]->gid, gid) == 0) ereport(ERROR, ...); /* duplicate GID */
gxact = TwoPhaseState->freeGXacts; /* pop from freelist */ TwoPhaseState->freeGXacts = gxact->next; MarkAsPreparingGuts(gxact, xid, gid, ...); /* init PGPROC + gxact fields */ gxact->ondisk = false; TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact; LWLockRelease(TwoPhaseStateLock); return gxact;}MarkAsPreparingGuts initializes the dummy PGPROC (zeroing it, setting
pid = 0, copying the lxid for TwoPhaseGetXidByVirtualXID, setting up empty
myProcLocks[] partitions) and stores MyLockedGxact = gxact. The gxact is
in the prepXacts array but valid = false, so other backends cannot yet lock
it for second-phase operations.
Step 2 — StartPrepare: serialize the state blob.
// StartPrepare — src/backend/access/transam/twophase.cvoidStartPrepare(GlobalTransaction gxact){ TwoPhaseFileHeader hdr; // ... collect children, commitrels, abortrels, stats, invalmsgs ... hdr.magic = TWOPHASE_MAGIC; hdr.xid = gxact->xid; hdr.nsubxacts = xactGetCommittedChildren(&children); hdr.ncommitrels = smgrGetPendingDeletes(true, &commitrels); hdr.nabortrels = smgrGetPendingDeletes(false, &abortrels); hdr.ninvalmsgs = xactGetCommittedInvalidationMessages(&invalmsgs, ...); // ... save_state_data for each section ...}save_state_data appends to a static chain of StateFileChunk blocks
(records — a module-level linked list). The 2PC resource managers then call
RegisterTwoPhaseRecord to append their own records to the same chain; the
lock manager, pgstat, multixact, and predicate lock manager each serialize
their per-transaction state here.
Step 3 — EndPrepare: write to WAL, flush, mark valid.
// EndPrepare — src/backend/access/transam/twophase.cvoidEndPrepare(GlobalTransaction gxact){ /* append end-sentinel record */ RegisterTwoPhaseRecord(TWOPHASE_RM_END_ID, 0, NULL, 0);
/* fill total_len back into the header */ hdr->total_len = records.total_len + sizeof(pg_crc32c);
START_CRIT_SECTION(); MyProc->delayChkptFlags |= DELAY_CHKPT_START; /* prevent checkpoint race */
XLogBeginInsert(); for (record = records.head; record != NULL; record = record->next) XLogRegisterData(record->data, record->len); XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN); gxact->prepare_end_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE); XLogFlush(gxact->prepare_end_lsn); /* durable before returning to client */
gxact->prepare_start_lsn = ProcLastRecPtr;
MarkAsPrepared(gxact, false); /* valid = true, ProcArrayAdd(dummy PGPROC) */
MyProc->delayChkptFlags &= ~DELAY_CHKPT_START; MyLockedGxact = gxact; END_CRIT_SECTION();
SyncRepWaitForLSN(gxact->prepare_end_lsn, false); /* sync replication */}The DELAY_CHKPT_START flag on MyProc prevents a checkpoint from completing
between WAL insert and MarkAsPrepared — without it the checkpoint would not
fsync the not-yet-valid gxact, violating the invariant that checkpoint sees all
prepared transactions. The flush is unconditional: there is no synchronous_commit = off shortcut for 2PC. The prepare record is durably on disk before the
client receives the PREPARE TRANSACTION acknowledgment.
After MarkAsPrepared, the dummy PGPROC is in the ProcArray. There is briefly
a window where the same XID appears twice in the ProcArray — once in MyProc
(the real backend, still clearing its own XID) and once in the dummy proc. This
double-presence is intentional: it prevents a gap where TransactionIdIsInProgress
would report the XID as not running, which could mislead concurrent snapshots.
Figure 2 — PREPARE TRANSACTION state transitions
stateDiagram-v2
[*] --> Reserving : MarkAsPreparing\nvalid=false
Reserving --> Serializing : StartPrepare\nassemble blob
Serializing --> WALWritten : EndPrepare\nXLogInsert+Flush
WALWritten --> Prepared : MarkAsPrepared\nvalid=true ProcArrayAdd
Prepared --> [*] : COMMIT PREPARED\nor ROLLBACK PREPARED
Figure 2 — The four internal states a GXACT passes through during
PREPARE TRANSACTION. The transition to WALWritten is the durable point;
Prepared makes the dummy PGPROC visible to other backends.
Phase 2: COMMIT PREPARED / ROLLBACK PREPARED
Section titled “Phase 2: COMMIT PREPARED / ROLLBACK PREPARED”Any backend — not the one that prepared — can complete the transaction. The
entry point is FinishPreparedTransaction(gid, isCommit):
// FinishPreparedTransaction — src/backend/access/transam/twophase.cvoidFinishPreparedTransaction(const char *gid, bool isCommit){ gxact = LockGXact(gid, GetUserId()); /* validate GID, set locking_backend */ xid = gxact->xid;
/* read state: from WAL if !ondisk, from file if ondisk */ if (gxact->ondisk) buf = ReadTwoPhaseFile(xid, false); else XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
/* disassemble header: children[], commitrels[], abortrels[], commitstats[], abortstats[], invalmsgs[] */ hdr = (TwoPhaseFileHeader *) buf; bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader)); bufptr += MAXALIGN(hdr->gidlen); children = (TransactionId *) bufptr; bufptr += ...; commitrels = (RelFileLocator *) bufptr; bufptr += ...; abortrels = (RelFileLocator *) bufptr; bufptr += ...; // ... stats, invalmsgs ...
HOLD_INTERRUPTS();
/* 1. Write commit/abort WAL record (always flushes) */ if (isCommit) RecordTransactionCommitPrepared(xid, ...); else RecordTransactionAbortPrepared(xid, ...);
/* 2. Remove dummy proc from ProcArray */ ProcArrayRemove(proc, latestXid);
gxact->valid = false; /* protect against re-entry on failure */
/* 3. Drop pending relations, execute stat drops */ DropRelationFiles(delrels, ndelrels, false); pgstat_execute_transactional_drops(...);
/* 4. Send cache invalidation messages (commit only) */ if (isCommit) SendSharedInvalidMessages(invalmsgs, hdr->ninvalmsgs);
/* 5. Run 2PC rmgr callbacks (lock release, pgstat, multixact, predlock) */ LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE); if (isCommit) ProcessRecords(bufptr, xid, twophase_postcommit_callbacks); else ProcessRecords(bufptr, xid, twophase_postabort_callbacks); PredicateLockTwoPhaseFinish(xid, isCommit); RemoveGXact(gxact); /* return to freelist */ LWLockRelease(TwoPhaseStateLock);
/* 6. Remove pg_twophase file if it was promoted to disk */ if (ondisk) RemoveTwoPhaseFile(xid, true);
RESUME_INTERRUPTS();}The ordering is deliberate: WAL record first (the irreversible decision), then
ProcArray removal (XID becomes “not running”), then resource cleanup, then lock
release via the callback chain. This mirrors CommitTransaction’s ordering for
single-backend commits and ensures the same safety properties: no other backend
can observe a committed-but-not-cleaned state because ProcArray removal happens
before lock release.
LockGXact serializes concurrent COMMIT/ROLLBACK attempts on the same GID by
storing the calling backend’s ProcNumber into gxact->locking_backend; the
second attempt sees locking_backend != INVALID_PROC_NUMBER and raises an
error. It also enforces that the committing backend is in the same database as
the prepared transaction (a restriction imposed by NOTIFY and other
database-local state).
The RecordTransactionCommitPrepared path always flushes (XLogFlush) — there
is no synchronous_commit shortcut — because the prepare record that justified
holding the GID slot was itself flushed. The abort path similarly always flushes
because the 2PC state file must be removed after abort, and the WAL record must
precede that removal.
State file format and the WAL ↔ file lifecycle
Section titled “State file format and the WAL ↔ file lifecycle”The serialized state blob (assembled by StartPrepare / RegisterTwoPhaseRecord)
has a fixed structure:
1. TwoPhaseFileHeader (= xl_xact_prepare, includes magic, xid, GID, counts)2. TransactionId[] (sub-transaction XIDs)3. RelFileLocator[] (relations to drop on commit)4. RelFileLocator[] (relations to drop on abort)5. xl_xact_stats_item[] (pgstat drops on commit)6. xl_xact_stats_item[] (pgstat drops on abort)7. SharedInvalidationMessage[] (cache inval messages)8. TwoPhaseRecordOnDisk* (per-rmgr records: lock state, multixact, predlock)9. TwoPhaseRecordOnDisk (end sentinel, rmid == TWOPHASE_RM_END_ID)10. pg_crc32c (CRC-32C over all preceding bytes)This same layout is used both in the WAL record body and in the pg_twophase/
file (the file adds a CRC on disk; WAL uses its own CRC). On commit, the blob
is re-read by FinishPreparedTransaction; the bufptr walk through the fixed
sections reaches the per-rmgr records, which are dispatched through
ProcessRecords.
The WAL-to-file promotion happens in CheckPointTwoPhase:
// CheckPointTwoPhase — src/backend/access/transam/twophase.cvoidCheckPointTwoPhase(XLogRecPtr redo_horizon){ LWLockAcquire(TwoPhaseStateLock, LW_SHARED); for (i = 0; i < TwoPhaseState->numPrepXacts; i++) { GlobalTransaction gxact = TwoPhaseState->prepXacts[i]; if ((gxact->valid || gxact->inredo) && !gxact->ondisk && gxact->prepare_end_lsn <= redo_horizon) { XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, &len); RecreateTwoPhaseFile(gxact->xid, buf, len); /* write + fsync */ gxact->ondisk = true; gxact->prepare_start_lsn = InvalidXLogRecPtr; /* WAL pointer cleared */ } } LWLockRelease(TwoPhaseStateLock); fsync_fname(TWOPHASE_DIR, true); /* fsync directory for removals too */}The redo_horizon threshold is the checkpoint’s redo LSN. A gxact whose
prepare_end_lsn is at or below redo_horizon means its WAL record might be
recycled before the next recovery scan — it must be promoted to a file. Gxacts
with newer LSNs stay WAL-backed; the most recently prepared transactions almost
never touch the filesystem at all.
Figure 3 — State data lifecycle from prepare to commit
flowchart TD
A["PREPARE TRANSACTION\nEndPrepare writes XLOG_XACT_PREPARE\ngxact.ondisk=false\ngxact.prepare_start_lsn=L"]
B{"checkpoint\nredo_horizon ≥ L?"}
C["CheckPointTwoPhase\nRecreateTwoPhaseFile\ngxact.ondisk=true"]
D{"COMMIT / ROLLBACK\nPREPARED"}
E["XlogReadTwoPhaseData\nread from WAL at L"]
F["ReadTwoPhaseFile\nread from pg_twophase/"]
G["FinishPreparedTransaction\napply callbacks, WAL record, cleanup"]
A --> B
B -- "no (fast path)" --> D
B -- "yes" --> C
C --> D
D -- "!ondisk" --> E
D -- "ondisk" --> F
E --> G
F --> G
Figure 3 — The state data starts WAL-backed. A checkpoint whose redo horizon
passes the prepare LSN promotes it to pg_twophase/. The second-phase path
reads from whichever store the ondisk flag indicates.
Recovery and replication
Section titled “Recovery and replication”At startup, StartupXLOG calls restoreTwoPhaseData to scan pg_twophase/
and populate TwoPhaseState with entries that were already on disk before the
crash. WAL replay then calls PrepareRedoAdd for each XLOG_XACT_PREPARE
record encountered:
// PrepareRedoAdd — src/backend/access/transam/twophase.cvoidPrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn, RepOriginId origin_id){ // ... gxact = MarkAsPreparing(hdr->xid, gid, hdr->prepared_at, hdr->owner, hdr->database); gxact->prepare_start_lsn = start_lsn; gxact->prepare_end_lsn = end_lsn; gxact->ondisk = (start_lsn == InvalidXLogRecPtr); /* from disk = no LSN */ gxact->inredo = true; MarkAsPrepared(gxact, true);}Entries added via PrepareRedoAdd have inredo = true. If a
XLOG_XACT_COMMIT_PREPARED or XLOG_XACT_ABORT_PREPARED record follows in
the WAL stream, PrepareRedoRemove drops the gxact. Entries still present at
end-of-recovery are restored by RecoverPreparedTransactions:
re-running MarkAsPreparingGuts, loading subxact data, calling
MarkAsPrepared to put the dummy PGPROC back in the ProcArray, then
dispatching twophase_recover_callbacks to re-acquire locks and multixact
state. After this, the prepared transactions are indistinguishable from ones
that were prepared on the now-running primary.
Hot standby needs a lighter form: StandbyRecoverPreparedTransactions runs
ProcessTwoPhaseBuffer with setParent = true (to populate pg_subtrans for
correct snapshot behavior) but does not re-acquire locks — standby query
conflicts with prepared transactions are handled by the StandbyReleaseLockTree
path.
The 2PC resource manager callback tables in twophase_rmgr.c expose four
arrays:
// twophase_rmgr.c — src/backend/access/transam/twophase_rmgr.cconst TwoPhaseCallback twophase_recover_callbacks[TWOPHASE_RM_MAX_ID + 1] = { NULL, /* END ID */ lock_twophase_recover, /* TWOPHASE_RM_LOCK_ID */ NULL, /* TWOPHASE_RM_PGSTAT_ID */ multixact_twophase_recover, /* TWOPHASE_RM_MULTIXACT_ID */ predicatelock_twophase_recover /* TWOPHASE_RM_PREDICATELOCK_ID */};const TwoPhaseCallback twophase_postcommit_callbacks[...] = { ..., lock_twophase_postcommit, pgstat_twophase_postcommit, multixact_twophase_postcommit, NULL };const TwoPhaseCallback twophase_postabort_callbacks[...] = { ..., lock_twophase_postabort, pgstat_twophase_postabort, multixact_twophase_postabort, NULL };const TwoPhaseCallback twophase_standby_recover_callbacks[...] = { ..., lock_twophase_standby_recover, NULL, NULL, NULL };predicatelock_twophase_recover re-establishes SSI predicate locks so that
serialization anomaly detection continues to work correctly after a crash
involving a prepared transaction. Post-commit/abort callbacks release the locks
(lock manager), update statistics (pgstat), and handle multixact cleanup.
Source Walkthrough
Section titled “Source Walkthrough”Shared memory and lifecycle
Section titled “Shared memory and lifecycle”TwoPhaseShmemSize/TwoPhaseShmemInit— size calculation and initialization; linksPreparedXactProcs[]dummy procs into eachGlobalTransactionData.pgprocno.MarkAsPreparing— pops a free gxact, checks GID uniqueness, callsMarkAsPreparingGutsto initialize the dummy PGPROC, inserts intoprepXacts[].MarkAsPreparingGuts— zeroes and initializes the dummy PGPROC struct (pid=0, myProcLocks[], vxid clone); setsMyLockedGxact.MarkAsPrepared— setsvalid = true, callsProcArrayAdd.LockGXact— linear scan ofprepXacts[]for GID, setslocking_backendto prevent concurrent second-phase.RemoveGXact— removes fromprepXacts[]array, pushes back onto freelist.AtAbort_Twophase/PostPrepare_Twophase— cleanup hooks for the preparing backend on error.
Prepare pipeline
Section titled “Prepare pipeline”StartPrepare— allocatesrecordschain, writesTwoPhaseFileHeader, callsGXactLoadSubxactData.save_state_data— appends MAXALIGN-padded bytes torecordschain.RegisterTwoPhaseRecord— public API for 2PC resource managers; appends aTwoPhaseRecordOnDiskheader plus data.EndPrepare— appends end sentinel, fillstotal_len, writesXLOG_XACT_PREPAREviaXLogInsert+XLogFlush, callsMarkAsPrepared, waits for sync replication.
Second-phase completion
Section titled “Second-phase completion”FinishPreparedTransaction— main dispatch:LockGXact→ read state → WAL commit/abort record →ProcArrayRemove→ drop files → send invals →ProcessRecords(callbacks) →RemoveGXact→ optional file removal.RecordTransactionCommitPrepared/RecordTransactionAbortPrepared— mirror ofRecordTransactionCommit/RecordTransactionAbortinxact.cbut always flushes; cannot usesynchronous_commit = off.ProcessRecords— walks theTwoPhaseRecordOnDiskchain dispatching each record’s rmid to the appropriate callback array.
State file I/O
Section titled “State file I/O”ReadTwoPhaseFile— openspg_twophase/<XID>, validates magic + CRC-32C, returns palloc’d buffer.XlogReadTwoPhaseData— allocates anXLogReaderState, reads theXLOG_XACT_PREPARErecord atprepare_start_lsn, returns the data portion.RecreateTwoPhaseFile— writescontent+ CRC-32C topg_twophase/<XID>, fsyncs (recovery end-checkpoint will not re-fsync).RemoveTwoPhaseFile—unlink(pg_twophase/<XID>).CheckPointTwoPhase— iteratesprepXacts[], promotes any gxact whoseprepare_end_lsn ≤ redo_horizonfrom WAL to file; fsyncs directory.TwoPhaseFilePath— formatspg_twophase/<16-hex-digit FullXID>path.
Recovery
Section titled “Recovery”restoreTwoPhaseData— called at recovery start; scanspg_twophase/, callsProcessTwoPhaseBuffer+PrepareRedoAddfor each valid file.PrepareRedoAdd— called during WAL replay ofXLOG_XACT_PREPARE; creates gxact withinredo = true, records LSN pointers, callsMarkAsPrepared.PrepareRedoRemove— called during WAL replay of commit/abort prepared; removes gxact fromTwoPhaseState, optionally removes file.PrescanPreparedTransactions— post-WAL-replay scan to compute oldest prepared XID (forpg_subtransstartup) and collect XIDs forTransamVariables->nextXidadvancement.StandbyRecoverPreparedTransactions— hot standby setup: populatespg_subtransviaSubTransSetParentwithout re-acquiring locks.RecoverPreparedTransactions— end-of-recovery: callstwophase_recover_callbacksto re-acquire locks and multixact state; callsPostPrepare_Twophaseto unlock gxact.ProcessTwoPhaseBuffer— common helper: reads from disk or WAL, validates header, optionally advancesnextXid, optionally sets subtrans parents.
Position hints (as of 2026-06-05, commit 273fe94)
Section titled “Position hints (as of 2026-06-05, commit 273fe94)”| Symbol | File | Line |
|---|---|---|
TwoPhaseStateData | src/backend/access/transam/twophase.c | 176 |
GlobalTransactionData | src/backend/access/transam/twophase.c | 147 |
TwoPhaseShmemSize | src/backend/access/transam/twophase.c | 237 |
TwoPhaseShmemInit | src/backend/access/transam/twophase.c | 252 |
MarkAsPreparing | src/backend/access/transam/twophase.c | 358 |
MarkAsPreparingGuts | src/backend/access/transam/twophase.c | 432 |
MarkAsPrepared | src/backend/access/transam/twophase.c | 529 |
LockGXact | src/backend/access/transam/twophase.c | 551 |
RemoveGXact | src/backend/access/transam/twophase.c | 627 |
StartPrepare | src/backend/access/transam/twophase.c | 1050 |
EndPrepare | src/backend/access/transam/twophase.c | 1143 |
RegisterTwoPhaseRecord | src/backend/access/transam/twophase.c | 1264 |
ReadTwoPhaseFile | src/backend/access/transam/twophase.c | 1287 |
XlogReadTwoPhaseData | src/backend/access/transam/twophase.c | 1404 |
FinishPreparedTransaction | src/backend/access/transam/twophase.c | 1487 |
ProcessRecords | src/backend/access/transam/twophase.c | 1681 |
RecreateTwoPhaseFile | src/backend/access/transam/twophase.c | 1727 |
CheckPointTwoPhase | src/backend/access/transam/twophase.c | 1807 |
restoreTwoPhaseData | src/backend/access/transam/twophase.c | 1888 |
PrepareRedoAdd | src/backend/access/transam/twophase.c | 2469 |
PrepareRedoRemove | src/backend/access/transam/twophase.c | 2572 |
PrescanPreparedTransactions | src/backend/access/transam/twophase.c | 1952 |
StandbyRecoverPreparedTransactions | src/backend/access/transam/twophase.c | 2033 |
RecoverPreparedTransactions | src/backend/access/transam/twophase.c | 2073 |
ProcessTwoPhaseBuffer | src/backend/access/transam/twophase.c | 2176 |
twophase_recover_callbacks | src/backend/access/transam/twophase_rmgr.c | 24 |
twophase_postcommit_callbacks | src/backend/access/transam/twophase_rmgr.c | 33 |
twophase_postabort_callbacks | src/backend/access/transam/twophase_rmgr.c | 42 |
twophase_standby_recover_callbacks | src/backend/access/transam/twophase_rmgr.c | 51 |
TWOPHASE_RM_LOCK_ID | src/include/access/twophase_rmgr.h | 23 |
GIDSIZE | src/include/access/xact.h | — |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified facts
Section titled “Verified facts”-
max_prepared_xacts = 0disables 2PC entirely.MarkAsPreparingchecks this before doing anything and raisesERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE. Default is 0; users must setmax_prepared_transactions> 0 explicitly. Verified inMarkAsPreparing(twophase.c ~373). -
State data is written to WAL first, files only at checkpoint. The comment at the top of
twophase.c(lines 38–68) and the implementation ofEndPrepare/CheckPointTwoPhaseconfirm: onPREPARE TRANSACTIONonly a WAL record is written;pg_twophase/files appear only whenprepare_end_lsn ≤ redo_horizonat checkpoint time. Fast-path committed prepared transactions (committed before the next checkpoint) never touch the filesystem. Verified by reading both functions. -
The dummy PGPROC appears in the ProcArray before the preparing backend exits.
MarkAsPreparedcallsProcArrayAddbeforeEndPreparereturns, so there is a brief double-presence of the XID in the ProcArray. The comment at line 1222–1230 explains this is deliberate. Verified inEndPrepareandMarkAsPrepared. -
COMMIT PREPAREDalways flushes WAL, regardless ofsynchronous_commit.RecordTransactionCommitPreparedcallsXLogFlush(recptr)unconditionally (twophase.c ~2367). The comment notes “there is no support for async commit of a prepared xact (the very idea is probably a contradiction).” Verified. -
Second-phase is restricted to the same database.
LockGXactchecksMyDatabaseId != proc->databaseIdand raises an error (twophase.c ~595–599). NOTIFY and other database-local side-effects are the stated reason. Verified. -
CRC-32C protects state files.
ReadTwoPhaseFilevalidates both magic (TWOPHASE_MAGIC = 0x57F94534) and a CRC-32C over all preceding bytes.RecreateTwoPhaseFilerecomputes and appends the CRC when writing. Verified in both functions. -
twophase_recover_callbacksdoes not include pgstat. The recover callback forTWOPHASE_RM_PGSTAT_IDisNULLintwophase_rmgr.c(line 28). pgstat does participate in post-commit and post-abort paths but recovery of pgstat state is not attempted. Verified by readingtwophase_rmgr.c. -
predicatelock_twophase_recoverre-establishes SSI locks after crash. Verified by tracingtwophase_recover_callbacks[TWOPHASE_RM_PREDICATELOCK_ID]→predicatelock_twophase_recover(instorage/lmgr/predicate.c). This ensures prepared transactions do not escape SSI tracking after recovery.
Open questions
Section titled “Open questions”-
GIDSIZEvalue. Used inGlobalTransactionData.gid[GIDSIZE]but not visible intwophase.h; it is defined insrc/include/access/xact.h. The value is 200 (bytes including the null terminator). Confirm againstxact.hif the position-hint table is updated. -
Maximum size of the state blob.
EndPreparecheckshdr->total_len > MaxAllocSizeand raises an error, but the maximum practical blob size for a transaction with many subxacts, many pending relation drops, and large lock state is not documented. An extreme workload (thousands of subxacts + thousands of locks) could conceivably approach the limit; exact threshold requires measurement. -
DELAY_CHKPT_STARTvs.DELAY_CHKPT_COMPLETEinteraction. The comment inEndPrepareexplains the race being prevented, but the full interaction withCheckPointTwoPhase’s own lock protocol under concurrent checkpoint stress has not been traced in this document.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”-
X/Open XA standard — The
PREPARE TRANSACTION/COMMIT PREPARED/ROLLBACK PREPAREDSQL syntax and the GID string are a direct implementation of the XA interface (X/Open CAE Specification, Distributed Transaction Processing: The XA Specification, 1991). PostgreSQL’stwophase_rmgr.ccallback arrays are the XA resource-manager role. A comparison of PostgreSQL’s crash-recovery behavior with XA’s “heuristic completion” concept (where a resource manager makes a unilateral decision after timeout) would clarify what PostgreSQL does and does not guarantee without a TM. -
Distributed transaction coordinators (TM-less 2PC). PostgreSQL exposes 2PC primitives but ships no transaction manager — the application or middleware (Pgpool-II,
postgres_fdw+pg_prepared_statements, Patroni, Citus) acts as coordinator. Comparing with Spanner’s “Paxos-based 2PC” (Corbett et al., 2012) or CockroachDB’s parallel commit shows how removing the blocking second phase or the coordinator SPOF changes the latency/availability tradeoff. -
Saga pattern as 2PC alternative. Long-lived distributed transactions that cannot afford blocking during phase one often use sagas (Garcia-Molina & Salem, Sagas, SIGMOD 1987) — a sequence of local transactions each with a compensating action. PostgreSQL’s 2PC is the classical alternative; a side-by-side of when each is appropriate would complement the
postgres-xact.mddiscussion of subtransaction vs. savepoint. -
MySQL/InnoDB XA path. InnoDB exposes
XA PREPARE/XA COMMIT/XA ROLLBACKwith a hash table of prepared transactions in the lock manager layer. Unlike PostgreSQL’s dummy PGPROC approach, InnoDB’s prepared XA transactions do not maintain an entry in a shared process array visible to all backends for lock conflict detection — conflicts are resolved at the InnoDB transaction layer instead. The two designs represent different trade- off points between integration depth and code complexity. -
Incremental checkpoint and 2PC file footprint. PostgreSQL’s incremental backup feature (PG17+,
pg_combinebackup) must handlepg_twophase/files correctly since they represent durable state that is not directly recoverable from WAL if the WAL segment was recycled. The interaction betweenCheckPointTwoPhase’s promotion logic and incremental backup’s changed-block tracking is a potential research area for correctness verification.
Sources
Section titled “Sources”Source files
Section titled “Source files”src/backend/access/transam/twophase.c— main implementation (2753 lines, commit 273fe94)src/backend/access/transam/twophase_rmgr.c— 2PC resource manager callback tables (58 lines)src/include/access/twophase.h— public API declarationssrc/include/access/twophase_rmgr.h—TwoPhaseCallback,TwoPhaseRmgrId,TWOPHASE_RM_*constants
Cross-references in this knowledge base
Section titled “Cross-references in this knowledge base”knowledge/code-analysis/postgres/postgres-xact.md— single-backend transaction lifecycle;PrepareTransaction()calls intotwophase.cknowledge/code-analysis/postgres/postgres-xlog-wal.md— WAL insert mechanics (XLogBeginInsert/XLogInsert/XLogFlush)knowledge/code-analysis/postgres/postgres-recovery-redo.md— WAL replay,PrepareRedoAdd/PrepareRedoRemovein the redo pathknowledge/code-analysis/postgres/postgres-lock-manager.md— lock manager 2PC callbacks (lock_twophase_recover,lock_twophase_postcommit, etc.)knowledge/code-analysis/postgres/postgres-mvcc-snapshots.md— procarray snapshot acquisition; dummy PGPROC visibility toGetSnapshotDataknowledge/research/dbms-papers/aries.md— ARIES recovery theory