Skip to content

PostgreSQL Two-Phase Commit — Global Transactions, Dummy PGPROCs, and WAL-First State Persistence

Contents:

Two-phase commit (2PC) is the classical protocol for achieving atomic commit across multiple, independently-failing resource managers. Jim Gray’s Notes on Database Operating Systems (1978) and the subsequent X/Open XA specification (1991) formalize the model: a transaction coordinator drives all participants through a two-round handshake — prepare (can you commit?) then commit (do commit) — so that the outcome is either universally committed or universally aborted even if any single participant crashes between the two rounds. The correctness guarantee rests on a single invariant: once a participant votes “prepared,” it may not unilaterally abort; it must wait for the coordinator’s second-phase decision and honor it.

Database System Concepts (Silberschatz, Korth, Sudarshan, 7e, §19.4 “Atomic Commit Protocols”) states the canonical safety property: “if any site crashes after the coordinator sends commit, the transaction must still commit on recovery.” This lifts the durable instant to the coordinator’s decision log write, not the participant’s. The engineering consequence is that every participant must persist enough state at prepare time to re-execute the second phase — commit or abort — after any crash. That persistent state is the 2PC state record, and its durability requirements are strictly stronger than those of a single-backend commit: it must survive restarts that may come arbitrarily long after the prepare.

Three design choices follow from this model and shape every 2PC implementation:

  1. What must be durably recorded at prepare time? At minimum: the XID, which resource managers participated, and the resources they hold (locks, pending file drops, catalog invalidation messages). Anything not recorded is lost across a crash, making the second-phase completion incorrect.

  2. Where does the prepared state live? A WAL-primary approach writes the state record to WAL and reads it back for commit/rollback. A file-primary approach writes a separate state file immediately. A hybrid defers the file write to checkpoint time, using WAL as the primary backing store for the common fast path.

  3. How does the rest of the system see a prepared transaction? An XID in the “prepared” state must be treated as still running — it holds locks, its writes are not yet visible to others, and it must block XID wraparound cleanup — but its original backend has detached. The engine needs a proxy entry in the shared process array so that lock conflicts, snapshot decisions, and XID-horizon tracking all work correctly without special-casing prepared XIDs everywhere.

PostgreSQL’s twophase.c answers all three: serialize into WAL at prepare, promote to file at checkpoint, and maintain a dummy PGPROC for each prepared XID so the rest of the engine sees it as an ordinary running transaction.

The recovery substrate underneath 2PC is ARIES (Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM TODS 1992; captured in knowledge/research/dbms-papers/aries.md). ARIES’s repeat-history principle is why replay of a PREPARE WAL record simply re-creates the TwoPhaseState entry in shared memory — the prepared transaction is considered still running, and the second-phase WAL record (commit or abort) will complete it during redo if the second phase happened before the crash.

Across Oracle (two-phase with in-doubt transaction table), IBM DB2 (prepared transaction coordinator log), MySQL/InnoDB (XA with XA PREPARE), SQL Server (DTC integration), and PostgreSQL, four engineering conventions recur.

A dummy process entry per prepared XID. The locking subsystem, the snapshot engine (procarray), and the XID-horizon tracker all operate on the shared process array. Rather than plumbing a special “prepared XID” path through each of these subsystems, every major engine inserts a lightweight proxy entry into the same array that normal backends use. PostgreSQL’s proxy is a full PGPROC struct with pid = 0; Oracle uses an in-doubt transaction table entry that also participates in lock conflict detection; InnoDB maintains a hash of prepared XA transactions visible to the lock manager. The common insight is that the locking and visibility machinery should not need to know whether a “running” transaction has a live backend.

State blob serialized at prepare time, read back at second-phase time. No engine reconstructs the necessary cleanup state from the heap or WAL at commit time — the information that must be acted on (which relations to drop, which catalog caches to invalidate, which locks to release) is gathered once at prepare time and stored in a compact, versioned record. The blob’s schema is driven by the set of subsystems that register cleanup actions; those subsystems become “2PC resource managers” in XA terminology.

Extensible resource manager callback tables. XA formalizes this: any resource manager that participates in a distributed transaction implements prepare, commit, and rollback callbacks. PostgreSQL expresses this as four static arrays of TwoPhaseCallback function pointers indexed by a small TwoPhaseRmgrId (0–4): one array each for recovery, post-commit, post-abort, and standby-recovery. The set of built-in registrants is lock manager, pgstat, multixact, and predicate lock; the arrays are compiled into twophase_rmgr.c.

Lazy promotion from write-ahead log to stable file. A prepared transaction that lives only seconds does not need a dedicated file — the WAL record it wrote at prepare time is still in the WAL buffer and can be re-read instantly. Writing a separate file for every prepare, even short-lived ones, would generate small-file I/O on every PREPARE TRANSACTION call. Instead, engines write the state to WAL (or an equivalent redo log) and promote to a dedicated file only at checkpoint time, after the WAL that carries the state might otherwise be recycled. This is the pattern PostgreSQL follows precisely: prepare_start_lsn in the GlobalTransactionData is the WAL read-pointer; ondisk is the flag that switches from WAL to file.

Theory ↔ PostgreSQL mapping

Theory / conventionPostgreSQL name
Global transaction ID (GID)gid[GIDSIZE] in GlobalTransactionData
Prepared XID in the running-transaction setdummy PGPROC at gxact->pgprocno, inserted via ProcArrayAdd
Shared table of all prepared transactionsTwoPhaseStateData.prepXacts[], guarded by TwoPhaseStateLock
Prepare state blob (WAL primary)XLOG_XACT_PREPARE record assembled by StartPrepare / EndPrepare
Prepare state blob (file secondary)pg_twophase/<XID> file, promoted at checkpoint by CheckPointTwoPhase
2PC resource manager callbacksTwoPhaseCallback arrays in twophase_rmgr.c
Coordinator decision (commit)FinishPreparedTransaction(gid, true)RecordTransactionCommitPrepared
Coordinator decision (abort)FinishPreparedTransaction(gid, false)RecordTransactionAbortPrepared
Recovery of prepared transactionsrestoreTwoPhaseData + RecoverPreparedTransactions

Shared memory layout: TwoPhaseStateData and GlobalTransactionData

Section titled “Shared memory layout: TwoPhaseStateData and GlobalTransactionData”

The entire 2PC runtime state lives in a single fixed-size shared-memory region allocated by TwoPhaseShmemInit. The root is TwoPhaseStateData:

// TwoPhaseStateData — src/backend/access/transam/twophase.c
typedef struct TwoPhaseStateData
{
GlobalTransaction freeGXacts; /* linked list of free entries */
int numPrepXacts; /* number of valid entries */
GlobalTransaction prepXacts[FLEXIBLE_ARRAY_MEMBER]; /* max_prepared_xacts */
} TwoPhaseStateData;

The trailing flexible array holds exactly max_prepared_xacts pointers. Every pointer points into a second allocation region — a flat array of GlobalTransactionData structs — that is laid out immediately after the pointer array in the same ShmemInitStruct call. On startup the structs are linked into freeGXacts; allocation pops from the front.

Each GlobalTransactionData carries both the 2PC metadata and a reference to a dummy PGPROC from the PreparedXactProcs[] array (a sibling allocation in InitProcGlobal):

// GlobalTransactionData — src/backend/access/transam/twophase.c
typedef struct GlobalTransactionData
{
GlobalTransaction next; /* free-list link */
int pgprocno; /* ID of associated dummy PGPROC */
TimestampTz prepared_at;
XLogRecPtr prepare_start_lsn; /* read state from WAL at this LSN */
XLogRecPtr prepare_end_lsn; /* wait for sync replication up to here */
TransactionId xid;
Oid owner;
ProcNumber locking_backend; /* backend currently finishing it */
bool valid; /* true once in ProcArray */
bool ondisk; /* true once promoted to pg_twophase/ */
bool inredo; /* true if added via WAL replay */
char gid[GIDSIZE];
} GlobalTransactionData;

The valid / ondisk / inredo trio drives almost all branching in the subsystem. valid means the GXACT is ready for a second-phase command; ondisk switches the data source from XlogReadTwoPhaseData to ReadTwoPhaseFile; inredo marks entries that arrived through WAL replay rather than normal PREPARE TRANSACTION.

Figure 1 — TwoPhaseState shared-memory layout

flowchart LR
    subgraph shmem["Shared memory (TwoPhaseStateLock)"]
        TS["TwoPhaseStateData<br/>freeGXacts → freelist<br/>numPrepXacts<br/>prepXacts[0..N-1]"]
        G0["GlobalTransactionData[0]<br/>gid / xid / valid / ondisk / inredo<br/>prepare_start_lsn<br/>pgprocno"]
        G1["GlobalTransactionData[1]<br/>..."]
        P0["dummy PGPROC[0]<br/>xid / databaseId / myProcLocks<br/>pid = 0"]
    end
    TS -- "prepXacts[0]" --> G0
    TS -- "prepXacts[1]" --> G1
    G0 -- "pgprocno" --> P0
    P0 -- "ProcArrayAdd" --> PA["ProcArray<br/>(global)"]

Figure 1 — TwoPhaseState maps GID-keyed GlobalTransactionData entries to dummy PGPROC slots. The dummy proc’s presence in the ProcArray makes the prepared XID visible to snapshot acquisition and lock conflict detection.

Phase 1: PREPARE TRANSACTION — the prepare pipeline

Section titled “Phase 1: PREPARE TRANSACTION — the prepare pipeline”

The SQL-level PrepareTransaction() in xact.c orchestrates phase one. It calls into twophase.c in three steps:

Step 1 — MarkAsPreparing: reserve the slot and check for GID collisions.

// MarkAsPreparing — src/backend/access/transam/twophase.c
GlobalTransaction
MarkAsPreparing(TransactionId xid, const char *gid,
TimestampTz prepared_at, Oid owner, Oid databaseid)
{
// ... length check, max_prepared_xacts guard ...
LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
/* Check for duplicate GID */
for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
if (strcmp(TwoPhaseState->prepXacts[i]->gid, gid) == 0)
ereport(ERROR, ...); /* duplicate GID */
gxact = TwoPhaseState->freeGXacts; /* pop from freelist */
TwoPhaseState->freeGXacts = gxact->next;
MarkAsPreparingGuts(gxact, xid, gid, ...); /* init PGPROC + gxact fields */
gxact->ondisk = false;
TwoPhaseState->prepXacts[TwoPhaseState->numPrepXacts++] = gxact;
LWLockRelease(TwoPhaseStateLock);
return gxact;
}

MarkAsPreparingGuts initializes the dummy PGPROC (zeroing it, setting pid = 0, copying the lxid for TwoPhaseGetXidByVirtualXID, setting up empty myProcLocks[] partitions) and stores MyLockedGxact = gxact. The gxact is in the prepXacts array but valid = false, so other backends cannot yet lock it for second-phase operations.

Step 2 — StartPrepare: serialize the state blob.

// StartPrepare — src/backend/access/transam/twophase.c
void
StartPrepare(GlobalTransaction gxact)
{
TwoPhaseFileHeader hdr;
// ... collect children, commitrels, abortrels, stats, invalmsgs ...
hdr.magic = TWOPHASE_MAGIC;
hdr.xid = gxact->xid;
hdr.nsubxacts = xactGetCommittedChildren(&children);
hdr.ncommitrels = smgrGetPendingDeletes(true, &commitrels);
hdr.nabortrels = smgrGetPendingDeletes(false, &abortrels);
hdr.ninvalmsgs = xactGetCommittedInvalidationMessages(&invalmsgs, ...);
// ... save_state_data for each section ...
}

save_state_data appends to a static chain of StateFileChunk blocks (records — a module-level linked list). The 2PC resource managers then call RegisterTwoPhaseRecord to append their own records to the same chain; the lock manager, pgstat, multixact, and predicate lock manager each serialize their per-transaction state here.

Step 3 — EndPrepare: write to WAL, flush, mark valid.

// EndPrepare — src/backend/access/transam/twophase.c
void
EndPrepare(GlobalTransaction gxact)
{
/* append end-sentinel record */
RegisterTwoPhaseRecord(TWOPHASE_RM_END_ID, 0, NULL, 0);
/* fill total_len back into the header */
hdr->total_len = records.total_len + sizeof(pg_crc32c);
START_CRIT_SECTION();
MyProc->delayChkptFlags |= DELAY_CHKPT_START; /* prevent checkpoint race */
XLogBeginInsert();
for (record = records.head; record != NULL; record = record->next)
XLogRegisterData(record->data, record->len);
XLogSetRecordFlags(XLOG_INCLUDE_ORIGIN);
gxact->prepare_end_lsn = XLogInsert(RM_XACT_ID, XLOG_XACT_PREPARE);
XLogFlush(gxact->prepare_end_lsn); /* durable before returning to client */
gxact->prepare_start_lsn = ProcLastRecPtr;
MarkAsPrepared(gxact, false); /* valid = true, ProcArrayAdd(dummy PGPROC) */
MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
MyLockedGxact = gxact;
END_CRIT_SECTION();
SyncRepWaitForLSN(gxact->prepare_end_lsn, false); /* sync replication */
}

The DELAY_CHKPT_START flag on MyProc prevents a checkpoint from completing between WAL insert and MarkAsPrepared — without it the checkpoint would not fsync the not-yet-valid gxact, violating the invariant that checkpoint sees all prepared transactions. The flush is unconditional: there is no synchronous_commit = off shortcut for 2PC. The prepare record is durably on disk before the client receives the PREPARE TRANSACTION acknowledgment.

After MarkAsPrepared, the dummy PGPROC is in the ProcArray. There is briefly a window where the same XID appears twice in the ProcArray — once in MyProc (the real backend, still clearing its own XID) and once in the dummy proc. This double-presence is intentional: it prevents a gap where TransactionIdIsInProgress would report the XID as not running, which could mislead concurrent snapshots.

Figure 2 — PREPARE TRANSACTION state transitions

stateDiagram-v2
    [*] --> Reserving : MarkAsPreparing\nvalid=false
    Reserving --> Serializing : StartPrepare\nassemble blob
    Serializing --> WALWritten : EndPrepare\nXLogInsert+Flush
    WALWritten --> Prepared : MarkAsPrepared\nvalid=true ProcArrayAdd
    Prepared --> [*] : COMMIT PREPARED\nor ROLLBACK PREPARED

Figure 2 — The four internal states a GXACT passes through during PREPARE TRANSACTION. The transition to WALWritten is the durable point; Prepared makes the dummy PGPROC visible to other backends.

Phase 2: COMMIT PREPARED / ROLLBACK PREPARED

Section titled “Phase 2: COMMIT PREPARED / ROLLBACK PREPARED”

Any backend — not the one that prepared — can complete the transaction. The entry point is FinishPreparedTransaction(gid, isCommit):

// FinishPreparedTransaction — src/backend/access/transam/twophase.c
void
FinishPreparedTransaction(const char *gid, bool isCommit)
{
gxact = LockGXact(gid, GetUserId()); /* validate GID, set locking_backend */
xid = gxact->xid;
/* read state: from WAL if !ondisk, from file if ondisk */
if (gxact->ondisk)
buf = ReadTwoPhaseFile(xid, false);
else
XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, NULL);
/* disassemble header: children[], commitrels[], abortrels[],
commitstats[], abortstats[], invalmsgs[] */
hdr = (TwoPhaseFileHeader *) buf;
bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader));
bufptr += MAXALIGN(hdr->gidlen);
children = (TransactionId *) bufptr; bufptr += ...;
commitrels = (RelFileLocator *) bufptr; bufptr += ...;
abortrels = (RelFileLocator *) bufptr; bufptr += ...;
// ... stats, invalmsgs ...
HOLD_INTERRUPTS();
/* 1. Write commit/abort WAL record (always flushes) */
if (isCommit)
RecordTransactionCommitPrepared(xid, ...);
else
RecordTransactionAbortPrepared(xid, ...);
/* 2. Remove dummy proc from ProcArray */
ProcArrayRemove(proc, latestXid);
gxact->valid = false; /* protect against re-entry on failure */
/* 3. Drop pending relations, execute stat drops */
DropRelationFiles(delrels, ndelrels, false);
pgstat_execute_transactional_drops(...);
/* 4. Send cache invalidation messages (commit only) */
if (isCommit)
SendSharedInvalidMessages(invalmsgs, hdr->ninvalmsgs);
/* 5. Run 2PC rmgr callbacks (lock release, pgstat, multixact, predlock) */
LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
if (isCommit)
ProcessRecords(bufptr, xid, twophase_postcommit_callbacks);
else
ProcessRecords(bufptr, xid, twophase_postabort_callbacks);
PredicateLockTwoPhaseFinish(xid, isCommit);
RemoveGXact(gxact); /* return to freelist */
LWLockRelease(TwoPhaseStateLock);
/* 6. Remove pg_twophase file if it was promoted to disk */
if (ondisk)
RemoveTwoPhaseFile(xid, true);
RESUME_INTERRUPTS();
}

The ordering is deliberate: WAL record first (the irreversible decision), then ProcArray removal (XID becomes “not running”), then resource cleanup, then lock release via the callback chain. This mirrors CommitTransaction’s ordering for single-backend commits and ensures the same safety properties: no other backend can observe a committed-but-not-cleaned state because ProcArray removal happens before lock release.

LockGXact serializes concurrent COMMIT/ROLLBACK attempts on the same GID by storing the calling backend’s ProcNumber into gxact->locking_backend; the second attempt sees locking_backend != INVALID_PROC_NUMBER and raises an error. It also enforces that the committing backend is in the same database as the prepared transaction (a restriction imposed by NOTIFY and other database-local state).

The RecordTransactionCommitPrepared path always flushes (XLogFlush) — there is no synchronous_commit shortcut — because the prepare record that justified holding the GID slot was itself flushed. The abort path similarly always flushes because the 2PC state file must be removed after abort, and the WAL record must precede that removal.

State file format and the WAL ↔ file lifecycle

Section titled “State file format and the WAL ↔ file lifecycle”

The serialized state blob (assembled by StartPrepare / RegisterTwoPhaseRecord) has a fixed structure:

1. TwoPhaseFileHeader (= xl_xact_prepare, includes magic, xid, GID, counts)
2. TransactionId[] (sub-transaction XIDs)
3. RelFileLocator[] (relations to drop on commit)
4. RelFileLocator[] (relations to drop on abort)
5. xl_xact_stats_item[] (pgstat drops on commit)
6. xl_xact_stats_item[] (pgstat drops on abort)
7. SharedInvalidationMessage[] (cache inval messages)
8. TwoPhaseRecordOnDisk* (per-rmgr records: lock state, multixact, predlock)
9. TwoPhaseRecordOnDisk (end sentinel, rmid == TWOPHASE_RM_END_ID)
10. pg_crc32c (CRC-32C over all preceding bytes)

This same layout is used both in the WAL record body and in the pg_twophase/ file (the file adds a CRC on disk; WAL uses its own CRC). On commit, the blob is re-read by FinishPreparedTransaction; the bufptr walk through the fixed sections reaches the per-rmgr records, which are dispatched through ProcessRecords.

The WAL-to-file promotion happens in CheckPointTwoPhase:

// CheckPointTwoPhase — src/backend/access/transam/twophase.c
void
CheckPointTwoPhase(XLogRecPtr redo_horizon)
{
LWLockAcquire(TwoPhaseStateLock, LW_SHARED);
for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
{
GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
if ((gxact->valid || gxact->inredo)
&& !gxact->ondisk
&& gxact->prepare_end_lsn <= redo_horizon)
{
XlogReadTwoPhaseData(gxact->prepare_start_lsn, &buf, &len);
RecreateTwoPhaseFile(gxact->xid, buf, len); /* write + fsync */
gxact->ondisk = true;
gxact->prepare_start_lsn = InvalidXLogRecPtr; /* WAL pointer cleared */
}
}
LWLockRelease(TwoPhaseStateLock);
fsync_fname(TWOPHASE_DIR, true); /* fsync directory for removals too */
}

The redo_horizon threshold is the checkpoint’s redo LSN. A gxact whose prepare_end_lsn is at or below redo_horizon means its WAL record might be recycled before the next recovery scan — it must be promoted to a file. Gxacts with newer LSNs stay WAL-backed; the most recently prepared transactions almost never touch the filesystem at all.

Figure 3 — State data lifecycle from prepare to commit

flowchart TD
    A["PREPARE TRANSACTION\nEndPrepare writes XLOG_XACT_PREPARE\ngxact.ondisk=false\ngxact.prepare_start_lsn=L"]
    B{"checkpoint\nredo_horizon ≥ L?"}
    C["CheckPointTwoPhase\nRecreateTwoPhaseFile\ngxact.ondisk=true"]
    D{"COMMIT / ROLLBACK\nPREPARED"}
    E["XlogReadTwoPhaseData\nread from WAL at L"]
    F["ReadTwoPhaseFile\nread from pg_twophase/"]
    G["FinishPreparedTransaction\napply callbacks, WAL record, cleanup"]

    A --> B
    B -- "no (fast path)" --> D
    B -- "yes" --> C
    C --> D
    D -- "!ondisk" --> E
    D -- "ondisk" --> F
    E --> G
    F --> G

Figure 3 — The state data starts WAL-backed. A checkpoint whose redo horizon passes the prepare LSN promotes it to pg_twophase/. The second-phase path reads from whichever store the ondisk flag indicates.

At startup, StartupXLOG calls restoreTwoPhaseData to scan pg_twophase/ and populate TwoPhaseState with entries that were already on disk before the crash. WAL replay then calls PrepareRedoAdd for each XLOG_XACT_PREPARE record encountered:

// PrepareRedoAdd — src/backend/access/transam/twophase.c
void
PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
XLogRecPtr end_lsn, RepOriginId origin_id)
{
// ...
gxact = MarkAsPreparing(hdr->xid, gid, hdr->prepared_at,
hdr->owner, hdr->database);
gxact->prepare_start_lsn = start_lsn;
gxact->prepare_end_lsn = end_lsn;
gxact->ondisk = (start_lsn == InvalidXLogRecPtr); /* from disk = no LSN */
gxact->inredo = true;
MarkAsPrepared(gxact, true);
}

Entries added via PrepareRedoAdd have inredo = true. If a XLOG_XACT_COMMIT_PREPARED or XLOG_XACT_ABORT_PREPARED record follows in the WAL stream, PrepareRedoRemove drops the gxact. Entries still present at end-of-recovery are restored by RecoverPreparedTransactions: re-running MarkAsPreparingGuts, loading subxact data, calling MarkAsPrepared to put the dummy PGPROC back in the ProcArray, then dispatching twophase_recover_callbacks to re-acquire locks and multixact state. After this, the prepared transactions are indistinguishable from ones that were prepared on the now-running primary.

Hot standby needs a lighter form: StandbyRecoverPreparedTransactions runs ProcessTwoPhaseBuffer with setParent = true (to populate pg_subtrans for correct snapshot behavior) but does not re-acquire locks — standby query conflicts with prepared transactions are handled by the StandbyReleaseLockTree path.

The 2PC resource manager callback tables in twophase_rmgr.c expose four arrays:

// twophase_rmgr.c — src/backend/access/transam/twophase_rmgr.c
const TwoPhaseCallback twophase_recover_callbacks[TWOPHASE_RM_MAX_ID + 1] = {
NULL, /* END ID */
lock_twophase_recover, /* TWOPHASE_RM_LOCK_ID */
NULL, /* TWOPHASE_RM_PGSTAT_ID */
multixact_twophase_recover, /* TWOPHASE_RM_MULTIXACT_ID */
predicatelock_twophase_recover /* TWOPHASE_RM_PREDICATELOCK_ID */
};
const TwoPhaseCallback twophase_postcommit_callbacks[...] = { ..., lock_twophase_postcommit, pgstat_twophase_postcommit, multixact_twophase_postcommit, NULL };
const TwoPhaseCallback twophase_postabort_callbacks[...] = { ..., lock_twophase_postabort, pgstat_twophase_postabort, multixact_twophase_postabort, NULL };
const TwoPhaseCallback twophase_standby_recover_callbacks[...] = { ..., lock_twophase_standby_recover, NULL, NULL, NULL };

predicatelock_twophase_recover re-establishes SSI predicate locks so that serialization anomaly detection continues to work correctly after a crash involving a prepared transaction. Post-commit/abort callbacks release the locks (lock manager), update statistics (pgstat), and handle multixact cleanup.

  • TwoPhaseShmemSize / TwoPhaseShmemInit — size calculation and initialization; links PreparedXactProcs[] dummy procs into each GlobalTransactionData.pgprocno.
  • MarkAsPreparing — pops a free gxact, checks GID uniqueness, calls MarkAsPreparingGuts to initialize the dummy PGPROC, inserts into prepXacts[].
  • MarkAsPreparingGuts — zeroes and initializes the dummy PGPROC struct (pid=0, myProcLocks[], vxid clone); sets MyLockedGxact.
  • MarkAsPrepared — sets valid = true, calls ProcArrayAdd.
  • LockGXact — linear scan of prepXacts[] for GID, sets locking_backend to prevent concurrent second-phase.
  • RemoveGXact — removes from prepXacts[] array, pushes back onto freelist.
  • AtAbort_Twophase / PostPrepare_Twophase — cleanup hooks for the preparing backend on error.
  • StartPrepare — allocates records chain, writes TwoPhaseFileHeader, calls GXactLoadSubxactData.
  • save_state_data — appends MAXALIGN-padded bytes to records chain.
  • RegisterTwoPhaseRecord — public API for 2PC resource managers; appends a TwoPhaseRecordOnDisk header plus data.
  • EndPrepare — appends end sentinel, fills total_len, writes XLOG_XACT_PREPARE via XLogInsert+XLogFlush, calls MarkAsPrepared, waits for sync replication.
  • FinishPreparedTransaction — main dispatch: LockGXact → read state → WAL commit/abort record → ProcArrayRemove → drop files → send invals → ProcessRecords (callbacks) → RemoveGXact → optional file removal.
  • RecordTransactionCommitPrepared / RecordTransactionAbortPrepared — mirror of RecordTransactionCommit / RecordTransactionAbort in xact.c but always flushes; cannot use synchronous_commit = off.
  • ProcessRecords — walks the TwoPhaseRecordOnDisk chain dispatching each record’s rmid to the appropriate callback array.
  • ReadTwoPhaseFile — opens pg_twophase/<XID>, validates magic + CRC-32C, returns palloc’d buffer.
  • XlogReadTwoPhaseData — allocates an XLogReaderState, reads the XLOG_XACT_PREPARE record at prepare_start_lsn, returns the data portion.
  • RecreateTwoPhaseFile — writes content + CRC-32C to pg_twophase/<XID>, fsyncs (recovery end-checkpoint will not re-fsync).
  • RemoveTwoPhaseFileunlink(pg_twophase/<XID>).
  • CheckPointTwoPhase — iterates prepXacts[], promotes any gxact whose prepare_end_lsn ≤ redo_horizon from WAL to file; fsyncs directory.
  • TwoPhaseFilePath — formats pg_twophase/<16-hex-digit FullXID> path.
  • restoreTwoPhaseData — called at recovery start; scans pg_twophase/, calls ProcessTwoPhaseBuffer + PrepareRedoAdd for each valid file.
  • PrepareRedoAdd — called during WAL replay of XLOG_XACT_PREPARE; creates gxact with inredo = true, records LSN pointers, calls MarkAsPrepared.
  • PrepareRedoRemove — called during WAL replay of commit/abort prepared; removes gxact from TwoPhaseState, optionally removes file.
  • PrescanPreparedTransactions — post-WAL-replay scan to compute oldest prepared XID (for pg_subtrans startup) and collect XIDs for TransamVariables->nextXid advancement.
  • StandbyRecoverPreparedTransactions — hot standby setup: populates pg_subtrans via SubTransSetParent without re-acquiring locks.
  • RecoverPreparedTransactions — end-of-recovery: calls twophase_recover_callbacks to re-acquire locks and multixact state; calls PostPrepare_Twophase to unlock gxact.
  • ProcessTwoPhaseBuffer — common helper: reads from disk or WAL, validates header, optionally advances nextXid, optionally sets subtrans parents.

Position hints (as of 2026-06-05, commit 273fe94)

Section titled “Position hints (as of 2026-06-05, commit 273fe94)”
SymbolFileLine
TwoPhaseStateDatasrc/backend/access/transam/twophase.c176
GlobalTransactionDatasrc/backend/access/transam/twophase.c147
TwoPhaseShmemSizesrc/backend/access/transam/twophase.c237
TwoPhaseShmemInitsrc/backend/access/transam/twophase.c252
MarkAsPreparingsrc/backend/access/transam/twophase.c358
MarkAsPreparingGutssrc/backend/access/transam/twophase.c432
MarkAsPreparedsrc/backend/access/transam/twophase.c529
LockGXactsrc/backend/access/transam/twophase.c551
RemoveGXactsrc/backend/access/transam/twophase.c627
StartPreparesrc/backend/access/transam/twophase.c1050
EndPreparesrc/backend/access/transam/twophase.c1143
RegisterTwoPhaseRecordsrc/backend/access/transam/twophase.c1264
ReadTwoPhaseFilesrc/backend/access/transam/twophase.c1287
XlogReadTwoPhaseDatasrc/backend/access/transam/twophase.c1404
FinishPreparedTransactionsrc/backend/access/transam/twophase.c1487
ProcessRecordssrc/backend/access/transam/twophase.c1681
RecreateTwoPhaseFilesrc/backend/access/transam/twophase.c1727
CheckPointTwoPhasesrc/backend/access/transam/twophase.c1807
restoreTwoPhaseDatasrc/backend/access/transam/twophase.c1888
PrepareRedoAddsrc/backend/access/transam/twophase.c2469
PrepareRedoRemovesrc/backend/access/transam/twophase.c2572
PrescanPreparedTransactionssrc/backend/access/transam/twophase.c1952
StandbyRecoverPreparedTransactionssrc/backend/access/transam/twophase.c2033
RecoverPreparedTransactionssrc/backend/access/transam/twophase.c2073
ProcessTwoPhaseBuffersrc/backend/access/transam/twophase.c2176
twophase_recover_callbackssrc/backend/access/transam/twophase_rmgr.c24
twophase_postcommit_callbackssrc/backend/access/transam/twophase_rmgr.c33
twophase_postabort_callbackssrc/backend/access/transam/twophase_rmgr.c42
twophase_standby_recover_callbackssrc/backend/access/transam/twophase_rmgr.c51
TWOPHASE_RM_LOCK_IDsrc/include/access/twophase_rmgr.h23
GIDSIZEsrc/include/access/xact.h
  • max_prepared_xacts = 0 disables 2PC entirely. MarkAsPreparing checks this before doing anything and raises ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE. Default is 0; users must set max_prepared_transactions > 0 explicitly. Verified in MarkAsPreparing (twophase.c ~373).

  • State data is written to WAL first, files only at checkpoint. The comment at the top of twophase.c (lines 38–68) and the implementation of EndPrepare / CheckPointTwoPhase confirm: on PREPARE TRANSACTION only a WAL record is written; pg_twophase/ files appear only when prepare_end_lsn ≤ redo_horizon at checkpoint time. Fast-path committed prepared transactions (committed before the next checkpoint) never touch the filesystem. Verified by reading both functions.

  • The dummy PGPROC appears in the ProcArray before the preparing backend exits. MarkAsPrepared calls ProcArrayAdd before EndPrepare returns, so there is a brief double-presence of the XID in the ProcArray. The comment at line 1222–1230 explains this is deliberate. Verified in EndPrepare and MarkAsPrepared.

  • COMMIT PREPARED always flushes WAL, regardless of synchronous_commit. RecordTransactionCommitPrepared calls XLogFlush(recptr) unconditionally (twophase.c ~2367). The comment notes “there is no support for async commit of a prepared xact (the very idea is probably a contradiction).” Verified.

  • Second-phase is restricted to the same database. LockGXact checks MyDatabaseId != proc->databaseId and raises an error (twophase.c ~595–599). NOTIFY and other database-local side-effects are the stated reason. Verified.

  • CRC-32C protects state files. ReadTwoPhaseFile validates both magic (TWOPHASE_MAGIC = 0x57F94534) and a CRC-32C over all preceding bytes. RecreateTwoPhaseFile recomputes and appends the CRC when writing. Verified in both functions.

  • twophase_recover_callbacks does not include pgstat. The recover callback for TWOPHASE_RM_PGSTAT_ID is NULL in twophase_rmgr.c (line 28). pgstat does participate in post-commit and post-abort paths but recovery of pgstat state is not attempted. Verified by reading twophase_rmgr.c.

  • predicatelock_twophase_recover re-establishes SSI locks after crash. Verified by tracing twophase_recover_callbacks[TWOPHASE_RM_PREDICATELOCK_ID]predicatelock_twophase_recover (in storage/lmgr/predicate.c). This ensures prepared transactions do not escape SSI tracking after recovery.

  1. GIDSIZE value. Used in GlobalTransactionData.gid[GIDSIZE] but not visible in twophase.h; it is defined in src/include/access/xact.h. The value is 200 (bytes including the null terminator). Confirm against xact.h if the position-hint table is updated.

  2. Maximum size of the state blob. EndPrepare checks hdr->total_len > MaxAllocSize and raises an error, but the maximum practical blob size for a transaction with many subxacts, many pending relation drops, and large lock state is not documented. An extreme workload (thousands of subxacts + thousands of locks) could conceivably approach the limit; exact threshold requires measurement.

  3. DELAY_CHKPT_START vs. DELAY_CHKPT_COMPLETE interaction. The comment in EndPrepare explains the race being prevented, but the full interaction with CheckPointTwoPhase’s own lock protocol under concurrent checkpoint stress has not been traced in this document.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”
  • X/Open XA standard — The PREPARE TRANSACTION / COMMIT PREPARED / ROLLBACK PREPARED SQL syntax and the GID string are a direct implementation of the XA interface (X/Open CAE Specification, Distributed Transaction Processing: The XA Specification, 1991). PostgreSQL’s twophase_rmgr.c callback arrays are the XA resource-manager role. A comparison of PostgreSQL’s crash-recovery behavior with XA’s “heuristic completion” concept (where a resource manager makes a unilateral decision after timeout) would clarify what PostgreSQL does and does not guarantee without a TM.

  • Distributed transaction coordinators (TM-less 2PC). PostgreSQL exposes 2PC primitives but ships no transaction manager — the application or middleware (Pgpool-II, postgres_fdw + pg_prepared_statements, Patroni, Citus) acts as coordinator. Comparing with Spanner’s “Paxos-based 2PC” (Corbett et al., 2012) or CockroachDB’s parallel commit shows how removing the blocking second phase or the coordinator SPOF changes the latency/availability tradeoff.

  • Saga pattern as 2PC alternative. Long-lived distributed transactions that cannot afford blocking during phase one often use sagas (Garcia-Molina & Salem, Sagas, SIGMOD 1987) — a sequence of local transactions each with a compensating action. PostgreSQL’s 2PC is the classical alternative; a side-by-side of when each is appropriate would complement the postgres-xact.md discussion of subtransaction vs. savepoint.

  • MySQL/InnoDB XA path. InnoDB exposes XA PREPARE / XA COMMIT / XA ROLLBACK with a hash table of prepared transactions in the lock manager layer. Unlike PostgreSQL’s dummy PGPROC approach, InnoDB’s prepared XA transactions do not maintain an entry in a shared process array visible to all backends for lock conflict detection — conflicts are resolved at the InnoDB transaction layer instead. The two designs represent different trade- off points between integration depth and code complexity.

  • Incremental checkpoint and 2PC file footprint. PostgreSQL’s incremental backup feature (PG17+, pg_combinebackup) must handle pg_twophase/ files correctly since they represent durable state that is not directly recoverable from WAL if the WAL segment was recycled. The interaction between CheckPointTwoPhase’s promotion logic and incremental backup’s changed-block tracking is a potential research area for correctness verification.

  • src/backend/access/transam/twophase.c — main implementation (2753 lines, commit 273fe94)
  • src/backend/access/transam/twophase_rmgr.c — 2PC resource manager callback tables (58 lines)
  • src/include/access/twophase.h — public API declarations
  • src/include/access/twophase_rmgr.hTwoPhaseCallback, TwoPhaseRmgrId, TWOPHASE_RM_* constants
  • knowledge/code-analysis/postgres/postgres-xact.md — single-backend transaction lifecycle; PrepareTransaction() calls into twophase.c
  • knowledge/code-analysis/postgres/postgres-xlog-wal.md — WAL insert mechanics (XLogBeginInsert / XLogInsert / XLogFlush)
  • knowledge/code-analysis/postgres/postgres-recovery-redo.md — WAL replay, PrepareRedoAdd / PrepareRedoRemove in the redo path
  • knowledge/code-analysis/postgres/postgres-lock-manager.md — lock manager 2PC callbacks (lock_twophase_recover, lock_twophase_postcommit, etc.)
  • knowledge/code-analysis/postgres/postgres-mvcc-snapshots.md — procarray snapshot acquisition; dummy PGPROC visibility to GetSnapshotData
  • knowledge/research/dbms-papers/aries.md — ARIES recovery theory