PostgreSQL Transaction Management — The Commit State Machine, Subtransactions, and 2PC Hooks
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A transaction is the unit of atomicity and durability in a database: the ACID promise that a group of reads and writes either takes full effect or has no effect at all, and that once the system reports “committed” the effect survives a crash. Database System Concepts (Silberschatz, Korth, Sudarshan, 7e, ch. 17 “Transactions”) frames this with the classic transaction state diagram — active → partially committed → committed, or active → failed → aborted — and pins the durable instant to the moment the commit log record reaches stable storage. Everything before that instant is reversible; everything after is not.
That single sentence hides the design space this module lives in. The
textbook model says “write a commit record and make it durable,” but it does
not say when in the sequence of cleanup steps the record must be written,
how the engine names the transaction, or what happens to the half-finished
work of a nested operation that the user wants to undo without losing the
whole transaction. Three design choices follow, and PostgreSQL’s xact.c
turns each of them in a specific direction:
-
How is a transaction named, and when? The textbook assumes a transaction identifier exists for the lifetime of the transaction. A real engine has to decide whether to allocate that identifier eagerly at
BEGINor lazily at first write — a read-only transaction needs a handle for locking and snapshots but arguably never needs to appear in the permanent commit log. -
What is the order of operations at commit? Commit is not one instruction; it is “fire deferred triggers, close cursors, run pre-commit callbacks, write the WAL commit record, flush, mark the commit log, leave the running-transaction set, release locks, run post-commit callbacks, free memory.” The textbook gives the durable instant; the engineering problem is ordering everything else around it so that (a) a crash before the instant looks like an abort, (b) a crash after it looks like a commit, and (c) no other backend can observe an inconsistent in-between.
-
How do you undo part of a transaction? SQL savepoints (and PL/pgSQL exception blocks) require nested, partially-rollback-able units. The model for this is the nested transaction: a tree where a child’s effects become permanent only if it and every ancestor commit. Database System Concepts (§17.3, recoverable/cascadeless schedules; the nested-transaction discussion) supplies the correctness frame; the engineering frame is “how do you cheaply represent and unwind the tree.”
The durability mechanism underneath all of this is write-ahead logging, and
its canonical account is ARIES (Mohan et al., ARIES: A Transaction Recovery
Method Supporting Fine-Granularity Locking and Partial Rollbacks Using
Write-Ahead Logging, ACM TODS 1992; captured in dbms-papers/aries.md). ARIES
fixes three principles this module depends on: WAL (log the change before
the change reaches the data page), repeating history during redo, and
logging undo actions so that partial rollback (a savepoint, or an aborted
subtransaction) is itself a recoverable operation. PostgreSQL’s transaction
manager is the producer of the commit/abort records that ARIES-style recovery
later replays; the recovery side proper lives in postgres-recovery-redo.md,
and the WAL insert machinery in postgres-xlog-wal.md. What xact.c owns is
the moment of decision: it decides that the transaction commits, in what
order, and writes the record that makes the decision durable.
One PostgreSQL-specific subtlety that the textbook does not anticipate: because
PostgreSQL’s heap is no-overwrite MVCC, the commit record is not where the
data becomes visible. Visibility is decided by the snapshot machinery
(postgres-mvcc-snapshots.md) reading the commit status of an XID from the
commit log (pg_xact / CLOG, postgres-clog-commit-ts.md). So xact.c’s job is
narrower than in an in-place engine: it does not install new values; it flips a
two-bit status from “in progress” to “committed” (or “aborted”) and gets out of
the running-transaction set. Everyone else’s visibility decision keys off that
flip.
Common DBMS Design
Section titled “Common DBMS Design”Across PostgreSQL, Oracle, SQL Server, MySQL/InnoDB, and CUBRID, a recurring set of engineering conventions realizes the textbook transaction. None of them are in the textbook; all of them recur because the same three pressures (naming, commit ordering, nesting) push implementers toward the same shapes.
A monotonic transaction-id counter, allocated under a short lock. Every
engine has a global counter that hands out transaction identifiers in
increasing order; the order is load-bearing because visibility and recovery
both reason “this id is older than that id.” The counter is protected by a
narrow critical section (a latch held for a handful of instructions) because it
is one of the hottest shared resources in the system. PostgreSQL calls it
nextXid under XidGenLock; InnoDB has trx_sys->max_trx_id; CUBRID has its
own transaction-id allocator. The shared trick is to hold the lock for as
little as possible and to publish the new id into the shared
running-transaction registry before dropping the lock, so no concurrent
snapshot can miss it.
Lazy id assignment for read-only work. Mature engines avoid burning a
permanent id on transactions that never write. A read-only transaction still
needs some handle (to hold locks, to anchor a snapshot), so engines split the
identity in two: a cheap, backend-local handle assigned at start, and an
expensive, globally-ordered id assigned only at first write. PostgreSQL’s split
is VXID (virtual transaction id) vs XID: the VXID — (procNumber, localTransactionId) — is assigned with zero shared-memory contention at
StartTransaction, and the real XID is deferred to AssignTransactionId.
Commit as a fixed pipeline around one durable instant. Every engine turns “commit” into an ordered sequence with exactly one step that is the point of no return: writing-and-flushing the commit log record. Steps before it are abort-safe (an error reroutes to the abort path); steps after it are “noncritical cleanup” that must not fail the transaction. The universal ordering rule is: make changes visible-to-others (release row/page locks) only after the commit is durable, and leave the running set in a way that interlocks with snapshot-takers.
A separate “abort is free” path. Because the recovery rule is “a
transaction with no commit record is presumed aborted,” engines do not need to
flush an abort record. PostgreSQL writes an abort record (useful for hot
standby and for waking lock waiters) but deliberately does not XLogFlush it.
This asymmetry — commit pays for durability, abort does not — is universal.
Nested transactions as a stack with upward merge. Savepoints and exception blocks are implemented as a stack of per-level state records. A child that commits does not independently durably commit; instead its identity (and its own committed children) is merged upward into the parent, and only the top-level commit durably commits the whole tree in one atomic record. A child that aborts is marked aborted immediately (so lock waiters and visibility checks short-circuit) and unwound from the stack.
Resource ownership tied to transaction scope. Buffers pinned, files
opened, catalog-cache entries, locks held — all are tracked so they can be
released en masse at end-of-transaction in a defined order. PostgreSQL
centralizes this in the ResourceOwner tree (utils/resowner); the commit
and abort paths walk it in phases (before-locks, locks, after-locks).
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Theory / convention | PostgreSQL name (this module) |
|---|---|
| Transaction state (active / committed / aborted) | TransState enum — TRANS_DEFAULT, TRANS_START, TRANS_INPROGRESS, TRANS_COMMIT, TRANS_ABORT, TRANS_PREPARE |
| Client-block state (BEGIN/COMMIT/SAVEPOINT block) | TBlockState enum — TBLOCK_* |
| Per-transaction control block | TransactionStateData (the stack node) |
| Global monotonic id counter | TransamVariables->nextXid, under XidGenLock |
| Cheap start-time handle (read-only) | VXID — (procNumber, localTransactionId) |
| Lazy real id (first write) | XID — AssignTransactionId → GetNewTransactionId |
| Durable-instant record | XLOG_XACT_COMMIT written by XactLogCommitRecord |
| Commit-status store (the flip) | pg_xact / CLOG via TransactionIdCommitTree (see postgres-clog-commit-ts.md) |
| Leave the running set (interlock) | ProcArrayEndTransaction (see postgres-mvcc-snapshots.md) |
| Nested transaction node | a pushed TransactionStateData with parent link |
| Child→parent merge at subcommit | AtSubCommit_childXids → parent’s childXids[] |
| Per-scope resource cleanup | ResourceOwner release phases + AtEOXact_* / AtEOSubXact_* callbacks |
| Undo-logging for partial rollback (ARIES) | abort record + immediate CLOG abort mark; physical undo is MVCC, not logged |
By the time TransactionStateData and AssignTransactionId appear in the next
section, the reader already knows what kind of thing each is.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”Two state machines, one per layer
Section titled “Two state machines, one per layer”The in-tree README opens by calling the transaction system “a three-layer
system.” The bottom layer is the low-level routines (StartTransaction,
CommitTransaction, …); the middle layer is postgres.c’s per-query
StartTransactionCommand / CommitTransactionCommand; the top layer is the
SQL traffic-cop (BeginTransactionBlock, EndTransactionBlock, …). The two
states that thread through these layers are deliberately separate:
// TransState — src/backend/access/transam/xact.ctypedef enum TransState{ TRANS_DEFAULT, /* idle */ TRANS_START, /* transaction starting */ TRANS_INPROGRESS, /* inside a valid transaction */ TRANS_COMMIT, /* commit in progress */ TRANS_ABORT, /* abort in progress */ TRANS_PREPARE, /* prepare in progress */} TransState;TransState is the engine’s view: what is this backend physically doing
right now. IsTransactionState() returns true only for TRANS_INPROGRESS —
the transient START/COMMIT/ABORT/PREPARE states are explicitly “too soon or too
late to do anything interesting,” so database access is only safe in
TRANS_INPROGRESS.
The second machine is the client’s view — what the user’s BEGIN/COMMIT block wants — and it has far more states because it has to remember intent across multiple query cycles:
// TBlockState — src/backend/access/transam/xact.c (condensed)typedef enum TBlockState{ /* not-in-transaction-block states */ TBLOCK_DEFAULT, /* idle */ TBLOCK_STARTED, /* running single-query transaction */ /* transaction block states */ TBLOCK_BEGIN, /* starting transaction block */ TBLOCK_INPROGRESS, /* live transaction */ TBLOCK_IMPLICIT_INPROGRESS, /* live transaction after implicit BEGIN */ TBLOCK_PARALLEL_INPROGRESS, /* live transaction inside parallel worker */ TBLOCK_END, /* COMMIT received */ TBLOCK_ABORT, /* failed xact, awaiting ROLLBACK */ TBLOCK_ABORT_END, /* failed xact, ROLLBACK received */ TBLOCK_ABORT_PENDING, /* live xact, ROLLBACK received */ TBLOCK_PREPARE, /* live xact, PREPARE received */ /* subtransaction states */ TBLOCK_SUBBEGIN, TBLOCK_SUBINPROGRESS, TBLOCK_SUBRELEASE, TBLOCK_SUBCOMMIT, TBLOCK_SUBABORT, TBLOCK_SUBABORT_END, TBLOCK_SUBABORT_PENDING, TBLOCK_SUBRESTART, TBLOCK_SUBABORT_RESTART,} TBlockState;The reason for two machines is the subtlety the README spends a page on: a
single SQL COMMIT does not immediately close the transaction. When the user
types COMMIT, the traffic-cop’s EndTransactionBlock only moves the block to
TBLOCK_END; the real CommitTransaction() runs later, when
CommitTransactionCommand is next called by the main loop. This split is what
lets control leave xact.c with the transaction still open so the main loop
can keep processing inside the same transaction. The low-level TransState
tracks the physical commit; the high-level TBlockState remembers that a
COMMIT is pending.
The TBLOCK_* machine is driven almost entirely from one giant switch in
CommitTransactionCommandInternal (and its abort sibling). The hot transitions:
stateDiagram-v2
[*] --> TBLOCK_DEFAULT
TBLOCK_DEFAULT --> TBLOCK_STARTED: StartTransactionCommand
TBLOCK_STARTED --> TBLOCK_DEFAULT: CommitTransaction \n implicit single-query xact
TBLOCK_STARTED --> TBLOCK_BEGIN: BEGIN
TBLOCK_BEGIN --> TBLOCK_INPROGRESS: CommitTransactionCommand
TBLOCK_INPROGRESS --> TBLOCK_INPROGRESS: CommandCounterIncrement \n per statement
TBLOCK_INPROGRESS --> TBLOCK_END: COMMIT received
TBLOCK_END --> TBLOCK_DEFAULT: CommitTransaction
TBLOCK_INPROGRESS --> TBLOCK_ABORT_PENDING: ROLLBACK received
TBLOCK_ABORT_PENDING --> TBLOCK_DEFAULT: AbortTransaction then CleanupTransaction
TBLOCK_INPROGRESS --> TBLOCK_ABORT: error inside block
TBLOCK_ABORT --> TBLOCK_ABORT_END: ROLLBACK received
TBLOCK_ABORT_END --> TBLOCK_DEFAULT: CleanupTransaction
TBLOCK_INPROGRESS --> TBLOCK_PREPARE: PREPARE TRANSACTION
TBLOCK_PREPARE --> TBLOCK_DEFAULT: PrepareTransaction
Figure 1 — The high-level TBlockState machine for a top-level transaction.
The crucial asymmetry: an error inside a block lands in TBLOCK_ABORT and
waits for the user’s ROLLBACK (any other command is ignored), whereas an
explicit ROLLBACK on a healthy block goes through TBLOCK_ABORT_PENDING
where the engine still has to do the abort. Both converge on
CleanupTransaction and back to idle. COMMIT never commits immediately — it
parks in TBLOCK_END until the next CommitTransactionCommand.
The README’s worked example makes the two-phase nature concrete. For BEGIN; SELECT; INSERT; COMMIT; the main loop calls StartTransactionCommand /
CommitTransactionCommand around every statement, and only the BEGIN
statement actually runs StartTransaction(), only the COMMIT statement
actually runs CommitTransaction(); the in-between CommitTransactionCommands
just call CommandCounterIncrement() so later commands see earlier commands’
effects.
XID assignment: lazy, virtual-first, under XidGenLock
Section titled “XID assignment: lazy, virtual-first, under XidGenLock”A transaction begins with no XID. StartTransaction assigns only a VXID,
which costs nothing in shared memory:
// StartTransaction — src/backend/access/transam/xact.c (condensed)s->state = TRANS_START;s->fullTransactionId = InvalidFullTransactionId; /* until assigned */// ...vxid.procNumber = MyProcNumber;vxid.localTransactionId = GetNextLocalTransactionId();VirtualXactLockTableInsert(vxid);MyProc->vxid.lxid = vxid.localTransactionId;The real XID is allocated the first time something needs one —
GetCurrentTransactionId() is the public door, and it lazily assigns:
// GetCurrentTransactionId — src/backend/access/transam/xact.cTransactionIdGetCurrentTransactionId(void){ TransactionState s = CurrentTransactionState; if (!FullTransactionIdIsValid(s->fullTransactionId)) AssignTransactionId(s); return XidFromFullTransactionId(s->fullTransactionId);}AssignTransactionId does four things that the in-tree README’s “Transaction
and Subtransaction Numbering” section explains: (1) if this is a subxact, it
first assigns XIDs to all unassigned parents, iteratively (never recursing
deeper than one frame), to preserve the invariant child XID > parent XID; (2)
it calls GetNewTransactionId; (3) for a subxact it records the parent link in
pg_subtrans via SubTransSetParent; (4) it takes the transaction-XID lock
(XactLockTableInsert) charged to this level’s ResourceOwner, and for a
top-level xact registers the XID with the predicate-lock (SSI) system.
// AssignTransactionId — src/backend/access/transam/xact.c (condensed)s->fullTransactionId = GetNewTransactionId(isSubXact);if (!isSubXact) XactTopFullTransactionId = s->fullTransactionId;if (isSubXact) SubTransSetParent(XidFromFullTransactionId(s->fullTransactionId), XidFromFullTransactionId(s->parent->fullTransactionId));if (!isSubXact) RegisterPredicateLockingXid(XidFromFullTransactionId(s->fullTransactionId));// take XID lock charged to this level's ResourceOwnercurrentOwner = CurrentResourceOwner;CurrentResourceOwner = s->curTransactionOwner;XactLockTableInsert(XidFromFullTransactionId(s->fullTransactionId));CurrentResourceOwner = currentOwner;The actual counter bump lives in varsup.c. GetNewTransactionId is the only
place nextXid advances. It takes XidGenLock exclusively, checks the
wraparound limits (xidVacLimit / xidWarnLimit / xidStopLimit — the
escalating defenses that nag, then warn, then refuse new XIDs), extends the
on-disk SLRUs for the new page, advances the counter, and — critically —
publishes the XID into the shared ProcArray before releasing the lock:
// GetNewTransactionId — src/backend/access/transam/varsup.c (condensed)LWLockAcquire(XidGenLock, LW_EXCLUSIVE);full_xid = TransamVariables->nextXid;xid = XidFromFullTransactionId(full_xid);// ... wraparound-limit checks (xidStopLimit -> ERROR, xidWarnLimit -> WARNING) ...ExtendCLOG(xid); /* zero a new pg_xact page if needed */ExtendCommitTs(xid);ExtendSUBTRANS(xid);FullTransactionIdAdvance(&TransamVariables->nextXid);if (!isSubXact){ /* LWLockRelease acts as barrier */ MyProc->xid = xid; ProcGlobal->xids[MyProc->pgxactoff] = xid;}else{ /* store into the PGPROC subxid cache, or set overflowed */ if (nxids < PGPROC_MAX_CACHED_SUBXIDS) { ... MyProc->subxids.xids[nxids] = xid; ... } else MyProc->subxidStatus.overflowed = substat->overflowed = true;}LWLockRelease(XidGenLock);flowchart TD
NEED["a write needs an XID<br/>GetCurrentTransactionId"] --> HAVE{fullTransactionId<br/>already valid?}
HAVE -- yes --> RET["return existing XID"]
HAVE -- no --> SUB{is this a subxact<br/>with unassigned parents?}
SUB -- yes --> PAR["assign parents first<br/>iteratively, child XID > parent XID"]
SUB -- no --> GNT
PAR --> GNT["GetNewTransactionId under XidGenLock"]
GNT --> EXT["ExtendCLOG / CommitTs / SUBTRANS<br/>then advance nextXid"]
EXT --> PUB["publish XID into ProcArray<br/>(before releasing XidGenLock)"]
PUB --> POST["SubTransSetParent (subxact)<br/>RegisterPredicateLockingXid (top)<br/>XactLockTableInsert"]
POST --> RET
Figure 2 — Lazy XID assignment. The ordering rule from the README’s
“Interlocking” section is the load-bearing detail: GetNewTransactionId must
store the new XID into the ProcArray before releasing XidGenLock, so that
every top-level XID ≤ latestCompletedXid is guaranteed present in the
ProcArray (or no longer running). Without that ordering, a concurrent backend
could allocate and commit a later XID, advancing latestCompletedXid past this
one before it became visible — breaking the oldest-xmin computation that vacuum
relies on.
CommitTransaction: the order of operations
Section titled “CommitTransaction: the order of operations”CommitTransaction() is a long, deliberately-ordered routine. Its shape is
three phases: pre-commit (may run user code, may still error into abort),
the durable instant (RecordTransactionCommit, then leave the running
set), and post-commit cleanup (must not fail the transaction). The
abbreviated sequence:
// CommitTransaction — src/backend/access/transam/xact.c (heavily condensed)/* --- pre-commit: user code may run, errors reroute to abort --- */for (;;) { AfterTriggerFireDeferred(); if (!PreCommit_Portals(false)) break; }CallXactCallbacks(XACT_EVENT_PRE_COMMIT);AtEOXact_Parallel(true);AfterTriggerEndXact(true);PreCommit_on_commit_actions();smgrDoPendingSyncs(true, is_parallel_worker);AtEOXact_LargeObject(true);PreCommit_Notify();if (!is_parallel_worker) PreCommit_CheckForSerializationFailure(); /* SSI may abort here */
HOLD_INTERRUPTS();s->state = TRANS_COMMIT;
/* --- the durable instant --- */if (!is_parallel_worker) latestXid = RecordTransactionCommit(); /* writes + flushes + CLOG */
ProcArrayEndTransaction(MyProc, latestXid); /* leave running set */
/* --- post-commit: noncritical cleanup, must not fail the xact --- */CallXactCallbacks(XACT_EVENT_COMMIT);ResourceOwnerRelease(TopTransactionResourceOwner, RESOURCE_RELEASE_BEFORE_LOCKS, true, true);AtEOXact_Buffers(true); AtEOXact_RelationCache(true);AtEOXact_Inval(true); /* publish catalog invals */ResourceOwnerRelease(TopTransactionResourceOwner, RESOURCE_RELEASE_LOCKS, true, true);ResourceOwnerRelease(TopTransactionResourceOwner, RESOURCE_RELEASE_AFTER_LOCKS, true, true);smgrDoPendingDeletes(true); /* drop files of dropped rels */AtCommit_Notify();/* ... AtEOXact_GUC / SPI / Namespace / PgStat / Snapshot ... */AtCommit_Memory(); /* free TopTransactionContext */s->state = TRANS_DEFAULT;RESUME_INTERRUPTS();Two ordering rules in this sequence come straight from the README and are worth naming because every other DBMS has to solve the same problem:
- Leave the running set (
ProcArrayEndTransaction) only afterRecordTransactionCommit, and only before releasing locks. A concurrent snapshot-taker that no longer sees this XID as running must already be able to see it as committed in CLOG; and lock waiters must not be woken until the transaction is fully cleaned up from their point of view. - Publish catalog invalidations (
AtEOXact_Inval) after dropping relcache references but before releasing locks, so that any backend waiting on a lock for a relation this transaction modified learns about the catalog change before it starts using the relation.
RecordTransactionCommit: where durability happens
Section titled “RecordTransactionCommit: where durability happens”This is the one function in the module that contains the point of no return.
The structure is: gather what the commit record needs (dropped files, committed
children, dropped stats, invalidation messages), decide whether there is even an
XID to commit, and if so enter a critical section, write the record, then flush
or not based on synchronous_commit, then update CLOG.
// RecordTransactionCommit — src/backend/access/transam/xact.c (condensed)TransactionId xid = GetTopTransactionIdIfAny();bool markXidCommitted = TransactionIdIsValid(xid);// ... gather nrels, nchildren, ndroppedstats, invalMessages ...
if (!markXidCommitted) { /* no XID: nothing to commit. Only flush if we wrote WAL (e.g. HOT pruning) */ if (!wrote_xlog) goto cleanup;} else { /* force concurrent checkpoint to wait until pg_xact is updated */ START_CRIT_SECTION(); MyProc->delayChkptFlags |= DELAY_CHKPT_START;
XactLogCommitRecord(GetCurrentTransactionStopTimestamp(), nchildren, children, nrels, rels, ndroppedstats, droppedstats, nmsgs, invalMessages, RelcacheInitFileInval, MyXactFlags, InvalidTransactionId, NULL /* plain commit */); TransactionTreeSetCommitTsData(xid, nchildren, children, ...);}
if ((wrote_xlog && markXidCommitted && synchronous_commit > SYNCHRONOUS_COMMIT_OFF) || forceSyncCommit || nrels > 0){ XLogFlush(XactLastRecEnd); /* SYNCHRONOUS path */ if (markXidCommitted) TransactionIdCommitTree(xid, nchildren, children); /* CLOG = COMMITTED now */}else{ XLogSetAsyncXactLSN(XactLastRecEnd); /* ASYNCHRONOUS path */ if (markXidCommitted) TransactionIdAsyncCommitTree(xid, nchildren, children, XactLastRecEnd);}
if (markXidCommitted) { MyProc->delayChkptFlags &= ~DELAY_CHKPT_START; END_CRIT_SECTION();}if (wrote_xlog && markXidCommitted) SyncRepWaitForLSN(XactLastRecEnd, true); /* wait for sync standby */Three things to read off this:
-
No XID, no commit record. A read-only transaction (or one that only touched temp tables) never got an XID, so
markXidCommittedis false and there is literally nothing to commit — it skips to cleanup. The only reason such a transaction flushes at all is if it wrote WAL for some side reason (HOT pruning), in which case the flush is for durability of that, not of a commit. -
The commit critical section +
DELAY_CHKPT_START. Between writing the WAL record and updating CLOG, the backend setsDELAY_CHKPT_STARTso a concurrent checkpoint cannot move its redo pointer past the commit record while failing to flush the CLOG update. Without this interlock a crash just after checkpoint could lose a commit that the WAL already recorded. -
Synchronous vs asynchronous commit. With
synchronous_commiton, the WAL is flushed (XLogFlush) and CLOG is updated synchronously (TransactionIdCommitTree). Withsynchronous_commit=off, the WAL is not flushed; instead the commit LSN is noted (XLogSetAsyncXactLSN) so the walwriter flushes it soon, and CLOG is updated async — the actual CLOG write defers until WAL up to that LSN is known flushed (theTransactionIdAsyncCommitTreepath; see the README’s “Asynchronous Commit” section andpostgres-clog-commit-ts.md). The order — record, then flush, then CLOG — is invariant; what async commit relaxes is when the flush completes, trading a small window of possible commit loss on crash for latency.
The record itself is built by XactLogCommitRecord, which is a study in
optional sub-records: a bare xl_xact_commit is just a timestamp, and each
extra payload (subxacts, dropped relfilelocators, dropped stats, invalidation
messages, two-phase xid/gid, replication origin) is appended only if present,
its presence flagged in an xinfo bitmask. This is why the same function
serves plain commits and COMMIT PREPARED — the only difference is whether
twophase_xid is valid, which flips the opcode from XLOG_XACT_COMMIT to
XLOG_XACT_COMMIT_PREPARED.
AbortTransaction: the mirror image, but free
Section titled “AbortTransaction: the mirror image, but free”Abort is structurally similar but reversed in spirit. It first releases lightweight resources as fast as possible (the README: “release all shared resources so that we do not delay other backends”), then records the abort, leaves the running set, then does the heavier cleanup:
// AbortTransaction — src/backend/access/transam/xact.c (condensed)HOLD_INTERRUPTS();AtAbort_Memory(); AtAbort_ResourceOwner();LWLockReleaseAll(); /* drop LWLocks immediately */UnlockBuffers();XLogResetInsertion(); /* discard half-built WAL record */LockErrorCleanup();s->state = TRANS_ABORT;SetUserIdAndSecContext(s->prevUser, s->prevSecContext); /* undo SECURITY DEFINER */// ... AtEOXact_Parallel(false), AfterTriggerEndXact(false), AtAbort_Portals() ...
if (!is_parallel_worker) latestXid = RecordTransactionAbort(false); /* writes record, NO flush */
ProcArrayEndTransaction(MyProc, latestXid);
if (TopTransactionResourceOwner != NULL) { CallXactCallbacks(XACT_EVENT_ABORT); ResourceOwnerRelease(..., RESOURCE_RELEASE_BEFORE_LOCKS, false, true); // ... buffers, relcache, inval, multixact ... ResourceOwnerRelease(..., RESOURCE_RELEASE_LOCKS, false, true); ResourceOwnerRelease(..., RESOURCE_RELEASE_AFTER_LOCKS, false, true); smgrDoPendingDeletes(false); // ... GUC / SPI / PgStat ...}/* State remains TRANS_ABORT until CleanupTransaction(). */RESUME_INTERRUPTS();The key difference is in RecordTransactionAbort: it writes the abort record
inside a critical section and immediately marks CLOG aborted via
TransactionIdAbortTree, but it never flushes and never sets
DELAY_CHKPT_START:
// RecordTransactionAbort — src/backend/access/transam/xact.c (condensed)TransactionId xid = GetCurrentTransactionIdIfAny();if (!TransactionIdIsValid(xid)) { /* no XID -> nobody cares we aborted */ if (!isSubXact) XactLastRecEnd = 0; return InvalidTransactionId;}if (TransactionIdDidCommit(xid)) /* sanity: didn't half-commit */ elog(PANIC, "cannot abort transaction %u, it was already committed", xid);
START_CRIT_SECTION();XactLogAbortRecord(xact_time, nchildren, children, nrels, rels, ndroppedstats, droppedstats, MyXactFlags, InvalidTransactionId, NULL);if (!isSubXact) XLogSetAsyncXactLSN(XactLastRecEnd); /* nudge walwriter, but do not block */TransactionIdAbortTree(xid, nchildren, children); /* CLOG = ABORTED */END_CRIT_SECTION();The comment in the source is explicit: “We do not flush XLOG to disk here, since
the default assumption after a crash would be that we aborted, anyway.” This is
the universal “abort is free” asymmetry made concrete. Note also the two-phase
abort handling the README describes: AbortTransaction releases shared
resources immediately (so other backends aren’t delayed), but
CleanupTransaction — which finally tears down TopTransactionContext and
returns to TRANS_DEFAULT — does not run until the user actually issues
ROLLBACK. That is why a failed transaction block sits in TBLOCK_ABORT
ignoring everything until it sees a termination command.
Subtransactions: a stack with upward merge
Section titled “Subtransactions: a stack with upward merge”Savepoints, PL/pgSQL exception blocks, and internal subtransactions are all
built on the same primitive: a stack of TransactionStateData nodes linked by
parent. The struct carries everything a level needs to be unwound
independently:
// TransactionStateData — src/backend/access/transam/xact.c (condensed)typedef struct TransactionStateData{ FullTransactionId fullTransactionId; /* my XID (lazy; may be invalid) */ SubTransactionId subTransactionId; /* my subxact ID */ char *name; /* savepoint name, if any */ int savepointLevel; TransState state; /* low-level state */ TBlockState blockState; /* high-level state */ int nestingLevel; /* transaction nesting depth */ int gucNestLevel; MemoryContext curTransactionContext; /* my xact-lifetime context */ ResourceOwner curTransactionOwner; /* my query resources */ TransactionId *childXids; /* subcommitted child XIDs, in XID order */ int nChildXids; int maxChildXids; Oid prevUser; int prevSecContext; bool prevXactReadOnly; bool startedInRecovery; bool didLogXid; // ... parallel-mode bookkeeping, chain flag ... struct TransactionStateData *parent; /* back link to parent */} TransactionStateData;PushTransaction allocates a new node in TopTransactionContext, bumps the
backend-local currentSubTransactionId counter (the top level is
SubTransactionId 1; subxacts are 2 and up; the counter resets per top
transaction and is not an XID), links it to the current node, and sets
blockState = TBLOCK_SUBBEGIN. StartSubTransaction then initializes the
subsystems (AtSubStart_Memory, AtSubStart_ResourceOwner) and moves the node
to TRANS_INPROGRESS. Note the subxact node starts with no XID — like a top
transaction, a savepoint that only reads never burns an XID.
flowchart TB
TOP["top TransactionStateData<br/>subxid 1, XID 100 (lazy)<br/>childXids: [101, 102]"]
S1["subxact node<br/>subxid 2, XID 103<br/>parent ->"]
S2["subxact node (current)<br/>subxid 3, XID (none yet)<br/>parent ->"]
S2 --> S1 --> TOP
note["PushTransaction: allocate + link + TBLOCK_SUBBEGIN<br/>PopTransaction: relink CurrentTransactionState to parent, free node"]
Figure 3 — The subtransaction stack. Each node is an independent unwind unit
with its own ResourceOwner and memory context, but XIDs are still lazy per
node. The invariant from AssignTransactionId guarantees XIDs increase down
the stack (parent always gets an XID before a child needs one), which is why
childXids[] stays sorted.
The interesting moment is subcommit. A committing subtransaction does not
write its own commit record and does not mark CLOG committed. Instead
CommitSubTransaction merges its identity upward into the parent and releases
only its own XID lock; all other locks transfer to the parent’s ResourceOwner:
// CommitSubTransaction — src/backend/access/transam/xact.c (condensed)s->state = TRANS_COMMIT;CommandCounterIncrement(); /* make subxact's commands visible *//* Prior to 8.4 we marked subcommit in clog here; now deferred to top-level. */if (FullTransactionIdIsValid(s->fullTransactionId)) AtSubCommit_childXids(); /* merge my XID + my children into parent */// ... AfterTriggerEndSubXact(true), AtSubCommit_Portals(...), callbacks ...CurrentResourceOwner = s->curTransactionOwner;if (FullTransactionIdIsValid(s->fullTransactionId)) XactLockTableDelete(XidFromFullTransactionId(s->fullTransactionId));/* Other locks transfer to parent */ResourceOwnerRelease(s->curTransactionOwner, RESOURCE_RELEASE_LOCKS, true, false);// ...CurrentResourceOwner = s->parent->curTransactionOwner;PopTransaction();AtSubCommit_childXids is the merge: it grows the parent’s childXids[] array
(doubling to amortize, capped at MaxAllocSize) and copies the subxact’s own
XID followed by its children, keeping the array sorted by relying on the
“child XID > parent XID” invariant. When the top transaction finally commits,
RecordTransactionCommit passes this whole children array to
XactLogCommitRecord and TransactionIdCommitTree — so the entire tree is
marked committed in CLOG atomically, in one record. The README’s “pg_xact
and pg_subtrans” section describes the corner case: when the tree’s status
spans multiple CLOG pages, an intermediate “sub-committed” state is used to keep
the multi-page update atomic, but within a single page they are all flipped to
committed at once with no intermediate state.
Subabort is eager where subcommit is lazy. AbortSubTransaction calls
RecordTransactionAbort(true) immediately (writing an abort record, marking
CLOG aborted, and calling XidCacheRemoveRunningXids to drop the failed XIDs
from the PGPROC running-children cache right away), then CleanupSubTransaction
pops the node. The reason for the asymmetry is the same as at top level: an
aborted subxact’s XID must be observably aborted now so that
XactLockTableWait and visibility checks short-circuit, whereas a committed
subxact’s fate is still contingent on its ancestors, so its CLOG mark waits for
the top.
ROLLBACK TO <savepoint> is built from these primitives: the engine aborts all
subtransactions up through the named one, then re-creates that level with the
same name — “a completely new subtransaction as far as the internals are
concerned.” RELEASE simply commits (merges up) the named level and everything
above it.
Resource-cleanup callbacks
Section titled “Resource-cleanup callbacks”The end-of-transaction work is dispatched through two parallel families of
callbacks plus the ResourceOwner phases. The AtEOXact_* functions
(buffers, relcache, inval, GUC, SPI, namespace, pgstat, snapshot, …) run at top
commit/abort with a boolean isCommit; the AtEOSubXact_* family does the same
at subtransaction boundaries, additionally taking the sub/parent subxids so
state can be re-parented rather than discarded. Extensions plug into the same
seam through registered callbacks:
// RegisterXactCallback / the event enum — src/include/access/xact.htypedef enum{ XACT_EVENT_COMMIT, XACT_EVENT_PARALLEL_COMMIT, XACT_EVENT_ABORT, XACT_EVENT_PARALLEL_ABORT, XACT_EVENT_PREPARE, XACT_EVENT_PRE_COMMIT, XACT_EVENT_PARALLEL_PRE_COMMIT, XACT_EVENT_PRE_PREPARE,} XactEvent;typedef void (*XactCallback) (XactEvent event, void *arg);extern void RegisterXactCallback(XactCallback callback, void *arg);The PRE_COMMIT events fire while user code can still run and an error can
still reroute to abort; the plain COMMIT/ABORT events fire in the
post-commit phase where failure is no longer survivable. This is why a callback
that might fail belongs on PRE_COMMIT, not COMMIT.
The ResourceOwner release happens in three ordered phases —
RESOURCE_RELEASE_BEFORE_LOCKS (buffer pins, files — things visible to other
backends), RESOURCE_RELEASE_LOCKS (the heavyweight locks themselves), and
RESOURCE_RELEASE_AFTER_LOCKS (backend-local odds and ends) — so that locks are
dropped exactly at the point where a waiting backend will see this transaction
as fully cleaned up. The mechanism itself is documented in
postgres-resource-owners.md (planned, base-infra); xact.c is its primary
caller.
The two-phase commit seam
Section titled “The two-phase commit seam”xact.c contains PrepareTransaction, but two-phase commit proper —
PREPARE TRANSACTION, the on-disk twophase state files, COMMIT/ROLLBACK PREPARED, recovery of prepared xacts — lives in twophase.c and is the subject
of postgres-two-phase-commit.md. The seam this document owns is narrow and
worth stating precisely so the boundary is clear:
- The
TBLOCK_PREPAREblock state routes toPrepareTransaction(), which is a near-twin ofCommitTransaction()but, instead ofRecordTransactionCommit, hands the transaction’s state to the twophase machinery (StartPrepare/EndPrepare) that durably writes a prepare record and a state file. XactLogCommitRecordandXactLogAbortRecordare shared between plain and prepared paths: passing a validtwophase_xidflips the opcode toXLOG_XACT_COMMIT_PREPARED/XLOG_XACT_ABORT_PREPAREDand appends thexl_xact_twophase(and optionally GID) sub-record. So the WAL record format for commit/abort is defined here; the protocol that uses the prepared variants is defined there.
Everything else about 2PC — the resolver, the GID namespace, surviving a restart with prepared transactions still pending — is deliberately out of scope here.
Source Walkthrough
Section titled “Source Walkthrough”Grouped by sub-system. Symbols are the stable anchor; the position-hint table
at the end pins line numbers to commit 273fe94.
State and the top-level lifecycle
Section titled “State and the top-level lifecycle”TransState,TBlockState(enums,xact.c) — the two state machines.TransactionStateData/TransactionState(xact.c) — the stack node;TopTransactionStateDatais the static top node,CurrentTransactionStatethe cursor into the stack.StartTransaction(xact.c) — assigns VXID, resets per-xact counters, initializes memory/resource owner; leaves stateTRANS_INPROGRESS.CommitTransaction(xact.c) — the ordered commit pipeline.AbortTransaction(xact.c) — fast resource release, record abort, cleanup.CleanupTransaction(xact.c) — final teardown after abort; tears downTopTransactionContext, returns toTRANS_DEFAULT.PrepareTransaction(xact.c) — the 2PC twin ofCommitTransaction(seam).CommitTransactionCommandInternal/AbortCurrentTransactionInternal(xact.c) — theTBLOCK_*switch that drives block-state transitions.IsTransactionState,IsAbortedTransactionBlockState(xact.c) — state predicates used pervasively by the rest of the backend.
XID assignment and naming
Section titled “XID assignment and naming”GetCurrentTransactionId,GetTopTransactionId(xact.c) — public doors that lazily callAssignTransactionId.GetCurrentTransactionIdIfAny,GetTopTransactionIdIfAny(xact.c) — the non-assigning variants (returnInvalidif no XID yet).AssignTransactionId(xact.c) — parent-first assignment, pg_subtrans link, XID lock, SSI registration, hot-standbyXLOG_XACT_ASSIGNMENTbatching.GetNewTransactionId(varsup.c) — the only placenextXidadvances; wraparound limits, SLRU extension, ProcArray publication underXidGenLock.ReadNextFullTransactionId,AdvanceNextFullTransactionIdPastXid(varsup.c) — read / recovery-time advance.GetCurrentSubTransactionId,GetCurrentCommandId(xact.c) — the backend-local sub/command counters.
The durable records
Section titled “The durable records”RecordTransactionCommit(xact.c) — the point of no return: crit section +DELAY_CHKPT_START, write, flush-or-async, CLOG update, sync-rep wait.RecordTransactionAbort(xact.c) — write abort record, mark CLOG aborted, never flush.XactLogCommitRecord,XactLogAbortRecord(xact.c) — assemble the variable WAL record from optional sub-records keyed by thexinfomask; shared with the 2PC paths.xl_xact_commit,xl_xact_abort,xl_xact_xinfo,xl_xact_subxacts,xl_xact_assignment, theXACT_XINFO_HAS_*flags, theXLOG_XACT_*opcodes (xact.h) — the on-disk record vocabulary.xact_redo(xact.c, declared inxact.h) — the rmgr redo entry for these records (recovery side; detail inpostgres-recovery-redo.md).
Subtransactions
Section titled “Subtransactions”PushTransaction,PopTransaction(xact.c) — stack push/pop; subxid counter bump and wraparound guard.StartSubTransaction,CommitSubTransaction,AbortSubTransaction,CleanupSubTransaction(xact.c) — the subxact lifecycle.AtSubCommit_childXids(xact.c) — the upward merge into parentchildXids[].xactGetCommittedChildren(xact.c) — hands the committed-children array to the record writers.DefineSavepoint,ReleaseSavepoint,RollbackToSavepoint,BeginInternalSubTransaction,ReleaseCurrentSubTransaction,RollbackAndReleaseCurrentSubTransaction(xact.c, inxact.h) — the SQL and internal entry points.
Callbacks and resource cleanup
Section titled “Callbacks and resource cleanup”RegisterXactCallback/UnregisterXactCallback,RegisterSubXactCallback/UnregisterSubXactCallback(xact.c) — extension hook registration;CallXactCallbacks/CallSubXactCallbacksfire them.- The
AtEOXact_*andAtEOSubXact_*families (called fromxact.c, defined across many modules) — per-subsystem end-of-(sub)transaction cleanup. XactEvent,SubXactEvent(xact.h) — the event enums.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
TransState (enum) | src/backend/access/transam/xact.c | 141 |
TBlockState (enum) | src/backend/access/transam/xact.c | 157 |
TransactionStateData (struct) | src/backend/access/transam/xact.c | 193 |
TopTransactionStateData | src/backend/access/transam/xact.c | 247 |
IsTransactionState | src/backend/access/transam/xact.c | 386 |
GetTopTransactionId | src/backend/access/transam/xact.c | 426 |
GetCurrentTransactionId | src/backend/access/transam/xact.c | 454 |
AssignTransactionId | src/backend/access/transam/xact.c | 635 |
RecordTransactionCommit | src/backend/access/transam/xact.c | 1315 |
AtSubCommit_childXids | src/backend/access/transam/xact.c | 1664 |
RecordTransactionAbort | src/backend/access/transam/xact.c | 1754 |
StartTransaction | src/backend/access/transam/xact.c | 2064 |
CommitTransaction | src/backend/access/transam/xact.c | 2228 |
PrepareTransaction | src/backend/access/transam/xact.c | 2514 |
AbortTransaction | src/backend/access/transam/xact.c | 2809 |
CleanupTransaction | src/backend/access/transam/xact.c | 3008 |
CommitTransactionCommandInternal | src/backend/access/transam/xact.c | 3175 |
StartSubTransaction | src/backend/access/transam/xact.c | 5067 |
CommitSubTransaction | src/backend/access/transam/xact.c | 5104 |
AbortSubTransaction | src/backend/access/transam/xact.c | 5219 |
CleanupSubTransaction | src/backend/access/transam/xact.c | 5383 |
PushTransaction | src/backend/access/transam/xact.c | 5416 |
PopTransaction | src/backend/access/transam/xact.c | 5478 |
xactGetCommittedChildren | src/backend/access/transam/xact.c | 5790 |
XactLogCommitRecord | src/backend/access/transam/xact.c | 5814 |
XactLogAbortRecord | src/backend/access/transam/xact.c | 5986 |
GetNewTransactionId | src/backend/access/transam/varsup.c | 77 |
ReadNextFullTransactionId | src/backend/access/transam/varsup.c | 288 |
TransState / TBlockState consumers | — | see grep |
XLOG_XACT_COMMIT … opcodes | src/include/access/xact.h | 169 |
XACT_XINFO_HAS_* flags | src/include/access/xact.h | 188 |
XactEvent / XactCallback | src/include/access/xact.h | 126 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified facts
Section titled “Verified facts”-
The transaction has two independent state enums,
TransStateandTBlockState, and database access is legal only inTRANS_INPROGRESS. Verified by reading the enum definitions andIsTransactionStateinxact.con 2026-06-05 (commit 273fe94).IsTransactionStatereturnss->state == TRANS_INPROGRESSand nothing else. -
XID assignment is lazy:
StartTransactionassigns only a VXID, and the real XID is allocated on first call toGetCurrentTransactionId/GetTopTransactionId. Verified inStartTransaction(setsfullTransactionId = InvalidFullTransactionId) and the two getters, which callAssignTransactionIdonly when the id is invalid. -
GetNewTransactionIdpublishes the new XID intoProcGlobal->xids[]before releasingXidGenLock. Verified invarsup.c: theMyProc->xid = xid; ProcGlobal->xids[...] = xid;stores sit inside theXidGenLock-held region, and the comment citesaccess/transam/READMEfor the interlock rationale. The README’s “Interlocking” section confirms this is required for correctComputeXidHorizons/ oldest-xmin tracking. -
Commit’s durable instant is
RecordTransactionCommit, which writes one commit record, then (synchronously)XLogFlush+TransactionIdCommitTree, guarded by a critical section withDELAY_CHKPT_START. Verified by reading the function on 2026-06-05. The async path (synchronous_commit=off) replacesXLogFlushwithXLogSetAsyncXactLSNandTransactionIdCommitTreewithTransactionIdAsyncCommitTree. -
Abort never flushes WAL. Verified in
RecordTransactionAbort: there is noXLogFlushcall; the source comment states the post-crash assumption is “aborted anyway.” It does mark CLOG aborted (TransactionIdAbortTree) inside a critical section and nudges the walwriter withXLogSetAsyncXactLSN. -
Subcommit does not write a WAL record or mark CLOG; it merges children up into the parent and defers the CLOG mark to top-level commit. Verified in
CommitSubTransaction(no record-writing call; callsAtSubCommit_childXids) and confirmed by the source comment “Prior to 8.4 we marked subcommit in clog at this point. We now only perform that step … as part of the atomic update of the whole transaction tree at top level commit or abort,” and by the README’s “pg_xact and pg_subtrans” section. -
Subabort is eager:
AbortSubTransactioncallsRecordTransactionAbort(true)andXidCacheRemoveRunningXidsimmediately. Verified inAbortSubTransactionandRecordTransactionAbort(theisSubXactbranch removes the failed XIDs from the PGPROC running-children cache at once). -
XactLogCommitRecord/XactLogAbortRecordare shared with two-phase commit; a validtwophase_xidflips the opcode to the*_PREPAREDvariant. Verified inXactLogCommitRecord:info = !TransactionIdIsValid(twophase_xid) ? XLOG_XACT_COMMIT : XLOG_XACT_COMMIT_PREPARED, and theXACT_XINFO_HAS_TWOPHASEsub-record is appended only when the xid is valid. -
The subxid cache in PGPROC is bounded by
PGPROC_MAX_CACHED_SUBXIDS; past that it sets an overflow flag and readers must consult pg_subtrans. Verified in theisSubXactbranch ofGetNewTransactionId(varsup.c) and the README “pg_xact and pg_subtrans” section.
Open questions
Section titled “Open questions”-
Exact async-commit CLOG hint-bit deferral. This doc states that with
synchronous_commit=offthe CLOG mark goes throughTransactionIdAsyncCommitTreeand the actual hint-bit setting on heap pages is deferred until WAL is flushed to the relevant LSN. The mechanics of the per-CLOG-page LSN cache (group size 32, theGROUP_LSNSmachinery) live in the SLRU/clog layer, not inxact.c. Investigation path: cross-readpostgres-clog-commit-ts.mdandpostgres-slru.mdagainst the README’s “Asynchronous Commit” section to confirm the group-LSN size has not changed on REL_18. -
AbortCurrentTransactionInternalblock-state coverage. This doc reads the commit-sideTBLOCK_*switch in full but only summarizes the abort-side switch. Whether every abort-side block state has an exactly symmetric transition (especially theTBLOCK_SUBABORT_RESTART/TBLOCK_SUBRESTARTre-create-savepoint path) was not exhaustively traced. Investigation path: readAbortCurrentTransactionInternalcase-by-case and diagram the subxact abort/restart transitions. -
Cross-version drift of the commit record
xinfoflags. TheXACT_XINFO_HAS_DROPPED_STATSflag (PG15 cumulative stats) andXACT_XINFO_HAS_*set were read on REL_18; whether the bit assignments are stable across the versions a reader might compare against (PG14↔PG18) is not verified here. Investigation path: diffxact.hflag definitions across the relevantREL_*_STABLEtags.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”-
CUBRID’s transaction descriptor (TDES) vs PostgreSQL’s TransactionState stack. CUBRID centralizes per-transaction state in a
LOG_TDESand tracks savepoints as save-LSN markers in the undo log rather than as a stack of independent control blocks (seecubrid-transaction.md). PostgreSQL’s per-levelResourceOwner+ memory context per subxact is a heavier but more uniform unwind unit. A side-by-side of “savepoint rollback cost” between the two models would quantify what PostgreSQL pays for that uniformity. -
Physical undo logging (ARIES CLRs) vs MVCC-as-undo. ARIES (
dbms-papers/aries.md) logs undo actions and writes compensation log records so partial rollback is itself recoverable. PostgreSQL sidesteps most physical undo: a rolled-back tuple version is simply never made visible (the abort flips CLOG, and vacuum reclaims the dead version later), soxact.cwrites almost no undo. The zheap project (an undo-based PostgreSQL storage AM) is the natural comparison — it reintroduces ARIES-style undo into PostgreSQL and would change this module’s abort path substantially. -
Eager vs lazy XID and the 64-bit XID debate. PostgreSQL’s lazy 32-bit XID with epoch-extended
FullTransactionIdis a direct response to wraparound; engines with native 64-bit transaction ids (and InnoDB’s larger trx ids) avoid thexidStopLimitdefenses entirely. The recurring community proposal to make the on-disk XID 64-bit would delete the wraparound machinery invarsup.c; tracking that proposal against this module is a useful frontier note. -
Group commit and commit latency. PostgreSQL’s
synchronous_commitasync path plus the walwriter approximate group commit; dedicated group-commit designs (the classic Helland/DeWitt group-commit work, and modern log-pipelining) coalesce many backends’ flushes more aggressively. How PostgreSQL’s per-backendDELAY_CHKPT_START+SyncRepWaitForLSNinteract with a heavier group-commit scheme is an open comparison;postgres-xlog-wal.mdowns the flush mechanics this would build on.
Sources
Section titled “Sources”In-tree design docs:
src/backend/access/transam/README— “The Transaction System” (three-layer model, the workedBEGIN/SELECT/INSERT/COMMITexample), “Subtransaction Handling,” “Transaction and Subtransaction Numbering,” “Interlocking Transaction Begin, Transaction End, and Snapshots,” “pg_xact and pg_subtrans,” “Asynchronous Commit,” “Transaction Emulation during Recovery.”
Source files (REL_18_STABLE, commit 273fe94):
src/backend/access/transam/xact.c— the transaction manager: state machines, commit/abort pipelines, subtransactions, record writers, callbacks.src/include/access/xact.h— record formats,xinfoflags, opcodes, the public API, callback event enums.src/backend/access/transam/varsup.c—GetNewTransactionId,nextXid, the wraparound limits.
Textbook / paper anchors:
- Database System Concepts, Silberschatz, Korth, Sudarshan, 7e — ch. 17
(Transactions: ACID, the state diagram, recoverable/cascadeless schedules,
nested transactions). Capture:
knowledge/research/dbms-general/database-system-concepts.md. - ARIES: A Transaction Recovery Method…, Mohan et al., ACM TODS 1992 — WAL,
repeating history, logging undo for partial rollback. Capture:
knowledge/research/dbms-papers/aries.md.
Cross-references (do not duplicate their mechanism):
postgres-xlog-wal.md— WAL insert/flush,XLogInsert, the LSN gate.postgres-clog-commit-ts.md— pg_xact commit-status store and commit timestamps; the async-commit CLOG hint deferral.postgres-two-phase-commit.md—PREPARE TRANSACTION, twophase state files,COMMIT/ROLLBACK PREPARED, recovery of prepared xacts.postgres-mvcc-snapshots.md—ProcArrayEndTransaction, snapshots,latestCompletedXid, the visibility decision that keys off the CLOG flip.postgres-recovery-redo.md—xact_redoand replay of commit/abort records.