Skip to content

PostgreSQL Transaction Management — The Commit State Machine, Subtransactions, and 2PC Hooks

Contents:

A transaction is the unit of atomicity and durability in a database: the ACID promise that a group of reads and writes either takes full effect or has no effect at all, and that once the system reports “committed” the effect survives a crash. Database System Concepts (Silberschatz, Korth, Sudarshan, 7e, ch. 17 “Transactions”) frames this with the classic transaction state diagram — active → partially committed → committed, or active → failed → aborted — and pins the durable instant to the moment the commit log record reaches stable storage. Everything before that instant is reversible; everything after is not.

That single sentence hides the design space this module lives in. The textbook model says “write a commit record and make it durable,” but it does not say when in the sequence of cleanup steps the record must be written, how the engine names the transaction, or what happens to the half-finished work of a nested operation that the user wants to undo without losing the whole transaction. Three design choices follow, and PostgreSQL’s xact.c turns each of them in a specific direction:

  1. How is a transaction named, and when? The textbook assumes a transaction identifier exists for the lifetime of the transaction. A real engine has to decide whether to allocate that identifier eagerly at BEGIN or lazily at first write — a read-only transaction needs a handle for locking and snapshots but arguably never needs to appear in the permanent commit log.

  2. What is the order of operations at commit? Commit is not one instruction; it is “fire deferred triggers, close cursors, run pre-commit callbacks, write the WAL commit record, flush, mark the commit log, leave the running-transaction set, release locks, run post-commit callbacks, free memory.” The textbook gives the durable instant; the engineering problem is ordering everything else around it so that (a) a crash before the instant looks like an abort, (b) a crash after it looks like a commit, and (c) no other backend can observe an inconsistent in-between.

  3. How do you undo part of a transaction? SQL savepoints (and PL/pgSQL exception blocks) require nested, partially-rollback-able units. The model for this is the nested transaction: a tree where a child’s effects become permanent only if it and every ancestor commit. Database System Concepts (§17.3, recoverable/cascadeless schedules; the nested-transaction discussion) supplies the correctness frame; the engineering frame is “how do you cheaply represent and unwind the tree.”

The durability mechanism underneath all of this is write-ahead logging, and its canonical account is ARIES (Mohan et al., ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, ACM TODS 1992; captured in dbms-papers/aries.md). ARIES fixes three principles this module depends on: WAL (log the change before the change reaches the data page), repeating history during redo, and logging undo actions so that partial rollback (a savepoint, or an aborted subtransaction) is itself a recoverable operation. PostgreSQL’s transaction manager is the producer of the commit/abort records that ARIES-style recovery later replays; the recovery side proper lives in postgres-recovery-redo.md, and the WAL insert machinery in postgres-xlog-wal.md. What xact.c owns is the moment of decision: it decides that the transaction commits, in what order, and writes the record that makes the decision durable.

One PostgreSQL-specific subtlety that the textbook does not anticipate: because PostgreSQL’s heap is no-overwrite MVCC, the commit record is not where the data becomes visible. Visibility is decided by the snapshot machinery (postgres-mvcc-snapshots.md) reading the commit status of an XID from the commit log (pg_xact / CLOG, postgres-clog-commit-ts.md). So xact.c’s job is narrower than in an in-place engine: it does not install new values; it flips a two-bit status from “in progress” to “committed” (or “aborted”) and gets out of the running-transaction set. Everyone else’s visibility decision keys off that flip.

Across PostgreSQL, Oracle, SQL Server, MySQL/InnoDB, and CUBRID, a recurring set of engineering conventions realizes the textbook transaction. None of them are in the textbook; all of them recur because the same three pressures (naming, commit ordering, nesting) push implementers toward the same shapes.

A monotonic transaction-id counter, allocated under a short lock. Every engine has a global counter that hands out transaction identifiers in increasing order; the order is load-bearing because visibility and recovery both reason “this id is older than that id.” The counter is protected by a narrow critical section (a latch held for a handful of instructions) because it is one of the hottest shared resources in the system. PostgreSQL calls it nextXid under XidGenLock; InnoDB has trx_sys->max_trx_id; CUBRID has its own transaction-id allocator. The shared trick is to hold the lock for as little as possible and to publish the new id into the shared running-transaction registry before dropping the lock, so no concurrent snapshot can miss it.

Lazy id assignment for read-only work. Mature engines avoid burning a permanent id on transactions that never write. A read-only transaction still needs some handle (to hold locks, to anchor a snapshot), so engines split the identity in two: a cheap, backend-local handle assigned at start, and an expensive, globally-ordered id assigned only at first write. PostgreSQL’s split is VXID (virtual transaction id) vs XID: the VXID — (procNumber, localTransactionId) — is assigned with zero shared-memory contention at StartTransaction, and the real XID is deferred to AssignTransactionId.

Commit as a fixed pipeline around one durable instant. Every engine turns “commit” into an ordered sequence with exactly one step that is the point of no return: writing-and-flushing the commit log record. Steps before it are abort-safe (an error reroutes to the abort path); steps after it are “noncritical cleanup” that must not fail the transaction. The universal ordering rule is: make changes visible-to-others (release row/page locks) only after the commit is durable, and leave the running set in a way that interlocks with snapshot-takers.

A separate “abort is free” path. Because the recovery rule is “a transaction with no commit record is presumed aborted,” engines do not need to flush an abort record. PostgreSQL writes an abort record (useful for hot standby and for waking lock waiters) but deliberately does not XLogFlush it. This asymmetry — commit pays for durability, abort does not — is universal.

Nested transactions as a stack with upward merge. Savepoints and exception blocks are implemented as a stack of per-level state records. A child that commits does not independently durably commit; instead its identity (and its own committed children) is merged upward into the parent, and only the top-level commit durably commits the whole tree in one atomic record. A child that aborts is marked aborted immediately (so lock waiters and visibility checks short-circuit) and unwound from the stack.

Resource ownership tied to transaction scope. Buffers pinned, files opened, catalog-cache entries, locks held — all are tracked so they can be released en masse at end-of-transaction in a defined order. PostgreSQL centralizes this in the ResourceOwner tree (utils/resowner); the commit and abort paths walk it in phases (before-locks, locks, after-locks).

Theory / conventionPostgreSQL name (this module)
Transaction state (active / committed / aborted)TransState enum — TRANS_DEFAULT, TRANS_START, TRANS_INPROGRESS, TRANS_COMMIT, TRANS_ABORT, TRANS_PREPARE
Client-block state (BEGIN/COMMIT/SAVEPOINT block)TBlockState enum — TBLOCK_*
Per-transaction control blockTransactionStateData (the stack node)
Global monotonic id counterTransamVariables->nextXid, under XidGenLock
Cheap start-time handle (read-only)VXID(procNumber, localTransactionId)
Lazy real id (first write)XIDAssignTransactionIdGetNewTransactionId
Durable-instant recordXLOG_XACT_COMMIT written by XactLogCommitRecord
Commit-status store (the flip)pg_xact / CLOG via TransactionIdCommitTree (see postgres-clog-commit-ts.md)
Leave the running set (interlock)ProcArrayEndTransaction (see postgres-mvcc-snapshots.md)
Nested transaction nodea pushed TransactionStateData with parent link
Child→parent merge at subcommitAtSubCommit_childXids → parent’s childXids[]
Per-scope resource cleanupResourceOwner release phases + AtEOXact_* / AtEOSubXact_* callbacks
Undo-logging for partial rollback (ARIES)abort record + immediate CLOG abort mark; physical undo is MVCC, not logged

By the time TransactionStateData and AssignTransactionId appear in the next section, the reader already knows what kind of thing each is.

The in-tree README opens by calling the transaction system “a three-layer system.” The bottom layer is the low-level routines (StartTransaction, CommitTransaction, …); the middle layer is postgres.c’s per-query StartTransactionCommand / CommitTransactionCommand; the top layer is the SQL traffic-cop (BeginTransactionBlock, EndTransactionBlock, …). The two states that thread through these layers are deliberately separate:

// TransState — src/backend/access/transam/xact.c
typedef enum TransState
{
TRANS_DEFAULT, /* idle */
TRANS_START, /* transaction starting */
TRANS_INPROGRESS, /* inside a valid transaction */
TRANS_COMMIT, /* commit in progress */
TRANS_ABORT, /* abort in progress */
TRANS_PREPARE, /* prepare in progress */
} TransState;

TransState is the engine’s view: what is this backend physically doing right now. IsTransactionState() returns true only for TRANS_INPROGRESS — the transient START/COMMIT/ABORT/PREPARE states are explicitly “too soon or too late to do anything interesting,” so database access is only safe in TRANS_INPROGRESS.

The second machine is the client’s view — what the user’s BEGIN/COMMIT block wants — and it has far more states because it has to remember intent across multiple query cycles:

// TBlockState — src/backend/access/transam/xact.c (condensed)
typedef enum TBlockState
{
/* not-in-transaction-block states */
TBLOCK_DEFAULT, /* idle */
TBLOCK_STARTED, /* running single-query transaction */
/* transaction block states */
TBLOCK_BEGIN, /* starting transaction block */
TBLOCK_INPROGRESS, /* live transaction */
TBLOCK_IMPLICIT_INPROGRESS, /* live transaction after implicit BEGIN */
TBLOCK_PARALLEL_INPROGRESS, /* live transaction inside parallel worker */
TBLOCK_END, /* COMMIT received */
TBLOCK_ABORT, /* failed xact, awaiting ROLLBACK */
TBLOCK_ABORT_END, /* failed xact, ROLLBACK received */
TBLOCK_ABORT_PENDING, /* live xact, ROLLBACK received */
TBLOCK_PREPARE, /* live xact, PREPARE received */
/* subtransaction states */
TBLOCK_SUBBEGIN, TBLOCK_SUBINPROGRESS, TBLOCK_SUBRELEASE,
TBLOCK_SUBCOMMIT, TBLOCK_SUBABORT, TBLOCK_SUBABORT_END,
TBLOCK_SUBABORT_PENDING, TBLOCK_SUBRESTART, TBLOCK_SUBABORT_RESTART,
} TBlockState;

The reason for two machines is the subtlety the README spends a page on: a single SQL COMMIT does not immediately close the transaction. When the user types COMMIT, the traffic-cop’s EndTransactionBlock only moves the block to TBLOCK_END; the real CommitTransaction() runs later, when CommitTransactionCommand is next called by the main loop. This split is what lets control leave xact.c with the transaction still open so the main loop can keep processing inside the same transaction. The low-level TransState tracks the physical commit; the high-level TBlockState remembers that a COMMIT is pending.

The TBLOCK_* machine is driven almost entirely from one giant switch in CommitTransactionCommandInternal (and its abort sibling). The hot transitions:

stateDiagram-v2
    [*] --> TBLOCK_DEFAULT
    TBLOCK_DEFAULT --> TBLOCK_STARTED: StartTransactionCommand
    TBLOCK_STARTED --> TBLOCK_DEFAULT: CommitTransaction \n implicit single-query xact

    TBLOCK_STARTED --> TBLOCK_BEGIN: BEGIN
    TBLOCK_BEGIN --> TBLOCK_INPROGRESS: CommitTransactionCommand
    TBLOCK_INPROGRESS --> TBLOCK_INPROGRESS: CommandCounterIncrement \n per statement
    TBLOCK_INPROGRESS --> TBLOCK_END: COMMIT received
    TBLOCK_END --> TBLOCK_DEFAULT: CommitTransaction

    TBLOCK_INPROGRESS --> TBLOCK_ABORT_PENDING: ROLLBACK received
    TBLOCK_ABORT_PENDING --> TBLOCK_DEFAULT: AbortTransaction then CleanupTransaction

    TBLOCK_INPROGRESS --> TBLOCK_ABORT: error inside block
    TBLOCK_ABORT --> TBLOCK_ABORT_END: ROLLBACK received
    TBLOCK_ABORT_END --> TBLOCK_DEFAULT: CleanupTransaction

    TBLOCK_INPROGRESS --> TBLOCK_PREPARE: PREPARE TRANSACTION
    TBLOCK_PREPARE --> TBLOCK_DEFAULT: PrepareTransaction

Figure 1 — The high-level TBlockState machine for a top-level transaction. The crucial asymmetry: an error inside a block lands in TBLOCK_ABORT and waits for the user’s ROLLBACK (any other command is ignored), whereas an explicit ROLLBACK on a healthy block goes through TBLOCK_ABORT_PENDING where the engine still has to do the abort. Both converge on CleanupTransaction and back to idle. COMMIT never commits immediately — it parks in TBLOCK_END until the next CommitTransactionCommand.

The README’s worked example makes the two-phase nature concrete. For BEGIN; SELECT; INSERT; COMMIT; the main loop calls StartTransactionCommand / CommitTransactionCommand around every statement, and only the BEGIN statement actually runs StartTransaction(), only the COMMIT statement actually runs CommitTransaction(); the in-between CommitTransactionCommands just call CommandCounterIncrement() so later commands see earlier commands’ effects.

XID assignment: lazy, virtual-first, under XidGenLock

Section titled “XID assignment: lazy, virtual-first, under XidGenLock”

A transaction begins with no XID. StartTransaction assigns only a VXID, which costs nothing in shared memory:

// StartTransaction — src/backend/access/transam/xact.c (condensed)
s->state = TRANS_START;
s->fullTransactionId = InvalidFullTransactionId; /* until assigned */
// ...
vxid.procNumber = MyProcNumber;
vxid.localTransactionId = GetNextLocalTransactionId();
VirtualXactLockTableInsert(vxid);
MyProc->vxid.lxid = vxid.localTransactionId;

The real XID is allocated the first time something needs one — GetCurrentTransactionId() is the public door, and it lazily assigns:

// GetCurrentTransactionId — src/backend/access/transam/xact.c
TransactionId
GetCurrentTransactionId(void)
{
TransactionState s = CurrentTransactionState;
if (!FullTransactionIdIsValid(s->fullTransactionId))
AssignTransactionId(s);
return XidFromFullTransactionId(s->fullTransactionId);
}

AssignTransactionId does four things that the in-tree README’s “Transaction and Subtransaction Numbering” section explains: (1) if this is a subxact, it first assigns XIDs to all unassigned parents, iteratively (never recursing deeper than one frame), to preserve the invariant child XID > parent XID; (2) it calls GetNewTransactionId; (3) for a subxact it records the parent link in pg_subtrans via SubTransSetParent; (4) it takes the transaction-XID lock (XactLockTableInsert) charged to this level’s ResourceOwner, and for a top-level xact registers the XID with the predicate-lock (SSI) system.

// AssignTransactionId — src/backend/access/transam/xact.c (condensed)
s->fullTransactionId = GetNewTransactionId(isSubXact);
if (!isSubXact)
XactTopFullTransactionId = s->fullTransactionId;
if (isSubXact)
SubTransSetParent(XidFromFullTransactionId(s->fullTransactionId),
XidFromFullTransactionId(s->parent->fullTransactionId));
if (!isSubXact)
RegisterPredicateLockingXid(XidFromFullTransactionId(s->fullTransactionId));
// take XID lock charged to this level's ResourceOwner
currentOwner = CurrentResourceOwner;
CurrentResourceOwner = s->curTransactionOwner;
XactLockTableInsert(XidFromFullTransactionId(s->fullTransactionId));
CurrentResourceOwner = currentOwner;

The actual counter bump lives in varsup.c. GetNewTransactionId is the only place nextXid advances. It takes XidGenLock exclusively, checks the wraparound limits (xidVacLimit / xidWarnLimit / xidStopLimit — the escalating defenses that nag, then warn, then refuse new XIDs), extends the on-disk SLRUs for the new page, advances the counter, and — critically — publishes the XID into the shared ProcArray before releasing the lock:

// GetNewTransactionId — src/backend/access/transam/varsup.c (condensed)
LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
full_xid = TransamVariables->nextXid;
xid = XidFromFullTransactionId(full_xid);
// ... wraparound-limit checks (xidStopLimit -> ERROR, xidWarnLimit -> WARNING) ...
ExtendCLOG(xid); /* zero a new pg_xact page if needed */
ExtendCommitTs(xid);
ExtendSUBTRANS(xid);
FullTransactionIdAdvance(&TransamVariables->nextXid);
if (!isSubXact)
{
/* LWLockRelease acts as barrier */
MyProc->xid = xid;
ProcGlobal->xids[MyProc->pgxactoff] = xid;
}
else
{
/* store into the PGPROC subxid cache, or set overflowed */
if (nxids < PGPROC_MAX_CACHED_SUBXIDS) { ... MyProc->subxids.xids[nxids] = xid; ... }
else MyProc->subxidStatus.overflowed = substat->overflowed = true;
}
LWLockRelease(XidGenLock);
flowchart TD
    NEED["a write needs an XID<br/>GetCurrentTransactionId"] --> HAVE{fullTransactionId<br/>already valid?}
    HAVE -- yes --> RET["return existing XID"]
    HAVE -- no --> SUB{is this a subxact<br/>with unassigned parents?}
    SUB -- yes --> PAR["assign parents first<br/>iteratively, child XID > parent XID"]
    SUB -- no --> GNT
    PAR --> GNT["GetNewTransactionId under XidGenLock"]
    GNT --> EXT["ExtendCLOG / CommitTs / SUBTRANS<br/>then advance nextXid"]
    EXT --> PUB["publish XID into ProcArray<br/>(before releasing XidGenLock)"]
    PUB --> POST["SubTransSetParent (subxact)<br/>RegisterPredicateLockingXid (top)<br/>XactLockTableInsert"]
    POST --> RET

Figure 2 — Lazy XID assignment. The ordering rule from the README’s “Interlocking” section is the load-bearing detail: GetNewTransactionId must store the new XID into the ProcArray before releasing XidGenLock, so that every top-level XID ≤ latestCompletedXid is guaranteed present in the ProcArray (or no longer running). Without that ordering, a concurrent backend could allocate and commit a later XID, advancing latestCompletedXid past this one before it became visible — breaking the oldest-xmin computation that vacuum relies on.

CommitTransaction: the order of operations

Section titled “CommitTransaction: the order of operations”

CommitTransaction() is a long, deliberately-ordered routine. Its shape is three phases: pre-commit (may run user code, may still error into abort), the durable instant (RecordTransactionCommit, then leave the running set), and post-commit cleanup (must not fail the transaction). The abbreviated sequence:

// CommitTransaction — src/backend/access/transam/xact.c (heavily condensed)
/* --- pre-commit: user code may run, errors reroute to abort --- */
for (;;) { AfterTriggerFireDeferred(); if (!PreCommit_Portals(false)) break; }
CallXactCallbacks(XACT_EVENT_PRE_COMMIT);
AtEOXact_Parallel(true);
AfterTriggerEndXact(true);
PreCommit_on_commit_actions();
smgrDoPendingSyncs(true, is_parallel_worker);
AtEOXact_LargeObject(true);
PreCommit_Notify();
if (!is_parallel_worker)
PreCommit_CheckForSerializationFailure(); /* SSI may abort here */
HOLD_INTERRUPTS();
s->state = TRANS_COMMIT;
/* --- the durable instant --- */
if (!is_parallel_worker)
latestXid = RecordTransactionCommit(); /* writes + flushes + CLOG */
ProcArrayEndTransaction(MyProc, latestXid); /* leave running set */
/* --- post-commit: noncritical cleanup, must not fail the xact --- */
CallXactCallbacks(XACT_EVENT_COMMIT);
ResourceOwnerRelease(TopTransactionResourceOwner, RESOURCE_RELEASE_BEFORE_LOCKS, true, true);
AtEOXact_Buffers(true); AtEOXact_RelationCache(true);
AtEOXact_Inval(true); /* publish catalog invals */
ResourceOwnerRelease(TopTransactionResourceOwner, RESOURCE_RELEASE_LOCKS, true, true);
ResourceOwnerRelease(TopTransactionResourceOwner, RESOURCE_RELEASE_AFTER_LOCKS, true, true);
smgrDoPendingDeletes(true); /* drop files of dropped rels */
AtCommit_Notify();
/* ... AtEOXact_GUC / SPI / Namespace / PgStat / Snapshot ... */
AtCommit_Memory(); /* free TopTransactionContext */
s->state = TRANS_DEFAULT;
RESUME_INTERRUPTS();

Two ordering rules in this sequence come straight from the README and are worth naming because every other DBMS has to solve the same problem:

  • Leave the running set (ProcArrayEndTransaction) only after RecordTransactionCommit, and only before releasing locks. A concurrent snapshot-taker that no longer sees this XID as running must already be able to see it as committed in CLOG; and lock waiters must not be woken until the transaction is fully cleaned up from their point of view.
  • Publish catalog invalidations (AtEOXact_Inval) after dropping relcache references but before releasing locks, so that any backend waiting on a lock for a relation this transaction modified learns about the catalog change before it starts using the relation.

RecordTransactionCommit: where durability happens

Section titled “RecordTransactionCommit: where durability happens”

This is the one function in the module that contains the point of no return. The structure is: gather what the commit record needs (dropped files, committed children, dropped stats, invalidation messages), decide whether there is even an XID to commit, and if so enter a critical section, write the record, then flush or not based on synchronous_commit, then update CLOG.

// RecordTransactionCommit — src/backend/access/transam/xact.c (condensed)
TransactionId xid = GetTopTransactionIdIfAny();
bool markXidCommitted = TransactionIdIsValid(xid);
// ... gather nrels, nchildren, ndroppedstats, invalMessages ...
if (!markXidCommitted) {
/* no XID: nothing to commit. Only flush if we wrote WAL (e.g. HOT pruning) */
if (!wrote_xlog) goto cleanup;
} else {
/* force concurrent checkpoint to wait until pg_xact is updated */
START_CRIT_SECTION();
MyProc->delayChkptFlags |= DELAY_CHKPT_START;
XactLogCommitRecord(GetCurrentTransactionStopTimestamp(),
nchildren, children, nrels, rels,
ndroppedstats, droppedstats,
nmsgs, invalMessages, RelcacheInitFileInval,
MyXactFlags,
InvalidTransactionId, NULL /* plain commit */);
TransactionTreeSetCommitTsData(xid, nchildren, children, ...);
}
if ((wrote_xlog && markXidCommitted && synchronous_commit > SYNCHRONOUS_COMMIT_OFF)
|| forceSyncCommit || nrels > 0)
{
XLogFlush(XactLastRecEnd); /* SYNCHRONOUS path */
if (markXidCommitted)
TransactionIdCommitTree(xid, nchildren, children); /* CLOG = COMMITTED now */
}
else
{
XLogSetAsyncXactLSN(XactLastRecEnd); /* ASYNCHRONOUS path */
if (markXidCommitted)
TransactionIdAsyncCommitTree(xid, nchildren, children, XactLastRecEnd);
}
if (markXidCommitted) {
MyProc->delayChkptFlags &= ~DELAY_CHKPT_START;
END_CRIT_SECTION();
}
if (wrote_xlog && markXidCommitted)
SyncRepWaitForLSN(XactLastRecEnd, true); /* wait for sync standby */

Three things to read off this:

  1. No XID, no commit record. A read-only transaction (or one that only touched temp tables) never got an XID, so markXidCommitted is false and there is literally nothing to commit — it skips to cleanup. The only reason such a transaction flushes at all is if it wrote WAL for some side reason (HOT pruning), in which case the flush is for durability of that, not of a commit.

  2. The commit critical section + DELAY_CHKPT_START. Between writing the WAL record and updating CLOG, the backend sets DELAY_CHKPT_START so a concurrent checkpoint cannot move its redo pointer past the commit record while failing to flush the CLOG update. Without this interlock a crash just after checkpoint could lose a commit that the WAL already recorded.

  3. Synchronous vs asynchronous commit. With synchronous_commit on, the WAL is flushed (XLogFlush) and CLOG is updated synchronously (TransactionIdCommitTree). With synchronous_commit=off, the WAL is not flushed; instead the commit LSN is noted (XLogSetAsyncXactLSN) so the walwriter flushes it soon, and CLOG is updated async — the actual CLOG write defers until WAL up to that LSN is known flushed (the TransactionIdAsyncCommitTree path; see the README’s “Asynchronous Commit” section and postgres-clog-commit-ts.md). The order — record, then flush, then CLOG — is invariant; what async commit relaxes is when the flush completes, trading a small window of possible commit loss on crash for latency.

The record itself is built by XactLogCommitRecord, which is a study in optional sub-records: a bare xl_xact_commit is just a timestamp, and each extra payload (subxacts, dropped relfilelocators, dropped stats, invalidation messages, two-phase xid/gid, replication origin) is appended only if present, its presence flagged in an xinfo bitmask. This is why the same function serves plain commits and COMMIT PREPARED — the only difference is whether twophase_xid is valid, which flips the opcode from XLOG_XACT_COMMIT to XLOG_XACT_COMMIT_PREPARED.

AbortTransaction: the mirror image, but free

Section titled “AbortTransaction: the mirror image, but free”

Abort is structurally similar but reversed in spirit. It first releases lightweight resources as fast as possible (the README: “release all shared resources so that we do not delay other backends”), then records the abort, leaves the running set, then does the heavier cleanup:

// AbortTransaction — src/backend/access/transam/xact.c (condensed)
HOLD_INTERRUPTS();
AtAbort_Memory(); AtAbort_ResourceOwner();
LWLockReleaseAll(); /* drop LWLocks immediately */
UnlockBuffers();
XLogResetInsertion(); /* discard half-built WAL record */
LockErrorCleanup();
s->state = TRANS_ABORT;
SetUserIdAndSecContext(s->prevUser, s->prevSecContext); /* undo SECURITY DEFINER */
// ... AtEOXact_Parallel(false), AfterTriggerEndXact(false), AtAbort_Portals() ...
if (!is_parallel_worker)
latestXid = RecordTransactionAbort(false); /* writes record, NO flush */
ProcArrayEndTransaction(MyProc, latestXid);
if (TopTransactionResourceOwner != NULL) {
CallXactCallbacks(XACT_EVENT_ABORT);
ResourceOwnerRelease(..., RESOURCE_RELEASE_BEFORE_LOCKS, false, true);
// ... buffers, relcache, inval, multixact ...
ResourceOwnerRelease(..., RESOURCE_RELEASE_LOCKS, false, true);
ResourceOwnerRelease(..., RESOURCE_RELEASE_AFTER_LOCKS, false, true);
smgrDoPendingDeletes(false);
// ... GUC / SPI / PgStat ...
}
/* State remains TRANS_ABORT until CleanupTransaction(). */
RESUME_INTERRUPTS();

The key difference is in RecordTransactionAbort: it writes the abort record inside a critical section and immediately marks CLOG aborted via TransactionIdAbortTree, but it never flushes and never sets DELAY_CHKPT_START:

// RecordTransactionAbort — src/backend/access/transam/xact.c (condensed)
TransactionId xid = GetCurrentTransactionIdIfAny();
if (!TransactionIdIsValid(xid)) { /* no XID -> nobody cares we aborted */
if (!isSubXact) XactLastRecEnd = 0;
return InvalidTransactionId;
}
if (TransactionIdDidCommit(xid)) /* sanity: didn't half-commit */
elog(PANIC, "cannot abort transaction %u, it was already committed", xid);
START_CRIT_SECTION();
XactLogAbortRecord(xact_time, nchildren, children, nrels, rels,
ndroppedstats, droppedstats, MyXactFlags,
InvalidTransactionId, NULL);
if (!isSubXact)
XLogSetAsyncXactLSN(XactLastRecEnd); /* nudge walwriter, but do not block */
TransactionIdAbortTree(xid, nchildren, children); /* CLOG = ABORTED */
END_CRIT_SECTION();

The comment in the source is explicit: “We do not flush XLOG to disk here, since the default assumption after a crash would be that we aborted, anyway.” This is the universal “abort is free” asymmetry made concrete. Note also the two-phase abort handling the README describes: AbortTransaction releases shared resources immediately (so other backends aren’t delayed), but CleanupTransaction — which finally tears down TopTransactionContext and returns to TRANS_DEFAULT — does not run until the user actually issues ROLLBACK. That is why a failed transaction block sits in TBLOCK_ABORT ignoring everything until it sees a termination command.

Subtransactions: a stack with upward merge

Section titled “Subtransactions: a stack with upward merge”

Savepoints, PL/pgSQL exception blocks, and internal subtransactions are all built on the same primitive: a stack of TransactionStateData nodes linked by parent. The struct carries everything a level needs to be unwound independently:

// TransactionStateData — src/backend/access/transam/xact.c (condensed)
typedef struct TransactionStateData
{
FullTransactionId fullTransactionId; /* my XID (lazy; may be invalid) */
SubTransactionId subTransactionId; /* my subxact ID */
char *name; /* savepoint name, if any */
int savepointLevel;
TransState state; /* low-level state */
TBlockState blockState; /* high-level state */
int nestingLevel; /* transaction nesting depth */
int gucNestLevel;
MemoryContext curTransactionContext; /* my xact-lifetime context */
ResourceOwner curTransactionOwner; /* my query resources */
TransactionId *childXids; /* subcommitted child XIDs, in XID order */
int nChildXids;
int maxChildXids;
Oid prevUser;
int prevSecContext;
bool prevXactReadOnly;
bool startedInRecovery;
bool didLogXid;
// ... parallel-mode bookkeeping, chain flag ...
struct TransactionStateData *parent; /* back link to parent */
} TransactionStateData;

PushTransaction allocates a new node in TopTransactionContext, bumps the backend-local currentSubTransactionId counter (the top level is SubTransactionId 1; subxacts are 2 and up; the counter resets per top transaction and is not an XID), links it to the current node, and sets blockState = TBLOCK_SUBBEGIN. StartSubTransaction then initializes the subsystems (AtSubStart_Memory, AtSubStart_ResourceOwner) and moves the node to TRANS_INPROGRESS. Note the subxact node starts with no XID — like a top transaction, a savepoint that only reads never burns an XID.

flowchart TB
    TOP["top TransactionStateData<br/>subxid 1, XID 100 (lazy)<br/>childXids: [101, 102]"]
    S1["subxact node<br/>subxid 2, XID 103<br/>parent ->"]
    S2["subxact node (current)<br/>subxid 3, XID (none yet)<br/>parent ->"]
    S2 --> S1 --> TOP
    note["PushTransaction: allocate + link + TBLOCK_SUBBEGIN<br/>PopTransaction: relink CurrentTransactionState to parent, free node"]

Figure 3 — The subtransaction stack. Each node is an independent unwind unit with its own ResourceOwner and memory context, but XIDs are still lazy per node. The invariant from AssignTransactionId guarantees XIDs increase down the stack (parent always gets an XID before a child needs one), which is why childXids[] stays sorted.

The interesting moment is subcommit. A committing subtransaction does not write its own commit record and does not mark CLOG committed. Instead CommitSubTransaction merges its identity upward into the parent and releases only its own XID lock; all other locks transfer to the parent’s ResourceOwner:

// CommitSubTransaction — src/backend/access/transam/xact.c (condensed)
s->state = TRANS_COMMIT;
CommandCounterIncrement(); /* make subxact's commands visible */
/* Prior to 8.4 we marked subcommit in clog here; now deferred to top-level. */
if (FullTransactionIdIsValid(s->fullTransactionId))
AtSubCommit_childXids(); /* merge my XID + my children into parent */
// ... AfterTriggerEndSubXact(true), AtSubCommit_Portals(...), callbacks ...
CurrentResourceOwner = s->curTransactionOwner;
if (FullTransactionIdIsValid(s->fullTransactionId))
XactLockTableDelete(XidFromFullTransactionId(s->fullTransactionId));
/* Other locks transfer to parent */
ResourceOwnerRelease(s->curTransactionOwner, RESOURCE_RELEASE_LOCKS, true, false);
// ...
CurrentResourceOwner = s->parent->curTransactionOwner;
PopTransaction();

AtSubCommit_childXids is the merge: it grows the parent’s childXids[] array (doubling to amortize, capped at MaxAllocSize) and copies the subxact’s own XID followed by its children, keeping the array sorted by relying on the “child XID > parent XID” invariant. When the top transaction finally commits, RecordTransactionCommit passes this whole children array to XactLogCommitRecord and TransactionIdCommitTree — so the entire tree is marked committed in CLOG atomically, in one record. The README’s “pg_xact and pg_subtrans” section describes the corner case: when the tree’s status spans multiple CLOG pages, an intermediate “sub-committed” state is used to keep the multi-page update atomic, but within a single page they are all flipped to committed at once with no intermediate state.

Subabort is eager where subcommit is lazy. AbortSubTransaction calls RecordTransactionAbort(true) immediately (writing an abort record, marking CLOG aborted, and calling XidCacheRemoveRunningXids to drop the failed XIDs from the PGPROC running-children cache right away), then CleanupSubTransaction pops the node. The reason for the asymmetry is the same as at top level: an aborted subxact’s XID must be observably aborted now so that XactLockTableWait and visibility checks short-circuit, whereas a committed subxact’s fate is still contingent on its ancestors, so its CLOG mark waits for the top.

ROLLBACK TO <savepoint> is built from these primitives: the engine aborts all subtransactions up through the named one, then re-creates that level with the same name — “a completely new subtransaction as far as the internals are concerned.” RELEASE simply commits (merges up) the named level and everything above it.

The end-of-transaction work is dispatched through two parallel families of callbacks plus the ResourceOwner phases. The AtEOXact_* functions (buffers, relcache, inval, GUC, SPI, namespace, pgstat, snapshot, …) run at top commit/abort with a boolean isCommit; the AtEOSubXact_* family does the same at subtransaction boundaries, additionally taking the sub/parent subxids so state can be re-parented rather than discarded. Extensions plug into the same seam through registered callbacks:

// RegisterXactCallback / the event enum — src/include/access/xact.h
typedef enum
{
XACT_EVENT_COMMIT, XACT_EVENT_PARALLEL_COMMIT,
XACT_EVENT_ABORT, XACT_EVENT_PARALLEL_ABORT,
XACT_EVENT_PREPARE,
XACT_EVENT_PRE_COMMIT, XACT_EVENT_PARALLEL_PRE_COMMIT,
XACT_EVENT_PRE_PREPARE,
} XactEvent;
typedef void (*XactCallback) (XactEvent event, void *arg);
extern void RegisterXactCallback(XactCallback callback, void *arg);

The PRE_COMMIT events fire while user code can still run and an error can still reroute to abort; the plain COMMIT/ABORT events fire in the post-commit phase where failure is no longer survivable. This is why a callback that might fail belongs on PRE_COMMIT, not COMMIT.

The ResourceOwner release happens in three ordered phases — RESOURCE_RELEASE_BEFORE_LOCKS (buffer pins, files — things visible to other backends), RESOURCE_RELEASE_LOCKS (the heavyweight locks themselves), and RESOURCE_RELEASE_AFTER_LOCKS (backend-local odds and ends) — so that locks are dropped exactly at the point where a waiting backend will see this transaction as fully cleaned up. The mechanism itself is documented in postgres-resource-owners.md (planned, base-infra); xact.c is its primary caller.

xact.c contains PrepareTransaction, but two-phase commit proper — PREPARE TRANSACTION, the on-disk twophase state files, COMMIT/ROLLBACK PREPARED, recovery of prepared xacts — lives in twophase.c and is the subject of postgres-two-phase-commit.md. The seam this document owns is narrow and worth stating precisely so the boundary is clear:

  • The TBLOCK_PREPARE block state routes to PrepareTransaction(), which is a near-twin of CommitTransaction() but, instead of RecordTransactionCommit, hands the transaction’s state to the twophase machinery (StartPrepare / EndPrepare) that durably writes a prepare record and a state file.
  • XactLogCommitRecord and XactLogAbortRecord are shared between plain and prepared paths: passing a valid twophase_xid flips the opcode to XLOG_XACT_COMMIT_PREPARED / XLOG_XACT_ABORT_PREPARED and appends the xl_xact_twophase (and optionally GID) sub-record. So the WAL record format for commit/abort is defined here; the protocol that uses the prepared variants is defined there.

Everything else about 2PC — the resolver, the GID namespace, surviving a restart with prepared transactions still pending — is deliberately out of scope here.

Grouped by sub-system. Symbols are the stable anchor; the position-hint table at the end pins line numbers to commit 273fe94.

  • TransState, TBlockState (enums, xact.c) — the two state machines.
  • TransactionStateData / TransactionState (xact.c) — the stack node; TopTransactionStateData is the static top node, CurrentTransactionState the cursor into the stack.
  • StartTransaction (xact.c) — assigns VXID, resets per-xact counters, initializes memory/resource owner; leaves state TRANS_INPROGRESS.
  • CommitTransaction (xact.c) — the ordered commit pipeline.
  • AbortTransaction (xact.c) — fast resource release, record abort, cleanup.
  • CleanupTransaction (xact.c) — final teardown after abort; tears down TopTransactionContext, returns to TRANS_DEFAULT.
  • PrepareTransaction (xact.c) — the 2PC twin of CommitTransaction (seam).
  • CommitTransactionCommandInternal / AbortCurrentTransactionInternal (xact.c) — the TBLOCK_* switch that drives block-state transitions.
  • IsTransactionState, IsAbortedTransactionBlockState (xact.c) — state predicates used pervasively by the rest of the backend.
  • GetCurrentTransactionId, GetTopTransactionId (xact.c) — public doors that lazily call AssignTransactionId.
  • GetCurrentTransactionIdIfAny, GetTopTransactionIdIfAny (xact.c) — the non-assigning variants (return Invalid if no XID yet).
  • AssignTransactionId (xact.c) — parent-first assignment, pg_subtrans link, XID lock, SSI registration, hot-standby XLOG_XACT_ASSIGNMENT batching.
  • GetNewTransactionId (varsup.c) — the only place nextXid advances; wraparound limits, SLRU extension, ProcArray publication under XidGenLock.
  • ReadNextFullTransactionId, AdvanceNextFullTransactionIdPastXid (varsup.c) — read / recovery-time advance.
  • GetCurrentSubTransactionId, GetCurrentCommandId (xact.c) — the backend-local sub/command counters.
  • RecordTransactionCommit (xact.c) — the point of no return: crit section + DELAY_CHKPT_START, write, flush-or-async, CLOG update, sync-rep wait.
  • RecordTransactionAbort (xact.c) — write abort record, mark CLOG aborted, never flush.
  • XactLogCommitRecord, XactLogAbortRecord (xact.c) — assemble the variable WAL record from optional sub-records keyed by the xinfo mask; shared with the 2PC paths.
  • xl_xact_commit, xl_xact_abort, xl_xact_xinfo, xl_xact_subxacts, xl_xact_assignment, the XACT_XINFO_HAS_* flags, the XLOG_XACT_* opcodes (xact.h) — the on-disk record vocabulary.
  • xact_redo (xact.c, declared in xact.h) — the rmgr redo entry for these records (recovery side; detail in postgres-recovery-redo.md).
  • PushTransaction, PopTransaction (xact.c) — stack push/pop; subxid counter bump and wraparound guard.
  • StartSubTransaction, CommitSubTransaction, AbortSubTransaction, CleanupSubTransaction (xact.c) — the subxact lifecycle.
  • AtSubCommit_childXids (xact.c) — the upward merge into parent childXids[].
  • xactGetCommittedChildren (xact.c) — hands the committed-children array to the record writers.
  • DefineSavepoint, ReleaseSavepoint, RollbackToSavepoint, BeginInternalSubTransaction, ReleaseCurrentSubTransaction, RollbackAndReleaseCurrentSubTransaction (xact.c, in xact.h) — the SQL and internal entry points.
  • RegisterXactCallback / UnregisterXactCallback, RegisterSubXactCallback / UnregisterSubXactCallback (xact.c) — extension hook registration; CallXactCallbacks / CallSubXactCallbacks fire them.
  • The AtEOXact_* and AtEOSubXact_* families (called from xact.c, defined across many modules) — per-subsystem end-of-(sub)transaction cleanup.
  • XactEvent, SubXactEvent (xact.h) — the event enums.

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
TransState (enum)src/backend/access/transam/xact.c141
TBlockState (enum)src/backend/access/transam/xact.c157
TransactionStateData (struct)src/backend/access/transam/xact.c193
TopTransactionStateDatasrc/backend/access/transam/xact.c247
IsTransactionStatesrc/backend/access/transam/xact.c386
GetTopTransactionIdsrc/backend/access/transam/xact.c426
GetCurrentTransactionIdsrc/backend/access/transam/xact.c454
AssignTransactionIdsrc/backend/access/transam/xact.c635
RecordTransactionCommitsrc/backend/access/transam/xact.c1315
AtSubCommit_childXidssrc/backend/access/transam/xact.c1664
RecordTransactionAbortsrc/backend/access/transam/xact.c1754
StartTransactionsrc/backend/access/transam/xact.c2064
CommitTransactionsrc/backend/access/transam/xact.c2228
PrepareTransactionsrc/backend/access/transam/xact.c2514
AbortTransactionsrc/backend/access/transam/xact.c2809
CleanupTransactionsrc/backend/access/transam/xact.c3008
CommitTransactionCommandInternalsrc/backend/access/transam/xact.c3175
StartSubTransactionsrc/backend/access/transam/xact.c5067
CommitSubTransactionsrc/backend/access/transam/xact.c5104
AbortSubTransactionsrc/backend/access/transam/xact.c5219
CleanupSubTransactionsrc/backend/access/transam/xact.c5383
PushTransactionsrc/backend/access/transam/xact.c5416
PopTransactionsrc/backend/access/transam/xact.c5478
xactGetCommittedChildrensrc/backend/access/transam/xact.c5790
XactLogCommitRecordsrc/backend/access/transam/xact.c5814
XactLogAbortRecordsrc/backend/access/transam/xact.c5986
GetNewTransactionIdsrc/backend/access/transam/varsup.c77
ReadNextFullTransactionIdsrc/backend/access/transam/varsup.c288
TransState / TBlockState consumerssee grep
XLOG_XACT_COMMIT … opcodessrc/include/access/xact.h169
XACT_XINFO_HAS_* flagssrc/include/access/xact.h188
XactEvent / XactCallbacksrc/include/access/xact.h126
  • The transaction has two independent state enums, TransState and TBlockState, and database access is legal only in TRANS_INPROGRESS. Verified by reading the enum definitions and IsTransactionState in xact.c on 2026-06-05 (commit 273fe94). IsTransactionState returns s->state == TRANS_INPROGRESS and nothing else.

  • XID assignment is lazy: StartTransaction assigns only a VXID, and the real XID is allocated on first call to GetCurrentTransactionId / GetTopTransactionId. Verified in StartTransaction (sets fullTransactionId = InvalidFullTransactionId) and the two getters, which call AssignTransactionId only when the id is invalid.

  • GetNewTransactionId publishes the new XID into ProcGlobal->xids[] before releasing XidGenLock. Verified in varsup.c: the MyProc->xid = xid; ProcGlobal->xids[...] = xid; stores sit inside the XidGenLock-held region, and the comment cites access/transam/README for the interlock rationale. The README’s “Interlocking” section confirms this is required for correct ComputeXidHorizons / oldest-xmin tracking.

  • Commit’s durable instant is RecordTransactionCommit, which writes one commit record, then (synchronously) XLogFlush + TransactionIdCommitTree, guarded by a critical section with DELAY_CHKPT_START. Verified by reading the function on 2026-06-05. The async path (synchronous_commit=off) replaces XLogFlush with XLogSetAsyncXactLSN and TransactionIdCommitTree with TransactionIdAsyncCommitTree.

  • Abort never flushes WAL. Verified in RecordTransactionAbort: there is no XLogFlush call; the source comment states the post-crash assumption is “aborted anyway.” It does mark CLOG aborted (TransactionIdAbortTree) inside a critical section and nudges the walwriter with XLogSetAsyncXactLSN.

  • Subcommit does not write a WAL record or mark CLOG; it merges children up into the parent and defers the CLOG mark to top-level commit. Verified in CommitSubTransaction (no record-writing call; calls AtSubCommit_childXids) and confirmed by the source comment “Prior to 8.4 we marked subcommit in clog at this point. We now only perform that step … as part of the atomic update of the whole transaction tree at top level commit or abort,” and by the README’s “pg_xact and pg_subtrans” section.

  • Subabort is eager: AbortSubTransaction calls RecordTransactionAbort(true) and XidCacheRemoveRunningXids immediately. Verified in AbortSubTransaction and RecordTransactionAbort (the isSubXact branch removes the failed XIDs from the PGPROC running-children cache at once).

  • XactLogCommitRecord / XactLogAbortRecord are shared with two-phase commit; a valid twophase_xid flips the opcode to the *_PREPARED variant. Verified in XactLogCommitRecord: info = !TransactionIdIsValid(twophase_xid) ? XLOG_XACT_COMMIT : XLOG_XACT_COMMIT_PREPARED, and the XACT_XINFO_HAS_TWOPHASE sub-record is appended only when the xid is valid.

  • The subxid cache in PGPROC is bounded by PGPROC_MAX_CACHED_SUBXIDS; past that it sets an overflow flag and readers must consult pg_subtrans. Verified in the isSubXact branch of GetNewTransactionId (varsup.c) and the README “pg_xact and pg_subtrans” section.

  1. Exact async-commit CLOG hint-bit deferral. This doc states that with synchronous_commit=off the CLOG mark goes through TransactionIdAsyncCommitTree and the actual hint-bit setting on heap pages is deferred until WAL is flushed to the relevant LSN. The mechanics of the per-CLOG-page LSN cache (group size 32, the GROUP_LSNS machinery) live in the SLRU/clog layer, not in xact.c. Investigation path: cross-read postgres-clog-commit-ts.md and postgres-slru.md against the README’s “Asynchronous Commit” section to confirm the group-LSN size has not changed on REL_18.

  2. AbortCurrentTransactionInternal block-state coverage. This doc reads the commit-side TBLOCK_* switch in full but only summarizes the abort-side switch. Whether every abort-side block state has an exactly symmetric transition (especially the TBLOCK_SUBABORT_RESTART / TBLOCK_SUBRESTART re-create-savepoint path) was not exhaustively traced. Investigation path: read AbortCurrentTransactionInternal case-by-case and diagram the subxact abort/restart transitions.

  3. Cross-version drift of the commit record xinfo flags. The XACT_XINFO_HAS_DROPPED_STATS flag (PG15 cumulative stats) and XACT_XINFO_HAS_* set were read on REL_18; whether the bit assignments are stable across the versions a reader might compare against (PG14↔PG18) is not verified here. Investigation path: diff xact.h flag definitions across the relevant REL_*_STABLE tags.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”
  • CUBRID’s transaction descriptor (TDES) vs PostgreSQL’s TransactionState stack. CUBRID centralizes per-transaction state in a LOG_TDES and tracks savepoints as save-LSN markers in the undo log rather than as a stack of independent control blocks (see cubrid-transaction.md). PostgreSQL’s per-level ResourceOwner + memory context per subxact is a heavier but more uniform unwind unit. A side-by-side of “savepoint rollback cost” between the two models would quantify what PostgreSQL pays for that uniformity.

  • Physical undo logging (ARIES CLRs) vs MVCC-as-undo. ARIES (dbms-papers/aries.md) logs undo actions and writes compensation log records so partial rollback is itself recoverable. PostgreSQL sidesteps most physical undo: a rolled-back tuple version is simply never made visible (the abort flips CLOG, and vacuum reclaims the dead version later), so xact.c writes almost no undo. The zheap project (an undo-based PostgreSQL storage AM) is the natural comparison — it reintroduces ARIES-style undo into PostgreSQL and would change this module’s abort path substantially.

  • Eager vs lazy XID and the 64-bit XID debate. PostgreSQL’s lazy 32-bit XID with epoch-extended FullTransactionId is a direct response to wraparound; engines with native 64-bit transaction ids (and InnoDB’s larger trx ids) avoid the xidStopLimit defenses entirely. The recurring community proposal to make the on-disk XID 64-bit would delete the wraparound machinery in varsup.c; tracking that proposal against this module is a useful frontier note.

  • Group commit and commit latency. PostgreSQL’s synchronous_commit async path plus the walwriter approximate group commit; dedicated group-commit designs (the classic Helland/DeWitt group-commit work, and modern log-pipelining) coalesce many backends’ flushes more aggressively. How PostgreSQL’s per-backend DELAY_CHKPT_START + SyncRepWaitForLSN interact with a heavier group-commit scheme is an open comparison; postgres-xlog-wal.md owns the flush mechanics this would build on.

In-tree design docs:

  • src/backend/access/transam/README — “The Transaction System” (three-layer model, the worked BEGIN/SELECT/INSERT/COMMIT example), “Subtransaction Handling,” “Transaction and Subtransaction Numbering,” “Interlocking Transaction Begin, Transaction End, and Snapshots,” “pg_xact and pg_subtrans,” “Asynchronous Commit,” “Transaction Emulation during Recovery.”

Source files (REL_18_STABLE, commit 273fe94):

  • src/backend/access/transam/xact.c — the transaction manager: state machines, commit/abort pipelines, subtransactions, record writers, callbacks.
  • src/include/access/xact.h — record formats, xinfo flags, opcodes, the public API, callback event enums.
  • src/backend/access/transam/varsup.cGetNewTransactionId, nextXid, the wraparound limits.

Textbook / paper anchors:

  • Database System Concepts, Silberschatz, Korth, Sudarshan, 7e — ch. 17 (Transactions: ACID, the state diagram, recoverable/cascadeless schedules, nested transactions). Capture: knowledge/research/dbms-general/database-system-concepts.md.
  • ARIES: A Transaction Recovery Method…, Mohan et al., ACM TODS 1992 — WAL, repeating history, logging undo for partial rollback. Capture: knowledge/research/dbms-papers/aries.md.

Cross-references (do not duplicate their mechanism):

  • postgres-xlog-wal.md — WAL insert/flush, XLogInsert, the LSN gate.
  • postgres-clog-commit-ts.md — pg_xact commit-status store and commit timestamps; the async-commit CLOG hint deferral.
  • postgres-two-phase-commit.mdPREPARE TRANSACTION, twophase state files, COMMIT/ROLLBACK PREPARED, recovery of prepared xacts.
  • postgres-mvcc-snapshots.mdProcArrayEndTransaction, snapshots, latestCompletedXid, the visibility decision that keys off the CLOG flip.
  • postgres-recovery-redo.mdxact_redo and replay of commit/abort records.