Skip to content

PostgreSQL CLOG & Commit Timestamp — Transaction Status Bitmaps, Subtransaction Parentage, and the commit_ts SLRU

Contents:

When a transaction commits, two things must become permanently true: (1) the write-ahead log record that captures the commit decision is flushed to stable storage so recovery can replay it, and (2) some persistent indicator tells every concurrent and future reader that the transaction is committed. In a multi-version concurrency control (MVCC) engine like PostgreSQL, that indicator is especially critical because heap tuples carry XID stamps, not pre-resolved visibility bits. Every visibility decision reduces to the question “is XID X committed, aborted, or still in progress?” — a lookup that must be fast, cheap, and accurate.

Database System Concepts (Silberschatz et al., 7th ed., §17.6 “Implementation of Atomicity and Durability”) describes the commit record in the log as the atomic commit point: once the commit log record reaches stable storage, the transaction is committed regardless of what happens to the data pages. PostgreSQL honours this exactly. The WAL commit record is the decision; the two-bit status entry in pg_xact is the announcement of that decision to the rest of the system, updated synchronously (for sync commits) after the flush.

Two design choices shape every persistent transaction-status store:

  1. Granularity of what is stored. The minimum is a single committed/aborted bit per top-level XID. A richer design stores per-subtransaction status, a commit timestamp, and a replication-origin tag. Each additional field has a cost in storage and a benefit in observability.

  2. Durability requirement. A committed-status bitmap must survive crashes; recovery replays it from WAL. A parent-linkage index for subtransactions (needed only while those subtransactions are still open) does not need to survive crashes; it can be zeroed at startup because no XID older than TransactionXmin can be an open subtransaction.

PostgreSQL resolves these two choices into three separate SLRU stores: pg_xact (durable, 2 bits/XID), pg_subtrans (volatile, 4 bytes/XID), and pg_commit_ts (durable but optional, 10 bytes/XID). The SLRU substrate that all three share is documented in postgres-slru.md; this document focuses on what each client stores, how it is read and written, and where its lifecycle hooks live.

The ARIES recovery framework (ARIES: A Transaction Recovery Method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging, Mohan et al., ACM TODS 1992; knowledge/research/dbms-papers/aries.md) is the theoretical basis for the WAL-before-data rule that makes pg_xact a safe announcement medium: because the commit WAL record is flushed before pg_xact is written (for sync commits), and because recovery replays the commit record on restart, the pg_xact bitmap is always consistent with the WAL.

Nearly every MVCC engine maintains some form of a transaction status table (TST) — a persistent, indexed structure that maps a transaction identifier to its final disposition (committed or aborted). The TST is the second of the two places a reader must consult: first the heap tuple’s XID stamps, then the TST to resolve whether those stamps are committed. The universal engineering conventions that shape every TST implementation are worth naming before diving into PostgreSQL’s choices.

Transaction IDs are sequential integers. The natural index into a TST is therefore a dense array: entry i stores the status of transaction i. Because status requires only two or three logical values (in-progress, committed, aborted, plus a sub-committed intermediate for multi-page atomicity), a 2-bit field suffices — four transactions per byte, thousands per page. Compact representation means fewer I/Os and a smaller shared-memory buffer pool. Every major engine (PostgreSQL, Oracle’s transaction table slots in the undo header, MySQL/InnoDB’s purge status in the trx_sys segment) exploits this compactness.

Writing status for a batch of transactions on the same TST page requires only one lock acquisition rather than one per transaction. Production engines group updates that land on the same page and hold the page lock for the entire batch. This is especially valuable at commit time when many concurrent transactions share a page of the TST.

Separate volatile parentage index for subtransactions

Section titled “Separate volatile parentage index for subtransactions”

SQL savepoints and PL exception blocks create subtransactions. A reader that encounters a subtransaction XID in a heap tuple needs to know the top-level parent’s status, not the subtransaction’s independent status, because a subtransaction commits only if all its ancestors also commit. The parentage data (child → parent links) is needed only while the subtransaction tree is open. Once the top-level transaction commits or aborts, all subtransaction XIDs either inherit committed status or are marked aborted. The parent-link index can therefore be volatile — zeroed at crash recovery startup — because no reader needs it for XIDs older than the oldest active XID.

Optional per-XID timestamp for replication and auditing

Section titled “Optional per-XID timestamp for replication and auditing”

Some engines record a commit timestamp alongside the status. This is never required for visibility decisions (snapshot isolation needs only before/after ordering, not wall-clock time), but it is invaluable for logical replication conflict resolution, CDC consumers that want commit ordering, and audit queries (pg_xact_commit_timestamp). Because it is optional, implementations gate it behind a configuration flag and store it in a separate file rather than bloating the primary TST.

ConceptPostgreSQL name
Transaction status tablepg_xact/ (CLOG), managed by clog.c
2-bit status entryXidStatus (0=in-progress, 1=committed, 2=aborted, 3=sub-committed)
Sub-committed intermediateTRANSACTION_STATUS_SUB_COMMITTED (0x03)
Subtransaction parentage indexpg_subtrans/, managed by subtrans.c
Optional commit timestamp storepg_commit_ts/, managed by commit_ts.c
SLRU page buffer poolSlruCtlData / SimpleLru* API (slru.c)
Page-level lockSLRU bank lock (SimpleLruGetBankLock)
Group commit batchTransactionGroupUpdateXidStatus via clogGroupFirst linked list

PostgreSQL’s transaction metadata lives in three co-operating SLRU volumes, each in its own subdirectory under $PGDATA:

pg_xact/ — 2 bits/XID — durable, WAL-redo-recoverable
pg_subtrans/ — 4 bytes/XID — volatile, zeroed at startup
pg_commit_ts/ — 10 bytes/XID — durable, optional (track_commit_timestamp)

All three are built on the same slru.c substrate. Each registers a SlruCtlData instance (XactCtlData, SubTransCtlData, CommitTsCtlData) and calls SimpleLruInit at postmaster startup. The difference is what they store per transaction and what their durability contract is.

Figure 1 — Three SLRU clients and their shared substrate

flowchart TD
    A[xact.c<br/>CommitTransaction] -->|TransactionIdCommitTree| B[transam.c<br/>TransactionIdSetTreeStatus]
    A -->|TransactionTreeSetCommitTsData| E[commit_ts.c<br/>CommitTsCtlData]
    B --> C[clog.c<br/>XactCtlData<br/>pg_xact/]
    A -->|SubTransSetParent at AssignTransactionId| D[subtrans.c<br/>SubTransCtlData<br/>pg_subtrans/]
    C --> F[slru.c SimpleLru API]
    D --> F
    E --> F
    F --> G[Shared memory page buffers]
    G --> H[Disk segments]

Figure 1 — xact.c drives all three SLRU clients at commit time. pg_subtrans is populated earlier, at sub-XID assignment, not at commit.

Each XID occupies exactly 2 bits. The encoding is defined in clog.h:

// XidStatus — src/include/access/clog.h
typedef int XidStatus;
#define TRANSACTION_STATUS_IN_PROGRESS 0x00
#define TRANSACTION_STATUS_COMMITTED 0x01
#define TRANSACTION_STATUS_ABORTED 0x02
#define TRANSACTION_STATUS_SUB_COMMITTED 0x03

Four transactions fit in one byte; BLCKSZ transactions fit on one page (CLOG_XACTS_PER_PAGE = BLCKSZ * 4). Addressing a specific XID into a page byte is pure arithmetic:

// TransactionIdToByte / TransactionIdToBIndex — clog.c
#define CLOG_BITS_PER_XACT 2
#define CLOG_XACTS_PER_BYTE 4
#define CLOG_XACT_BITMASK ((1 << CLOG_BITS_PER_XACT) - 1)
#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_PAGE)
#define TransactionIdToByte(xid) (TransactionIdToPgIndex(xid) / CLOG_XACTS_PER_BYTE)
#define TransactionIdToBIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_BYTE)

The read path is intentionally minimal. TransactionIdGetStatus computes the page and byte offset, calls SimpleLruReadPage_ReadOnly (which acquires the SLRU bank lock in shared mode and returns a buffer slot), extracts the 2-bit field, also returns the group LSN for that page (used by async-commit callers to know how far WAL must be flushed before the page can safely go to disk), and releases the bank lock:

// TransactionIdGetStatus — src/backend/access/transam/clog.c
XidStatus
TransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn)
{
int64 pageno = TransactionIdToPage(xid);
int byteno = TransactionIdToByte(xid);
int bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT;
int slotno;
char *byteptr;
XidStatus status;
slotno = SimpleLruReadPage_ReadOnly(XactCtl, pageno, xid);
byteptr = XactCtl->shared->page_buffer[slotno] + byteno;
status = (*byteptr >> bshift) & CLOG_XACT_BITMASK;
*lsn = XactCtl->shared->group_lsn[GetLSNIndex(slotno, xid)];
LWLockRelease(SimpleLruGetBankLock(XactCtl, pageno));
return status;
}

This function is the low-level primitive. The preferred entry point for callers is TransactionLogFetch in transam.c, which adds a single-slot cache (the last XID checked) and handles the TRANSACTION_STATUS_SUB_COMMITTED intermediate by chasing pg_subtrans to find the top-level parent before re-checking.

Writing status: the multi-page atomicity protocol

Section titled “Writing status: the multi-page atomicity protocol”

Setting status is more elaborate because a subtransaction tree can span multiple CLOG pages and the commit must appear atomic to concurrent readers. The entry point from transam.c is TransactionIdCommitTreeTransactionIdSetTreeStatus. The protocol for a multi-page commit is a three-step dance:

  1. Mark all sub-XIDs on other pages as SUB_COMMITTED.
  2. Atomically mark the top-level XID (and any sub-XIDs on its page) as COMMITTED.
  3. Mark the remaining sub-XIDs as COMMITTED.
// TransactionIdSetTreeStatus — src/backend/access/transam/clog.c
void
TransactionIdSetTreeStatus(TransactionId xid, int nsubxids,
TransactionId *subxids, XidStatus status,
XLogRecPtr lsn)
{
int64 pageno = TransactionIdToPage(xid);
int i;
/* count sub-xids on the same page as xid */
for (i = 0; i < nsubxids; i++)
if (TransactionIdToPage(subxids[i]) != pageno)
break;
if (i == nsubxids)
{
/* all on one page — single lock acquisition */
TransactionIdSetPageStatus(xid, nsubxids, subxids, status, lsn,
pageno, true);
}
else
{
/* multi-page: sub-commit others first, then commit top, then finalize */
if (status == TRANSACTION_STATUS_COMMITTED)
set_status_by_pages(nsubxids - i, subxids + i,
TRANSACTION_STATUS_SUB_COMMITTED, lsn);
TransactionIdSetPageStatus(xid, i, subxids, status, lsn, pageno, false);
set_status_by_pages(nsubxids - i, subxids + i, status, lsn);
}
}

A reader that sees SUB_COMMITTED knows the top-level XID is committed (step 2 has completed) even if step 3 has not yet finished, so visibility is consistent.

Figure 2 — Multi-page commit atomicity protocol

flowchart TD
    S[TransactionIdSetTreeStatus<br/>top XID on page p1] --> A{All subxids<br/>on page p1?}
    A -- yes --> B[single TransactionIdSetPageStatus<br/>lock p1 once]
    A -- no --> C[set_status_by_pages p2..pN<br/>SUB_COMMITTED]
    C --> D[TransactionIdSetPageStatus p1<br/>set top XID + same-page subs COMMITTED]
    D --> E[set_status_by_pages p2..pN<br/>COMMITTED]

Figure 2 — When a transaction tree spans multiple CLOG pages, sub-XIDs on remote pages pass through SUB_COMMITTED before the top-level commit is visible, preserving apparent atomicity.

At high concurrency, many backends commit simultaneously and all contend on the same SLRU bank lock for the current CLOG page. PostgreSQL uses a linked-list group-update mechanism to batch these writes: the first contending backend becomes the leader, collects all waiting backends into a singly-linked list through ProcGlobal->clogGroupFirst (a lock-free CAS chain via pg_atomic_compare_exchange_u32), acquires the bank lock once, writes status for the entire group, releases the lock, and wakes all followers via PGSemaphoreUnlock:

// TransactionGroupUpdateXidStatus (leader path) — clog.c
nextidx = pg_atomic_exchange_u32(&procglobal->clogGroupFirst,
INVALID_PROC_NUMBER);
while (nextidx != INVALID_PROC_NUMBER)
{
PGPROC *nextproc = &ProcGlobal->allProcs[nextidx];
TransactionIdSetPageStatusInternal(nextproc->clogGroupMemberXid,
nextproc->subxidStatus.count,
nextproc->subxids.xids,
nextproc->clogGroupMemberXidStatus,
nextproc->clogGroupMemberLsn,
nextproc->clogGroupMemberPage);
nextidx = pg_atomic_read_u32(&nextproc->clogGroupNext);
}
/* ... wake followers ... */

The group optimisation applies only when: (a) all group members are committing the same CLOG page, (b) the backend’s MyProc->xid matches the XID being written (i.e., it is the backend’s own commit, not a recovery replay), and (c) the subtransaction count does not exceed THRESHOLD_SUBTRANS_CLOG_OPT (5). When a group spans two bank-lock partitions (a race that leads different members to land on different pages), the leader switches bank locks mid-walk.

For async commits (synchronous_commit = off), the WAL commit record may not be flushed before pg_xact is updated. To honour the WAL rule on checkpoint (no dirty CLOG page may reach disk before its covering WAL is flushed), clog.c maintains a group_lsn array in the SLRU shared segment — one XLogRecPtr per group of CLOG_XACTS_PER_LSN_GROUP (32) transactions. TransactionIdSetStatusBit updates the group LSN when lsn is valid. The checkpointer consults these LSNs before writing dirty pages.

pg_subtrans stores one TransactionId (4 bytes) per XID — the immediate parent. A top-level transaction stores InvalidTransactionId; a savepoint-child stores the XID of the transaction that issued the SAVEPOINT statement.

The critical design property is volatility. Because subtransactions are, by definition, still open while their parent linkages are needed, and because recovery knows the parent links from the WAL anyway, pg_subtrans is not WAL-logged at all. At startup, StartupSUBTRANS zeroes every page that covers any XID between oldestActiveXID and nextXid:

// StartupSUBTRANS — src/backend/access/transam/subtrans.c
void
StartupSUBTRANS(TransactionId oldestActiveXID)
{
FullTransactionId nextXid;
int64 startPage, endPage;
// ... acquire bank locks page by page ...
for (;;)
{
(void) ZeroSUBTRANSPage(startPage);
if (startPage == endPage)
break;
startPage++;
if (startPage > TransactionIdToPage(MaxTransactionId))
startPage = 0;
}
// ...
}

Writing a parent link is SubTransSetParent, called from AssignTransactionId (in xact.c) when a sub-XID is first assigned. Reading it is SubTransGetParent, used by SubTransGetTopmostTransaction which walks the chain child → parent → grandparent until it reaches InvalidTransactionId or an XID older than TransactionXmin:

// SubTransGetTopmostTransaction — subtrans.c
TransactionId
SubTransGetTopmostTransaction(TransactionId xid)
{
TransactionId parentXid = xid, previousXid = xid;
while (TransactionIdIsValid(parentXid))
{
previousXid = parentXid;
if (TransactionIdPrecedes(parentXid, TransactionXmin))
break;
parentXid = SubTransGetParent(parentXid);
if (!TransactionIdPrecedes(parentXid, previousXid))
elog(ERROR, "pg_subtrans contains invalid entry: xid %u -> %u",
previousXid, parentXid);
}
return previousXid;
}

The loop guard against non-monotone parent chains prevents infinite loops on corrupted data. The TransactionXmin cut-off prevents page faults for very old XIDs whose pg_subtrans pages have been truncated.

pg_commit_ts: optional commit timestamp and origin

Section titled “pg_commit_ts: optional commit timestamp and origin”

pg_commit_ts stores a 10-byte CommitTimestampEntry per XID:

// CommitTimestampEntry — src/backend/access/transam/commit_ts.c
typedef struct CommitTimestampEntry
{
TimestampTz time; /* 8 bytes */
RepOriginId nodeid; /* 2 bytes */
} CommitTimestampEntry;

RepOriginId identifies the replication origin — which primary or intermediate node the transaction originated on — enabling logical-replication consumers to filter or order commits by origin.

The entire module is gated by track_commit_timestamp (GUC). When disabled, all write calls are no-ops (commitTsShared->commitTsActive == false). When enabled, TransactionTreeSetCommitTsData is called from RecordTransactionCommit in xact.c alongside the WAL commit record write, recording the timestamp and origin for the top-level XID and all its sub-XIDs:

// TransactionTreeSetCommitTsData (write path) — commit_ts.c
void
TransactionTreeSetCommitTsData(TransactionId xid, int nsubxids,
TransactionId *subxids, TimestampTz timestamp,
RepOriginId nodeid)
{
if (!commitTsShared->commitTsActive)
return;
headxid = xid;
i = 0;
for (;;)
{
int64 pageno = TransactionIdToCTsPage(headxid);
/* find subxids on the same page ... */
SetXidCommitTsInPage(headxid, j - i, subxids + i,
timestamp, nodeid, pageno);
if (j >= nsubxids)
break;
headxid = subxids[j];
i = j + 1;
}
/* update the in-memory cache of the most recent commit */
LWLockAcquire(CommitTsLock, LW_EXCLUSIVE);
commitTsShared->xidLastCommit = xid;
commitTsShared->dataLastCommit.time = timestamp;
commitTsShared->dataLastCommit.nodeid = nodeid;
// ... update newestCommitTsXid ...
LWLockRelease(CommitTsLock);
}

Unlike CLOG, pg_commit_ts does not have a group-commit mechanism; contention is not a bottleneck here because each entry is 10 bytes (fewer transactions per page) and commit timestamps are written once and rarely re-read on the critical path.

The read path — TransactionIdGetCommitTsData — first checks the in-memory CommitTimestampShared cache (xidLastCommit), then falls through to the SLRU if the requested XID is not the most recent. It validates the requested XID against [oldestCommitTsXid, newestCommitTsXid] before doing any page I/O.

pg_commit_ts has an activation layer that the other two SLRUs lack, because its GUC can change between restarts and a standby must mirror the primary’s setting:

StartupCommitTs() — calls ActivateCommitTs()
CompleteCommitTsInitialization() — calls Activate or Deactivate based on GUC
CommitTsParameterChange() — called on WAL replay of XLOG_PARAMETER_CHANGE

ActivateCommitTs creates the initial segment if missing and sets commitTsActive = true. DeactivateCommitTs clears the data directory and sets commitTsActive = false.

All three SLRUs follow the same lifecycle hook pattern, invoked from fixed call sites in postmaster startup, recovery, and checkpoint code:

PhaseCLOGSUBTRANSCOMMIT_TS
Postmaster allocCLOGShmemSize / CLOGShmemInitSUBTRANSShmemSize / SUBTRANSShmemInitCommitTsShmemSize / CommitTsShmemInit
initdbBootStrapCLOGBootStrapSUBTRANSBootStrapCommitTs (no-op)
StartupStartupCLOG + TrimCLOGStartupSUBTRANSStartupCommitTs
CheckpointCheckPointCLOGCheckPointSUBTRANSCheckPointCommitTs
XID extendExtendCLOGExtendSUBTRANSAdvanceOldestCommitTsXid
TruncateTruncateCLOGTruncateSUBTRANSTruncateCommitTs

TrimCLOG (called once after startup/recovery) zeroes the unused tail of the current CLOG page, preventing stale bits from a previous database lifecycle from being misread as legitimate status.

  • CLOGShmemInit — registers XactCtlData with SimpleLruInit, directory pg_xact, LWTranche LWTRANCHE_XACT_BUFFER/LWTRANCHE_XACT_SLRU.
  • TransactionIdSetTreeStatus — top-level entry point for writing status; handles single-page and multi-page trees; calls into TransactionIdSetPageStatus.
  • TransactionIdSetPageStatus — per-page dispatch; attempts group-update optimisation first (TransactionGroupUpdateXidStatus); falls back to direct LWLockAcquire + TransactionIdSetPageStatusInternal.
  • TransactionGroupUpdateXidStatus — group-commit leader/follower logic via ProcGlobal->clogGroupFirst CAS list; leader acquires bank lock, writes all members, wakes followers via PGSemaphoreUnlock.
  • TransactionIdSetPageStatusInternal — actual 2-bit write; calls TransactionIdSetStatusBit for each XID; marks the SLRU page dirty.
  • TransactionIdSetStatusBit — bit-level write: computes byte + shift, applies mask, updates group_lsn for async-commit tracking.
  • TransactionIdGetStatus — read path; uses SimpleLruReadPage_ReadOnly; returns status + group LSN.
  • StartupCLOG — records latest_page_number in shared memory.
  • TrimCLOG — zeroes unused bits in the current page after recovery.
  • CheckPointCLOG — calls SimpleLruWriteAll to flush dirty pages.
  • ExtendCLOG — zeroes a new page when nextXid crosses a page boundary (called under XidGenLock).
  • TruncateCLOG — calls SimpleLruTruncate; also updates pg_database.datminmxid bookkeeping for freeze horizon tracking.
  • TransactionLogFetch — single-slot cache wrapper around TransactionIdGetStatus; follows SUB_COMMITTEDSubTransGetTopmostTransaction → recheck.
  • TransactionIdCommitTree / TransactionIdAsyncCommitTree — sync and async wrappers; pass InvalidXLogRecPtr vs. valid LSN to TransactionIdSetTreeStatus.
  • TransactionIdAbortTree — abort wrapper; always passes InvalidXLogRecPtr (abort records need not be flushed before status is written).
  • SUBTRANSShmemInit — registers SubTransCtlData with SimpleLruInit, directory pg_subtrans, no LSN array (volatile, no WAL), SYNC_HANDLER_NONE.
  • SubTransSetParent — writes one TransactionId slot; called by AssignTransactionId in xact.c when a sub-XID is assigned.
  • SubTransGetParent — read-only page lookup.
  • SubTransGetTopmostTransaction — walks child→parent chain stopping at TransactionXmin; used by MVCC visibility checks for sub-XID tuples.
  • StartupSUBTRANS — zeroes active-range pages; requires no WAL.
  • ExtendSUBTRANS — zeroes new page at XID page boundary; called under XidGenLock.
  • TruncateSUBTRANS — frees pages below TransactionXmin.
  • CommitTsShmemInit — registers CommitTsCtlData + allocates CommitTimestampShared shmem struct for the in-memory cache.
  • TransactionTreeSetCommitTsData — main write entry point; iterates pages, calls SetXidCommitTsInPage per page, updates commitTsShared cache.
  • SetXidCommitTsInPage — acquires bank lock, reads page, calls TransactionIdSetCommitTs per XID.
  • TransactionIdGetCommitTsData — read entry point; checks cache, validates XID range, falls through to SLRU read.
  • GetLatestCommitTsData — returns xidLastCommit + cached entry under CommitTsLock.
  • ActivateCommitTs / DeactivateCommitTs — toggle commitTsActive flag; called at startup and on WAL replay of XLOG_PARAMETER_CHANGE.
  • StartupCommitTs / CompleteCommitTsInitialization — startup lifecycle hooks.
  • TruncateCommitTs — called from vac_truncate_clog in vacuumlazy.c.

Position-hint table (commit 273fe94, 2026-06-05)

Section titled “Position-hint table (commit 273fe94, 2026-06-05)”
SymbolFileLine
TRANSACTION_STATUS_*src/include/access/clog.h27–30
TransactionIdSetTreeStatussrc/backend/access/transam/clog.c183
TransactionIdSetPageStatussrc/backend/access/transam/clog.c293
TransactionIdSetPageStatusInternalsrc/backend/access/transam/clog.c364
TransactionGroupUpdateXidStatussrc/backend/access/transam/clog.c441
TransactionIdSetStatusBitsrc/backend/access/transam/clog.c661
TransactionIdGetStatussrc/backend/access/transam/clog.c735
CLOGShmemInitsrc/backend/access/transam/clog.c787
StartupCLOGsrc/backend/access/transam/clog.c877
TrimCLOGsrc/backend/access/transam/clog.c892
CheckPointCLOGsrc/backend/access/transam/clog.c937
ExtendCLOGsrc/backend/access/transam/clog.c959
TruncateCLOGsrc/backend/access/transam/clog.c1000
TransactionLogFetchsrc/backend/access/transam/transam.c52
TransactionIdCommitTreesrc/backend/access/transam/transam.c240
TransactionIdAsyncCommitTreesrc/backend/access/transam/transam.c252
TransactionIdAbortTreesrc/backend/access/transam/transam.c270
SubTransSetParentsrc/backend/access/transam/subtrans.c85
SubTransGetParentsrc/backend/access/transam/subtrans.c122
SubTransGetTopmostTransactionsrc/backend/access/transam/subtrans.c163
SUBTRANSShmemInitsrc/backend/access/transam/subtrans.c220
StartupSUBTRANSsrc/backend/access/transam/subtrans.c309
CommitTimestampEntrysrc/backend/access/transam/commit_ts.c55
CommitTimestampSharedsrc/backend/access/transam/commit_ts.c98
TransactionTreeSetCommitTsDatasrc/backend/access/transam/commit_ts.c141
TransactionIdGetCommitTsDatasrc/backend/access/transam/commit_ts.c274
CommitTsShmemInitsrc/backend/access/transam/commit_ts.c530
ActivateCommitTssrc/backend/access/transam/commit_ts.c705
DeactivateCommitTssrc/backend/access/transam/commit_ts.c785
  • pg_xact uses exactly 2 bits per XID, with four status values. Verified from the clog.h constants and TransactionIdSetStatusBit in clog.c. The 4th value SUB_COMMITTED (0x03) is an intermediate used only during multi-page tree commits and is never externally visible as a final status.

  • The group-update optimisation is gated on THRESHOLD_SUBTRANS_CLOG_OPT = 5. Verified at clog.c:301–326. The check compares nsubxids and MyProc->subxidStatus.count as a safety condition. Transactions with more than 5 sub-XIDs bypass the group path and acquire the bank lock directly.

  • pg_subtrans is completely zeroed on every startup, not replayed from WAL. Verified in subtrans.c:309–349. The module header comment and StartupSUBTRANS both confirm: “we only need to remember pg_subtrans information for currently-open transactions … no need to preserve data over a crash and restart.” SYNC_HANDLER_NONE in SimpleLruInit confirms no sync callbacks are registered.

  • track_commit_timestamp = off makes all pg_commit_ts writes no-ops. Verified in TransactionTreeSetCommitTsData (commit_ts.c:157): the function returns immediately when !commitTsShared->commitTsActive. This is an unlocked read, safe because on a standby the flag is only changed by the recovery process (which also calls this function).

  • TrimCLOG zeroes the unused tail of the current CLOG page. Verified at clog.c:892–931. The comment explains why: WAL replay during recovery might settle on a nextXID less than what the previous lifecycle wrote, leaving potentially non-zero bits beyond the new frontier. TrimCLOG makes those safe.

  • The group LSN array in CLOG shared memory has one entry per 32 XIDs. Verified: CLOG_XACTS_PER_LSN_GROUP = 32 (clog.c:92), CLOG_LSNS_PER_PAGE = CLOG_XACTS_PER_PAGE / 32 (clog.c:93). This is the granularity at which async-commit LSN tracking operates.

  • CommitTimestampShared holds only the single most-recently-committed XID. Verified at commit_ts.c:98–103. The cache is a single-slot cache (xidLastCommit

    • dataLastCommit), not a ring or multi-slot structure. On a cache miss the code falls through to the SLRU page.
  1. TruncateCLOG and pg_database bookkeeping. TruncateCLOG calls AdvanceOldestClogXid and updates pg_database.datfrozenxid bookkeeping for freeze-horizon management. The exact interaction with vac_truncate_clog in vacuumlazy.c and autovacuum’s freeze logic is not analysed here. The full chain is: autovacuum freeze → vac_truncate_clogTruncateCLOG. That chain belongs in postgres-vacuum.md and postgres-xid-wraparound-freeze.md.

  2. Group-update page mismatch handling. When the group-update leader discovers that a follower’s page number differs from the initial page (the race described at clog.c:490–513), it switches bank locks mid-walk. The code path for multi-bank-lock group walks is present but the performance characteristics under high contention with large subtransaction trees are not experimentally characterised.

  3. pg_commit_ts and oldestCommitTsXid maintenance. TransamVariables-> oldestCommitTsXid is used by TransactionIdGetCommitTsData to bound valid lookups. The code that advances this lower bound on truncation and the exact truncation trigger (checkpoint? autovacuum? both?) is in TruncateCommitTs (commit_ts.c:890) but the full orchestration path is unverified.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”
  • Oracle’s Interested Transaction List (ITL) and undo-based visibility. Oracle stores transaction slots directly in the heap block header (the ITL), not in a central status bitmap. Visibility is resolved by consulting the ITL slot or, for older transactions, chasing the undo chain. This eliminates the CLOG-equivalent hot spot at the cost of per-block transaction metadata. A side-by-side comparison would quantify the CLOG bank-lock contention that PostgreSQL’s group-update optimisation is designed to mitigate.

  • MySQL/InnoDB’s trx_sys and purge system. InnoDB’s transaction descriptor lives in trx_sys (a special segment), and visibility uses a “read view” that snapshots the active transaction list. Committed transaction records are purged lazily by the purge thread, which is the InnoDB analog of PostgreSQL vacuum. The CLOG equivalent is implicit in the undo log: a transaction whose undo space is reclaimed is definitively committed (from the purge system’s perspective).

  • MVCC without a central TST: HANA’s in-memory MVCC. SAP HANA and other in-memory engines avoid a persistent TST by keeping transaction status entirely in memory (since the entire database fits there) and rebuilding it at restart from a committed transaction log. This is possible only when restart time is bounded and the status table fits in DRAM. PostgreSQL’s pg_xact is the durable fallback for when neither condition holds.

  • pg_commit_ts and logical replication conflict resolution. The combination of commit timestamp + RepOriginId in pg_commit_ts is the enabling data structure for multi-master conflict resolution. A logical replication apply worker can compare timestamps across origins to implement last-write-wins. The paper Serializable Snapshot Isolation in PostgreSQL (Ports & Grittner, VLDB 2012; knowledge/research/dbms-papers/) does not discuss commit timestamps directly, but the architecture dovetails: SSI detects serialisation conflicts at commit time, and commit_ts provides the ordering evidence.

  • SUBTRANS volatility vs. full persistence. The design decision to not WAL-log pg_subtrans trades I/O cost for a constraint: the subtransaction chain cannot be walked for transactions older than TransactionXmin. This means SubTransGetTopmostTransaction can return an intermediate XID rather than the true topmost parent for very old sub-XIDs — documented and accepted in the comment at subtrans.c:151–159. A fully persistent design (as in some distributed MVCC systems that need global sub-XID resolution) would require WAL logging and a truncation-safe read path.

Source code (REL_18_STABLE, commit 273fe94)

  • src/backend/access/transam/clog.c — 1152 lines; CLOG SLRU client
  • src/backend/access/transam/subtrans.c — 447 lines; SUBTRANS SLRU client
  • src/backend/access/transam/commit_ts.c — 1073 lines; CommitTs SLRU client
  • src/backend/access/transam/transam.c — higher-level callers (TransactionIdCommitTree, TransactionLogFetch)
  • src/include/access/clog.hXidStatus typedef and status constants
  • src/include/access/slru.hSlruCtlData, SimpleLru* API declarations

Textbooks

  • Database System Concepts, Silberschatz, Korth, Sudarshan (7th ed.) — §17.6 “Implementation of Atomicity and Durability” (WAL, commit log record as commit point)
  • Database Internals, Petrov (2019) — ch. 5 MVCC and ch. 7 Log-Structured Storage (background on TST design patterns)

Papers

  • ARIES (Mohan et al., ACM TODS 1992) — WAL correctness and recovery protocol that underpins pg_xact durability guarantees; knowledge/research/dbms-papers/aries.md

Adjacent documents in this KB

  • postgres-slru.md — SLRU substrate (slru.c): buffer pool, bank locks, page lifecycle, SimpleLru* API (forward reference — doc not yet written as of 2026-06-05)
  • postgres-xact.md — transaction state machine and commit pipeline; the producer of CLOG and commit_ts writes
  • postgres-mvcc-snapshots.md — snapshot acquisition and visibility check; the primary consumer of TransactionLogFetch
  • postgres-recovery-redo.md — replays CLOG WAL records (XLOG_CLOG_ZEROPAGE, XLOG_CLOG_TRUNCATE) on restart
  • postgres-vacuum.md — drives TruncateCLOG and TruncateSUBTRANS via vac_truncate_clog