PostgreSQL CLOG & Commit Timestamp — Transaction Status Bitmaps, Subtransaction Parentage, and the commit_ts SLRU
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”When a transaction commits, two things must become permanently true: (1) the write-ahead log record that captures the commit decision is flushed to stable storage so recovery can replay it, and (2) some persistent indicator tells every concurrent and future reader that the transaction is committed. In a multi-version concurrency control (MVCC) engine like PostgreSQL, that indicator is especially critical because heap tuples carry XID stamps, not pre-resolved visibility bits. Every visibility decision reduces to the question “is XID X committed, aborted, or still in progress?” — a lookup that must be fast, cheap, and accurate.
Database System Concepts (Silberschatz et al., 7th ed., §17.6 “Implementation
of Atomicity and Durability”) describes the commit record in the log as the
atomic commit point: once the commit log record reaches stable storage, the
transaction is committed regardless of what happens to the data pages. PostgreSQL
honours this exactly. The WAL commit record is the decision; the two-bit status
entry in pg_xact is the announcement of that decision to the rest of the
system, updated synchronously (for sync commits) after the flush.
Two design choices shape every persistent transaction-status store:
-
Granularity of what is stored. The minimum is a single committed/aborted bit per top-level XID. A richer design stores per-subtransaction status, a commit timestamp, and a replication-origin tag. Each additional field has a cost in storage and a benefit in observability.
-
Durability requirement. A committed-status bitmap must survive crashes; recovery replays it from WAL. A parent-linkage index for subtransactions (needed only while those subtransactions are still open) does not need to survive crashes; it can be zeroed at startup because no XID older than
TransactionXmincan be an open subtransaction.
PostgreSQL resolves these two choices into three separate SLRU stores: pg_xact
(durable, 2 bits/XID), pg_subtrans (volatile, 4 bytes/XID), and pg_commit_ts
(durable but optional, 10 bytes/XID). The SLRU substrate that all three share is
documented in postgres-slru.md; this document focuses on what each client
stores, how it is read and written, and where its lifecycle hooks live.
The ARIES recovery framework (ARIES: A Transaction Recovery Method Supporting
Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging,
Mohan et al., ACM TODS 1992; knowledge/research/dbms-papers/aries.md) is the
theoretical basis for the WAL-before-data rule that makes pg_xact a safe
announcement medium: because the commit WAL record is flushed before
pg_xact is written (for sync commits), and because recovery replays the
commit record on restart, the pg_xact bitmap is always consistent with the
WAL.
Common DBMS Design
Section titled “Common DBMS Design”Nearly every MVCC engine maintains some form of a transaction status table (TST) — a persistent, indexed structure that maps a transaction identifier to its final disposition (committed or aborted). The TST is the second of the two places a reader must consult: first the heap tuple’s XID stamps, then the TST to resolve whether those stamps are committed. The universal engineering conventions that shape every TST implementation are worth naming before diving into PostgreSQL’s choices.
Compact fixed-width entries
Section titled “Compact fixed-width entries”Transaction IDs are sequential integers. The natural index into a TST is therefore a dense array: entry i stores the status of transaction i. Because status requires only two or three logical values (in-progress, committed, aborted, plus a sub-committed intermediate for multi-page atomicity), a 2-bit field suffices — four transactions per byte, thousands per page. Compact representation means fewer I/Os and a smaller shared-memory buffer pool. Every major engine (PostgreSQL, Oracle’s transaction table slots in the undo header, MySQL/InnoDB’s purge status in the trx_sys segment) exploits this compactness.
Page-at-a-time updates with a single lock
Section titled “Page-at-a-time updates with a single lock”Writing status for a batch of transactions on the same TST page requires only one lock acquisition rather than one per transaction. Production engines group updates that land on the same page and hold the page lock for the entire batch. This is especially valuable at commit time when many concurrent transactions share a page of the TST.
Separate volatile parentage index for subtransactions
Section titled “Separate volatile parentage index for subtransactions”SQL savepoints and PL exception blocks create subtransactions. A reader that encounters a subtransaction XID in a heap tuple needs to know the top-level parent’s status, not the subtransaction’s independent status, because a subtransaction commits only if all its ancestors also commit. The parentage data (child → parent links) is needed only while the subtransaction tree is open. Once the top-level transaction commits or aborts, all subtransaction XIDs either inherit committed status or are marked aborted. The parent-link index can therefore be volatile — zeroed at crash recovery startup — because no reader needs it for XIDs older than the oldest active XID.
Optional per-XID timestamp for replication and auditing
Section titled “Optional per-XID timestamp for replication and auditing”Some engines record a commit timestamp alongside the status. This is never
required for visibility decisions (snapshot isolation needs only before/after
ordering, not wall-clock time), but it is invaluable for logical replication
conflict resolution, CDC consumers that want commit ordering, and audit queries
(pg_xact_commit_timestamp). Because it is optional, implementations gate it
behind a configuration flag and store it in a separate file rather than bloating
the primary TST.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Concept | PostgreSQL name |
|---|---|
| Transaction status table | pg_xact/ (CLOG), managed by clog.c |
| 2-bit status entry | XidStatus (0=in-progress, 1=committed, 2=aborted, 3=sub-committed) |
| Sub-committed intermediate | TRANSACTION_STATUS_SUB_COMMITTED (0x03) |
| Subtransaction parentage index | pg_subtrans/, managed by subtrans.c |
| Optional commit timestamp store | pg_commit_ts/, managed by commit_ts.c |
| SLRU page buffer pool | SlruCtlData / SimpleLru* API (slru.c) |
| Page-level lock | SLRU bank lock (SimpleLruGetBankLock) |
| Group commit batch | TransactionGroupUpdateXidStatus via clogGroupFirst linked list |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”The three-SLRU landscape
Section titled “The three-SLRU landscape”PostgreSQL’s transaction metadata lives in three co-operating SLRU volumes, each
in its own subdirectory under $PGDATA:
pg_xact/ — 2 bits/XID — durable, WAL-redo-recoverablepg_subtrans/ — 4 bytes/XID — volatile, zeroed at startuppg_commit_ts/ — 10 bytes/XID — durable, optional (track_commit_timestamp)All three are built on the same slru.c substrate. Each registers a
SlruCtlData instance (XactCtlData, SubTransCtlData, CommitTsCtlData) and
calls SimpleLruInit at postmaster startup. The difference is what they store
per transaction and what their durability contract is.
Figure 1 — Three SLRU clients and their shared substrate
flowchart TD
A[xact.c<br/>CommitTransaction] -->|TransactionIdCommitTree| B[transam.c<br/>TransactionIdSetTreeStatus]
A -->|TransactionTreeSetCommitTsData| E[commit_ts.c<br/>CommitTsCtlData]
B --> C[clog.c<br/>XactCtlData<br/>pg_xact/]
A -->|SubTransSetParent at AssignTransactionId| D[subtrans.c<br/>SubTransCtlData<br/>pg_subtrans/]
C --> F[slru.c SimpleLru API]
D --> F
E --> F
F --> G[Shared memory page buffers]
G --> H[Disk segments]
Figure 1 — xact.c drives all three SLRU clients at commit time.
pg_subtrans is populated earlier, at sub-XID assignment, not at commit.
pg_xact (CLOG): 2-bit status bitmap
Section titled “pg_xact (CLOG): 2-bit status bitmap”Storage layout
Section titled “Storage layout”Each XID occupies exactly 2 bits. The encoding is defined in clog.h:
// XidStatus — src/include/access/clog.htypedef int XidStatus;
#define TRANSACTION_STATUS_IN_PROGRESS 0x00#define TRANSACTION_STATUS_COMMITTED 0x01#define TRANSACTION_STATUS_ABORTED 0x02#define TRANSACTION_STATUS_SUB_COMMITTED 0x03Four transactions fit in one byte; BLCKSZ transactions fit on one page
(CLOG_XACTS_PER_PAGE = BLCKSZ * 4). Addressing a specific XID into a page
byte is pure arithmetic:
// TransactionIdToByte / TransactionIdToBIndex — clog.c#define CLOG_BITS_PER_XACT 2#define CLOG_XACTS_PER_BYTE 4#define CLOG_XACT_BITMASK ((1 << CLOG_BITS_PER_XACT) - 1)
#define TransactionIdToPgIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_PAGE)#define TransactionIdToByte(xid) (TransactionIdToPgIndex(xid) / CLOG_XACTS_PER_BYTE)#define TransactionIdToBIndex(xid) ((xid) % (TransactionId) CLOG_XACTS_PER_BYTE)Reading status: TransactionIdGetStatus
Section titled “Reading status: TransactionIdGetStatus”The read path is intentionally minimal. TransactionIdGetStatus computes the
page and byte offset, calls SimpleLruReadPage_ReadOnly (which acquires the SLRU
bank lock in shared mode and returns a buffer slot), extracts the 2-bit field,
also returns the group LSN for that page (used by async-commit callers to know
how far WAL must be flushed before the page can safely go to disk), and releases
the bank lock:
// TransactionIdGetStatus — src/backend/access/transam/clog.cXidStatusTransactionIdGetStatus(TransactionId xid, XLogRecPtr *lsn){ int64 pageno = TransactionIdToPage(xid); int byteno = TransactionIdToByte(xid); int bshift = TransactionIdToBIndex(xid) * CLOG_BITS_PER_XACT; int slotno; char *byteptr; XidStatus status;
slotno = SimpleLruReadPage_ReadOnly(XactCtl, pageno, xid); byteptr = XactCtl->shared->page_buffer[slotno] + byteno; status = (*byteptr >> bshift) & CLOG_XACT_BITMASK; *lsn = XactCtl->shared->group_lsn[GetLSNIndex(slotno, xid)]; LWLockRelease(SimpleLruGetBankLock(XactCtl, pageno)); return status;}This function is the low-level primitive. The preferred entry point for callers
is TransactionLogFetch in transam.c, which adds a single-slot cache (the
last XID checked) and handles the TRANSACTION_STATUS_SUB_COMMITTED intermediate
by chasing pg_subtrans to find the top-level parent before re-checking.
Writing status: the multi-page atomicity protocol
Section titled “Writing status: the multi-page atomicity protocol”Setting status is more elaborate because a subtransaction tree can span multiple
CLOG pages and the commit must appear atomic to concurrent readers. The entry
point from transam.c is TransactionIdCommitTree → TransactionIdSetTreeStatus.
The protocol for a multi-page commit is a three-step dance:
- Mark all sub-XIDs on other pages as
SUB_COMMITTED. - Atomically mark the top-level XID (and any sub-XIDs on its page) as
COMMITTED. - Mark the remaining sub-XIDs as
COMMITTED.
// TransactionIdSetTreeStatus — src/backend/access/transam/clog.cvoidTransactionIdSetTreeStatus(TransactionId xid, int nsubxids, TransactionId *subxids, XidStatus status, XLogRecPtr lsn){ int64 pageno = TransactionIdToPage(xid); int i;
/* count sub-xids on the same page as xid */ for (i = 0; i < nsubxids; i++) if (TransactionIdToPage(subxids[i]) != pageno) break;
if (i == nsubxids) { /* all on one page — single lock acquisition */ TransactionIdSetPageStatus(xid, nsubxids, subxids, status, lsn, pageno, true); } else { /* multi-page: sub-commit others first, then commit top, then finalize */ if (status == TRANSACTION_STATUS_COMMITTED) set_status_by_pages(nsubxids - i, subxids + i, TRANSACTION_STATUS_SUB_COMMITTED, lsn); TransactionIdSetPageStatus(xid, i, subxids, status, lsn, pageno, false); set_status_by_pages(nsubxids - i, subxids + i, status, lsn); }}A reader that sees SUB_COMMITTED knows the top-level XID is committed (step 2
has completed) even if step 3 has not yet finished, so visibility is consistent.
Figure 2 — Multi-page commit atomicity protocol
flowchart TD
S[TransactionIdSetTreeStatus<br/>top XID on page p1] --> A{All subxids<br/>on page p1?}
A -- yes --> B[single TransactionIdSetPageStatus<br/>lock p1 once]
A -- no --> C[set_status_by_pages p2..pN<br/>SUB_COMMITTED]
C --> D[TransactionIdSetPageStatus p1<br/>set top XID + same-page subs COMMITTED]
D --> E[set_status_by_pages p2..pN<br/>COMMITTED]
Figure 2 — When a transaction tree spans multiple CLOG pages, sub-XIDs on
remote pages pass through SUB_COMMITTED before the top-level commit is visible,
preserving apparent atomicity.
Group-commit optimisation
Section titled “Group-commit optimisation”At high concurrency, many backends commit simultaneously and all contend on the
same SLRU bank lock for the current CLOG page. PostgreSQL uses a linked-list
group-update mechanism to batch these writes: the first contending backend
becomes the leader, collects all waiting backends into a singly-linked list
through ProcGlobal->clogGroupFirst (a lock-free CAS chain via
pg_atomic_compare_exchange_u32), acquires the bank lock once, writes status
for the entire group, releases the lock, and wakes all followers via
PGSemaphoreUnlock:
// TransactionGroupUpdateXidStatus (leader path) — clog.cnextidx = pg_atomic_exchange_u32(&procglobal->clogGroupFirst, INVALID_PROC_NUMBER);while (nextidx != INVALID_PROC_NUMBER){ PGPROC *nextproc = &ProcGlobal->allProcs[nextidx]; TransactionIdSetPageStatusInternal(nextproc->clogGroupMemberXid, nextproc->subxidStatus.count, nextproc->subxids.xids, nextproc->clogGroupMemberXidStatus, nextproc->clogGroupMemberLsn, nextproc->clogGroupMemberPage); nextidx = pg_atomic_read_u32(&nextproc->clogGroupNext);}/* ... wake followers ... */The group optimisation applies only when: (a) all group members are committing
the same CLOG page, (b) the backend’s MyProc->xid matches the XID being
written (i.e., it is the backend’s own commit, not a recovery replay), and (c)
the subtransaction count does not exceed THRESHOLD_SUBTRANS_CLOG_OPT (5).
When a group spans two bank-lock partitions (a race that leads different members
to land on different pages), the leader switches bank locks mid-walk.
Async-commit LSN tracking
Section titled “Async-commit LSN tracking”For async commits (synchronous_commit = off), the WAL commit record may not be
flushed before pg_xact is updated. To honour the WAL rule on checkpoint (no
dirty CLOG page may reach disk before its covering WAL is flushed), clog.c
maintains a group_lsn array in the SLRU shared segment — one XLogRecPtr per
group of CLOG_XACTS_PER_LSN_GROUP (32) transactions. TransactionIdSetStatusBit
updates the group LSN when lsn is valid. The checkpointer consults these LSNs
before writing dirty pages.
pg_subtrans: volatile parent-XID links
Section titled “pg_subtrans: volatile parent-XID links”pg_subtrans stores one TransactionId (4 bytes) per XID — the immediate
parent. A top-level transaction stores InvalidTransactionId; a savepoint-child
stores the XID of the transaction that issued the SAVEPOINT statement.
The critical design property is volatility. Because subtransactions are, by
definition, still open while their parent linkages are needed, and because
recovery knows the parent links from the WAL anyway, pg_subtrans is not
WAL-logged at all. At startup, StartupSUBTRANS zeroes every page that covers
any XID between oldestActiveXID and nextXid:
// StartupSUBTRANS — src/backend/access/transam/subtrans.cvoidStartupSUBTRANS(TransactionId oldestActiveXID){ FullTransactionId nextXid; int64 startPage, endPage; // ... acquire bank locks page by page ... for (;;) { (void) ZeroSUBTRANSPage(startPage); if (startPage == endPage) break; startPage++; if (startPage > TransactionIdToPage(MaxTransactionId)) startPage = 0; } // ...}Writing a parent link is SubTransSetParent, called from AssignTransactionId
(in xact.c) when a sub-XID is first assigned. Reading it is
SubTransGetParent, used by SubTransGetTopmostTransaction which walks the chain
child → parent → grandparent until it reaches InvalidTransactionId or an XID
older than TransactionXmin:
// SubTransGetTopmostTransaction — subtrans.cTransactionIdSubTransGetTopmostTransaction(TransactionId xid){ TransactionId parentXid = xid, previousXid = xid; while (TransactionIdIsValid(parentXid)) { previousXid = parentXid; if (TransactionIdPrecedes(parentXid, TransactionXmin)) break; parentXid = SubTransGetParent(parentXid); if (!TransactionIdPrecedes(parentXid, previousXid)) elog(ERROR, "pg_subtrans contains invalid entry: xid %u -> %u", previousXid, parentXid); } return previousXid;}The loop guard against non-monotone parent chains prevents infinite loops on
corrupted data. The TransactionXmin cut-off prevents page faults for very old
XIDs whose pg_subtrans pages have been truncated.
pg_commit_ts: optional commit timestamp and origin
Section titled “pg_commit_ts: optional commit timestamp and origin”pg_commit_ts stores a 10-byte CommitTimestampEntry per XID:
// CommitTimestampEntry — src/backend/access/transam/commit_ts.ctypedef struct CommitTimestampEntry{ TimestampTz time; /* 8 bytes */ RepOriginId nodeid; /* 2 bytes */} CommitTimestampEntry;RepOriginId identifies the replication origin — which primary or intermediate
node the transaction originated on — enabling logical-replication consumers to
filter or order commits by origin.
The entire module is gated by track_commit_timestamp (GUC). When disabled, all
write calls are no-ops (commitTsShared->commitTsActive == false). When enabled,
TransactionTreeSetCommitTsData is called from RecordTransactionCommit in
xact.c alongside the WAL commit record write, recording the timestamp and
origin for the top-level XID and all its sub-XIDs:
// TransactionTreeSetCommitTsData (write path) — commit_ts.cvoidTransactionTreeSetCommitTsData(TransactionId xid, int nsubxids, TransactionId *subxids, TimestampTz timestamp, RepOriginId nodeid){ if (!commitTsShared->commitTsActive) return;
headxid = xid; i = 0; for (;;) { int64 pageno = TransactionIdToCTsPage(headxid); /* find subxids on the same page ... */ SetXidCommitTsInPage(headxid, j - i, subxids + i, timestamp, nodeid, pageno); if (j >= nsubxids) break; headxid = subxids[j]; i = j + 1; }
/* update the in-memory cache of the most recent commit */ LWLockAcquire(CommitTsLock, LW_EXCLUSIVE); commitTsShared->xidLastCommit = xid; commitTsShared->dataLastCommit.time = timestamp; commitTsShared->dataLastCommit.nodeid = nodeid; // ... update newestCommitTsXid ... LWLockRelease(CommitTsLock);}Unlike CLOG, pg_commit_ts does not have a group-commit mechanism; contention is
not a bottleneck here because each entry is 10 bytes (fewer transactions per page)
and commit timestamps are written once and rarely re-read on the critical path.
The read path — TransactionIdGetCommitTsData — first checks the in-memory
CommitTimestampShared cache (xidLastCommit), then falls through to the SLRU
if the requested XID is not the most recent. It validates the requested XID
against [oldestCommitTsXid, newestCommitTsXid] before doing any page I/O.
Activate/deactivate lifecycle
Section titled “Activate/deactivate lifecycle”pg_commit_ts has an activation layer that the other two SLRUs lack, because its
GUC can change between restarts and a standby must mirror the primary’s setting:
StartupCommitTs() — calls ActivateCommitTs()CompleteCommitTsInitialization() — calls Activate or Deactivate based on GUCCommitTsParameterChange() — called on WAL replay of XLOG_PARAMETER_CHANGEActivateCommitTs creates the initial segment if missing and sets
commitTsActive = true. DeactivateCommitTs clears the data directory and sets
commitTsActive = false.
Lifecycle hooks — all three SLRUs
Section titled “Lifecycle hooks — all three SLRUs”All three SLRUs follow the same lifecycle hook pattern, invoked from fixed call sites in postmaster startup, recovery, and checkpoint code:
| Phase | CLOG | SUBTRANS | COMMIT_TS |
|---|---|---|---|
| Postmaster alloc | CLOGShmemSize / CLOGShmemInit | SUBTRANSShmemSize / SUBTRANSShmemInit | CommitTsShmemSize / CommitTsShmemInit |
| initdb | BootStrapCLOG | BootStrapSUBTRANS | BootStrapCommitTs (no-op) |
| Startup | StartupCLOG + TrimCLOG | StartupSUBTRANS | StartupCommitTs |
| Checkpoint | CheckPointCLOG | CheckPointSUBTRANS | CheckPointCommitTs |
| XID extend | ExtendCLOG | ExtendSUBTRANS | AdvanceOldestCommitTsXid |
| Truncate | TruncateCLOG | TruncateSUBTRANS | TruncateCommitTs |
TrimCLOG (called once after startup/recovery) zeroes the unused tail of the
current CLOG page, preventing stale bits from a previous database lifecycle from
being misread as legitimate status.
Source Walkthrough
Section titled “Source Walkthrough”clog.c — pg_xact management
Section titled “clog.c — pg_xact management”CLOGShmemInit— registersXactCtlDatawithSimpleLruInit, directorypg_xact, LWTrancheLWTRANCHE_XACT_BUFFER/LWTRANCHE_XACT_SLRU.TransactionIdSetTreeStatus— top-level entry point for writing status; handles single-page and multi-page trees; calls intoTransactionIdSetPageStatus.TransactionIdSetPageStatus— per-page dispatch; attempts group-update optimisation first (TransactionGroupUpdateXidStatus); falls back to directLWLockAcquire+TransactionIdSetPageStatusInternal.TransactionGroupUpdateXidStatus— group-commit leader/follower logic viaProcGlobal->clogGroupFirstCAS list; leader acquires bank lock, writes all members, wakes followers viaPGSemaphoreUnlock.TransactionIdSetPageStatusInternal— actual 2-bit write; callsTransactionIdSetStatusBitfor each XID; marks the SLRU page dirty.TransactionIdSetStatusBit— bit-level write: computes byte + shift, applies mask, updatesgroup_lsnfor async-commit tracking.TransactionIdGetStatus— read path; usesSimpleLruReadPage_ReadOnly; returns status + group LSN.StartupCLOG— recordslatest_page_numberin shared memory.TrimCLOG— zeroes unused bits in the current page after recovery.CheckPointCLOG— callsSimpleLruWriteAllto flush dirty pages.ExtendCLOG— zeroes a new page whennextXidcrosses a page boundary (called underXidGenLock).TruncateCLOG— callsSimpleLruTruncate; also updatespg_database.datminmxidbookkeeping for freeze horizon tracking.
transam.c — higher-level callers
Section titled “transam.c — higher-level callers”TransactionLogFetch— single-slot cache wrapper aroundTransactionIdGetStatus; followsSUB_COMMITTED→SubTransGetTopmostTransaction→ recheck.TransactionIdCommitTree/TransactionIdAsyncCommitTree— sync and async wrappers; passInvalidXLogRecPtrvs. valid LSN toTransactionIdSetTreeStatus.TransactionIdAbortTree— abort wrapper; always passesInvalidXLogRecPtr(abort records need not be flushed before status is written).
subtrans.c — pg_subtrans management
Section titled “subtrans.c — pg_subtrans management”SUBTRANSShmemInit— registersSubTransCtlDatawithSimpleLruInit, directorypg_subtrans, no LSN array (volatile, no WAL),SYNC_HANDLER_NONE.SubTransSetParent— writes oneTransactionIdslot; called byAssignTransactionIdinxact.cwhen a sub-XID is assigned.SubTransGetParent— read-only page lookup.SubTransGetTopmostTransaction— walks child→parent chain stopping atTransactionXmin; used by MVCC visibility checks for sub-XID tuples.StartupSUBTRANS— zeroes active-range pages; requires no WAL.ExtendSUBTRANS— zeroes new page at XID page boundary; called underXidGenLock.TruncateSUBTRANS— frees pages belowTransactionXmin.
commit_ts.c — pg_commit_ts management
Section titled “commit_ts.c — pg_commit_ts management”CommitTsShmemInit— registersCommitTsCtlData+ allocatesCommitTimestampSharedshmem struct for the in-memory cache.TransactionTreeSetCommitTsData— main write entry point; iterates pages, callsSetXidCommitTsInPageper page, updatescommitTsSharedcache.SetXidCommitTsInPage— acquires bank lock, reads page, callsTransactionIdSetCommitTsper XID.TransactionIdGetCommitTsData— read entry point; checks cache, validates XID range, falls through to SLRU read.GetLatestCommitTsData— returnsxidLastCommit+ cached entry underCommitTsLock.ActivateCommitTs/DeactivateCommitTs— togglecommitTsActiveflag; called at startup and on WAL replay ofXLOG_PARAMETER_CHANGE.StartupCommitTs/CompleteCommitTsInitialization— startup lifecycle hooks.TruncateCommitTs— called fromvac_truncate_cloginvacuumlazy.c.
Position-hint table (commit 273fe94, 2026-06-05)
Section titled “Position-hint table (commit 273fe94, 2026-06-05)”| Symbol | File | Line |
|---|---|---|
TRANSACTION_STATUS_* | src/include/access/clog.h | 27–30 |
TransactionIdSetTreeStatus | src/backend/access/transam/clog.c | 183 |
TransactionIdSetPageStatus | src/backend/access/transam/clog.c | 293 |
TransactionIdSetPageStatusInternal | src/backend/access/transam/clog.c | 364 |
TransactionGroupUpdateXidStatus | src/backend/access/transam/clog.c | 441 |
TransactionIdSetStatusBit | src/backend/access/transam/clog.c | 661 |
TransactionIdGetStatus | src/backend/access/transam/clog.c | 735 |
CLOGShmemInit | src/backend/access/transam/clog.c | 787 |
StartupCLOG | src/backend/access/transam/clog.c | 877 |
TrimCLOG | src/backend/access/transam/clog.c | 892 |
CheckPointCLOG | src/backend/access/transam/clog.c | 937 |
ExtendCLOG | src/backend/access/transam/clog.c | 959 |
TruncateCLOG | src/backend/access/transam/clog.c | 1000 |
TransactionLogFetch | src/backend/access/transam/transam.c | 52 |
TransactionIdCommitTree | src/backend/access/transam/transam.c | 240 |
TransactionIdAsyncCommitTree | src/backend/access/transam/transam.c | 252 |
TransactionIdAbortTree | src/backend/access/transam/transam.c | 270 |
SubTransSetParent | src/backend/access/transam/subtrans.c | 85 |
SubTransGetParent | src/backend/access/transam/subtrans.c | 122 |
SubTransGetTopmostTransaction | src/backend/access/transam/subtrans.c | 163 |
SUBTRANSShmemInit | src/backend/access/transam/subtrans.c | 220 |
StartupSUBTRANS | src/backend/access/transam/subtrans.c | 309 |
CommitTimestampEntry | src/backend/access/transam/commit_ts.c | 55 |
CommitTimestampShared | src/backend/access/transam/commit_ts.c | 98 |
TransactionTreeSetCommitTsData | src/backend/access/transam/commit_ts.c | 141 |
TransactionIdGetCommitTsData | src/backend/access/transam/commit_ts.c | 274 |
CommitTsShmemInit | src/backend/access/transam/commit_ts.c | 530 |
ActivateCommitTs | src/backend/access/transam/commit_ts.c | 705 |
DeactivateCommitTs | src/backend/access/transam/commit_ts.c | 785 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified facts
Section titled “Verified facts”-
pg_xactuses exactly 2 bits per XID, with four status values. Verified from theclog.hconstants andTransactionIdSetStatusBitinclog.c. The 4th valueSUB_COMMITTED(0x03) is an intermediate used only during multi-page tree commits and is never externally visible as a final status. -
The group-update optimisation is gated on THRESHOLD_SUBTRANS_CLOG_OPT = 5. Verified at
clog.c:301–326. The check comparesnsubxidsandMyProc->subxidStatus.countas a safety condition. Transactions with more than 5 sub-XIDs bypass the group path and acquire the bank lock directly. -
pg_subtransis completely zeroed on every startup, not replayed from WAL. Verified insubtrans.c:309–349. The module header comment andStartupSUBTRANSboth confirm: “we only need to remember pg_subtrans information for currently-open transactions … no need to preserve data over a crash and restart.”SYNC_HANDLER_NONEinSimpleLruInitconfirms no sync callbacks are registered. -
track_commit_timestamp = offmakes allpg_commit_tswrites no-ops. Verified inTransactionTreeSetCommitTsData(commit_ts.c:157): the function returns immediately when!commitTsShared->commitTsActive. This is an unlocked read, safe because on a standby the flag is only changed by the recovery process (which also calls this function). -
TrimCLOGzeroes the unused tail of the current CLOG page. Verified atclog.c:892–931. The comment explains why: WAL replay during recovery might settle on anextXIDless than what the previous lifecycle wrote, leaving potentially non-zero bits beyond the new frontier.TrimCLOGmakes those safe. -
The group LSN array in CLOG shared memory has one entry per 32 XIDs. Verified:
CLOG_XACTS_PER_LSN_GROUP = 32(clog.c:92),CLOG_LSNS_PER_PAGE = CLOG_XACTS_PER_PAGE / 32(clog.c:93). This is the granularity at which async-commit LSN tracking operates. -
CommitTimestampSharedholds only the single most-recently-committed XID. Verified at commit_ts.c:98–103. The cache is a single-slot cache (xidLastCommitdataLastCommit), not a ring or multi-slot structure. On a cache miss the code falls through to the SLRU page.
Open questions
Section titled “Open questions”-
TruncateCLOGand pg_database bookkeeping.TruncateCLOGcallsAdvanceOldestClogXidand updatespg_database.datfrozenxidbookkeeping for freeze-horizon management. The exact interaction withvac_truncate_cloginvacuumlazy.cand autovacuum’s freeze logic is not analysed here. The full chain is: autovacuum freeze →vac_truncate_clog→TruncateCLOG. That chain belongs inpostgres-vacuum.mdandpostgres-xid-wraparound-freeze.md. -
Group-update page mismatch handling. When the group-update leader discovers that a follower’s page number differs from the initial page (the race described at clog.c:490–513), it switches bank locks mid-walk. The code path for multi-bank-lock group walks is present but the performance characteristics under high contention with large subtransaction trees are not experimentally characterised.
-
pg_commit_tsandoldestCommitTsXidmaintenance.TransamVariables-> oldestCommitTsXidis used byTransactionIdGetCommitTsDatato bound valid lookups. The code that advances this lower bound on truncation and the exact truncation trigger (checkpoint? autovacuum? both?) is inTruncateCommitTs(commit_ts.c:890) but the full orchestration path is unverified.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”-
Oracle’s Interested Transaction List (ITL) and undo-based visibility. Oracle stores transaction slots directly in the heap block header (the ITL), not in a central status bitmap. Visibility is resolved by consulting the ITL slot or, for older transactions, chasing the undo chain. This eliminates the CLOG-equivalent hot spot at the cost of per-block transaction metadata. A side-by-side comparison would quantify the CLOG bank-lock contention that PostgreSQL’s group-update optimisation is designed to mitigate.
-
MySQL/InnoDB’s trx_sys and purge system. InnoDB’s transaction descriptor lives in
trx_sys(a special segment), and visibility uses a “read view” that snapshots the active transaction list. Committed transaction records are purged lazily by the purge thread, which is the InnoDB analog of PostgreSQL vacuum. The CLOG equivalent is implicit in the undo log: a transaction whose undo space is reclaimed is definitively committed (from the purge system’s perspective). -
MVCC without a central TST: HANA’s in-memory MVCC. SAP HANA and other in-memory engines avoid a persistent TST by keeping transaction status entirely in memory (since the entire database fits there) and rebuilding it at restart from a committed transaction log. This is possible only when restart time is bounded and the status table fits in DRAM. PostgreSQL’s
pg_xactis the durable fallback for when neither condition holds. -
pg_commit_tsand logical replication conflict resolution. The combination of commit timestamp +RepOriginIdinpg_commit_tsis the enabling data structure for multi-master conflict resolution. A logical replication apply worker can compare timestamps across origins to implement last-write-wins. The paper Serializable Snapshot Isolation in PostgreSQL (Ports & Grittner, VLDB 2012;knowledge/research/dbms-papers/) does not discuss commit timestamps directly, but the architecture dovetails: SSI detects serialisation conflicts at commit time, and commit_ts provides the ordering evidence. -
SUBTRANS volatility vs. full persistence. The design decision to not WAL-log
pg_subtranstrades I/O cost for a constraint: the subtransaction chain cannot be walked for transactions older thanTransactionXmin. This meansSubTransGetTopmostTransactioncan return an intermediate XID rather than the true topmost parent for very old sub-XIDs — documented and accepted in the comment at subtrans.c:151–159. A fully persistent design (as in some distributed MVCC systems that need global sub-XID resolution) would require WAL logging and a truncation-safe read path.
Sources
Section titled “Sources”Source code (REL_18_STABLE, commit 273fe94)
src/backend/access/transam/clog.c— 1152 lines; CLOG SLRU clientsrc/backend/access/transam/subtrans.c— 447 lines; SUBTRANS SLRU clientsrc/backend/access/transam/commit_ts.c— 1073 lines; CommitTs SLRU clientsrc/backend/access/transam/transam.c— higher-level callers (TransactionIdCommitTree,TransactionLogFetch)src/include/access/clog.h—XidStatustypedef and status constantssrc/include/access/slru.h—SlruCtlData,SimpleLru*API declarations
Textbooks
- Database System Concepts, Silberschatz, Korth, Sudarshan (7th ed.) — §17.6 “Implementation of Atomicity and Durability” (WAL, commit log record as commit point)
- Database Internals, Petrov (2019) — ch. 5 MVCC and ch. 7 Log-Structured Storage (background on TST design patterns)
Papers
- ARIES (Mohan et al., ACM TODS 1992) — WAL correctness and recovery protocol
that underpins
pg_xactdurability guarantees;knowledge/research/dbms-papers/aries.md
Adjacent documents in this KB
postgres-slru.md— SLRU substrate (slru.c): buffer pool, bank locks, page lifecycle,SimpleLru*API (forward reference — doc not yet written as of 2026-06-05)postgres-xact.md— transaction state machine and commit pipeline; the producer of CLOG and commit_ts writespostgres-mvcc-snapshots.md— snapshot acquisition and visibility check; the primary consumer ofTransactionLogFetchpostgres-recovery-redo.md— replays CLOG WAL records (XLOG_CLOG_ZEROPAGE,XLOG_CLOG_TRUNCATE) on restartpostgres-vacuum.md— drivesTruncateCLOGandTruncateSUBTRANSviavac_truncate_clog