PostgreSQL procarray — The Live-XID Census Behind Every Snapshot
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Snapshot-isolation MVCC pushes concurrency control down to visibility: a transaction reads the version of a row that was committed as of the moment its snapshot was taken, and ignores versions written by transactions that were still in flight then. Database System Concepts (Silberschatz, 7e, §18.6 “Multiversion Schemes” and §18.7 “Snapshot Isolation”) frames the snapshot as the decision oracle for visibility: every row carries the id of the transaction that created it (and, if deleted, the one that deleted it), and the snapshot answers “was that transaction committed before I started, or not?” Database Internals (Petrov, ch. 5) makes the same point from the engine side — MVCC’s distinguishing property is that “reads can continue accessing older values until the new ones are committed,” so the engine must, at snapshot time, capture which transactions are still running.
That capture is exactly what this document is about. The visibility rules — how
a given row’s xmin/xmax are tested against a snapshot — belong to a sibling
doc (postgres-mvcc-snapshots.md). The procarray’s job is narrower and prior:
it is the live-XID census. Three quantities define a PostgreSQL snapshot:
xmin— the lowest transaction id still considered running. Anything older is definitely finished.xmax— the highest committed id plus one. Anything at or above is definitely still running (it hadn’t even started when the snapshot formed).xip(xid-in-progress array) — the explicit list of ids in the open interval[xmin, xmax)that were running at snapshot time. An id in that range is “running” iff it appears inxip.
Two design choices follow from needing to produce that triple cheaply, and they shape the whole module:
-
Where does the “currently running” set live, and how is it scanned? The set changes on every transaction begin/end, and a snapshot must read it atomically with respect to those changes. The naive answer — a lock-protected list walked per snapshot — is on the hottest path in the engine, so the representation and its latch discipline dominate scalability.
-
What is the oldest id any snapshot still cares about, and who needs it? This is the dual of
xmin: the global minimum xmin across all backends (plus replication holdbacks) is the boundary below which dead row versions can be physically reclaimed. Vacuum, pruning,pg_xact/pg_subtranstruncation, and hot-standby feedback all consume this xmin horizon.
PostgreSQL answers both from one shared-memory structure — the process array.
Common DBMS Design
Section titled “Common DBMS Design”The textbook gives the model (SI, snapshot triple, oldest-active horizon for GC). This section names the engineering conventions that nearly every MVCC engine — PostgreSQL, Oracle, MySQL/InnoDB, SQL Server Hekaton, CUBRID — adopts to realize it. PostgreSQL’s specific choices in the next section read as one set of dials within this shared space.
A central registry of active transactions
Section titled “A central registry of active transactions”Every MVCC engine keeps one shared structure that enumerates the transactions
currently in flight, because that is what a snapshot must read. The two recurring
shapes are a centralized active-transaction table (a fixed array of
per-session slots, each advertising that session’s current id and oldest
interest) and a commit-sequence counter that timestamps each commit so
visibility becomes a numeric comparison. PostgreSQL uses the array shape: one
slot per backend, scanned to build the snapshot. The slot count is bounded at
startup (max_connections + background workers + autovacuum + prepared-xact
slots), so the array is fixed-size and never reallocated at runtime.
Snapshot = copy the active set under a shared latch
Section titled “Snapshot = copy the active set under a shared latch”Acquiring a snapshot means reading the active set atomically with respect to transactions entering and leaving it. The universal pattern is a reader/writer latch: a snapshot taker holds the latch in shared mode and copies the active ids out; a transaction that ends takes the latch exclusively to remove its id. Shared-mode acquisition lets many backends snapshot concurrently, while the exclusive end-of-transaction step is the serialization point that defines a consistent “set of running transactions.”
Dense mirror arrays for the scan-hot fields
Section titled “Dense mirror arrays for the scan-hot fields”A snapshot scan touches only a few fields of each session record — the current id, the sub-transaction summary, a status byte — but a full session-control block is large and cache-cold. The convention is to mirror the scan-hot fields into separate, densely packed arrays so the snapshot loop streams through contiguous memory and never pulls a whole control block into cache. This was the central optimization of PostgreSQL’s “Snapshot scalability” rework (v14).
Subtransaction summaries with an overflow escape hatch
Section titled “Subtransaction summaries with an overflow escape hatch”Subtransactions each get their own id, so the active set is really a set of
trees. Tracking every subxid in shared memory is unbounded, so engines cache a
bounded number of subxids per session plus an “overflowed” flag. When the
cache overflows, snapshots fall back to a secondary structure (PostgreSQL’s
pg_subtrans) to resolve parentage. The flag lets the common, non-overflowed
case stay entirely in the fast array.
Batch the contended write
Section titled “Batch the contended write”When many sessions commit at once they all want the exclusive end-of-transaction latch, turning it into a convoy. A standard mitigation is group commit of the metadata update: the first arrival becomes a leader, drains a lock-free queue of waiters, and clears all their ids under a single latch acquisition. The others sleep on a semaphore and wake already cleared. This trades one latch hand-off per committer for one per batch.
Recovery emulation for standby reads
Section titled “Recovery emulation for standby reads”A physical-replication standby replays WAL but runs no real transactions of its
own, yet read-only queries on it still need snapshots. The convention is to
reconstruct the primary’s active set from the WAL stream — watching id
assignment and commit/abort records — into a standby-local structure that the
snapshot builder treats as the active set while in recovery. PostgreSQL calls
this KnownAssignedXids.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Theory / convention | PostgreSQL name |
|---|---|
| Central active-transaction registry | ProcArrayStruct (sorted pgprocnos[] index into allProcs[]) |
| Per-session slot | PGPROC (one per backend / prepared xact) + its pgxactoff |
| Scan-hot mirror arrays | ProcGlobal->xids[], subxidStates[], statusFlags[] |
| Snapshot triple | SnapshotData.xmin / xmax / xip[] (built by GetSnapshotData) |
| Snapshot latch (shared) | ProcArrayLock held LW_SHARED |
| End-of-transaction (exclusive) step | ProcArrayEndTransaction → …Internal under LW_EXCLUSIVE |
| Bounded subxid cache + overflow flag | PGPROC.subxids (64 ids) + XidCacheStatus.overflowed |
| Group commit of the metadata clear | ProcArrayGroupClearXid (lock-free queue + leader) |
| Oldest-active horizon for GC | ComputeXidHorizons → GetOldestNonRemovableTransactionId |
| Cached approximate horizon | GlobalVisState (definitely_needed / maybe_needed) |
| Replication holdback on the horizon | replication_slot_xmin / replication_slot_catalog_xmin |
| Standby active-set emulation | KnownAssignedXids[] + KnownAssignedXidsValid[] |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”procarray.c opens by stating its own remit: it “maintains arrays of PGPROC
substructures, as well as associated arrays in ProcGlobal, for all active
backends … the principal one is as a means of determining the set of currently
running transactions.” The module is the single producer of every MVCC snapshot
(GetSnapshotData) and every vacuum horizon (ComputeXidHorizons).
The distinguishing choices are: (1) the active set is a sorted array of
indexes into the global PGPROC pool, paired with dense mirror arrays of
the three scan-hot fields so the snapshot loop streams contiguous memory; (2)
the snapshot triple is built in one ProcArrayLock-shared pass with the
expensive horizon left approximate and recomputed lazily; (3) the contended
commit-time clear is batched via a lock-free group-clear queue; (4) the
vacuum horizon is four horizons, tuned per relation kind, all from one
ComputeXidHorizons scan; and (5) hot standby reuses the same snapshot builder
against a WAL-reconstructed KnownAssignedXids array.
The shared structure: a sorted index plus dense mirror arrays
Section titled “The shared structure: a sorted index plus dense mirror arrays”// ProcArrayStruct — src/backend/storage/ipc/procarray.ctypedef struct ProcArrayStruct{ int numProcs; /* number of valid procs entries */ int maxProcs; /* allocated size of procs array */
/* Known assigned XIDs handling */ int maxKnownAssignedXids; /* allocated size of array */ int numKnownAssignedXids; /* current # of valid entries */ int tailKnownAssignedXids; /* index of oldest valid element */ int headKnownAssignedXids; /* index of newest element, + 1 */
TransactionId lastOverflowedXid; /* highest subxid dropped from KAX */
/* oldest xmin of any replication slot */ TransactionId replication_slot_xmin; /* oldest catalog xmin of any replication slot */ TransactionId replication_slot_catalog_xmin;
/* indexes into allProcs[], has PROCARRAY_MAXPROCS entries */ int pgprocnos[FLEXIBLE_ARRAY_MEMBER];} ProcArrayStruct;The procarray itself stores almost no transaction data. pgprocnos[] is a list
of indexes into the global PGPROC pool (ProcGlobal->allProcs), kept sorted
by PGPROC address. The scan-hot transaction fields live in three parallel,
densely packed arrays hung off ProcGlobal:
// PROC_HDR (ProcGlobal) — src/include/storage/proc.htypedef struct PROC_HDR{ PGPROC *allProcs; /* Array of PGPROC structures */ TransactionId *xids; /* mirror of PGPROC.xid, indexed by pgxactoff */ XidCacheStatus *subxidStates; /* mirror of PGPROC.subxidStatus */ uint8 *statusFlags; /* mirror of PGPROC.statusFlags */ uint32 allProcCount; /* ... free lists, group-clear/clog-group queue heads ... */ pg_atomic_uint32 procArrayGroupFirst; /* group XID-clear queue head */ /* ... */} PROC_HDR;Each PGPROC carries its own copy of these fields plus the index pgxactoff
that locates its slot in the mirror arrays:
// PGPROC (excerpt) — src/include/storage/proc.hstruct PGPROC{ /* ... */ TransactionId xid; /* top-level xid, or Invalid; mirrored in * ProcGlobal->xids[pgxactoff] */ TransactionId xmin; /* minimal running XID when this xact started, * excluding LAZY VACUUM */ int pgxactoff; /* offset into the ProcGlobal mirror arrays */ /* ... */ uint8 statusFlags; /* mirrored in ProcGlobal->statusFlags[pgxactoff] */ XidCacheStatus subxidStatus;/* mirrored in ProcGlobal->subxidStates[pgxactoff] */ struct XidCache subxids; /* cache of up to 64 subtransaction XIDs */ /* group XID clearing */ bool procArrayGroupMember; pg_atomic_uint32 procArrayGroupNext; TransactionId procArrayGroupMemberXid;};The duplication is deliberate. The per-PGPROC copy is the source of truth a
backend updates; the ProcGlobal mirror is what a snapshot scan reads. The
header is explicit that “both copies need to be maintained coherently” and that
“the pgxactoff indexed value can never be accessed without holding locks.” The
point of the mirror is cache density: GetSnapshotData walks
ProcGlobal->xids[] — one TransactionId per backend, contiguous — instead of
striding through full PGPROC structs. This is the v14 “snapshot scalability”
layout.
flowchart LR
subgraph POOL["ProcGlobal->allProcs[] (fixed pool)"]
P0["PGPROC #5<br/>xid, xmin, subxids[64]<br/>pgxactoff=2"]
P1["PGPROC #2<br/>xid, xmin, subxids[64]<br/>pgxactoff=0"]
P2["PGPROC #9<br/>xid, xmin, subxids[64]<br/>pgxactoff=1"]
end
subgraph PA["ProcArrayStruct (sorted by PGPROC addr)"]
PN["pgprocnos[] = [2, 9, 5, ...]"]
end
subgraph MIR["ProcGlobal dense mirror arrays (indexed by pgxactoff)"]
XS["xids[] = [x@2, x@9, x@5, ...]"]
SS["subxidStates[]"]
FS["statusFlags[]"]
end
PN -- "pgprocnos[k] = procno" --> POOL
P1 -- "pgxactoff=0" --> XS
P2 -- "pgxactoff=1" --> XS
P0 -- "pgxactoff=2" --> XS
Figure 1 — The three-level structure. ProcArrayStruct.pgprocnos[] is a sorted
list of indexes into the fixed allProcs[] pool; the array stays sorted by
PGPROC address for cache locality. Each backend’s pgxactoff locates its slot
in the dense ProcGlobal mirror arrays (xids[], subxidStates[],
statusFlags[]), which is what the snapshot loop actually streams through. The
sorted-index array and the dense mirror are kept in lockstep — adding or
removing a backend memmoves both.
ProcArrayAdd / ProcArrayRemove: entering and leaving the census
Section titled “ProcArrayAdd / ProcArrayRemove: entering and leaving the census”A backend joins the census at startup (ProcArrayAdd) and leaves at exit
(ProcArrayRemove). Both run under both ProcArrayLock and XidGenLock
exclusively, and both keep pgprocnos[] and all three mirror arrays sorted by
PGPROC address via memmove, then fix up every following backend’s
pgxactoff:
// ProcArrayAdd — src/backend/storage/ipc/procarray.c (condensed)LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);LWLockAcquire(XidGenLock, LW_EXCLUSIVE);/* find sorted insertion point (keep procs array sorted by PGPROC *) */for (index = 0; index < arrayP->numProcs; index++) if (arrayP->pgprocnos[index] > pgprocno) break;movecount = arrayP->numProcs - index;memmove(&arrayP->pgprocnos[index + 1], &arrayP->pgprocnos[index], movecount * ...);memmove(&ProcGlobal->xids[index + 1], &ProcGlobal->xids[index], movecount * ...);memmove(&ProcGlobal->subxidStates[index + 1], &ProcGlobal->subxidStates[index], ...);memmove(&ProcGlobal->statusFlags[index + 1], &ProcGlobal->statusFlags[index], ...);arrayP->pgprocnos[index] = pgprocno;proc->pgxactoff = index;ProcGlobal->xids[index] = proc->xid;/* ... mirror subxidStatus, statusFlags ... */arrayP->numProcs++;for (index++; index < arrayP->numProcs; index++) /* renumber the tail */ allProcs[arrayP->pgprocnos[index]].pgxactoff = index;The sort is justified inline: “there is an increased likelihood of finding the
next PGPROC structure in the cache” during a ProcArray traversal, and “the
occurrence of adding/removing a proc is much lower than the access to the
ProcArray itself, [so] the overhead should be marginal.” The O(numProcs)
memmove-and-renumber cost is paid on connect/disconnect, not on the snapshot
path.
ProcArrayRemove is the mirror image. It takes the same two locks, advances the
global latestCompletedXid if a still-live xid is being removed (the 2PC case),
zeroes the slot’s mirror fields, memmoves the tail down over the hole, and
renumbers. The comment notes the two locks are released “in reversed acquisition
order, to reduce frequency of having to wait for XidGenLock while holding
ProcArrayLock.”
ProcArrayEndTransaction: clearing the XID at commit/abort
Section titled “ProcArrayEndTransaction: clearing the XID at commit/abort”ProcArrayEndTransaction is called at every commit and abort to mark the
transaction no longer running — i.e. to clear ProcGlobal->xids[pgxactoff]. This
is the exclusive-mode write that pairs with every shared-mode snapshot read, and
its correctness hinges on holding ProcArrayLock so a snapshot never observes a
half-cleared state:
// ProcArrayEndTransaction — src/backend/storage/ipc/procarray.c (condensed)if (TransactionIdIsValid(latestXid)){ /* We must lock ProcArrayLock while clearing our advertised XID, so * that we do not exit the set of "running" transactions while someone * else is taking a snapshot. */ if (LWLockConditionalAcquire(ProcArrayLock, LW_EXCLUSIVE)) { ProcArrayEndTransactionInternal(proc, latestXid); LWLockRelease(ProcArrayLock); } else ProcArrayGroupClearXid(proc, latestXid); /* contended: batch it */}else{ /* No XID: we needn't lock — we don't affect anyone's snapshot. */ proc->vxid.lxid = InvalidLocalTransactionId; proc->xmin = InvalidTransactionId; /* ... */}A read-only transaction never assigned an xid, so it skips the lock entirely —
the most common case is free. A transaction that did write tries
LWLockConditionalAcquire: if the lock is free it clears its own xid and is
done; if not, it defers to group clearing. The actual clearing in
ProcArrayEndTransactionInternal zeroes the mirror and the PGPROC copy
together, drops any subxid cache, and crucially advances two global counters
under the lock:
// ProcArrayEndTransactionInternal — procarray.c (condensed)ProcGlobal->xids[pgxactoff] = InvalidTransactionId;proc->xid = InvalidTransactionId;proc->xmin = InvalidTransactionId;/* clear subxid cache (both copies) ... *//* Also advance global latestCompletedXid while holding the lock */MaintainLatestCompletedXid(latestXid);/* Same with xactCompletionCount */TransamVariables->xactCompletionCount++;latestCompletedXid is what the next snapshot’s xmax is derived from;
xactCompletionCount is a monotonically increasing “something committed” tick
that lets GetSnapshotData cheaply prove a cached snapshot is still valid (see
GetSnapshotDataReuse below). Both must move atomically with the xid clear,
which is why they live inside the exclusive section.
ProcArrayGroupClearXid: group commit of the metadata clear
Section titled “ProcArrayGroupClearXid: group commit of the metadata clear”When ProcArrayLock is contended at commit time — many backends finishing at
once — handing the exclusive lock from one committer to the next serializes them.
ProcArrayGroupClearXid instead batches the clears. Each waiter pushes itself
onto a lock-free Treiber stack (procArrayGroupFirst via atomic
compare-exchange); the backend that finds the stack empty becomes the leader,
acquires the lock once, and clears everyone’s xid:
// ProcArrayGroupClearXid — src/backend/storage/ipc/procarray.c (condensed)proc->procArrayGroupMember = true;proc->procArrayGroupMemberXid = latestXid;nextidx = pg_atomic_read_u32(&procglobal->procArrayGroupFirst);while (true){ pg_atomic_write_u32(&proc->procArrayGroupNext, nextidx); if (pg_atomic_compare_exchange_u32(&procglobal->procArrayGroupFirst, &nextidx, (uint32) pgprocno)) break;}if (nextidx != INVALID_PROC_NUMBER){ /* Not the leader: sleep on our semaphore until the leader clears us. */ for (;;) { PGSemaphoreLock(proc->sem); if (!proc->procArrayGroupMember) break; } return;}/* We are the leader. Acquire the lock on behalf of everyone. */LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);nextidx = pg_atomic_exchange_u32(&procglobal->procArrayGroupFirst, INVALID_PROC_NUMBER);while (nextidx != INVALID_PROC_NUMBER){ PGPROC *nextproc = &allProcs[nextidx]; ProcArrayEndTransactionInternal(nextproc, nextproc->procArrayGroupMemberXid); nextidx = pg_atomic_read_u32(&nextproc->procArrayGroupNext);}LWLockRelease(ProcArrayLock);/* Then wake every follower (outside the lock, to keep hold time minimal). */The leader pops the whole list in one pg_atomic_exchange (the comment notes
that popping one-at-a-time “could lead to an ABA problem”), runs
ProcArrayEndTransactionInternal for each member, then wakes them after
releasing the lock — “the system calls we need to perform to wake other
processes up are probably much slower than the simple memory writes we did while
holding the lock.” A pg_write_barrier() before clearing procArrayGroupMember
ensures a woken follower sees its xid already cleared. The net effect: under a
commit storm, one lock acquisition clears N transactions instead of N
acquisitions.
ProcArrayClearTransaction is a related but distinct operation: after a backend
prepares a 2PC transaction, the gxact’s PGPROC stays in the array (still
“running”), so the backend only scrubs its own PGPROC fields without removing
the xid from the census.
GetSnapshotData: building xmin / xmax / xip in one pass
Section titled “GetSnapshotData: building xmin / xmax / xip in one pass”GetSnapshotData is the hottest function in the module. It produces the snapshot
triple by scanning ProcGlobal->xids[] under ProcArrayLock shared. The
structure: derive xmax from latestCompletedXid, seed xmin = xmax, then walk
the dense xid array, lowering xmin and appending each running xid to xip.
flowchart TD
A["GetSnapshotData(snapshot)"] --> B["LWLockAcquire(ProcArrayLock, LW_SHARED)"]
B --> C{"GetSnapshotDataReuse:<br/>xactCompletionCount<br/>unchanged?"}
C -- "yes" --> C1["reuse old xmin/xip,<br/>release lock, return"]
C -- "no" --> D["xmax = latestCompletedXid + 1<br/>xmin = xmax"]
D --> E{"RecoveryInProgress?"}
E -- "no (normal)" --> F["for each pgxactoff in 0..numProcs:"]
F --> G{"xid == Invalid?"}
G -- "yes" --> F
G -- "no" --> H{"own slot, or xid >= xmax,<br/>or PROC_IN_VACUUM /<br/>PROC_IN_LOGICAL_DECODING?"}
H -- "skip" --> F
H -- "include" --> I["xmin = min(xmin, xid)<br/>xip[count++] = xid<br/>copy subxids if room"]
I --> F
E -- "yes (hot standby)" --> J["KnownAssignedXidsGetAndSetXmin<br/>→ fill subxip[], lower xmin"]
F --> K["read replication_slot_xmin /<br/>_catalog_xmin"]
J --> K
K --> L["MyProc->xmin = xmin (if unset)<br/>release lock"]
L --> M["advance GlobalVis* bounds<br/>(definitely/maybe_needed)"]
M --> N["RecentXmin = xmin<br/>fill snapshot fields, return"]
Figure 2 — GetSnapshotData control flow. The fast exit is
GetSnapshotDataReuse: if no transaction with an xid has completed since this
snapshot struct was last filled (the xactCompletionCount tick is unchanged),
the previous contents are still correct and only MyProc->xmin is re-published.
Otherwise the normal path streams the dense xids[] array; the hot-standby path
reads the WAL-reconstructed KnownAssignedXids instead. In both cases the
expensive vacuum horizon is not computed here — only the cheap GlobalVis*
bounds are nudged.
The core loop, lifted from the source:
// GetSnapshotData — main scan, src/backend/storage/ipc/procarray.c (condensed)latest_completed = TransamVariables->latestCompletedXid;xmax = XidFromFullTransactionId(latest_completed);TransactionIdAdvance(xmax); /* xmax = latestCompletedXid + 1 */xmin = xmax; /* seed; lowered below */
if (TransactionIdIsNormal(myxid) && NormalTransactionIdPrecedes(myxid, xmin)) xmin = myxid; /* own xid counts toward xmin */
for (int pgxactoff = 0; pgxactoff < numProcs; pgxactoff++){ TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]); if (likely(xid == InvalidTransactionId)) continue; /* no xid: skip */ if (pgxactoff == mypgxactoff) continue; /* not own xid in xip */ if (!NormalTransactionIdPrecedes(xid, xmax)) continue; /* >= xmax: running anyway */
statusFlags = allStatusFlags[pgxactoff]; if (statusFlags & (PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM)) continue; /* decoding manages xmin separately; vacuum ignored */
if (NormalTransactionIdPrecedes(xid, xmin)) xmin = xid; xip[count++] = xid; /* add to in-progress array */
/* copy this backend's subxids too, unless anyone has overflowed */ if (!suboverflowed) { /* memcpy proc->subxids.xids into snapshot->subxip */ }}Three filters matter. Own xid is excluded from xip (a transaction always
sees its own writes) but still pulls xmin down. xid >= xmax transactions
began after the snapshot’s cutoff and are treated as running by the
xmax boundary, so they need not be listed. PROC_IN_VACUUM is skipped
because a lazy vacuum holds no row versions that another snapshot must preserve,
so leaving it out of xmin lets vacuum reclaim more aggressively;
PROC_IN_LOGICAL_DECODING backends manage their xmin via replication slots
instead. Subxids are copied opportunistically into snapshot->subxip until some
backend’s cache has overflowed, at which point suboverflowed is set and
visibility checks must consult pg_subtrans.
GetSnapshotData does not compute the precise vacuum horizon. Instead it
cheaply advances the GlobalVisState bounds (definitely_needed /
maybe_needed) using xmin, oldestXid, and the replication-slot xmins read
while the lock is held — the lazy-horizon mechanism described next.
ComputeXidHorizons: the four vacuum xmin horizons
Section titled “ComputeXidHorizons: the four vacuum xmin horizons”The dual of xmin is the oldest xid any backend still cares about — the boundary
below which dead tuples are physically removable. Computing it precisely means a
second full ProcArray scan, which GetSnapshotData deliberately avoids.
ComputeXidHorizons is that scan, run on demand (by vacuum, pruning, and
pg_subtrans truncation). It produces not one horizon but a struct of them,
because a relation’s removability depends on which backends can see it:
// GlobalVisState — src/backend/storage/ipc/procarray.cstruct GlobalVisState{ FullTransactionId definitely_needed; /* XIDs >= are considered running */ FullTransactionId maybe_needed; /* XIDs < are not considered running */};flowchart TD
A["ComputeXidHorizons(h)"] --> B["LWLockAcquire(ProcArrayLock, LW_SHARED)"]
B --> C["seed all horizons with<br/>latestCompletedXid + 1<br/>temp = MyProc->xid (own)"]
C --> D["read replication_slot_xmin /<br/>_catalog_xmin"]
D --> E["for each proc in array:"]
E --> F["xmin = min(proc->xmin, proc->xid)"]
F --> G["oldest_considered_running =<br/>min(.., xmin) (incl. vacuum/decoding)"]
G --> H{"PROC_IN_VACUUM or<br/>PROC_IN_LOGICAL_DECODING?"}
H -- "yes" --> E
H -- "no" --> I["shared_oldest_nonremovable =<br/>min(.., xmin)"]
I --> J{"same DB, or MyDatabaseId unset,<br/>or PROC_AFFECTS_ALL_HORIZONS,<br/>or in recovery?"}
J -- "yes" --> K["data_oldest_nonremovable =<br/>min(.., xmin)"]
J -- "no" --> E
K --> E
E --> L["release lock<br/>(if recovery: fold in KnownAssignedXids min)"]
L --> M["apply slot_xmin to shared+data,<br/>slot_catalog_xmin to catalog+shared"]
M --> N["GlobalVisUpdateApply(h):<br/>publish into GlobalVis* states"]
Figure 3 — ComputeXidHorizons and the four-horizon split. Every backend’s
effective interest is min(xmin, xid). A backend in another database does not
constrain this database’s data tables, so its xmin only lowers the shared
and considered-running horizons, not data. Lazy vacuum and logical
decoding are skipped for the removability horizons (they hold no versions others
must keep) but still count toward oldest_considered_running, which governs
pg_subtrans truncation. Replication-slot xmins back the horizons up last.
The four removability horizons, ordered from most to least conservative:
Horizon (ComputeXidHorizonsResult) | Used for |
|---|---|
shared_oldest_nonremovable | shared catalogs (visible from all databases) |
catalog_oldest_nonremovable | catalog / logical-decoding-accessible relations |
data_oldest_nonremovable | ordinary user tables in the current database |
temp_oldest_nonremovable | this session’s temp tables (only own xid matters) |
A fifth field, oldest_considered_running, is not a removability horizon: it
is the oldest xid any backend (including vacuum and decoding) might still consider
running, used to decide how far pg_subtrans can be truncated. GetReplicationHorizons
exposes shared_oldest_nonremovable_raw (the shared horizon without the slot’s
catalog_xmin folded in) so hot-standby feedback can send the data and catalog
horizons separately to an upstream primary.
GetOldestNonRemovableTransactionId is the public face that vacuum calls; it
runs ComputeXidHorizons and picks the right horizon for the relation kind:
// GetOldestNonRemovableTransactionId — procarray.cTransactionIdGetOldestNonRemovableTransactionId(Relation rel){ ComputeXidHorizonsResult horizons; ComputeXidHorizons(&horizons); switch (GlobalVisHorizonKindForRel(rel)) { case VISHORIZON_SHARED: return horizons.shared_oldest_nonremovable; case VISHORIZON_CATALOG: return horizons.catalog_oldest_nonremovable; case VISHORIZON_DATA: return horizons.data_oldest_nonremovable; case VISHORIZON_TEMP: return horizons.temp_oldest_nonremovable; } return InvalidTransactionId;}The GlobalVisState pair caches an approximate version of these horizons. Pruning
hot paths call GlobalVisTestIsRemovableXid(state, xid): an xid below
maybe_needed is definitely removable, one at or above definitely_needed is
definitely needed, and only an xid between the two triggers a fresh
ComputeXidHorizons to tighten the bounds. GetSnapshotData advances these
bounds for free on every snapshot, so the expensive recompute is rare.
KnownAssignedXids: emulating the primary’s active set on a standby
Section titled “KnownAssignedXids: emulating the primary’s active set on a standby”On a hot-standby replica the local PGPROC array represents standby query
backends, “which by definition are not running transactions that have XIDs.” Yet
read-only queries there still need snapshots whose xip reflects the primary’s
in-flight transactions. PostgreSQL reconstructs that set from the WAL stream into
KnownAssignedXids — a sorted array with a parallel validity bitmap:
// KnownAssignedXids design — src/backend/storage/ipc/procarray.c (comment, condensed)// The XIDs are stored in an array in sorted order --- TransactionIdPrecedes// order --- to allow binary search. ...// To keep individual deletions cheap, we allow gaps in the array, marking// elements valid/invalid via the parallel boolean array KnownAssignedXidsValid[].// A deletion sets KnownAssignedXidsValid[i]=false *without* clearing the XID,// preserving sortedness for binary search. Periodically we compress out the// unused entries.// Valid items are those with tail <= i < head. When head reaches the array// end we force compression rather than wrapping.The startup (WAL-replay) process is the single writer: it appends new ids in
xid order as it sees assignment records (RecordKnownAssignedTransactionIds),
marks ids invalid as it sees commit/abort records, prunes old ids, and
periodically compresses. Standby query backends are readers under shared
ProcArrayLock. Append-at-the-right keeps the array sorted for free because
“new known XIDs are always reported in XID order.” Deletions only flip a validity
bit, so they are O(1) and never disturb sort order; the O(n) compaction is
deferred and batched (KnownAssignedXidsCompress, driven by KAXCompressReason).
When GetSnapshotData runs in recovery it takes the standby branch and fills the
snapshot directly from this array:
// GetSnapshotData — recovery branch (condensed)subcount = KnownAssignedXidsGetAndSetXmin(snapshot->subxip, &xmin, xmax);if (TransactionIdPrecedesOrEquals(xmin, procArray->lastOverflowedXid)) suboverflowed = true;Because the standby cannot tell top-level xids from subxids in the WAL stream,
all known ids are stored together in snapshot->subxip and xip is left empty —
“a design choice that greatly simplifies xid processing.” lastOverflowedXid
tracks the highest subxid that had to be dropped, so the standby knows when to
mark a snapshot suboverflowed and fall back to pg_subtrans. The same
ProcArrayLock, latestCompletedXid, and snapshot builder are reused — only the
source of the active set changes between normal running and recovery.
Source Walkthrough
Section titled “Source Walkthrough”Anchor on symbol names, not line numbers.
procarray.cis large and reorganizes between releases; function/struct names are the stable handle. Usegit grep -n '<symbol>' src/backend/storage/ipc/procarray.cto locate the current position. Line numbers in the table below were observed at commit 273fe94 (REL_18_STABLE) and are hints only.
Shared structures (src/include/storage/proc.h, procarray.c)
Section titled “Shared structures (src/include/storage/proc.h, procarray.c)”struct PGPROC(inproc.h) — the per-backend slot;xid,xmin,pgxactoff,statusFlags,subxidStatus,subxids, and the group-clear fields (procArrayGroupMember,procArrayGroupNext,procArrayGroupMemberXid).typedef struct PROC_HDR(inproc.h) —ProcGlobal: theallProcs[]pool plus the three dense mirror arrays (xids,subxidStates,statusFlags) and theprocArrayGroupFirstqueue head.struct XidCacheStatus,PGPROC_MAX_CACHED_SUBXIDS(inproc.h) — the 64-id subxid cache and itscount/overflowedsummary.PROC_IN_VACUUM,PROC_IN_LOGICAL_DECODING,PROC_AFFECTS_ALL_HORIZONS,PROC_VACUUM_STATE_MASK(status-flag macros, inproc.h).typedef struct ProcArrayStruct(inprocarray.c) — the sortedpgprocnos[]index, theKnownAssignedXidshead/tail bookkeeping,lastOverflowedXid, and the tworeplication_slot_*_xminfields.struct GlobalVisState,typedef struct ComputeXidHorizonsResult,enum GlobalVisHorizonKind(inprocarray.c) — the cached approximate horizon pair and the precise multi-horizon result.
Membership (procarray.c)
Section titled “Membership (procarray.c)”ProcArrayShmemInit— allocate and size the shared array at startup.ProcArrayAdd— insert a backend’sPGPROCindex, keeping the array sorted; renumberpgxactofffor the tail.ProcArrayRemove— remove a backend (or a finished 2PC gxact), advancinglatestCompletedXidfor the live-xid case.
End-of-transaction (procarray.c)
Section titled “End-of-transaction (procarray.c)”ProcArrayEndTransaction— clear an xid at commit/abort; fast no-lock path for xidless xacts, conditional-acquire fast path, group-clear fallback.ProcArrayEndTransactionInternal— the under-lock clear; advanceslatestCompletedXidandxactCompletionCount.ProcArrayGroupClearXid— lock-free group-clear queue + leader.ProcArrayClearTransaction— scrub ownPGPROCafter a 2PC prepare without leaving the census.MaintainLatestCompletedXid/MaintainLatestCompletedXidRecovery— bump the globallatestCompletedXidthat backsxmax.
Snapshot building (procarray.c)
Section titled “Snapshot building (procarray.c)”GetSnapshotData— buildxmin/xmax/xipin one shared-lock pass; advanceGlobalVis*.GetSnapshotDataReuse— thexactCompletionCountfast-path that reuses an unchanged snapshot.GetMaxSnapshotXidCount/GetMaxSnapshotSubxidCount— array sizing for snapmgr.c.ProcArrayInstallImportedXmin/ProcArrayInstallRestoredXmin— adopt an imported xmin while verifying the source xact still runs.
Horizons (procarray.c)
Section titled “Horizons (procarray.c)”ComputeXidHorizons— the precise multi-horizon scan.GetOldestNonRemovableTransactionId— vacuum’s entry point; picks the horizon per relation kind viaGlobalVisHorizonKindForRel.GetOldestTransactionIdConsideredRunning—pg_subtrans-truncation horizon.GetReplicationHorizons— data + catalog horizons for hot-standby feedback.GlobalVisTestFor/GlobalVisTestIsRemovableXid/GlobalVisCheckRemovableXid— the cached approximate-horizon test on the pruning hot path.
Membership queries (procarray.c)
Section titled “Membership queries (procarray.c)”TransactionIdIsInProgress— is a specific xid still running (with RecentXmin and single-entry caches up front).TransactionIdIsActive,ProcNumberGetTransactionIds,BackendXidGetPid,GetOldestActiveTransactionId.
Replication slot horizon (procarray.c)
Section titled “Replication slot horizon (procarray.c)”ProcArraySetReplicationSlotXmin/ProcArrayGetReplicationSlotXmin— publish / read the slot-imposed holdback on the horizon.
Hot standby / KnownAssignedXids (procarray.c)
Section titled “Hot standby / KnownAssignedXids (procarray.c)”ProcArrayInitRecovery/ProcArrayApplyRecoveryInfo/ProcArrayApplyXidAssignment— set up and update the standby active set from WAL.RecordKnownAssignedTransactionIds— note a newly seen primary xid.KnownAssignedXidsAdd/KnownAssignedXidsSearch— append (in sorted order) / binary-search-and-optionally-remove.KnownAssignedXidsGetAndSetXmin— fill a standby snapshot and lower its xmin.KnownAssignedXidsRemovePreceding/ExpireOldKnownAssignedTransactionIds/KnownAssignedXidsCompress— pruning and the deferred O(n) compaction.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
struct PGPROC | proc.h | 176 |
typedef struct XidCacheStatus | proc.h | 41 |
PGPROC_MAX_CACHED_SUBXIDS | proc.h | 39 |
typedef struct PROC_HDR | proc.h | 383 |
typedef struct ProcArrayStruct | procarray.c | 71 |
struct GlobalVisState | procarray.c | 167 |
ComputeXidHorizonsResult | procarray.c | 179 |
ProcArrayShmemInit | procarray.c | 418 |
ProcArrayAdd | procarray.c | 468 |
ProcArrayRemove | procarray.c | 565 |
ProcArrayEndTransaction | procarray.c | 667 |
ProcArrayEndTransactionInternal | procarray.c | 731 |
ProcArrayGroupClearXid | procarray.c | 792 |
ProcArrayClearTransaction | procarray.c | 907 |
MaintainLatestCompletedXid | procarray.c | 967 |
ProcArrayApplyRecoveryInfo | procarray.c | 1054 |
ProcArrayApplyXidAssignment | procarray.c | 1318 |
TransactionIdIsInProgress | procarray.c | 1402 |
ComputeXidHorizons | procarray.c | 1735 |
GetOldestNonRemovableTransactionId | procarray.c | 2005 |
GetOldestTransactionIdConsideredRunning | procarray.c | 2034 |
GetReplicationHorizons | procarray.c | 2047 |
GetSnapshotDataReuse | procarray.c | 2095 |
GetSnapshotData | procarray.c | 2175 |
ProcArrayInstallImportedXmin | procarray.c | 2531 |
GlobalVisTestFor | procarray.c | 4107 |
GlobalVisTestIsRemovableXid | procarray.c | 4264 |
GlobalVisCheckRemovableXid | procarray.c | 4300 |
RecordKnownAssignedTransactionIds | procarray.c | 4403 |
ExpireOldKnownAssignedTransactionIds | procarray.c | 4531 |
KnownAssignedXidsAdd | procarray.c | 4782 |
KnownAssignedXidsSearch | procarray.c | 4886 |
KnownAssignedXidsGetAndSetXmin | procarray.c | 5127 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Each entry leads with a fact about the current source at commit 273fe94 (REL_18_STABLE), readable without any external material. The trailing sentence records how it was checked. Open questions follow as recorded gaps.
Verified facts
Section titled “Verified facts”-
The active set is a sorted index array plus three dense mirror arrays, not a scan over full
PGPROCstructs.ProcArrayStruct.pgprocnos[]holds indexes intoProcGlobal->allProcs[], kept sorted byPGPROCaddress;GetSnapshotDataandComputeXidHorizonsstreamProcGlobal->xids[]/subxidStates[]/statusFlags[]indexed bypgxactoff. Verified by readingProcArrayStruct,PROC_HDR, and the scan loops. The per-PGPROCcopies are the writer’s source of truth;proc.hstates both copies “need to be maintained coherently.” -
xmaxislatestCompletedXid + 1, advanced underProcArrayLockexclusive at every commit/abort.GetSnapshotDatacomputesxmaxfromTransamVariables->latestCompletedXid;ProcArrayEndTransactionInternaladvanceslatestCompletedXidviaMaintainLatestCompletedXidinside the exclusive section. Verified by reading both. -
A read-only (xidless) transaction clears no shared state and takes no lock at commit.
ProcArrayEndTransaction’selsebranch (whenlatestXidis invalid) only resets localPGPROCfields, with a comment that it “won’t affect anyone else’s calculation of a snapshot.” Verified inProcArrayEndTransaction. -
The commit-time clear escalates to group clearing only when
ProcArrayLockcannot be acquired immediately.ProcArrayEndTransactionusesLWLockConditionalAcquirefirst and falls back toProcArrayGroupClearXid, which builds a lock-free Treiber stack onProcGlobal->procArrayGroupFirst(CAS loop), elects the first arrival as leader, and pops the whole list in onepg_atomic_exchange. Verified in both functions; the one-at-a-time pop is explicitly avoided to dodge an ABA problem. -
GetSnapshotDataexcludes the backend’s own xid fromxipbut includes it inxmin, and skipsPROC_IN_VACUUMandPROC_IN_LOGICAL_DECODINGbackends. Verified in the main scan loop: own-slotcontinue, thexid >= xmaxcontinue, and thestatusFlags & (PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM)continue, withxminseeded from the own xid before the loop. -
A snapshot is reused without a scan when
xactCompletionCountis unchanged.GetSnapshotDataReusecomparessnapshot->snapXactCompletionCounttoTransamVariables->xactCompletionCount; on a match it re-publishesMyProc->xmin/RecentXminand returns the prior contents. Verified inGetSnapshotDataReuse; the counter is bumped inProcArrayEndTransactionInternal. -
ComputeXidHorizonsproduces four removability horizons plusoldest_considered_running, differentiated by relation visibility scope.ComputeXidHorizonsResultcarriesshared_/catalog_/data_/temp_oldest_nonremovable,shared_oldest_nonremovable_raw, andoldest_considered_running. A backend in another database lowers only the shared (and considered-running) horizons unlessMyDatabaseIdis unset,PROC_AFFECTS_ALL_HORIZONSis set, or in recovery. Verified inComputeXidHorizonsandGetOldestNonRemovableTransactionId. -
The expensive horizon is cached approximately as a
(maybe_needed, definitely_needed)pair and recomputed only for in-between xids.GlobalVisStateholds the twoFullTransactionIdbounds;GetSnapshotDataadvances them on every snapshot, and a preciseComputeXidHorizonsruns only when a tested xid falls between them. Verified in theGlobalVis*maintenance block ofGetSnapshotDataand inGlobalVisTestFor. -
Replication slots back the horizons up but not the snapshot’s
xmindirectly.replication_slot_xmin/replication_slot_catalog_xminlive inProcArrayStruct, are read under the lock in bothGetSnapshotDataandComputeXidHorizons, and are folded into theGlobalVis*bounds / horizons — not into the snapshotxip. Verified by reading both functions andProcArraySetReplicationSlotXmin. -
In hot standby the snapshot is built from
KnownAssignedXids, a sorted array with a parallel validity bitmap and deferred compaction.GetSnapshotData’s recovery branch callsKnownAssignedXidsGetAndSetXminintosnapshot->subxip(leavingxipempty); deletions only flipKnownAssignedXidsValid[i], preserving sort order, with O(n) compaction deferred toKnownAssignedXidsCompress. Verified against the design comment andKnownAssignedXidsRemovePreceding/KnownAssignedXidsGetAndSetXmin. The startup process is the sole writer.
Open questions
Section titled “Open questions”-
Cost of the
ProcArrayAdd/Removememmove on very high connection churn. Each connect/disconnect memmovespgprocnos[]and all three mirror arrays and renumbers every following backend’spgxactoff—O(numProcs)under two exclusive LWLocks. The inline comment asserts the churn is “much lower than the access to the ProcArray itself,” but the crossover point with thousands of short-lived connections (no pooler) is not characterized here. Investigation path: microbenchmark connect/disconnect storms vs.max_connections, and compare against the snapshot-scan savings the sort buys. -
Exact
GlobalVis*recompute frequency under mixed workloads.GlobalVisTestShouldUpdatelimits how often the preciseComputeXidHorizonsruns (keyed onComputeXidHorizonsResultLastXmin), but the practical hit rate of the(maybe_needed, definitely_needed)window on a real pruning workload — and how often a tested xid lands in the in-between zone — is not measured here. Investigation path: instrumentGlobalVisTestIsRemovableXidcall/recompute counts under HOT-update-heavy and long-transaction workloads. -
lastOverflowedXidprecision on a busy standby. On a standby with many subtransaction-overflowing primary transactions,lastOverflowedXidcan mark snapshots suboverflowed conservatively, forcingpg_subtranslookups. WhetherExpireOldKnownAssignedTransactionIdsresets it aggressively enough to avoid sustained over-marking is not traced here. Investigation path: read thelastOverflowedXidreset logic alongside a subxid-overflow standby workload.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”Pointers, not analysis. Each bullet is a handle for a follow-up doc.
-
CUBRID’s active-MVCCID table (
cubrid-mvcc.md) — CUBRID tracks the live set withmvcc_active_tran, a bit array over a sliding MVCCID window plus an overflow list and cached low/high watermarks, rather than PostgreSQL’s scan-an-array-of-slots approach. PostgreSQL pays anO(numProcs)scan per snapshot (mitigated by the dense mirror arrays andxactCompletionCountreuse); CUBRID pays a bit-array copy. A side-by-side cost model of “scan slots” vs. “copy a windowed bit array” as a function of connection count and transaction rate is the natural follow-up. -
Commit-LSN / commit-timestamp visibility (Oracle SCN, SQL Server Hekaton, Spanner TrueTime) — engines that timestamp each commit with a monotonic sequence number and decide visibility by numeric comparison against a snapshot timestamp, avoiding an explicit in-progress list. Database Internals (Petrov, ch. 5) frames this as the commit-order alternative to active-set enumeration. PostgreSQL has a
commit_tsmodule but still drives MVCC visibility from the procarray’s active set; a comparison of where each design pays its cost (snapshot build vs. per-version timestamp storage) would be a focused note. -
Snapshot scalability rework (PostgreSQL 14, commit
1f51c17c66, Andres Freund) — the change that introduced the denseProcGlobal->xids[]mirror arrays analyzed here, replacing per-PGPROCfield scans. The before/after design and the measured throughput gain on many-core boxes deserve their own historical note; this doc describes the result, not the migration. -
Distributed / disaggregated active-set tracking (Aurora, Neon, Citus) — cloud-native Postgres derivatives that move the horizon/visibility authority off a single shared-memory segment. How they reconcile a per-node procarray with a global GC horizon (and whether they centralize
xmincomputation) is a distinct architecture worth a comparative doc once the single-node story here is settled. -
A Scalable Lock Manager for Multicores (Jung et al., SIGMOD 2013;
knowledge/research/dbms-papers/scalable-lock-manager.md) — diagnoses cache-line bouncing on a central latch as the multicore killer.ProcArrayLockis exactly such a central latch; the group-clear queue and shared-mode snapshot reads are PostgreSQL’s mitigations on the procarray side, parallel to the lock manager’s partitioning. A cross-walk of how the same contention lesson shows up in two different subsystems (postgres-lock-manager.mdand this doc) is a worthwhile synthesis.
Sources
Section titled “Sources”PostgreSQL source (under /data/hgryoo/references/postgres, REL_18 273fe94)
Section titled “PostgreSQL source (under /data/hgryoo/references/postgres, REL_18 273fe94)”src/backend/storage/ipc/procarray.c— the process-array module: shared structure, membership, snapshot building, horizon computation, group XID clearing, and the KnownAssignedXids hot-standby machinery.src/include/storage/proc.h—PGPROC,PROC_HDR(ProcGlobal),XidCacheStatus, and thePROC_*status-flag macros.src/include/storage/procarray.h— the module’s public API surface.
In-tree design notes
Section titled “In-tree design notes”src/backend/access/transam/README— the locking discipline for setting and clearing a backend’s xid relative to snapshot acquisition (referenced throughoutprocarray.c).- The
KnownAssignedXidsdesign comment block inprocarray.citself — sorted array, validity bitmap, deferred compaction, single-writer rule.
Textbook chapters (under knowledge/research/dbms-general/)
Section titled “Textbook chapters (under knowledge/research/dbms-general/)”- Database System Concepts (Silberschatz et al., 7e), ch. 18 — §18.6 multiversion schemes, §18.7 snapshot isolation.
- Database Internals (Petrov), ch. 5 — concurrency-control families; MVCC visibility vs. mutual exclusion; commit-order alternatives.
Papers (under knowledge/research/dbms-papers/)
Section titled “Papers (under knowledge/research/dbms-papers/)”scalable-lock-manager.md— Jung et al., A Scalable Lock Manager for Multicores, SIGMOD 2013 (cited for the central-latch contention frontier).
Sibling docs (cross-references, mechanism not duplicated here)
Section titled “Sibling docs (cross-references, mechanism not duplicated here)”postgres-mvcc-snapshots.md— the visibility rules that consume the snapshot triple this module produces. This doc owns the procarray internals; that doc references them rather than re-deriving them.postgres-architecture-overview.md— the shared-memory spine in which the procarray andProcGlobalarrays live.postgres-lock-manager.md— the other majorPGPROC-anchored shared structure and a parallel study in central-latch contention.