Skip to content

PostgreSQL procarray — The Live-XID Census Behind Every Snapshot

Contents:

Snapshot-isolation MVCC pushes concurrency control down to visibility: a transaction reads the version of a row that was committed as of the moment its snapshot was taken, and ignores versions written by transactions that were still in flight then. Database System Concepts (Silberschatz, 7e, §18.6 “Multiversion Schemes” and §18.7 “Snapshot Isolation”) frames the snapshot as the decision oracle for visibility: every row carries the id of the transaction that created it (and, if deleted, the one that deleted it), and the snapshot answers “was that transaction committed before I started, or not?” Database Internals (Petrov, ch. 5) makes the same point from the engine side — MVCC’s distinguishing property is that “reads can continue accessing older values until the new ones are committed,” so the engine must, at snapshot time, capture which transactions are still running.

That capture is exactly what this document is about. The visibility rules — how a given row’s xmin/xmax are tested against a snapshot — belong to a sibling doc (postgres-mvcc-snapshots.md). The procarray’s job is narrower and prior: it is the live-XID census. Three quantities define a PostgreSQL snapshot:

  1. xmin — the lowest transaction id still considered running. Anything older is definitely finished.
  2. xmax — the highest committed id plus one. Anything at or above is definitely still running (it hadn’t even started when the snapshot formed).
  3. xip (xid-in-progress array) — the explicit list of ids in the open interval [xmin, xmax) that were running at snapshot time. An id in that range is “running” iff it appears in xip.

Two design choices follow from needing to produce that triple cheaply, and they shape the whole module:

  1. Where does the “currently running” set live, and how is it scanned? The set changes on every transaction begin/end, and a snapshot must read it atomically with respect to those changes. The naive answer — a lock-protected list walked per snapshot — is on the hottest path in the engine, so the representation and its latch discipline dominate scalability.

  2. What is the oldest id any snapshot still cares about, and who needs it? This is the dual of xmin: the global minimum xmin across all backends (plus replication holdbacks) is the boundary below which dead row versions can be physically reclaimed. Vacuum, pruning, pg_xact/pg_subtrans truncation, and hot-standby feedback all consume this xmin horizon.

PostgreSQL answers both from one shared-memory structure — the process array.

The textbook gives the model (SI, snapshot triple, oldest-active horizon for GC). This section names the engineering conventions that nearly every MVCC engine — PostgreSQL, Oracle, MySQL/InnoDB, SQL Server Hekaton, CUBRID — adopts to realize it. PostgreSQL’s specific choices in the next section read as one set of dials within this shared space.

Every MVCC engine keeps one shared structure that enumerates the transactions currently in flight, because that is what a snapshot must read. The two recurring shapes are a centralized active-transaction table (a fixed array of per-session slots, each advertising that session’s current id and oldest interest) and a commit-sequence counter that timestamps each commit so visibility becomes a numeric comparison. PostgreSQL uses the array shape: one slot per backend, scanned to build the snapshot. The slot count is bounded at startup (max_connections + background workers + autovacuum + prepared-xact slots), so the array is fixed-size and never reallocated at runtime.

Snapshot = copy the active set under a shared latch

Section titled “Snapshot = copy the active set under a shared latch”

Acquiring a snapshot means reading the active set atomically with respect to transactions entering and leaving it. The universal pattern is a reader/writer latch: a snapshot taker holds the latch in shared mode and copies the active ids out; a transaction that ends takes the latch exclusively to remove its id. Shared-mode acquisition lets many backends snapshot concurrently, while the exclusive end-of-transaction step is the serialization point that defines a consistent “set of running transactions.”

Dense mirror arrays for the scan-hot fields

Section titled “Dense mirror arrays for the scan-hot fields”

A snapshot scan touches only a few fields of each session record — the current id, the sub-transaction summary, a status byte — but a full session-control block is large and cache-cold. The convention is to mirror the scan-hot fields into separate, densely packed arrays so the snapshot loop streams through contiguous memory and never pulls a whole control block into cache. This was the central optimization of PostgreSQL’s “Snapshot scalability” rework (v14).

Subtransaction summaries with an overflow escape hatch

Section titled “Subtransaction summaries with an overflow escape hatch”

Subtransactions each get their own id, so the active set is really a set of trees. Tracking every subxid in shared memory is unbounded, so engines cache a bounded number of subxids per session plus an “overflowed” flag. When the cache overflows, snapshots fall back to a secondary structure (PostgreSQL’s pg_subtrans) to resolve parentage. The flag lets the common, non-overflowed case stay entirely in the fast array.

When many sessions commit at once they all want the exclusive end-of-transaction latch, turning it into a convoy. A standard mitigation is group commit of the metadata update: the first arrival becomes a leader, drains a lock-free queue of waiters, and clears all their ids under a single latch acquisition. The others sleep on a semaphore and wake already cleared. This trades one latch hand-off per committer for one per batch.

A physical-replication standby replays WAL but runs no real transactions of its own, yet read-only queries on it still need snapshots. The convention is to reconstruct the primary’s active set from the WAL stream — watching id assignment and commit/abort records — into a standby-local structure that the snapshot builder treats as the active set while in recovery. PostgreSQL calls this KnownAssignedXids.

Theory / conventionPostgreSQL name
Central active-transaction registryProcArrayStruct (sorted pgprocnos[] index into allProcs[])
Per-session slotPGPROC (one per backend / prepared xact) + its pgxactoff
Scan-hot mirror arraysProcGlobal->xids[], subxidStates[], statusFlags[]
Snapshot tripleSnapshotData.xmin / xmax / xip[] (built by GetSnapshotData)
Snapshot latch (shared)ProcArrayLock held LW_SHARED
End-of-transaction (exclusive) stepProcArrayEndTransaction…Internal under LW_EXCLUSIVE
Bounded subxid cache + overflow flagPGPROC.subxids (64 ids) + XidCacheStatus.overflowed
Group commit of the metadata clearProcArrayGroupClearXid (lock-free queue + leader)
Oldest-active horizon for GCComputeXidHorizonsGetOldestNonRemovableTransactionId
Cached approximate horizonGlobalVisState (definitely_needed / maybe_needed)
Replication holdback on the horizonreplication_slot_xmin / replication_slot_catalog_xmin
Standby active-set emulationKnownAssignedXids[] + KnownAssignedXidsValid[]

procarray.c opens by stating its own remit: it “maintains arrays of PGPROC substructures, as well as associated arrays in ProcGlobal, for all active backends … the principal one is as a means of determining the set of currently running transactions.” The module is the single producer of every MVCC snapshot (GetSnapshotData) and every vacuum horizon (ComputeXidHorizons).

The distinguishing choices are: (1) the active set is a sorted array of indexes into the global PGPROC pool, paired with dense mirror arrays of the three scan-hot fields so the snapshot loop streams contiguous memory; (2) the snapshot triple is built in one ProcArrayLock-shared pass with the expensive horizon left approximate and recomputed lazily; (3) the contended commit-time clear is batched via a lock-free group-clear queue; (4) the vacuum horizon is four horizons, tuned per relation kind, all from one ComputeXidHorizons scan; and (5) hot standby reuses the same snapshot builder against a WAL-reconstructed KnownAssignedXids array.

The shared structure: a sorted index plus dense mirror arrays

Section titled “The shared structure: a sorted index plus dense mirror arrays”
// ProcArrayStruct — src/backend/storage/ipc/procarray.c
typedef struct ProcArrayStruct
{
int numProcs; /* number of valid procs entries */
int maxProcs; /* allocated size of procs array */
/* Known assigned XIDs handling */
int maxKnownAssignedXids; /* allocated size of array */
int numKnownAssignedXids; /* current # of valid entries */
int tailKnownAssignedXids; /* index of oldest valid element */
int headKnownAssignedXids; /* index of newest element, + 1 */
TransactionId lastOverflowedXid; /* highest subxid dropped from KAX */
/* oldest xmin of any replication slot */
TransactionId replication_slot_xmin;
/* oldest catalog xmin of any replication slot */
TransactionId replication_slot_catalog_xmin;
/* indexes into allProcs[], has PROCARRAY_MAXPROCS entries */
int pgprocnos[FLEXIBLE_ARRAY_MEMBER];
} ProcArrayStruct;

The procarray itself stores almost no transaction data. pgprocnos[] is a list of indexes into the global PGPROC pool (ProcGlobal->allProcs), kept sorted by PGPROC address. The scan-hot transaction fields live in three parallel, densely packed arrays hung off ProcGlobal:

// PROC_HDR (ProcGlobal) — src/include/storage/proc.h
typedef struct PROC_HDR
{
PGPROC *allProcs; /* Array of PGPROC structures */
TransactionId *xids; /* mirror of PGPROC.xid, indexed by pgxactoff */
XidCacheStatus *subxidStates; /* mirror of PGPROC.subxidStatus */
uint8 *statusFlags; /* mirror of PGPROC.statusFlags */
uint32 allProcCount;
/* ... free lists, group-clear/clog-group queue heads ... */
pg_atomic_uint32 procArrayGroupFirst; /* group XID-clear queue head */
/* ... */
} PROC_HDR;

Each PGPROC carries its own copy of these fields plus the index pgxactoff that locates its slot in the mirror arrays:

// PGPROC (excerpt) — src/include/storage/proc.h
struct PGPROC
{
/* ... */
TransactionId xid; /* top-level xid, or Invalid; mirrored in
* ProcGlobal->xids[pgxactoff] */
TransactionId xmin; /* minimal running XID when this xact started,
* excluding LAZY VACUUM */
int pgxactoff; /* offset into the ProcGlobal mirror arrays */
/* ... */
uint8 statusFlags; /* mirrored in ProcGlobal->statusFlags[pgxactoff] */
XidCacheStatus subxidStatus;/* mirrored in ProcGlobal->subxidStates[pgxactoff] */
struct XidCache subxids; /* cache of up to 64 subtransaction XIDs */
/* group XID clearing */
bool procArrayGroupMember;
pg_atomic_uint32 procArrayGroupNext;
TransactionId procArrayGroupMemberXid;
};

The duplication is deliberate. The per-PGPROC copy is the source of truth a backend updates; the ProcGlobal mirror is what a snapshot scan reads. The header is explicit that “both copies need to be maintained coherently” and that “the pgxactoff indexed value can never be accessed without holding locks.” The point of the mirror is cache density: GetSnapshotData walks ProcGlobal->xids[] — one TransactionId per backend, contiguous — instead of striding through full PGPROC structs. This is the v14 “snapshot scalability” layout.

flowchart LR
  subgraph POOL["ProcGlobal->allProcs[] (fixed pool)"]
    P0["PGPROC #5<br/>xid, xmin, subxids[64]<br/>pgxactoff=2"]
    P1["PGPROC #2<br/>xid, xmin, subxids[64]<br/>pgxactoff=0"]
    P2["PGPROC #9<br/>xid, xmin, subxids[64]<br/>pgxactoff=1"]
  end
  subgraph PA["ProcArrayStruct (sorted by PGPROC addr)"]
    PN["pgprocnos[] = [2, 9, 5, ...]"]
  end
  subgraph MIR["ProcGlobal dense mirror arrays (indexed by pgxactoff)"]
    XS["xids[] = [x@2, x@9, x@5, ...]"]
    SS["subxidStates[]"]
    FS["statusFlags[]"]
  end
  PN -- "pgprocnos[k] = procno" --> POOL
  P1 -- "pgxactoff=0" --> XS
  P2 -- "pgxactoff=1" --> XS
  P0 -- "pgxactoff=2" --> XS

Figure 1 — The three-level structure. ProcArrayStruct.pgprocnos[] is a sorted list of indexes into the fixed allProcs[] pool; the array stays sorted by PGPROC address for cache locality. Each backend’s pgxactoff locates its slot in the dense ProcGlobal mirror arrays (xids[], subxidStates[], statusFlags[]), which is what the snapshot loop actually streams through. The sorted-index array and the dense mirror are kept in lockstep — adding or removing a backend memmoves both.

ProcArrayAdd / ProcArrayRemove: entering and leaving the census

Section titled “ProcArrayAdd / ProcArrayRemove: entering and leaving the census”

A backend joins the census at startup (ProcArrayAdd) and leaves at exit (ProcArrayRemove). Both run under both ProcArrayLock and XidGenLock exclusively, and both keep pgprocnos[] and all three mirror arrays sorted by PGPROC address via memmove, then fix up every following backend’s pgxactoff:

// ProcArrayAdd — src/backend/storage/ipc/procarray.c (condensed)
LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
/* find sorted insertion point (keep procs array sorted by PGPROC *) */
for (index = 0; index < arrayP->numProcs; index++)
if (arrayP->pgprocnos[index] > pgprocno)
break;
movecount = arrayP->numProcs - index;
memmove(&arrayP->pgprocnos[index + 1], &arrayP->pgprocnos[index], movecount * ...);
memmove(&ProcGlobal->xids[index + 1], &ProcGlobal->xids[index], movecount * ...);
memmove(&ProcGlobal->subxidStates[index + 1], &ProcGlobal->subxidStates[index], ...);
memmove(&ProcGlobal->statusFlags[index + 1], &ProcGlobal->statusFlags[index], ...);
arrayP->pgprocnos[index] = pgprocno;
proc->pgxactoff = index;
ProcGlobal->xids[index] = proc->xid;
/* ... mirror subxidStatus, statusFlags ... */
arrayP->numProcs++;
for (index++; index < arrayP->numProcs; index++) /* renumber the tail */
allProcs[arrayP->pgprocnos[index]].pgxactoff = index;

The sort is justified inline: “there is an increased likelihood of finding the next PGPROC structure in the cache” during a ProcArray traversal, and “the occurrence of adding/removing a proc is much lower than the access to the ProcArray itself, [so] the overhead should be marginal.” The O(numProcs) memmove-and-renumber cost is paid on connect/disconnect, not on the snapshot path.

ProcArrayRemove is the mirror image. It takes the same two locks, advances the global latestCompletedXid if a still-live xid is being removed (the 2PC case), zeroes the slot’s mirror fields, memmoves the tail down over the hole, and renumbers. The comment notes the two locks are released “in reversed acquisition order, to reduce frequency of having to wait for XidGenLock while holding ProcArrayLock.”

ProcArrayEndTransaction: clearing the XID at commit/abort

Section titled “ProcArrayEndTransaction: clearing the XID at commit/abort”

ProcArrayEndTransaction is called at every commit and abort to mark the transaction no longer running — i.e. to clear ProcGlobal->xids[pgxactoff]. This is the exclusive-mode write that pairs with every shared-mode snapshot read, and its correctness hinges on holding ProcArrayLock so a snapshot never observes a half-cleared state:

// ProcArrayEndTransaction — src/backend/storage/ipc/procarray.c (condensed)
if (TransactionIdIsValid(latestXid))
{
/* We must lock ProcArrayLock while clearing our advertised XID, so
* that we do not exit the set of "running" transactions while someone
* else is taking a snapshot. */
if (LWLockConditionalAcquire(ProcArrayLock, LW_EXCLUSIVE))
{
ProcArrayEndTransactionInternal(proc, latestXid);
LWLockRelease(ProcArrayLock);
}
else
ProcArrayGroupClearXid(proc, latestXid); /* contended: batch it */
}
else
{
/* No XID: we needn't lock — we don't affect anyone's snapshot. */
proc->vxid.lxid = InvalidLocalTransactionId;
proc->xmin = InvalidTransactionId;
/* ... */
}

A read-only transaction never assigned an xid, so it skips the lock entirely — the most common case is free. A transaction that did write tries LWLockConditionalAcquire: if the lock is free it clears its own xid and is done; if not, it defers to group clearing. The actual clearing in ProcArrayEndTransactionInternal zeroes the mirror and the PGPROC copy together, drops any subxid cache, and crucially advances two global counters under the lock:

// ProcArrayEndTransactionInternal — procarray.c (condensed)
ProcGlobal->xids[pgxactoff] = InvalidTransactionId;
proc->xid = InvalidTransactionId;
proc->xmin = InvalidTransactionId;
/* clear subxid cache (both copies) ... */
/* Also advance global latestCompletedXid while holding the lock */
MaintainLatestCompletedXid(latestXid);
/* Same with xactCompletionCount */
TransamVariables->xactCompletionCount++;

latestCompletedXid is what the next snapshot’s xmax is derived from; xactCompletionCount is a monotonically increasing “something committed” tick that lets GetSnapshotData cheaply prove a cached snapshot is still valid (see GetSnapshotDataReuse below). Both must move atomically with the xid clear, which is why they live inside the exclusive section.

ProcArrayGroupClearXid: group commit of the metadata clear

Section titled “ProcArrayGroupClearXid: group commit of the metadata clear”

When ProcArrayLock is contended at commit time — many backends finishing at once — handing the exclusive lock from one committer to the next serializes them. ProcArrayGroupClearXid instead batches the clears. Each waiter pushes itself onto a lock-free Treiber stack (procArrayGroupFirst via atomic compare-exchange); the backend that finds the stack empty becomes the leader, acquires the lock once, and clears everyone’s xid:

// ProcArrayGroupClearXid — src/backend/storage/ipc/procarray.c (condensed)
proc->procArrayGroupMember = true;
proc->procArrayGroupMemberXid = latestXid;
nextidx = pg_atomic_read_u32(&procglobal->procArrayGroupFirst);
while (true)
{
pg_atomic_write_u32(&proc->procArrayGroupNext, nextidx);
if (pg_atomic_compare_exchange_u32(&procglobal->procArrayGroupFirst,
&nextidx, (uint32) pgprocno))
break;
}
if (nextidx != INVALID_PROC_NUMBER)
{
/* Not the leader: sleep on our semaphore until the leader clears us. */
for (;;) { PGSemaphoreLock(proc->sem); if (!proc->procArrayGroupMember) break; }
return;
}
/* We are the leader. Acquire the lock on behalf of everyone. */
LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
nextidx = pg_atomic_exchange_u32(&procglobal->procArrayGroupFirst, INVALID_PROC_NUMBER);
while (nextidx != INVALID_PROC_NUMBER)
{
PGPROC *nextproc = &allProcs[nextidx];
ProcArrayEndTransactionInternal(nextproc, nextproc->procArrayGroupMemberXid);
nextidx = pg_atomic_read_u32(&nextproc->procArrayGroupNext);
}
LWLockRelease(ProcArrayLock);
/* Then wake every follower (outside the lock, to keep hold time minimal). */

The leader pops the whole list in one pg_atomic_exchange (the comment notes that popping one-at-a-time “could lead to an ABA problem”), runs ProcArrayEndTransactionInternal for each member, then wakes them after releasing the lock — “the system calls we need to perform to wake other processes up are probably much slower than the simple memory writes we did while holding the lock.” A pg_write_barrier() before clearing procArrayGroupMember ensures a woken follower sees its xid already cleared. The net effect: under a commit storm, one lock acquisition clears N transactions instead of N acquisitions.

ProcArrayClearTransaction is a related but distinct operation: after a backend prepares a 2PC transaction, the gxact’s PGPROC stays in the array (still “running”), so the backend only scrubs its own PGPROC fields without removing the xid from the census.

GetSnapshotData: building xmin / xmax / xip in one pass

Section titled “GetSnapshotData: building xmin / xmax / xip in one pass”

GetSnapshotData is the hottest function in the module. It produces the snapshot triple by scanning ProcGlobal->xids[] under ProcArrayLock shared. The structure: derive xmax from latestCompletedXid, seed xmin = xmax, then walk the dense xid array, lowering xmin and appending each running xid to xip.

flowchart TD
  A["GetSnapshotData(snapshot)"] --> B["LWLockAcquire(ProcArrayLock, LW_SHARED)"]
  B --> C{"GetSnapshotDataReuse:<br/>xactCompletionCount<br/>unchanged?"}
  C -- "yes" --> C1["reuse old xmin/xip,<br/>release lock, return"]
  C -- "no" --> D["xmax = latestCompletedXid + 1<br/>xmin = xmax"]
  D --> E{"RecoveryInProgress?"}
  E -- "no (normal)" --> F["for each pgxactoff in 0..numProcs:"]
  F --> G{"xid == Invalid?"}
  G -- "yes" --> F
  G -- "no" --> H{"own slot, or xid &gt;= xmax,<br/>or PROC_IN_VACUUM /<br/>PROC_IN_LOGICAL_DECODING?"}
  H -- "skip" --> F
  H -- "include" --> I["xmin = min(xmin, xid)<br/>xip[count++] = xid<br/>copy subxids if room"]
  I --> F
  E -- "yes (hot standby)" --> J["KnownAssignedXidsGetAndSetXmin<br/>→ fill subxip[], lower xmin"]
  F --> K["read replication_slot_xmin /<br/>_catalog_xmin"]
  J --> K
  K --> L["MyProc->xmin = xmin (if unset)<br/>release lock"]
  L --> M["advance GlobalVis* bounds<br/>(definitely/maybe_needed)"]
  M --> N["RecentXmin = xmin<br/>fill snapshot fields, return"]

Figure 2 — GetSnapshotData control flow. The fast exit is GetSnapshotDataReuse: if no transaction with an xid has completed since this snapshot struct was last filled (the xactCompletionCount tick is unchanged), the previous contents are still correct and only MyProc->xmin is re-published. Otherwise the normal path streams the dense xids[] array; the hot-standby path reads the WAL-reconstructed KnownAssignedXids instead. In both cases the expensive vacuum horizon is not computed here — only the cheap GlobalVis* bounds are nudged.

The core loop, lifted from the source:

// GetSnapshotData — main scan, src/backend/storage/ipc/procarray.c (condensed)
latest_completed = TransamVariables->latestCompletedXid;
xmax = XidFromFullTransactionId(latest_completed);
TransactionIdAdvance(xmax); /* xmax = latestCompletedXid + 1 */
xmin = xmax; /* seed; lowered below */
if (TransactionIdIsNormal(myxid) && NormalTransactionIdPrecedes(myxid, xmin))
xmin = myxid; /* own xid counts toward xmin */
for (int pgxactoff = 0; pgxactoff < numProcs; pgxactoff++)
{
TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
if (likely(xid == InvalidTransactionId)) continue; /* no xid: skip */
if (pgxactoff == mypgxactoff) continue; /* not own xid in xip */
if (!NormalTransactionIdPrecedes(xid, xmax)) continue; /* >= xmax: running anyway */
statusFlags = allStatusFlags[pgxactoff];
if (statusFlags & (PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM))
continue; /* decoding manages xmin separately; vacuum ignored */
if (NormalTransactionIdPrecedes(xid, xmin))
xmin = xid;
xip[count++] = xid; /* add to in-progress array */
/* copy this backend's subxids too, unless anyone has overflowed */
if (!suboverflowed) { /* memcpy proc->subxids.xids into snapshot->subxip */ }
}

Three filters matter. Own xid is excluded from xip (a transaction always sees its own writes) but still pulls xmin down. xid >= xmax transactions began after the snapshot’s cutoff and are treated as running by the xmax boundary, so they need not be listed. PROC_IN_VACUUM is skipped because a lazy vacuum holds no row versions that another snapshot must preserve, so leaving it out of xmin lets vacuum reclaim more aggressively; PROC_IN_LOGICAL_DECODING backends manage their xmin via replication slots instead. Subxids are copied opportunistically into snapshot->subxip until some backend’s cache has overflowed, at which point suboverflowed is set and visibility checks must consult pg_subtrans.

GetSnapshotData does not compute the precise vacuum horizon. Instead it cheaply advances the GlobalVisState bounds (definitely_needed / maybe_needed) using xmin, oldestXid, and the replication-slot xmins read while the lock is held — the lazy-horizon mechanism described next.

ComputeXidHorizons: the four vacuum xmin horizons

Section titled “ComputeXidHorizons: the four vacuum xmin horizons”

The dual of xmin is the oldest xid any backend still cares about — the boundary below which dead tuples are physically removable. Computing it precisely means a second full ProcArray scan, which GetSnapshotData deliberately avoids. ComputeXidHorizons is that scan, run on demand (by vacuum, pruning, and pg_subtrans truncation). It produces not one horizon but a struct of them, because a relation’s removability depends on which backends can see it:

// GlobalVisState — src/backend/storage/ipc/procarray.c
struct GlobalVisState
{
FullTransactionId definitely_needed; /* XIDs >= are considered running */
FullTransactionId maybe_needed; /* XIDs < are not considered running */
};
flowchart TD
  A["ComputeXidHorizons(h)"] --> B["LWLockAcquire(ProcArrayLock, LW_SHARED)"]
  B --> C["seed all horizons with<br/>latestCompletedXid + 1<br/>temp = MyProc->xid (own)"]
  C --> D["read replication_slot_xmin /<br/>_catalog_xmin"]
  D --> E["for each proc in array:"]
  E --> F["xmin = min(proc->xmin, proc->xid)"]
  F --> G["oldest_considered_running =<br/>min(.., xmin)  (incl. vacuum/decoding)"]
  G --> H{"PROC_IN_VACUUM or<br/>PROC_IN_LOGICAL_DECODING?"}
  H -- "yes" --> E
  H -- "no" --> I["shared_oldest_nonremovable =<br/>min(.., xmin)"]
  I --> J{"same DB, or MyDatabaseId unset,<br/>or PROC_AFFECTS_ALL_HORIZONS,<br/>or in recovery?"}
  J -- "yes" --> K["data_oldest_nonremovable =<br/>min(.., xmin)"]
  J -- "no" --> E
  K --> E
  E --> L["release lock<br/>(if recovery: fold in KnownAssignedXids min)"]
  L --> M["apply slot_xmin to shared+data,<br/>slot_catalog_xmin to catalog+shared"]
  M --> N["GlobalVisUpdateApply(h):<br/>publish into GlobalVis* states"]

Figure 3 — ComputeXidHorizons and the four-horizon split. Every backend’s effective interest is min(xmin, xid). A backend in another database does not constrain this database’s data tables, so its xmin only lowers the shared and considered-running horizons, not data. Lazy vacuum and logical decoding are skipped for the removability horizons (they hold no versions others must keep) but still count toward oldest_considered_running, which governs pg_subtrans truncation. Replication-slot xmins back the horizons up last.

The four removability horizons, ordered from most to least conservative:

Horizon (ComputeXidHorizonsResult)Used for
shared_oldest_nonremovableshared catalogs (visible from all databases)
catalog_oldest_nonremovablecatalog / logical-decoding-accessible relations
data_oldest_nonremovableordinary user tables in the current database
temp_oldest_nonremovablethis session’s temp tables (only own xid matters)

A fifth field, oldest_considered_running, is not a removability horizon: it is the oldest xid any backend (including vacuum and decoding) might still consider running, used to decide how far pg_subtrans can be truncated. GetReplicationHorizons exposes shared_oldest_nonremovable_raw (the shared horizon without the slot’s catalog_xmin folded in) so hot-standby feedback can send the data and catalog horizons separately to an upstream primary.

GetOldestNonRemovableTransactionId is the public face that vacuum calls; it runs ComputeXidHorizons and picks the right horizon for the relation kind:

// GetOldestNonRemovableTransactionId — procarray.c
TransactionId
GetOldestNonRemovableTransactionId(Relation rel)
{
ComputeXidHorizonsResult horizons;
ComputeXidHorizons(&horizons);
switch (GlobalVisHorizonKindForRel(rel))
{
case VISHORIZON_SHARED: return horizons.shared_oldest_nonremovable;
case VISHORIZON_CATALOG: return horizons.catalog_oldest_nonremovable;
case VISHORIZON_DATA: return horizons.data_oldest_nonremovable;
case VISHORIZON_TEMP: return horizons.temp_oldest_nonremovable;
}
return InvalidTransactionId;
}

The GlobalVisState pair caches an approximate version of these horizons. Pruning hot paths call GlobalVisTestIsRemovableXid(state, xid): an xid below maybe_needed is definitely removable, one at or above definitely_needed is definitely needed, and only an xid between the two triggers a fresh ComputeXidHorizons to tighten the bounds. GetSnapshotData advances these bounds for free on every snapshot, so the expensive recompute is rare.

KnownAssignedXids: emulating the primary’s active set on a standby

Section titled “KnownAssignedXids: emulating the primary’s active set on a standby”

On a hot-standby replica the local PGPROC array represents standby query backends, “which by definition are not running transactions that have XIDs.” Yet read-only queries there still need snapshots whose xip reflects the primary’s in-flight transactions. PostgreSQL reconstructs that set from the WAL stream into KnownAssignedXids — a sorted array with a parallel validity bitmap:

// KnownAssignedXids design — src/backend/storage/ipc/procarray.c (comment, condensed)
// The XIDs are stored in an array in sorted order --- TransactionIdPrecedes
// order --- to allow binary search. ...
// To keep individual deletions cheap, we allow gaps in the array, marking
// elements valid/invalid via the parallel boolean array KnownAssignedXidsValid[].
// A deletion sets KnownAssignedXidsValid[i]=false *without* clearing the XID,
// preserving sortedness for binary search. Periodically we compress out the
// unused entries.
// Valid items are those with tail <= i < head. When head reaches the array
// end we force compression rather than wrapping.

The startup (WAL-replay) process is the single writer: it appends new ids in xid order as it sees assignment records (RecordKnownAssignedTransactionIds), marks ids invalid as it sees commit/abort records, prunes old ids, and periodically compresses. Standby query backends are readers under shared ProcArrayLock. Append-at-the-right keeps the array sorted for free because “new known XIDs are always reported in XID order.” Deletions only flip a validity bit, so they are O(1) and never disturb sort order; the O(n) compaction is deferred and batched (KnownAssignedXidsCompress, driven by KAXCompressReason).

When GetSnapshotData runs in recovery it takes the standby branch and fills the snapshot directly from this array:

// GetSnapshotData — recovery branch (condensed)
subcount = KnownAssignedXidsGetAndSetXmin(snapshot->subxip, &xmin, xmax);
if (TransactionIdPrecedesOrEquals(xmin, procArray->lastOverflowedXid))
suboverflowed = true;

Because the standby cannot tell top-level xids from subxids in the WAL stream, all known ids are stored together in snapshot->subxip and xip is left empty — “a design choice that greatly simplifies xid processing.” lastOverflowedXid tracks the highest subxid that had to be dropped, so the standby knows when to mark a snapshot suboverflowed and fall back to pg_subtrans. The same ProcArrayLock, latestCompletedXid, and snapshot builder are reused — only the source of the active set changes between normal running and recovery.

Anchor on symbol names, not line numbers. procarray.c is large and reorganizes between releases; function/struct names are the stable handle. Use git grep -n '<symbol>' src/backend/storage/ipc/procarray.c to locate the current position. Line numbers in the table below were observed at commit 273fe94 (REL_18_STABLE) and are hints only.

Shared structures (src/include/storage/proc.h, procarray.c)

Section titled “Shared structures (src/include/storage/proc.h, procarray.c)”
  • struct PGPROC (in proc.h) — the per-backend slot; xid, xmin, pgxactoff, statusFlags, subxidStatus, subxids, and the group-clear fields (procArrayGroupMember, procArrayGroupNext, procArrayGroupMemberXid).
  • typedef struct PROC_HDR (in proc.h) — ProcGlobal: the allProcs[] pool plus the three dense mirror arrays (xids, subxidStates, statusFlags) and the procArrayGroupFirst queue head.
  • struct XidCacheStatus, PGPROC_MAX_CACHED_SUBXIDS (in proc.h) — the 64-id subxid cache and its count / overflowed summary.
  • PROC_IN_VACUUM, PROC_IN_LOGICAL_DECODING, PROC_AFFECTS_ALL_HORIZONS, PROC_VACUUM_STATE_MASK (status-flag macros, in proc.h).
  • typedef struct ProcArrayStruct (in procarray.c) — the sorted pgprocnos[] index, the KnownAssignedXids head/tail bookkeeping, lastOverflowedXid, and the two replication_slot_*_xmin fields.
  • struct GlobalVisState, typedef struct ComputeXidHorizonsResult, enum GlobalVisHorizonKind (in procarray.c) — the cached approximate horizon pair and the precise multi-horizon result.
  • ProcArrayShmemInit — allocate and size the shared array at startup.
  • ProcArrayAdd — insert a backend’s PGPROC index, keeping the array sorted; renumber pgxactoff for the tail.
  • ProcArrayRemove — remove a backend (or a finished 2PC gxact), advancing latestCompletedXid for the live-xid case.
  • ProcArrayEndTransaction — clear an xid at commit/abort; fast no-lock path for xidless xacts, conditional-acquire fast path, group-clear fallback.
  • ProcArrayEndTransactionInternal — the under-lock clear; advances latestCompletedXid and xactCompletionCount.
  • ProcArrayGroupClearXid — lock-free group-clear queue + leader.
  • ProcArrayClearTransaction — scrub own PGPROC after a 2PC prepare without leaving the census.
  • MaintainLatestCompletedXid / MaintainLatestCompletedXidRecovery — bump the global latestCompletedXid that backs xmax.
  • GetSnapshotData — build xmin/xmax/xip in one shared-lock pass; advance GlobalVis*.
  • GetSnapshotDataReuse — the xactCompletionCount fast-path that reuses an unchanged snapshot.
  • GetMaxSnapshotXidCount / GetMaxSnapshotSubxidCount — array sizing for snapmgr.c.
  • ProcArrayInstallImportedXmin / ProcArrayInstallRestoredXmin — adopt an imported xmin while verifying the source xact still runs.
  • ComputeXidHorizons — the precise multi-horizon scan.
  • GetOldestNonRemovableTransactionId — vacuum’s entry point; picks the horizon per relation kind via GlobalVisHorizonKindForRel.
  • GetOldestTransactionIdConsideredRunningpg_subtrans-truncation horizon.
  • GetReplicationHorizons — data + catalog horizons for hot-standby feedback.
  • GlobalVisTestFor / GlobalVisTestIsRemovableXid / GlobalVisCheckRemovableXid — the cached approximate-horizon test on the pruning hot path.
  • TransactionIdIsInProgress — is a specific xid still running (with RecentXmin and single-entry caches up front).
  • TransactionIdIsActive, ProcNumberGetTransactionIds, BackendXidGetPid, GetOldestActiveTransactionId.
  • ProcArraySetReplicationSlotXmin / ProcArrayGetReplicationSlotXmin — publish / read the slot-imposed holdback on the horizon.

Hot standby / KnownAssignedXids (procarray.c)

Section titled “Hot standby / KnownAssignedXids (procarray.c)”
  • ProcArrayInitRecovery / ProcArrayApplyRecoveryInfo / ProcArrayApplyXidAssignment — set up and update the standby active set from WAL.
  • RecordKnownAssignedTransactionIds — note a newly seen primary xid.
  • KnownAssignedXidsAdd / KnownAssignedXidsSearch — append (in sorted order) / binary-search-and-optionally-remove.
  • KnownAssignedXidsGetAndSetXmin — fill a standby snapshot and lower its xmin.
  • KnownAssignedXidsRemovePreceding / ExpireOldKnownAssignedTransactionIds / KnownAssignedXidsCompress — pruning and the deferred O(n) compaction.

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
struct PGPROCproc.h176
typedef struct XidCacheStatusproc.h41
PGPROC_MAX_CACHED_SUBXIDSproc.h39
typedef struct PROC_HDRproc.h383
typedef struct ProcArrayStructprocarray.c71
struct GlobalVisStateprocarray.c167
ComputeXidHorizonsResultprocarray.c179
ProcArrayShmemInitprocarray.c418
ProcArrayAddprocarray.c468
ProcArrayRemoveprocarray.c565
ProcArrayEndTransactionprocarray.c667
ProcArrayEndTransactionInternalprocarray.c731
ProcArrayGroupClearXidprocarray.c792
ProcArrayClearTransactionprocarray.c907
MaintainLatestCompletedXidprocarray.c967
ProcArrayApplyRecoveryInfoprocarray.c1054
ProcArrayApplyXidAssignmentprocarray.c1318
TransactionIdIsInProgressprocarray.c1402
ComputeXidHorizonsprocarray.c1735
GetOldestNonRemovableTransactionIdprocarray.c2005
GetOldestTransactionIdConsideredRunningprocarray.c2034
GetReplicationHorizonsprocarray.c2047
GetSnapshotDataReuseprocarray.c2095
GetSnapshotDataprocarray.c2175
ProcArrayInstallImportedXminprocarray.c2531
GlobalVisTestForprocarray.c4107
GlobalVisTestIsRemovableXidprocarray.c4264
GlobalVisCheckRemovableXidprocarray.c4300
RecordKnownAssignedTransactionIdsprocarray.c4403
ExpireOldKnownAssignedTransactionIdsprocarray.c4531
KnownAssignedXidsAddprocarray.c4782
KnownAssignedXidsSearchprocarray.c4886
KnownAssignedXidsGetAndSetXminprocarray.c5127

Each entry leads with a fact about the current source at commit 273fe94 (REL_18_STABLE), readable without any external material. The trailing sentence records how it was checked. Open questions follow as recorded gaps.

  • The active set is a sorted index array plus three dense mirror arrays, not a scan over full PGPROC structs. ProcArrayStruct.pgprocnos[] holds indexes into ProcGlobal->allProcs[], kept sorted by PGPROC address; GetSnapshotData and ComputeXidHorizons stream ProcGlobal->xids[] / subxidStates[] / statusFlags[] indexed by pgxactoff. Verified by reading ProcArrayStruct, PROC_HDR, and the scan loops. The per-PGPROC copies are the writer’s source of truth; proc.h states both copies “need to be maintained coherently.”

  • xmax is latestCompletedXid + 1, advanced under ProcArrayLock exclusive at every commit/abort. GetSnapshotData computes xmax from TransamVariables->latestCompletedXid; ProcArrayEndTransactionInternal advances latestCompletedXid via MaintainLatestCompletedXid inside the exclusive section. Verified by reading both.

  • A read-only (xidless) transaction clears no shared state and takes no lock at commit. ProcArrayEndTransaction’s else branch (when latestXid is invalid) only resets local PGPROC fields, with a comment that it “won’t affect anyone else’s calculation of a snapshot.” Verified in ProcArrayEndTransaction.

  • The commit-time clear escalates to group clearing only when ProcArrayLock cannot be acquired immediately. ProcArrayEndTransaction uses LWLockConditionalAcquire first and falls back to ProcArrayGroupClearXid, which builds a lock-free Treiber stack on ProcGlobal->procArrayGroupFirst (CAS loop), elects the first arrival as leader, and pops the whole list in one pg_atomic_exchange. Verified in both functions; the one-at-a-time pop is explicitly avoided to dodge an ABA problem.

  • GetSnapshotData excludes the backend’s own xid from xip but includes it in xmin, and skips PROC_IN_VACUUM and PROC_IN_LOGICAL_DECODING backends. Verified in the main scan loop: own-slot continue, the xid >= xmax continue, and the statusFlags & (PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM) continue, with xmin seeded from the own xid before the loop.

  • A snapshot is reused without a scan when xactCompletionCount is unchanged. GetSnapshotDataReuse compares snapshot->snapXactCompletionCount to TransamVariables->xactCompletionCount; on a match it re-publishes MyProc->xmin/RecentXmin and returns the prior contents. Verified in GetSnapshotDataReuse; the counter is bumped in ProcArrayEndTransactionInternal.

  • ComputeXidHorizons produces four removability horizons plus oldest_considered_running, differentiated by relation visibility scope. ComputeXidHorizonsResult carries shared_/catalog_/data_/temp_oldest_nonremovable, shared_oldest_nonremovable_raw, and oldest_considered_running. A backend in another database lowers only the shared (and considered-running) horizons unless MyDatabaseId is unset, PROC_AFFECTS_ALL_HORIZONS is set, or in recovery. Verified in ComputeXidHorizons and GetOldestNonRemovableTransactionId.

  • The expensive horizon is cached approximately as a (maybe_needed, definitely_needed) pair and recomputed only for in-between xids. GlobalVisState holds the two FullTransactionId bounds; GetSnapshotData advances them on every snapshot, and a precise ComputeXidHorizons runs only when a tested xid falls between them. Verified in the GlobalVis* maintenance block of GetSnapshotData and in GlobalVisTestFor.

  • Replication slots back the horizons up but not the snapshot’s xmin directly. replication_slot_xmin / replication_slot_catalog_xmin live in ProcArrayStruct, are read under the lock in both GetSnapshotData and ComputeXidHorizons, and are folded into the GlobalVis* bounds / horizons — not into the snapshot xip. Verified by reading both functions and ProcArraySetReplicationSlotXmin.

  • In hot standby the snapshot is built from KnownAssignedXids, a sorted array with a parallel validity bitmap and deferred compaction. GetSnapshotData’s recovery branch calls KnownAssignedXidsGetAndSetXmin into snapshot->subxip (leaving xip empty); deletions only flip KnownAssignedXidsValid[i], preserving sort order, with O(n) compaction deferred to KnownAssignedXidsCompress. Verified against the design comment and KnownAssignedXidsRemovePreceding / KnownAssignedXidsGetAndSetXmin. The startup process is the sole writer.

  1. Cost of the ProcArrayAdd/Remove memmove on very high connection churn. Each connect/disconnect memmoves pgprocnos[] and all three mirror arrays and renumbers every following backend’s pgxactoffO(numProcs) under two exclusive LWLocks. The inline comment asserts the churn is “much lower than the access to the ProcArray itself,” but the crossover point with thousands of short-lived connections (no pooler) is not characterized here. Investigation path: microbenchmark connect/disconnect storms vs. max_connections, and compare against the snapshot-scan savings the sort buys.

  2. Exact GlobalVis* recompute frequency under mixed workloads. GlobalVisTestShouldUpdate limits how often the precise ComputeXidHorizons runs (keyed on ComputeXidHorizonsResultLastXmin), but the practical hit rate of the (maybe_needed, definitely_needed) window on a real pruning workload — and how often a tested xid lands in the in-between zone — is not measured here. Investigation path: instrument GlobalVisTestIsRemovableXid call/recompute counts under HOT-update-heavy and long-transaction workloads.

  3. lastOverflowedXid precision on a busy standby. On a standby with many subtransaction-overflowing primary transactions, lastOverflowedXid can mark snapshots suboverflowed conservatively, forcing pg_subtrans lookups. Whether ExpireOldKnownAssignedTransactionIds resets it aggressively enough to avoid sustained over-marking is not traced here. Investigation path: read the lastOverflowedXid reset logic alongside a subxid-overflow standby workload.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

Pointers, not analysis. Each bullet is a handle for a follow-up doc.

  • CUBRID’s active-MVCCID table (cubrid-mvcc.md) — CUBRID tracks the live set with mvcc_active_tran, a bit array over a sliding MVCCID window plus an overflow list and cached low/high watermarks, rather than PostgreSQL’s scan-an-array-of-slots approach. PostgreSQL pays an O(numProcs) scan per snapshot (mitigated by the dense mirror arrays and xactCompletionCount reuse); CUBRID pays a bit-array copy. A side-by-side cost model of “scan slots” vs. “copy a windowed bit array” as a function of connection count and transaction rate is the natural follow-up.

  • Commit-LSN / commit-timestamp visibility (Oracle SCN, SQL Server Hekaton, Spanner TrueTime) — engines that timestamp each commit with a monotonic sequence number and decide visibility by numeric comparison against a snapshot timestamp, avoiding an explicit in-progress list. Database Internals (Petrov, ch. 5) frames this as the commit-order alternative to active-set enumeration. PostgreSQL has a commit_ts module but still drives MVCC visibility from the procarray’s active set; a comparison of where each design pays its cost (snapshot build vs. per-version timestamp storage) would be a focused note.

  • Snapshot scalability rework (PostgreSQL 14, commit 1f51c17c66, Andres Freund) — the change that introduced the dense ProcGlobal->xids[] mirror arrays analyzed here, replacing per-PGPROC field scans. The before/after design and the measured throughput gain on many-core boxes deserve their own historical note; this doc describes the result, not the migration.

  • Distributed / disaggregated active-set tracking (Aurora, Neon, Citus) — cloud-native Postgres derivatives that move the horizon/visibility authority off a single shared-memory segment. How they reconcile a per-node procarray with a global GC horizon (and whether they centralize xmin computation) is a distinct architecture worth a comparative doc once the single-node story here is settled.

  • A Scalable Lock Manager for Multicores (Jung et al., SIGMOD 2013; knowledge/research/dbms-papers/scalable-lock-manager.md) — diagnoses cache-line bouncing on a central latch as the multicore killer. ProcArrayLock is exactly such a central latch; the group-clear queue and shared-mode snapshot reads are PostgreSQL’s mitigations on the procarray side, parallel to the lock manager’s partitioning. A cross-walk of how the same contention lesson shows up in two different subsystems (postgres-lock-manager.md and this doc) is a worthwhile synthesis.

PostgreSQL source (under /data/hgryoo/references/postgres, REL_18 273fe94)

Section titled “PostgreSQL source (under /data/hgryoo/references/postgres, REL_18 273fe94)”
  • src/backend/storage/ipc/procarray.c — the process-array module: shared structure, membership, snapshot building, horizon computation, group XID clearing, and the KnownAssignedXids hot-standby machinery.
  • src/include/storage/proc.hPGPROC, PROC_HDR (ProcGlobal), XidCacheStatus, and the PROC_* status-flag macros.
  • src/include/storage/procarray.h — the module’s public API surface.
  • src/backend/access/transam/README — the locking discipline for setting and clearing a backend’s xid relative to snapshot acquisition (referenced throughout procarray.c).
  • The KnownAssignedXids design comment block in procarray.c itself — sorted array, validity bitmap, deferred compaction, single-writer rule.

Textbook chapters (under knowledge/research/dbms-general/)

Section titled “Textbook chapters (under knowledge/research/dbms-general/)”
  • Database System Concepts (Silberschatz et al., 7e), ch. 18 — §18.6 multiversion schemes, §18.7 snapshot isolation.
  • Database Internals (Petrov), ch. 5 — concurrency-control families; MVCC visibility vs. mutual exclusion; commit-order alternatives.

Papers (under knowledge/research/dbms-papers/)

Section titled “Papers (under knowledge/research/dbms-papers/)”
  • scalable-lock-manager.md — Jung et al., A Scalable Lock Manager for Multicores, SIGMOD 2013 (cited for the central-latch contention frontier).

Sibling docs (cross-references, mechanism not duplicated here)

Section titled “Sibling docs (cross-references, mechanism not duplicated here)”
  • postgres-mvcc-snapshots.md — the visibility rules that consume the snapshot triple this module produces. This doc owns the procarray internals; that doc references them rather than re-deriving them.
  • postgres-architecture-overview.md — the shared-memory spine in which the procarray and ProcGlobal arrays live.
  • postgres-lock-manager.md — the other major PGPROC-anchored shared structure and a parallel study in central-latch contention.