PostgreSQL Cumulative Statistics — The PG15 Shared-Memory Stats Subsystem

Contents:

Theoretical Background
Common DBMS Design
PostgreSQL’s Approach
Source Walkthrough
Source verification (as of 2026-06-05)
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Sources

Theoretical Background

Every production DBMS keeps two distinct kinds of introspection state, and conflating them is a classic design error. The first is the live view: what is each session doing right now — which query, which wait event, how far through a VACUUM. The second is the cumulative view: what has happened over time — how many sequential scans this table has served, how many tuples were inserted, how many dead tuples accumulated since the last vacuum, how many WAL bytes were written. The live view is a snapshot that is meaningless a millisecond later; the cumulative view is a set of monotonically growing counters whose value is in the accumulation. This document is about the cumulative view — the subsystem behind the pg_stat_* family of views and, just as importantly, behind autovacuum’s decision of which table to vacuum next.

Database System Concepts (Silberschatz, 7e) frames the role of these counters in two places. In query optimization (ch. 16, “Query Optimization”), the catalog statistics — relation cardinalities, tuple sizes, value distributions — are the raw input to the cost model; without them the optimizer is blind. The textbook is careful to distinguish those planner statistics (gathered by ANALYZE, stored in pg_statistic, used to estimate selectivities) from the activity counters this document covers. They are different beasts: planner statistics are sampled estimates of data shape, refreshed occasionally; cumulative activity counters are exact event tallies, updated on every tuple touched. The overlap is that both feed autovacuum — the dead-tuple counter (cumulative) crossing a threshold relative to the live-tuple estimate (planner-ish) is what schedules an autovacuum, which is the engine’s self-maintenance loop (DSC ch. 25 §25.3 on the storage manager’s housekeeping).

The deeper architectural question the textbook raises only implicitly is where the authoritative copy of a shared counter lives, and who is allowed to write it. A counter like “tuples inserted into table T” is incremented by every backend that inserts into T, and read by any backend running SELECT * FROM pg_stat_user_tables. That is a classic many-writers/many-readers shared-state problem. The naive answer — one global mutable counter per object, locked on every increment — does not scale: the lock on a hot table’s counter would become a bottleneck worse than the data page itself. The Architecture of a Database System (Hellerstein, Stonebraker & Hamilton 2007, §6 on shared components and §2 on process models) names the governing tension directly: shared introspection state must be cheap enough to update that instrumentation does not perturb the thing being measured (the observer effect), yet consistent enough that the numbers are trustworthy. The standard resolution, which PostgreSQL adopts, is local accumulation with periodic aggregation: each writer keeps a private running tally and only occasionally folds it into the shared total. The shared total is then slightly stale but contention-free on the hot path; the staleness window is a tunable knob.

There is a fourth, quieter force that the textbook underplays but that dominates the implementation: transactional correctness of the counts. A row inserted in a transaction that later aborts did happen (work was done, a dead tuple exists) but did not logically add a live row. So tuple counts cannot simply be incremented at insert time — they must be staged per (sub)transaction and resolved at commit/abort, with the “attempted action” counts (which include aborted work) kept separate from the “net effect” counts (live/dead-tuple deltas that depend on the outcome). This is why PostgreSQL’s relation stats carry both a nontransactional PgStat_TableCounts and a stack of per-subxact PgStat_TableXactStatus records, and why the existence of a stats entry for a freshly-created table is itself transactional. A cumulative-stats system that ignored this would report phantom rows after every rollback.

Four design forces therefore shape any cumulative-stats subsystem, and they recur in PostgreSQL’s source:

Write amplification vs. accuracy. Folding local tallies into shared memory every N events or every T milliseconds trades freshness for throughput. PostgreSQL flushes at most once per second on the idle path and unconditionally at commit.
Object lifetime vs. counter lifetime. A table can be dropped while a backend is still mid-flush of counts for it, or while a pg_stat snapshot is being read. The counter’s storage must outlive the object just long enough for all observers to let go — a reference-counting problem.
Durability across restart. Cumulative counters are expensive to recompute (you cannot re-derive “lifetime sequential scans” from the data), so they are serialized to disk at shutdown and reloaded — but discarded after a crash, because a crash means some in-flight updates were lost and the totals can no longer be trusted.
Transactional resolution. Net-effect counts (live/dead tuples) must wait for commit/abort; attempted-action counts accrue regardless. The stats entry’s very lifetime is transactional — created on CREATE, dropped on DROP, both reversible by rollback and replayable on standbys via the commit/abort WAL record.

Common DBMS Design

The textbook gives the forces; this section names the engineering conventions production engines use to resolve them, so PostgreSQL’s specific choices read as one point in a known design space.

Per-worker accumulation buffers. Virtually every high-throughput engine gives each thread or process a private block of counters that it increments with plain (unlocked) arithmetic, then merges into a global aggregate on a slow cadence. Oracle’s V$ fixed tables, SQL Server’s DMVs, and MySQL’s performance_schema all rest on per-thread instrument buffers aggregated on read or on a timer. The merge is the only synchronized operation, so the cost of instrumentation on the hot path is a single non-atomic increment. PostgreSQL’s pending entries are exactly this buffer.

A keyed registry of shared counters. The shared aggregate is rarely a flat array — the set of objects (tables, functions) is dynamic, so it is a hashtable keyed by an object identifier. The registry must support concurrent lookup, insert (first time an object is touched), and delete (object dropped), which pushes designs toward partitioned/striped hash tables with per-partition locks rather than one global lock. PostgreSQL uses dshash, a dynamic-shared-memory hashtable with partitioned locking.

Reference-counted entry lifetime. Once counters live in a shared registry and can be dropped concurrently with being read or updated, the entry needs a lifetime independent of its presence in the hashtable. The canonical pattern is a refcount: the entry carries a count of outstanding users, a “logically deleted” tombstone flag, and is physically freed only when the count hits zero. This is the same pattern as Linux’s dput() dentry cache, Java’s PhantomReference-based cleanup, or any read-copy- update scheme. The subtlety is the reuse race: between a backend deciding to free an entry and actually freeing it, the slot can be reused for a new object with the same key (OID wraparound, replication-slot index reuse). A monotonic generation/epoch counter on the entry lets a late releaser detect “the thing I was about to free is not the thing that’s here now” and back off.

Snapshot isolation for readers. A reader scanning many counters wants a consistent picture, not a torn read where table A’s count is from time t1 and table B’s is from t2. Engines offer a materialized snapshot mode (copy everything once, read from the copy) for consistency, and a direct mode for freshness. PostgreSQL exposes this as the stats_fetch_consistency GUC with three settings (none, cache, snapshot).

Durability on the slow path. Because the totals are pure accumulation, they are flushed to disk lazily — typically by a dedicated background process at checkpoint or shutdown, never on the transaction commit path, since losing a few seconds of stats on a crash is acceptable (the engine discards all of them on crash anyway). This is the opposite of WAL’s synchronous discipline: stats durability is best-effort by design.

Single-writer fast paths. A second category of counter has exactly one writer — the checkpointer writes checkpointer stats, the archiver writes archiver stats, each backend writes only its own IO/WAL tallies. For these the engine can skip locking on the write side entirely and use a seqlock-style changecount protocol instead: the writer brackets its update between an odd→even pair of increments to a version counter with memory barriers, and a reader retries if it observed an odd (mid-write) count or saw the count change across the read. This is the classic optimistic-read pattern (Linux seqlocks, the RCU read side) and it makes the write — the performance-critical direction for these high-frequency counters — barrier-cheap and lock-free, at the cost of an occasional reader retry. PostgreSQL’s fixed-numbered stats use exactly this; the variable-numbered (per-object) stats instead take the entry’s LWLock, because they have many writers.

flowchart TD
  subgraph backend["Each backend (process-local)"]
    hot["hot path: pgstat_count_heap_insert etc.<br/>plain increments into pending entry"]
    pend["pending entry<br/>PgStat_EntryRef->pending"]
    lhash["local lookup hashtable<br/>pgStatEntryRefHash"]
    hot --> pend
    lhash -. caches refs to .-> pend
  end
  subgraph shmem["Shared memory (all backends)"]
    dshash["dshash table<br/>key = PgStat_HashKey<br/>entry = PgStatShared_HashEntry"]
    body["stats body in DSA<br/>PgStatShared_Relation etc."]
    fixed["fixed-numbered block<br/>PgStat_ShmemControl<br/>archiver/bgwriter/io/wal/slru"]
    dshash -- dsa_pointer body --> body
  end
  pend -- "pgstat_report_stat()<br/>flush, <= 1/sec or at commit" --> body
  lhash -- "miss: dshash_find_or_insert" --> dshash
  disk["pg_stat/pgstat.stat<br/>on-disk file"]
  body -. "checkpointer writes at shutdown" .-> disk
  fixed -. "checkpointer writes at shutdown" .-> disk
  disk -. "startup process reads<br/>(discard after crash)" .-> body

PostgreSQL’s Approach

Before PostgreSQL 15, cumulative stats lived in a single dedicated process, the stats collector. Backends serialized their counter deltas into UDP datagrams and fired them at the collector over a loopback socket; the collector owned the only in-memory copy of every counter and periodically wrote a snapshot file that readers slurped in whole. That design had three chronic problems: UDP packets could be dropped under load (silently losing counts), the single collector was a throughput ceiling, and every pg_stat read deserialized an entire file. PG15 (commit 5891c7a8e, Andres Freund) replaced it with the shared-memory cumulative statistics system documented here. The header comment in pgstat.c states the architecture directly:

// pgstat.c (file header) — src/backend/utils/activity/pgstat.c
// Fixed-numbered stats are stored in plain (non-dynamic) shared memory.
//
// Statistics for variable-numbered objects are stored in dynamic shared
// memory and can be found via a dshash hashtable. The statistics counters
// are not part of the dshash entry (PgStatShared_HashEntry) directly, but
// are separately allocated (PgStatShared_HashEntry->body)...
//
// To avoid contention on the shared hashtable, each backend has a
// backend-local hashtable (pgStatEntryRefHash) in front of the shared
// hashtable, containing references (PgStat_EntryRef) to shared hashtable
// entries.

Two storage classes, one façade. Every kind of statistic is described by a PgStat_KindInfo row in the static pgstat_kind_builtin_infos[] table. The fixed_amount flag splits the world in two. Fixed-numbered kinds (there is exactly one archiver, one bgwriter, one IO struct, one WAL struct, one SLRU array) live as embedded members of the single PgStat_ShmemControl struct in plain shared memory. Variable-numbered kinds (one entry per relation, per function, per replication slot, per subscription, per backend) live in the dshash. The kind table is the dispatch vtable for the whole subsystem:

// pgstat_kind_builtin_infos[] — src/backend/utils/activity/pgstat.c
[PGSTAT_KIND_RELATION] = {
    .name = "relation",
    .fixed_amount = false,
    .write_to_file = true,
    .shared_size = sizeof(PgStatShared_Relation),
    .shared_data_off = offsetof(PgStatShared_Relation, stats),
    .shared_data_len = sizeof(((PgStatShared_Relation *) 0)->stats),
    .pending_size = sizeof(PgStat_TableStatus),
    .flush_pending_cb = pgstat_relation_flush_cb,
    .delete_pending_cb = pgstat_relation_delete_pending_cb,
},

PGSTAT_KIND_RELATION is 2; built-in kinds run from PGSTAT_KIND_BUILTIN_MIN (= PGSTAT_KIND_DATABASE, 1) upward, and custom extension kinds occupy PGSTAT_KIND_CUSTOM_MIN (24) and above. The three shared_* offsets let the generic code in pgstat.c and pgstat_shmem.c slice the right sub-range out of a variable-size entry without knowing the kind’s concrete struct — pgstat_get_entry_data() just adds shared_data_off to the header pointer.

The shared hash entry is a thin handle, not the data. A dshash entry is fixed-size, but stats bodies vary (a relation entry is far larger than a subscription entry). So the dshash stores PgStatShared_HashEntry — key, flags, refcount, generation, and a dsa_pointer to the real body — and the body is allocated separately in DSA. The separation also means the dshash never has to be resized when a new kind is added.

// PgStatShared_HashEntry — src/include/utils/pgstat_internal.h
typedef struct PgStatShared_HashEntry
{
    PgStat_HashKey key;        /* {kind, dboid, objid} */
    bool        dropped;       /* logically deleted; no new refs allowed */
    pg_atomic_uint32 refcount; /* entry lifetime, not dshash-entry lifetime */
    pg_atomic_uint32 generation; /* bumped on reinit; detects reuse races */
    dsa_pointer body;          /* -> PgStatShared_Common + kind-specific stats */
} PgStatShared_HashEntry;

The key PgStat_HashKey is {PgStat_Kind kind, Oid dboid, uint64 objid}. For a per-relation entry, dboid is the database OID (or InvalidOid for shared catalogs) and objid is the table OID. The body always begins with a PgStatShared_Common header carrying a magic cross-check and a per-entry LWLock — so two backends flushing counts for two different tables never contend, and a flusher and a reader of the same table serialize only on that one table’s lock.

Notice the field that is not in the dshash entry: the actual counters. That choice is deliberate and worth dwelling on. A PgStatShared_Relation body is dozens of int64 counters plus timestamps; a PgStatShared_Subscription body is a handful. If every kind’s data were inlined into the dshash entry, the entry size would have to be the maximum over all kinds, wasting space for the small kinds and forcing a dshash resize whenever a larger kind is introduced. By storing only a dsa_pointer and allocating the body separately with the kind’s exact shared_size, the dshash stays kind-agnostic and right-sized. The price is one pointer indirection (dsa_get_address(body)) per access — which the PgStat_EntryRef caches away as a resolved local pointer, so it is paid once per backend per object, not once per update.

The fixed-numbered kinds avoid the dshash entirely. Their data is embedded directly in PgStat_ShmemControl, reached by shared_ctl_off, and they carry the seqlock changecount described above rather than an LWLock on the write path:

// PgStatShared_BgWriter / PgStat_ShmemControl — src/include/utils/pgstat_internal.h
typedef struct PgStatShared_BgWriter
{
    LWLock      lock;            /* protects reset_offset + reset_timestamp */
    uint32      changecount;     /* seqlock: odd = write in progress */
    PgStat_BgWriterStats stats;
    PgStat_BgWriterStats reset_offset;
} PgStatShared_BgWriter;

typedef struct PgStat_ShmemControl
{
    void       *raw_dsa_area;
    dshash_table_handle hash_handle;       /* the variable-numbered hash */
    bool        is_shutdown;
    pg_atomic_uint64 gc_request_count;     /* GC epoch */
    PgStatShared_Archiver archiver;        /* fixed kinds embedded inline */
    PgStatShared_BgWriter bgwriter;
    PgStatShared_Checkpointer checkpointer;
    PgStatShared_IO io;
    PgStatShared_SLRU slru;
    PgStatShared_Wal wal;
    void       *custom_data[PGSTAT_KIND_CUSTOM_SIZE];
} PgStat_ShmemControl;

Local pending + the reference cache. A backend almost never touches the dshash on the hot path. When a relation is first modified, the backend obtains a PgStat_EntryRef — a local handle that bundles a pointer to the shared PgStatShared_HashEntry, a resolved local pointer to the body (avoiding repeated dsa_get_address()), the body’s generation at acquire time, and a pending pointer to a process-local scratch buffer. These refs are cached in pgStatEntryRefHash, a backend-local simplehash. The hot path (pgstat_count_heap_insert, etc.) just bumps integers in the pending buffer. The transactional dance for tuple counts adds a layer: inserts/updates/deletes accumulate in a per-subtransaction PgStat_TableXactStatus so they can be rolled back, and only fold into the backend’s PgStat_TableStatus.counts at (sub)transaction end.

flowchart TD
  start["DML on table T<br/>pgstat_count_heap_insert(rel, n)"] --> chk{"pgstat_should_count_relation?"}
  chk -- no --> done1["return (untracked)"]
  chk -- yes --> ensure["ensure_tabstat_xact_level<br/>push PgStat_TableXactStatus<br/>for current nest level"]
  ensure --> bump["trans->tuples_inserted += n"]
  bump --> commitq{"(sub)xact end?"}
  commitq -- subxact commit --> up["fold counts into parent<br/>AtEOSubXact_PgStat_Relations"]
  commitq -- top commit --> fold["AtEOXact_PgStat_Relations<br/>fold trans -> counts<br/>derive delta_live/dead_tuples"]
  commitq -- abort --> abort["restore truncdrop counters<br/>inserted tuples become dead"]
  fold --> later["next pgstat_report_stat()"]
  later --> flush["pgstat_relation_flush_cb<br/>lock entry, add counts to shared,<br/>also bump database pending entry"]
  flush --> shared["PgStatShared_Relation.stats<br/>in DSA"]

Flush cadence. pgstat_report_stat(force) is the single drain point. It fast-exits if the pending list is empty and no fixed-numbered kind asked for a flush. Otherwise, unless forced, it throttles: it will not flush more often than PGSTAT_MIN_INTERVAL (1000 ms), and if updates have been pending longer than PGSTAT_MAX_INTERVAL (60000 ms) it forces a blocking flush. On the non-forced path it acquires per-entry locks with nowait and simply leaves un-flushable entries on the pending list for the next round, returning a suggested idle timeout (PGSTAT_IDLE_INTERVAL, 10000 ms). Commit, backend exit, and pg_stat_force_next_flush() force it.

Lifetime is reference-counted and the drop is transactional. When an object is dropped, the stats entry cannot be freed immediately — other backends may hold refs or be reading a snapshot. pgstat_init_entry() seeds refcount = 1 (a sentinel meaning “not dropped”); each backend that acquires a ref adds 1 more. A drop sets dropped = true and subtracts the sentinel; the body is dsa_free()d only when the count reaches 0, which can happen long after the drop, in whichever backend happens to release the last ref. Because OID/index reuse can resurrect a key, generation is bumped on reinit and a late releaser compares the entry’s current generation against the one it captured. And because the decision to drop is itself transactional (a DROP TABLE that rolls back must not drop the stats), pgstat_drop_transactional() only queues the drop; AtEOXact_PgStat_DroppedStats() executes it at commit and the drop is recorded in the commit/abort WAL record so standbys and crash recovery converge. The rest of this is traced symbol-by-symbol below.

Source Walkthrough

Shared-memory bootstrap and attach

StatsShmemSize() reserves sizeof(PgStat_ShmemControl) plus a fixed 256 KB DSA seed plus space for any custom fixed kinds. StatsShmemInit() runs once in the postmaster: it creates a DSA in place inside the reserved shared segment (the postmaster cannot use dynamic shared-memory segments), then — with the DSA size temporarily clamped to its initial size so the structure stays in plain shared memory — creates the dshash and records its handle. It then walks the kind table and calls each fixed-numbered kind’s init_shmem_cb.

// StatsShmemInit() — src/backend/utils/activity/pgstat_shmem.c
ctl->raw_dsa_area = p;
p += MAXALIGN(pgstat_dsa_init_size());
dsa = dsa_create_in_place(ctl->raw_dsa_area, pgstat_dsa_init_size(),
                          LWTRANCHE_PGSTATS_DSA, NULL);
dsa_pin(dsa);
/* clamp so dshash header lands in plain shared memory */
dsa_set_size_limit(dsa, pgstat_dsa_init_size());
dsh = dshash_create(dsa, &dsh_params, NULL);
ctl->hash_handle = dshash_get_hash_table_handle(dsh);
dsa_set_size_limit(dsa, -1);   /* lift limit */

Each backend then runs pgstat_attach_shmem() from pgstat_initialize() (called in BaseInit()): it attaches the DSA and dshash in place and pins the mapping for the backend’s lifetime. gc_request_count is initialized to 1 so the local-ref age comparison has a non-zero baseline. The matching pgstat_detach_shmem() first releases all local entry refs, then detaches — a process must not exit holding refs, or shared bodies could never be freed.

Acquiring a reference: pgstat_get_entry_ref

This is the funnel every stats access flows through. pgstat_get_entry_ref(kind, dboid, objid, create, created_entry) first checks the backend-local cache via pgstat_get_entry_ref_cached(), which always inserts a local hashtable slot first (so a cache miss costs one lookup, and so refcount increments never race with an out-of-memory error). On a local hit it returns immediately, touching no shared state. On a miss it probes the dshash with a shared lock (dshash_find), and only on a creating-and-absent path takes the insert path:

// pgstat_get_entry_ref() — src/backend/utils/activity/pgstat_shmem.c
shhashent = dshash_find(pgStatLocal.shared_hash, &key, false);
if (create && !shhashent)
{
    shhashent = dshash_find_or_insert(pgStatLocal.shared_hash, &key, &shfound);
    if (!shfound)
    {
        shheader = pgstat_init_entry(kind, shhashent);
        ...
        pgstat_acquire_entry_ref(entry_ref, shhashent, shheader);
        if (created_entry != NULL) *created_entry = true;
        return entry_ref;
    }
}

pgstat_init_entry() is where a fresh entry is born: it sets refcount = 1 (the not-dropped sentinel) and generation = 0, DSA_ALLOC_ZEROs the kind’s shared_size, stamps the 0xdeadbeef magic, and initializes the per-entry LWLock. pgstat_acquire_entry_ref() then adds the caller’s reference (a second increment), releases the dshash partition lock, and snapshots the entry’s generation into the local ref. The crucial property: the refcount is incremented while holding only a shared dshash lock — it is atomic precisely because many backends acquire references concurrently under shared locks.

If the found entry is dropped but create is true, pgstat_reinit_entry() resurrects it: it re-adds the sentinel refcount, bumps generation (so any backend still holding a stale ref to the old incarnation will notice), clears dropped, and zeroes the data.

The local cache (pgstat_get_entry_ref_cached()) deserves a closer look because its ordering is subtle. It inserts the local hashtable slot before touching shared memory, for two reasons stated in the code: it avoids a second hashtable lookup on a miss, and it sidesteps having to unwind a shared refcount increment if a later allocation throws out-of-memory. A freshly inserted slot has entry_ref->shared_stats == NULL, which the function reports as “not found” so the caller proceeds to the shared lookup; a populated slot is returned as a hit with assertions that its body magic is intact and refcount is positive. The local refs are allocated in pgStatSharedRefContext and the cache hashtable in pgStatEntryRefHashContext, both children of TopMemoryContext, so they survive across transactions — a backend that touches a table once keeps a cheap path to it for its whole life.

The vacuum/analyze path bypasses the pending buffer entirely. pgstat_report_vacuum() and pgstat_report_analyze() write directly to the shared entry under its LWLock (via pgstat_get_entry_ref_locked()), because these are infrequent, already-expensive operations and their results — last_vacuum_time, dead-tuple resets, ins_since_vacuum = 0 — should be visible immediately rather than waiting for a flush:

// pgstat_report_vacuum() — src/backend/utils/activity/pgstat_relation.c
entry_ref = pgstat_get_entry_ref_locked(PGSTAT_KIND_RELATION, dboid, tableoid, false);
shtabentry = (PgStatShared_Relation *) entry_ref->shared_stats;
tabentry = &shtabentry->stats;
tabentry->live_tuples = livetuples;
tabentry->dead_tuples = deadtuples;
tabentry->ins_since_vacuum = 0;
if (AmAutoVacuumWorkerProcess()) { tabentry->autovacuum_count++; ... }
pgstat_unlock_entry(entry_ref);

The pending buffer and the flush

pgstat_prep_pending_entry() lazily allocates the kind’s pending_size scratch buffer in pgStatPendingContext and links the ref onto the global pgStatPending dlist. The actual hot-path counters for relations go through pgstat_count_heap_insert() and friends in pgstat_relation.c, which route into the per-subxact PgStat_TableXactStatus:

// pgstat_count_heap_insert() — src/backend/utils/activity/pgstat_relation.c
if (pgstat_should_count_relation(rel))
{
    PgStat_TableStatus *pgstat_info = rel->pgstat_info;
    ensure_tabstat_xact_level(pgstat_info);
    pgstat_info->trans->tuples_inserted += n;
}

At top-level commit, AtEOXact_PgStat_Relations() folds the transactional counts into the nontransactional counts, deriving the live/dead-tuple deltas that autovacuum keys off — inserts add a live tuple, updates and deletes each make a dead one, and every action counts as a change event:

// AtEOXact_PgStat_Relations() — src/backend/utils/activity/pgstat_relation.c
tabstat->counts.delta_live_tuples +=
    trans->tuples_inserted - trans->tuples_deleted;
tabstat->counts.delta_dead_tuples +=
    trans->tuples_updated + trans->tuples_deleted;
tabstat->counts.changed_tuples +=
    trans->tuples_inserted + trans->tuples_updated + trans->tuples_deleted;

pgstat_report_stat() then drains the pending list. Its throttle and fast-exit are the performance heart of the subsystem:

// pgstat_report_stat() — src/backend/utils/activity/pgstat.c
/* Don't expend a clock check if nothing to do */
if (dlist_is_empty(&pgStatPending) && !pgstat_report_fixed)
    return 0;
...
nowait = !force;
partial_flush = false;
partial_flush |= pgstat_flush_pending_entries(nowait);
if (pgstat_report_fixed)
    for (kind = PGSTAT_KIND_MIN; kind <= PGSTAT_KIND_MAX; kind++)
        if (kind_info->flush_static_cb)
            partial_flush |= kind_info->flush_static_cb(nowait);

pgstat_flush_pending_entries() walks the dlist, calls each kind’s flush_pending_cb, and removes the entry from the pending list only if the flush succeeded; with nowait a lock-contended entry stays pending and the function returns have_pending = true. For relations, pgstat_relation_flush_cb() short-circuits if counts is all zeros (pg_memory_is_all_zeros — e.g. an index the planner opened but never scanned), takes the entry’s LWLock, adds every counter into the shared PgStat_StatTabEntry, clamps live/dead tuples non-negative, and — the cross-kind twist — also bumps the pending database entry so pg_stat_database stays consistent without a second pass:

// pgstat_relation_flush_cb() — src/backend/utils/activity/pgstat_relation.c
if (pg_memory_is_all_zeros(&lstats->counts, sizeof(struct PgStat_TableCounts)))
    return true;
if (!pgstat_lock_entry(entry_ref, nowait))
    return false;
tabentry = &shtabstats->stats;
tabentry->numscans += lstats->counts.numscans;
... /* all counters folded in */
pgstat_unlock_entry(entry_ref);
/* same counts also roll up to the database entry */
dbentry = pgstat_prep_database_pending(dboid);
dbentry->tuples_inserted += lstats->counts.tuples_inserted;

Reading: pgstat_fetch_entry and snapshot modes

pgstat_fetch_entry() is the read façade behind pg_stat_get_*. Its behaviour is governed by stats_fetch_consistency. In SNAPSHOT mode it first calls pgstat_build_snapshot(), which seq-scans the whole dshash once and copies every relevant entry into a local pgStatLocal.snapshot.stats simplehash (skipping dropped entries and other databases’ non-shared stats), giving the transaction a frozen view. In CACHE mode it copies each entry on first access and memoizes it; in NONE mode it copies fresh every time into the caller’s context:

// pgstat_fetch_entry() — src/backend/utils/activity/pgstat.c
entry_ref = pgstat_get_entry_ref(kind, dboid, objid, false, NULL);
if (entry_ref == NULL || entry_ref->shared_entry->dropped) { ... return NULL; }
...
(void) pgstat_lock_entry_shared(entry_ref, false);
memcpy(stats_data, pgstat_get_entry_data(kind, entry_ref->shared_stats),
       kind_info->shared_data_len);
pgstat_unlock_entry(entry_ref);

pgstat_build_snapshot() acquires each body’s LWLock directly (it has no PgStat_EntryRef, so it cannot use pgstat_lock_entry_shared) and copies under it, asserting the entry’s refcount is still positive — the snapshot scan relies on the dshash seq lock to keep entries alive during the copy. The fixed-numbered kinds are snapshotted by their snapshot_cb, which copies under the seqlock retry loop (pgstat_copy_changecounted_stats()) rather than a lock, mirroring the single-writer write side.

Fixed-numbered flush: the report-fixed flag

Variable-numbered kinds announce pending work simply by being on the pgStatPending list. Fixed-numbered kinds have no such list — each backend accumulates, say, its IO stats in a process-local struct — so they signal pgstat_report_stat() through the global pgstat_report_fixed boolean. When a backend records an IO or WAL event it sets the flag; on the next pgstat_report_stat() the loop over kinds calls each flush_static_cb (pgstat_io_flush_cb, pgstat_wal_flush_cb, pgstat_slru_flush_cb), which fold the local tallies into the embedded shared struct under the seqlock. The flag is what lets the common “nothing happened” call return in two comparisons:

// pgstat_report_stat() fast exit — src/backend/utils/activity/pgstat.c
if (dlist_is_empty(&pgStatPending) && !pgstat_report_fixed)
    return 0;

Only pgstat_report_stat() is allowed to clear pgstat_report_fixed, after a full round of flushes succeeded — the per-kind callbacks must never reset it, or a partially-flushed round under lock contention would lose the signal that more work remains.

Dropping and reference-counted reclamation

pgstat_drop_entry_internal() is the reclamation core. It sets dropped = true, then subtracts the sentinel refcount; if that reaches 0 (no backend holds a ref) it frees the body immediately, otherwise it leaves the tombstoned entry in the dshash for the last referrer to reap:

// pgstat_drop_entry_internal() — src/backend/utils/activity/pgstat_shmem.c
shent->dropped = true;
/* release refcount marking entry as not dropped */
if (pg_atomic_sub_fetch_u32(&shent->refcount, 1) == 0)
{
    pgstat_free_entry(shent, hstat);
    return true;
}
else
{
    if (!hstat) dshash_release_lock(pgStatLocal.shared_hash, shent);
    return false;   /* could not free yet */
}

The symmetric release path is pgstat_release_entry_ref(), run when a backend lets go of a local ref (at gc, detach, or explicit drop). It decrements the refcount; if it was the last one (fetch_sub returns 1) it re-finds the entry under an exclusive dshash lock and — only if the generation still matches — frees it. If the generation moved, the slot was reinitialized for a new object while this backend was releasing, so it just drops the lock and leaves the new incarnation alone:

// pgstat_release_entry_ref() — src/backend/utils/activity/pgstat_shmem.c
if (pg_atomic_fetch_sub_u32(&entry_ref->shared_entry->refcount, 1) == 1)
{
    shent = dshash_find(pgStatLocal.shared_hash, &entry_ref->shared_entry->key, true);
    if (pg_atomic_read_u32(&entry_ref->shared_entry->generation) == entry_ref->generation)
        pgstat_free_entry(shent, NULL);     /* same incarnation: safe to free */
    else
        dshash_release_lock(pgStatLocal.shared_hash, shent);  /* reused: back off */
}

When a drop could not free the body (some backend held a ref), pgstat_drop_entry() / pgstat_drop_matching_entries() call pgstat_request_entry_refs_gc(), which bumps the shared gc_request_count epoch. On its next pgstat_get_entry_ref(), every backend notices pgStatSharedRefAge != gc_request_count and runs pgstat_gc_entry_refs(), releasing any local refs whose entry is now dropped or whose generation drifted — which can be the release that finally hits refcount 0 and frees the body.

flowchart TD
  drop["DROP TABLE commits<br/>pgstat_drop_entry_internal"] --> tomb["dropped = true<br/>refcount -= 1 (sentinel)"]
  tomb --> z{"refcount == 0?"}
  z -- yes --> free1["pgstat_free_entry<br/>dshash_delete + dsa_free"]
  z -- no --> keep["leave tombstone<br/>pgstat_request_entry_refs_gc<br/>bump gc_request_count"]
  keep --> epoch["other backends:<br/>gc_request_count changed"]
  epoch --> gc["pgstat_gc_entry_refs<br/>release stale local refs"]
  gc --> rel["pgstat_release_entry_ref<br/>refcount -= 1"]
  rel --> z2{"last ref AND<br/>generation matches?"}
  z2 -- yes --> free2["pgstat_free_entry"]
  z2 -- no, reused --> backoff["release lock, keep entry"]

Transactional drop and crash safety

Object drops are transactional. pgstat_drop_transactional() (and its create twin) just push a PgStat_PendingDroppedStatsItem onto the current subxact’s pending_drops list via create_drop_transactional_internal(). AtEOXact_PgStat_DroppedStats() resolves them: on commit it executes the drops, on abort it executes the creates’ undo (drop the just-created entry). Subtransaction abort/commit propagates the list up or discards it via AtEOSubXact_PgStat_DroppedStats(). The same item set is handed to the commit/abort WAL record through pgstat_get_transactional_drops(), and pgstat_execute_transactional_drops() replays them during recovery and 2PC finish — that is what keeps a standby’s stats from accumulating orphan entries for tables dropped on the primary.

Durability

pgstat_before_server_shutdown() (run by the checkpointer at clean shutdown) forces a final flush and calls pgstat_write_statsfile(), which seq-scans the dshash and writes every write_to_file kind plus all fixed kinds to pg_stat/pgstat.stat, tagged with PGSTAT_FILE_FORMAT_ID. At startup the startup process calls pgstat_restore_stats() → pgstat_read_statsfile(). After a crash, pgstat_discard_stats() unlinks the file and resets everything — crash-surviving cumulative numbers would be untrustworthy, so they are deliberately thrown away.

The statsfile format is a flat self-describing stream: a format-ID header, then per-entry records tagged 'F' (fixed kind), 'S' (hash-keyed variable entry), or 'N' (name-keyed entry, used for replication-slot stats whose runtime objid is a slot index that is not stable across restart — the slot name is serialized instead and re-resolved on load via the kind’s from_serialized_name callback), terminated by 'E'. Only kinds with write_to_file = true are persisted; PGSTAT_KIND_BACKEND is explicitly not written, because per-backend stats are meaningless across a restart. Because the write happens in the single last process accessing shared memory, pgstat_write_statsfile() takes no locks — it documents and relies on that invariant.

What is gated, and where the knobs sit

The whole subsystem is conditional on pgstat_track_counts (track_counts GUC). pgstat_init_relation() honours it at relcache-open time: if counting is off it nulls the relation’s pgstat_info and sets pgstat_enabled = false, so the hot-path macro pgstat_should_count_relation() short-circuits with no branch into the stats system at all. Function-call timing is separately gated by track_functions, and block-IO timing by track_io_timing. These are the write-side knobs. stats_fetch_consistency is the only read-side knob and it is consulted entirely inside pgstat_fetch_entry() / pgstat_snapshot_fixed(); changing it sets force_stats_snapshot_clear so the next read discards any stale snapshot. The split matters: a DBA tuning monitoring overhead changes track_*; a DBA worried about cross-view consistency in a reporting query changes stats_fetch_consistency. They never interact.

Putting it together: one relation’s counter lifecycle

To anchor the pieces, follow a single counter end to end. A backend opens table T; pgstat_init_relation() marks it countable but allocates nothing. The first INSERT calls pgstat_assoc_relation() → pgstat_prep_relation_pending(), which calls pgstat_get_entry_ref(create = true); the dshash entry for {RELATION, mydb, T} is found or inserted, its refcount goes from 1 (sentinel) to 2 (this backend), a local PgStat_EntryRef is cached, and a zeroed PgStat_TableStatus pending buffer is linked onto pgStatPending. Each inserted tuple bumps trans->tuples_inserted in a per-subxact record — plain arithmetic, no locks, no shared memory. At COMMIT, AtEOXact_PgStat_Relations() folds the transactional counts into the pending buffer’s counts and derives the live/dead deltas. The next pgstat_report_stat() (forced at commit) calls pgstat_relation_flush_cb(), which takes T’s entry LWLock, adds the counts into the shared PgStat_StatTabEntry, and also rolls them up into the pending database entry. A concurrent SELECT ... FROM pg_stat_user_tables runs pgstat_fetch_entry(), which in the default cache mode copies T’s shared body once under a shared lock and memoizes it. Eventually someone runs DROP TABLE T: pgstat_drop_relation() queues a transactional drop; at that commit AtEOXact_PgStat_DroppedStats() calls pgstat_drop_entry(), which tombstones the entry and removes the sentinel refcount. If no other backend still holds a ref the body is freed immediately; otherwise the GC epoch is bumped and the last backend to pass through pgstat_get_entry_ref() reaps it — generation-checked against reuse. Every mechanism in this document appears once in that walkthrough.

Position hints (as of 2026-06-05, REL_18 273fe94)

Symbol	File	Line
`PgStatShared_HashEntry` (struct)	`src/include/utils/pgstat_internal.h`	65
`PgStat_EntryRef` (struct)	`src/include/utils/pgstat_internal.h`	135
`PgStat_KindInfo` (struct)	`src/include/utils/pgstat_internal.h`	202
`PgStat_ShmemControl` (struct)	`src/include/utils/pgstat_internal.h`	466
`PgStat_Snapshot` (struct)	`src/include/utils/pgstat_internal.h`	510
`PgStat_LocalState` (struct)	`src/include/utils/pgstat_internal.h`	548
`pgstat_kind_builtin_infos[]`	`src/backend/utils/activity/pgstat.c`	282
`pgstat_report_stat`	`src/backend/utils/activity/pgstat.c`	693
`pgstat_fetch_entry`	`src/backend/utils/activity/pgstat.c`	933
`pgstat_build_snapshot`	`src/backend/utils/activity/pgstat.c`	1122
`pgstat_prep_pending_entry`	`src/backend/utils/activity/pgstat.c`	1267
`pgstat_flush_pending_entries`	`src/backend/utils/activity/pgstat.c`	1341
`StatsShmemInit`	`src/backend/utils/activity/pgstat_shmem.c`	155
`pgstat_attach_shmem`	`src/backend/utils/activity/pgstat_shmem.c`	244
`pgstat_init_entry`	`src/backend/utils/activity/pgstat_shmem.c`	301
`pgstat_reinit_entry`	`src/backend/utils/activity/pgstat_shmem.c`	340
`pgstat_acquire_entry_ref`	`src/backend/utils/activity/pgstat_shmem.c`	381
`pgstat_get_entry_ref`	`src/backend/utils/activity/pgstat_shmem.c`	457
`pgstat_release_entry_ref`	`src/backend/utils/activity/pgstat_shmem.c`	603
`pgstat_gc_entry_refs`	`src/backend/utils/activity/pgstat_shmem.c`	758
`pgstat_drop_entry_internal`	`src/backend/utils/activity/pgstat_shmem.c`	888
`pgstat_drop_entry`	`src/backend/utils/activity/pgstat_shmem.c`	990
`pgstat_count_heap_insert`	`src/backend/utils/activity/pgstat_relation.c`	374
`AtEOXact_PgStat_Relations`	`src/backend/utils/activity/pgstat_relation.c`	551
`pgstat_relation_flush_cb`	`src/backend/utils/activity/pgstat_relation.c`	816
`pgstat_report_vacuum`	`src/backend/utils/activity/pgstat_relation.c`	210
`AtEOXact_PgStat_DroppedStats`	`src/backend/utils/activity/pgstat_xact.c`	67
`create_drop_transactional_internal`	`src/backend/utils/activity/pgstat_xact.c`	335
`pgstat_drop_transactional`	`src/backend/utils/activity/pgstat_xact.c`	384
`PGSTAT_KIND_RELATION` (macro)	`src/include/utils/pgstat_kind.h`	28

Source verification (as of 2026-06-05)

Verified against the REL_18_STABLE working tree at commit 273fe94852b.

Storage split confirmed. pgstat.c’s file header (lines 18-39) states fixed-numbered stats live in plain shared memory and variable-numbered stats in DSA reachable via dshash, with the counters separate from the dshash entry (PgStatShared_HashEntry->body). The PgStat_ShmemControl struct embeds archiver, bgwriter, checkpointer, io, slru, wal directly and carries hash_handle for the dshash. Confirmed.
Kind table and the five focus kinds. pgstat_kind_builtin_infos[] contains PGSTAT_KIND_RELATION and PGSTAT_KIND_FUNCTION (fixed_amount = false, with flush_pending_cb) and PGSTAT_KIND_IO, PGSTAT_KIND_WAL, PGSTAT_KIND_SLRU (fixed_amount = true, with flush_static_cb/init_shmem_cb/snapshot_cb). Confirmed at the cited lines.
Refcount sentinel + generation race. pgstat_init_entry() sets refcount = 1 / generation = 0; pgstat_reinit_entry() bumps generation; pgstat_release_entry_ref() compares generation == entry_ref->generation before freeing. Confirmed.
Transactional drop into WAL. pgstat_drop_transactional() → create_drop_transactional_internal() queues onto pending_drops; pgstat_get_transactional_drops() / pgstat_execute_transactional_drops() carry the items into the commit/abort record and recovery. Confirmed in pgstat_xact.c.
REL_18 sanity. No reference to a XLOG2 rmgr or B_DATACHECKSUMSWORKER_* backend type appears anywhere in these files; the PGSTAT_KIND_BACKEND kind and the per-backend IO/WAL flush flags (PGSTAT_BACKEND_FLUSH_IO/_WAL) are present, consistent with PG18.
Line numbers in the position-hint table were read directly from the tree on 2026-06-05; they are hints that decay — the symbol names are the durable anchors.
Scope boundary. Wait events, backend status, and progress reporting also live under utils/activity/ but are a different mechanism (the live view) and are deferred to postgres-wait-events-progress.md; autovacuum’s consumption of the dead-tuple counters is covered in postgres-autovacuum.md. This document does not re-derive those.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

The UDP collector PostgreSQL left behind. The pre-15 stats collector is worth remembering as a cautionary point in the design space. Funneling all counter deltas to one process over loopback UDP gave perfect write isolation (only the collector mutated the totals) but at three costs the shared-memory design eliminates: lossiness (UDP datagrams could be dropped silently under load, so the numbers were approximate by construction), a single-process throughput ceiling, and read amplification (every pg_stat query slurped a whole serialized file). The PG15 redesign is a textbook migration from a message-passing shared-state model to a shared-memory with local aggregation model — the §2 process-model trade-off from Architecture of a Database System (Hellerstein et al. 2007) playing out in one subsystem’s history. The lesson generalizes: when the shared state is a commutative aggregate (counters that only add), local accumulation plus periodic conflict-free merge beats centralized serialization, because addition is associative and the merge order does not matter.

Per-thread instruments elsewhere. Oracle’s V$ fixed tables and SGA statistics, SQL Server’s DMVs, and MySQL/InnoDB’s performance_schema all converged on the same per-worker-buffer-plus-aggregate shape, differing mainly in the threading model (threads sharing an address space can use plain shared structs with atomics, where PostgreSQL’s process model forces DSA and dshash). MySQL’s performance_schema makes the freshness/overhead knob explicit and per-instrument (you can disable individual instruments), whereas PostgreSQL’s knob is coarser: track_counts, track_functions, track_io_timing gate whole categories, and stats_fetch_consistency governs only the read side.

Reference-counted reclamation as a recurring pattern. The refcount + tombstone + generation triad PostgreSQL uses for stats entries is the same shape as epoch-based reclamation (EBR) and hazard pointers in lock-free data-structure research (Michael 2004; Fraser 2004). The gc_request_count epoch is essentially a coarse-grained, cooperative EBR: instead of each reader publishing a precise epoch, a dropper bumps a global counter and trusts that every backend will eventually pass through pgstat_get_entry_ref() and notice. This is cheaper than true EBR (no per-access epoch publication) at the cost of unbounded reclamation latency — acceptable here because stale stats memory is small and bounded by the number of dropped-but-still-referenced objects. The generation field is exactly the ABA-problem guard that lock-free literature insists on: it turns “is this the same pointer?” into “is this the same incarnation?”

Consistency models for monitoring reads. The stats_fetch_consistency GUC is a small, pragmatic answer to a question the streaming/observability research community treats at length: what isolation should a monitoring read get? snapshot gives serializable-ish consistency across all counters at one instant (at the cost of copying the whole table); cache gives read-committed-per-object-with-memoization; none gives the freshest but torn-est read. Modern time-series and observability systems (Prometheus, the work around streaming aggregation) typically choose the equivalent of none — eventually-consistent, per-series — because cross-series consistency rarely matters for alerting. PostgreSQL’s default of cache is a middle ground tuned for the common case of a pg_stat query touching a handful of objects within one statement.

Frontier: pushing aggregation further down. A recurring research and engineering theme is moving aggregation closer to the event source to cut the merge cost — e.g. sketch-based approximate counters (Count-Min, HyperLogLog) for high-cardinality cases, or hardware-assisted atomics. For exact, low-cardinality activity counters PostgreSQL’s per-backend exact tallies are the right call; the interesting open question is the variable-numbered extreme — workloads with millions of short-lived tables/partitions, where the dshash and the per-object DSA bodies become the cost center and the reference-counting GC churns. Custom cumulative stats kinds (the pgstat_register_kind() extension point, reserved IDs at or above PGSTAT_KIND_CUSTOM_MIN) let extensions add their own kinds into exactly this machinery, which is where experimentation with denser representations would land.

Sources

Source tree: PostgreSQL REL_18_STABLE at commit 273fe94852b (2026-06-05):
- src/backend/utils/activity/pgstat.c — infrastructure: kind table, pgstat_report_stat(), pending list, snapshot build, statsfile I/O.
- src/backend/utils/activity/pgstat_shmem.c — shared-memory layer: DSA/dshash bootstrap, pgstat_get_entry_ref(), refcount/generation lifetime, drop and GC.
- src/backend/utils/activity/pgstat_relation.c — the relation kind: transactional tuple counts, vacuum/analyze reporting, flush callback.
- src/backend/utils/activity/pgstat_xact.c — transactional create/drop integration and WAL/2PC hand-off.
- src/include/utils/pgstat_internal.h — the shared/local struct definitions and the changecount inline helpers for single-writer fixed stats.
- src/include/pgstat.h, src/include/utils/pgstat_kind.h — public types, kind IDs, stats_fetch_consistency enum, file format ID.
Theory anchors:
- Database System Concepts (Silberschatz, Korth & Sudarshan, 7e) — ch. 16 (planner statistics vs. activity counters), ch. 25 §25.3 (storage manager housekeeping / autovacuum role). Captured in knowledge/research/dbms-general/database-system-concepts.md.
- Architecture of a Database System (Hellerstein, Stonebraker & Hamilton 2007) — §2 process models, §6 shared components: the message-passing-vs-shared-memory trade-off the PG15 redesign embodies. Anchored in .omc/plans/postgres-paper-bibliography.md.
Lock-free reclamation context (comparative section): epoch-based reclamation and hazard-pointer literature (Michael 2004; Fraser 2004) — cited for design lineage of the refcount/generation/epoch triad, not as PostgreSQL sources.
Cross-references within this code-analysis tree:
- postgres-wait-events-progress.md — the live view (status, wait events, progress) sharing the utils/activity/ home.
- postgres-autovacuum.md — consumer of the dead-tuple / change counters.
- postgres-shared-memory-ipc.md, postgres-lwlock-spinlock.md — the DSA, dshash, and LWLock substrate this subsystem is built on.
- postgres-overview-monitoring-stats.md — the subcategory router.