PostgreSQL Cumulative Statistics — The PG15 Shared-Memory Stats Subsystem
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Every production DBMS keeps two distinct kinds of introspection state, and
conflating them is a classic design error. The first is the live view:
what is each session doing right now — which query, which wait event,
how far through a VACUUM. The second is the cumulative view: what has
happened over time — how many sequential scans this table has served,
how many tuples were inserted, how many dead tuples accumulated since the
last vacuum, how many WAL bytes were written. The live view is a snapshot
that is meaningless a millisecond later; the cumulative view is a set of
monotonically growing counters whose value is in the accumulation. This
document is about the cumulative view — the subsystem behind the
pg_stat_* family of views and, just as importantly, behind autovacuum’s
decision of which table to vacuum next.
Database System Concepts (Silberschatz, 7e) frames the role of these
counters in two places. In query optimization (ch. 16, “Query
Optimization”), the catalog statistics — relation cardinalities, tuple
sizes, value distributions — are the raw input to the cost model; without
them the optimizer is blind. The textbook is careful to distinguish those
planner statistics (gathered by ANALYZE, stored in pg_statistic,
used to estimate selectivities) from the activity counters this document
covers. They are different beasts: planner statistics are sampled
estimates of data shape, refreshed occasionally; cumulative activity
counters are exact event tallies, updated on every tuple touched. The
overlap is that both feed autovacuum — the dead-tuple counter (cumulative)
crossing a threshold relative to the live-tuple estimate (planner-ish) is
what schedules an autovacuum, which is the engine’s self-maintenance loop
(DSC ch. 25 §25.3 on the storage manager’s housekeeping).
The deeper architectural question the textbook raises only implicitly is
where the authoritative copy of a shared counter lives, and who is
allowed to write it. A counter like “tuples inserted into table T” is
incremented by every backend that inserts into T, and read by any
backend running SELECT * FROM pg_stat_user_tables. That is a classic
many-writers/many-readers shared-state problem. The naive answer —
one global mutable counter per object, locked on every increment — does
not scale: the lock on a hot table’s counter would become a bottleneck
worse than the data page itself. The Architecture of a Database System
(Hellerstein, Stonebraker & Hamilton 2007, §6 on shared components and
§2 on process models) names the governing tension directly: shared
introspection state must be cheap enough to update that instrumentation
does not perturb the thing being measured (the observer effect), yet
consistent enough that the numbers are trustworthy. The standard
resolution, which PostgreSQL adopts, is local accumulation with periodic
aggregation: each writer keeps a private running tally and only
occasionally folds it into the shared total. The shared total is then
slightly stale but contention-free on the hot path; the staleness window
is a tunable knob.
There is a fourth, quieter force that the textbook underplays but that
dominates the implementation: transactional correctness of the
counts. A row inserted in a transaction that later aborts did happen
(work was done, a dead tuple exists) but did not logically add a live
row. So tuple counts cannot simply be incremented at insert time — they
must be staged per (sub)transaction and resolved at commit/abort, with the
“attempted action” counts (which include aborted work) kept separate from
the “net effect” counts (live/dead-tuple deltas that depend on the
outcome). This is why PostgreSQL’s relation stats carry both a
nontransactional PgStat_TableCounts and a stack of per-subxact
PgStat_TableXactStatus records, and why the existence of a stats entry
for a freshly-created table is itself transactional. A cumulative-stats
system that ignored this would report phantom rows after every rollback.
Four design forces therefore shape any cumulative-stats subsystem, and they recur in PostgreSQL’s source:
- Write amplification vs. accuracy. Folding local tallies into shared memory every N events or every T milliseconds trades freshness for throughput. PostgreSQL flushes at most once per second on the idle path and unconditionally at commit.
- Object lifetime vs. counter lifetime. A table can be dropped while a
backend is still mid-flush of counts for it, or while a
pg_statsnapshot is being read. The counter’s storage must outlive the object just long enough for all observers to let go — a reference-counting problem. - Durability across restart. Cumulative counters are expensive to recompute (you cannot re-derive “lifetime sequential scans” from the data), so they are serialized to disk at shutdown and reloaded — but discarded after a crash, because a crash means some in-flight updates were lost and the totals can no longer be trusted.
- Transactional resolution. Net-effect counts (live/dead tuples) must
wait for commit/abort; attempted-action counts accrue regardless. The
stats entry’s very lifetime is transactional — created on
CREATE, dropped onDROP, both reversible by rollback and replayable on standbys via the commit/abort WAL record.
Common DBMS Design
Section titled “Common DBMS Design”The textbook gives the forces; this section names the engineering conventions production engines use to resolve them, so PostgreSQL’s specific choices read as one point in a known design space.
Per-worker accumulation buffers. Virtually every high-throughput
engine gives each thread or process a private block of counters that it
increments with plain (unlocked) arithmetic, then merges into a global
aggregate on a slow cadence. Oracle’s V$ fixed tables, SQL Server’s DMVs,
and MySQL’s performance_schema all rest on per-thread instrument buffers
aggregated on read or on a timer. The merge is the only synchronized
operation, so the cost of instrumentation on the hot path is a single
non-atomic increment. PostgreSQL’s pending entries are exactly this
buffer.
A keyed registry of shared counters. The shared aggregate is rarely a
flat array — the set of objects (tables, functions) is dynamic, so it is a
hashtable keyed by an object identifier. The registry must support
concurrent lookup, insert (first time an object is touched), and delete
(object dropped), which pushes designs toward partitioned/striped hash
tables with per-partition locks rather than one global lock. PostgreSQL
uses dshash, a dynamic-shared-memory hashtable with partitioned locking.
Reference-counted entry lifetime. Once counters live in a shared
registry and can be dropped concurrently with being read or updated, the
entry needs a lifetime independent of its presence in the hashtable. The
canonical pattern is a refcount: the entry carries a count of outstanding
users, a “logically deleted” tombstone flag, and is physically freed only
when the count hits zero. This is the same pattern as Linux’s dput()
dentry cache, Java’s PhantomReference-based cleanup, or any read-copy-
update scheme. The subtlety is the reuse race: between a backend
deciding to free an entry and actually freeing it, the slot can be reused
for a new object with the same key (OID wraparound, replication-slot index
reuse). A monotonic generation/epoch counter on the entry lets a
late releaser detect “the thing I was about to free is not the thing
that’s here now” and back off.
Snapshot isolation for readers. A reader scanning many counters wants
a consistent picture, not a torn read where table A’s count is from time
t1 and table B’s is from t2. Engines offer a materialized snapshot mode
(copy everything once, read from the copy) for consistency, and a
direct mode for freshness. PostgreSQL exposes this as the
stats_fetch_consistency GUC with three settings (none, cache,
snapshot).
Durability on the slow path. Because the totals are pure accumulation, they are flushed to disk lazily — typically by a dedicated background process at checkpoint or shutdown, never on the transaction commit path, since losing a few seconds of stats on a crash is acceptable (the engine discards all of them on crash anyway). This is the opposite of WAL’s synchronous discipline: stats durability is best-effort by design.
Single-writer fast paths. A second category of counter has exactly one writer — the checkpointer writes checkpointer stats, the archiver writes archiver stats, each backend writes only its own IO/WAL tallies. For these the engine can skip locking on the write side entirely and use a seqlock-style changecount protocol instead: the writer brackets its update between an odd→even pair of increments to a version counter with memory barriers, and a reader retries if it observed an odd (mid-write) count or saw the count change across the read. This is the classic optimistic-read pattern (Linux seqlocks, the RCU read side) and it makes the write — the performance-critical direction for these high-frequency counters — barrier-cheap and lock-free, at the cost of an occasional reader retry. PostgreSQL’s fixed-numbered stats use exactly this; the variable-numbered (per-object) stats instead take the entry’s LWLock, because they have many writers.
flowchart TD
subgraph backend["Each backend (process-local)"]
hot["hot path: pgstat_count_heap_insert etc.<br/>plain increments into pending entry"]
pend["pending entry<br/>PgStat_EntryRef->pending"]
lhash["local lookup hashtable<br/>pgStatEntryRefHash"]
hot --> pend
lhash -. caches refs to .-> pend
end
subgraph shmem["Shared memory (all backends)"]
dshash["dshash table<br/>key = PgStat_HashKey<br/>entry = PgStatShared_HashEntry"]
body["stats body in DSA<br/>PgStatShared_Relation etc."]
fixed["fixed-numbered block<br/>PgStat_ShmemControl<br/>archiver/bgwriter/io/wal/slru"]
dshash -- dsa_pointer body --> body
end
pend -- "pgstat_report_stat()<br/>flush, <= 1/sec or at commit" --> body
lhash -- "miss: dshash_find_or_insert" --> dshash
disk["pg_stat/pgstat.stat<br/>on-disk file"]
body -. "checkpointer writes at shutdown" .-> disk
fixed -. "checkpointer writes at shutdown" .-> disk
disk -. "startup process reads<br/>(discard after crash)" .-> body
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”Before PostgreSQL 15, cumulative stats lived in a single dedicated
process, the stats collector. Backends serialized their counter deltas
into UDP datagrams and fired them at the collector over a loopback socket;
the collector owned the only in-memory copy of every counter and
periodically wrote a snapshot file that readers slurped in whole. That
design had three chronic problems: UDP packets could be dropped under load
(silently losing counts), the single collector was a throughput ceiling,
and every pg_stat read deserialized an entire file. PG15 (commit
5891c7a8e, Andres Freund) replaced it with the shared-memory cumulative
statistics system documented here. The header comment in pgstat.c
states the architecture directly:
// pgstat.c (file header) — src/backend/utils/activity/pgstat.c// Fixed-numbered stats are stored in plain (non-dynamic) shared memory.//// Statistics for variable-numbered objects are stored in dynamic shared// memory and can be found via a dshash hashtable. The statistics counters// are not part of the dshash entry (PgStatShared_HashEntry) directly, but// are separately allocated (PgStatShared_HashEntry->body)...//// To avoid contention on the shared hashtable, each backend has a// backend-local hashtable (pgStatEntryRefHash) in front of the shared// hashtable, containing references (PgStat_EntryRef) to shared hashtable// entries.Two storage classes, one façade. Every kind of statistic is described
by a PgStat_KindInfo row in the static pgstat_kind_builtin_infos[]
table. The fixed_amount flag splits the world in two. Fixed-numbered
kinds (there is exactly one archiver, one bgwriter, one IO struct, one
WAL struct, one SLRU array) live as embedded members of the single
PgStat_ShmemControl struct in plain shared memory. Variable-numbered
kinds (one entry per relation, per function, per replication slot, per
subscription, per backend) live in the dshash. The kind table is the
dispatch vtable for the whole subsystem:
// pgstat_kind_builtin_infos[] — src/backend/utils/activity/pgstat.c[PGSTAT_KIND_RELATION] = { .name = "relation", .fixed_amount = false, .write_to_file = true, .shared_size = sizeof(PgStatShared_Relation), .shared_data_off = offsetof(PgStatShared_Relation, stats), .shared_data_len = sizeof(((PgStatShared_Relation *) 0)->stats), .pending_size = sizeof(PgStat_TableStatus), .flush_pending_cb = pgstat_relation_flush_cb, .delete_pending_cb = pgstat_relation_delete_pending_cb,},PGSTAT_KIND_RELATION is 2; built-in kinds run from
PGSTAT_KIND_BUILTIN_MIN (= PGSTAT_KIND_DATABASE, 1) upward, and custom
extension kinds occupy PGSTAT_KIND_CUSTOM_MIN (24) and above. The three
shared_* offsets let the generic code in pgstat.c and pgstat_shmem.c
slice the right sub-range out of a variable-size entry without knowing the
kind’s concrete struct — pgstat_get_entry_data() just adds
shared_data_off to the header pointer.
The shared hash entry is a thin handle, not the data. A dshash entry is
fixed-size, but stats bodies vary (a relation entry is far larger than a
subscription entry). So the dshash stores PgStatShared_HashEntry — key,
flags, refcount, generation, and a dsa_pointer to the real body — and the
body is allocated separately in DSA. The separation also means the dshash
never has to be resized when a new kind is added.
// PgStatShared_HashEntry — src/include/utils/pgstat_internal.htypedef struct PgStatShared_HashEntry{ PgStat_HashKey key; /* {kind, dboid, objid} */ bool dropped; /* logically deleted; no new refs allowed */ pg_atomic_uint32 refcount; /* entry lifetime, not dshash-entry lifetime */ pg_atomic_uint32 generation; /* bumped on reinit; detects reuse races */ dsa_pointer body; /* -> PgStatShared_Common + kind-specific stats */} PgStatShared_HashEntry;The key PgStat_HashKey is {PgStat_Kind kind, Oid dboid, uint64 objid}.
For a per-relation entry, dboid is the database OID (or InvalidOid for
shared catalogs) and objid is the table OID. The body always begins with
a PgStatShared_Common header carrying a magic cross-check and a
per-entry LWLock — so two backends flushing counts for two different
tables never contend, and a flusher and a reader of the same table
serialize only on that one table’s lock.
Notice the field that is not in the dshash entry: the actual counters.
That choice is deliberate and worth dwelling on. A PgStatShared_Relation
body is dozens of int64 counters plus timestamps; a
PgStatShared_Subscription body is a handful. If every kind’s data were
inlined into the dshash entry, the entry size would have to be the maximum
over all kinds, wasting space for the small kinds and forcing a dshash
resize whenever a larger kind is introduced. By storing only a
dsa_pointer and allocating the body separately with the kind’s exact
shared_size, the dshash stays kind-agnostic and right-sized. The price is
one pointer indirection (dsa_get_address(body)) per access — which the
PgStat_EntryRef caches away as a resolved local pointer, so it is paid
once per backend per object, not once per update.
The fixed-numbered kinds avoid the dshash entirely. Their data is embedded
directly in PgStat_ShmemControl, reached by shared_ctl_off, and they
carry the seqlock changecount described above rather than an LWLock on
the write path:
// PgStatShared_BgWriter / PgStat_ShmemControl — src/include/utils/pgstat_internal.htypedef struct PgStatShared_BgWriter{ LWLock lock; /* protects reset_offset + reset_timestamp */ uint32 changecount; /* seqlock: odd = write in progress */ PgStat_BgWriterStats stats; PgStat_BgWriterStats reset_offset;} PgStatShared_BgWriter;
typedef struct PgStat_ShmemControl{ void *raw_dsa_area; dshash_table_handle hash_handle; /* the variable-numbered hash */ bool is_shutdown; pg_atomic_uint64 gc_request_count; /* GC epoch */ PgStatShared_Archiver archiver; /* fixed kinds embedded inline */ PgStatShared_BgWriter bgwriter; PgStatShared_Checkpointer checkpointer; PgStatShared_IO io; PgStatShared_SLRU slru; PgStatShared_Wal wal; void *custom_data[PGSTAT_KIND_CUSTOM_SIZE];} PgStat_ShmemControl;Local pending + the reference cache. A backend almost never touches the
dshash on the hot path. When a relation is first modified, the backend
obtains a PgStat_EntryRef — a local handle that bundles a pointer to the
shared PgStatShared_HashEntry, a resolved local pointer to the body
(avoiding repeated dsa_get_address()), the body’s generation at
acquire time, and a pending pointer to a process-local scratch buffer.
These refs are cached in pgStatEntryRefHash, a backend-local simplehash.
The hot path (pgstat_count_heap_insert, etc.) just bumps integers in the
pending buffer. The transactional dance for tuple counts adds a layer:
inserts/updates/deletes accumulate in a per-subtransaction
PgStat_TableXactStatus so they can be rolled back, and only fold into the
backend’s PgStat_TableStatus.counts at (sub)transaction end.
flowchart TD
start["DML on table T<br/>pgstat_count_heap_insert(rel, n)"] --> chk{"pgstat_should_count_relation?"}
chk -- no --> done1["return (untracked)"]
chk -- yes --> ensure["ensure_tabstat_xact_level<br/>push PgStat_TableXactStatus<br/>for current nest level"]
ensure --> bump["trans->tuples_inserted += n"]
bump --> commitq{"(sub)xact end?"}
commitq -- subxact commit --> up["fold counts into parent<br/>AtEOSubXact_PgStat_Relations"]
commitq -- top commit --> fold["AtEOXact_PgStat_Relations<br/>fold trans -> counts<br/>derive delta_live/dead_tuples"]
commitq -- abort --> abort["restore truncdrop counters<br/>inserted tuples become dead"]
fold --> later["next pgstat_report_stat()"]
later --> flush["pgstat_relation_flush_cb<br/>lock entry, add counts to shared,<br/>also bump database pending entry"]
flush --> shared["PgStatShared_Relation.stats<br/>in DSA"]
Flush cadence. pgstat_report_stat(force) is the single drain point.
It fast-exits if the pending list is empty and no fixed-numbered kind asked
for a flush. Otherwise, unless forced, it throttles: it will not flush more
often than PGSTAT_MIN_INTERVAL (1000 ms), and if updates have been
pending longer than PGSTAT_MAX_INTERVAL (60000 ms) it forces a blocking
flush. On the non-forced path it acquires per-entry locks with nowait and
simply leaves un-flushable entries on the pending list for the next round,
returning a suggested idle timeout (PGSTAT_IDLE_INTERVAL, 10000 ms).
Commit, backend exit, and pg_stat_force_next_flush() force it.
Lifetime is reference-counted and the drop is transactional. When an
object is dropped, the stats entry cannot be freed immediately — other
backends may hold refs or be reading a snapshot. pgstat_init_entry()
seeds refcount = 1 (a sentinel meaning “not dropped”); each backend
that acquires a ref adds 1 more. A drop sets dropped = true and
subtracts the sentinel; the body is dsa_free()d only when the count
reaches 0, which can happen long after the drop, in whichever backend
happens to release the last ref. Because OID/index reuse can resurrect a
key, generation is bumped on reinit and a late releaser compares the
entry’s current generation against the one it captured. And because the
decision to drop is itself transactional (a DROP TABLE that rolls back
must not drop the stats), pgstat_drop_transactional() only queues the
drop; AtEOXact_PgStat_DroppedStats() executes it at commit and the drop
is recorded in the commit/abort WAL record so standbys and crash recovery
converge. The rest of this is traced symbol-by-symbol below.
Source Walkthrough
Section titled “Source Walkthrough”Shared-memory bootstrap and attach
Section titled “Shared-memory bootstrap and attach”StatsShmemSize() reserves sizeof(PgStat_ShmemControl) plus a fixed
256 KB DSA seed plus space for any custom fixed kinds. StatsShmemInit()
runs once in the postmaster: it creates a DSA in place inside the
reserved shared segment (the postmaster cannot use dynamic shared-memory
segments), then — with the DSA size temporarily clamped to its initial
size so the structure stays in plain shared memory — creates the dshash
and records its handle. It then walks the kind table and calls each
fixed-numbered kind’s init_shmem_cb.
// StatsShmemInit() — src/backend/utils/activity/pgstat_shmem.cctl->raw_dsa_area = p;p += MAXALIGN(pgstat_dsa_init_size());dsa = dsa_create_in_place(ctl->raw_dsa_area, pgstat_dsa_init_size(), LWTRANCHE_PGSTATS_DSA, NULL);dsa_pin(dsa);/* clamp so dshash header lands in plain shared memory */dsa_set_size_limit(dsa, pgstat_dsa_init_size());dsh = dshash_create(dsa, &dsh_params, NULL);ctl->hash_handle = dshash_get_hash_table_handle(dsh);dsa_set_size_limit(dsa, -1); /* lift limit */Each backend then runs pgstat_attach_shmem() from pgstat_initialize()
(called in BaseInit()): it attaches the DSA and dshash in place and pins
the mapping for the backend’s lifetime. gc_request_count is initialized
to 1 so the local-ref age comparison has a non-zero baseline. The matching
pgstat_detach_shmem() first releases all local entry refs, then detaches
— a process must not exit holding refs, or shared bodies could never be
freed.
Acquiring a reference: pgstat_get_entry_ref
Section titled “Acquiring a reference: pgstat_get_entry_ref”This is the funnel every stats access flows through.
pgstat_get_entry_ref(kind, dboid, objid, create, created_entry) first
checks the backend-local cache via pgstat_get_entry_ref_cached(), which
always inserts a local hashtable slot first (so a cache miss costs one
lookup, and so refcount increments never race with an out-of-memory
error). On a local hit it returns immediately, touching no shared state. On
a miss it probes the dshash with a shared lock (dshash_find), and only
on a creating-and-absent path takes the insert path:
// pgstat_get_entry_ref() — src/backend/utils/activity/pgstat_shmem.cshhashent = dshash_find(pgStatLocal.shared_hash, &key, false);if (create && !shhashent){ shhashent = dshash_find_or_insert(pgStatLocal.shared_hash, &key, &shfound); if (!shfound) { shheader = pgstat_init_entry(kind, shhashent); ... pgstat_acquire_entry_ref(entry_ref, shhashent, shheader); if (created_entry != NULL) *created_entry = true; return entry_ref; }}pgstat_init_entry() is where a fresh entry is born: it sets
refcount = 1 (the not-dropped sentinel) and generation = 0,
DSA_ALLOC_ZEROs the kind’s shared_size, stamps the 0xdeadbeef magic,
and initializes the per-entry LWLock. pgstat_acquire_entry_ref() then
adds the caller’s reference (a second increment), releases the dshash
partition lock, and snapshots the entry’s generation into the local ref.
The crucial property: the refcount is incremented while holding only a
shared dshash lock — it is atomic precisely because many backends acquire
references concurrently under shared locks.
If the found entry is dropped but create is true, pgstat_reinit_entry()
resurrects it: it re-adds the sentinel refcount, bumps generation (so
any backend still holding a stale ref to the old incarnation will notice),
clears dropped, and zeroes the data.
The local cache (pgstat_get_entry_ref_cached()) deserves a closer look
because its ordering is subtle. It inserts the local hashtable slot
before touching shared memory, for two reasons stated in the code: it
avoids a second hashtable lookup on a miss, and it sidesteps having to
unwind a shared refcount increment if a later allocation throws
out-of-memory. A freshly inserted slot has entry_ref->shared_stats == NULL, which the function reports as “not found” so the caller proceeds to
the shared lookup; a populated slot is returned as a hit with assertions
that its body magic is intact and refcount is positive. The local refs are
allocated in pgStatSharedRefContext and the cache hashtable in
pgStatEntryRefHashContext, both children of TopMemoryContext, so they
survive across transactions — a backend that touches a table once keeps a
cheap path to it for its whole life.
The vacuum/analyze path bypasses the pending buffer entirely.
pgstat_report_vacuum() and pgstat_report_analyze() write directly to
the shared entry under its LWLock (via pgstat_get_entry_ref_locked()),
because these are infrequent, already-expensive operations and their
results — last_vacuum_time, dead-tuple resets, ins_since_vacuum = 0 —
should be visible immediately rather than waiting for a flush:
// pgstat_report_vacuum() — src/backend/utils/activity/pgstat_relation.centry_ref = pgstat_get_entry_ref_locked(PGSTAT_KIND_RELATION, dboid, tableoid, false);shtabentry = (PgStatShared_Relation *) entry_ref->shared_stats;tabentry = &shtabentry->stats;tabentry->live_tuples = livetuples;tabentry->dead_tuples = deadtuples;tabentry->ins_since_vacuum = 0;if (AmAutoVacuumWorkerProcess()) { tabentry->autovacuum_count++; ... }pgstat_unlock_entry(entry_ref);The pending buffer and the flush
Section titled “The pending buffer and the flush”pgstat_prep_pending_entry() lazily allocates the kind’s pending_size
scratch buffer in pgStatPendingContext and links the ref onto the global
pgStatPending dlist. The actual hot-path counters for relations go
through pgstat_count_heap_insert() and friends in pgstat_relation.c,
which route into the per-subxact PgStat_TableXactStatus:
// pgstat_count_heap_insert() — src/backend/utils/activity/pgstat_relation.cif (pgstat_should_count_relation(rel)){ PgStat_TableStatus *pgstat_info = rel->pgstat_info; ensure_tabstat_xact_level(pgstat_info); pgstat_info->trans->tuples_inserted += n;}At top-level commit, AtEOXact_PgStat_Relations() folds the transactional
counts into the nontransactional counts, deriving the live/dead-tuple
deltas that autovacuum keys off — inserts add a live tuple, updates and
deletes each make a dead one, and every action counts as a change event:
// AtEOXact_PgStat_Relations() — src/backend/utils/activity/pgstat_relation.ctabstat->counts.delta_live_tuples += trans->tuples_inserted - trans->tuples_deleted;tabstat->counts.delta_dead_tuples += trans->tuples_updated + trans->tuples_deleted;tabstat->counts.changed_tuples += trans->tuples_inserted + trans->tuples_updated + trans->tuples_deleted;pgstat_report_stat() then drains the pending list. Its throttle and
fast-exit are the performance heart of the subsystem:
// pgstat_report_stat() — src/backend/utils/activity/pgstat.c/* Don't expend a clock check if nothing to do */if (dlist_is_empty(&pgStatPending) && !pgstat_report_fixed) return 0;...nowait = !force;partial_flush = false;partial_flush |= pgstat_flush_pending_entries(nowait);if (pgstat_report_fixed) for (kind = PGSTAT_KIND_MIN; kind <= PGSTAT_KIND_MAX; kind++) if (kind_info->flush_static_cb) partial_flush |= kind_info->flush_static_cb(nowait);pgstat_flush_pending_entries() walks the dlist, calls each kind’s
flush_pending_cb, and removes the entry from the pending list only if the
flush succeeded; with nowait a lock-contended entry stays pending and the
function returns have_pending = true. For relations,
pgstat_relation_flush_cb() short-circuits if counts is all zeros
(pg_memory_is_all_zeros — e.g. an index the planner opened but never
scanned), takes the entry’s LWLock, adds every counter into the shared
PgStat_StatTabEntry, clamps live/dead tuples non-negative, and — the
cross-kind twist — also bumps the pending database entry so
pg_stat_database stays consistent without a second pass:
// pgstat_relation_flush_cb() — src/backend/utils/activity/pgstat_relation.cif (pg_memory_is_all_zeros(&lstats->counts, sizeof(struct PgStat_TableCounts))) return true;if (!pgstat_lock_entry(entry_ref, nowait)) return false;tabentry = &shtabstats->stats;tabentry->numscans += lstats->counts.numscans;... /* all counters folded in */pgstat_unlock_entry(entry_ref);/* same counts also roll up to the database entry */dbentry = pgstat_prep_database_pending(dboid);dbentry->tuples_inserted += lstats->counts.tuples_inserted;Reading: pgstat_fetch_entry and snapshot modes
Section titled “Reading: pgstat_fetch_entry and snapshot modes”pgstat_fetch_entry() is the read façade behind pg_stat_get_*. Its
behaviour is governed by stats_fetch_consistency. In SNAPSHOT mode it
first calls pgstat_build_snapshot(), which seq-scans the whole dshash
once and copies every relevant entry into a local
pgStatLocal.snapshot.stats simplehash (skipping dropped entries and
other databases’ non-shared stats), giving the transaction a frozen view.
In CACHE mode it copies each entry on first access and memoizes it; in
NONE mode it copies fresh every time into the caller’s context:
// pgstat_fetch_entry() — src/backend/utils/activity/pgstat.centry_ref = pgstat_get_entry_ref(kind, dboid, objid, false, NULL);if (entry_ref == NULL || entry_ref->shared_entry->dropped) { ... return NULL; }...(void) pgstat_lock_entry_shared(entry_ref, false);memcpy(stats_data, pgstat_get_entry_data(kind, entry_ref->shared_stats), kind_info->shared_data_len);pgstat_unlock_entry(entry_ref);pgstat_build_snapshot() acquires each body’s LWLock directly (it has no
PgStat_EntryRef, so it cannot use pgstat_lock_entry_shared) and copies
under it, asserting the entry’s refcount is still positive — the snapshot
scan relies on the dshash seq lock to keep entries alive during the copy.
The fixed-numbered kinds are snapshotted by their snapshot_cb, which
copies under the seqlock retry loop (pgstat_copy_changecounted_stats())
rather than a lock, mirroring the single-writer write side.
Fixed-numbered flush: the report-fixed flag
Section titled “Fixed-numbered flush: the report-fixed flag”Variable-numbered kinds announce pending work simply by being on the
pgStatPending list. Fixed-numbered kinds have no such list — each backend
accumulates, say, its IO stats in a process-local struct — so they signal
pgstat_report_stat() through the global pgstat_report_fixed boolean.
When a backend records an IO or WAL event it sets the flag; on the next
pgstat_report_stat() the loop over kinds calls each flush_static_cb
(pgstat_io_flush_cb, pgstat_wal_flush_cb, pgstat_slru_flush_cb),
which fold the local tallies into the embedded shared struct under the
seqlock. The flag is what lets the common “nothing happened” call return in
two comparisons:
// pgstat_report_stat() fast exit — src/backend/utils/activity/pgstat.cif (dlist_is_empty(&pgStatPending) && !pgstat_report_fixed) return 0;Only pgstat_report_stat() is allowed to clear pgstat_report_fixed,
after a full round of flushes succeeded — the per-kind callbacks must never
reset it, or a partially-flushed round under lock contention would lose the
signal that more work remains.
Dropping and reference-counted reclamation
Section titled “Dropping and reference-counted reclamation”pgstat_drop_entry_internal() is the reclamation core. It sets
dropped = true, then subtracts the sentinel refcount; if that reaches 0
(no backend holds a ref) it frees the body immediately, otherwise it leaves
the tombstoned entry in the dshash for the last referrer to reap:
// pgstat_drop_entry_internal() — src/backend/utils/activity/pgstat_shmem.cshent->dropped = true;/* release refcount marking entry as not dropped */if (pg_atomic_sub_fetch_u32(&shent->refcount, 1) == 0){ pgstat_free_entry(shent, hstat); return true;}else{ if (!hstat) dshash_release_lock(pgStatLocal.shared_hash, shent); return false; /* could not free yet */}The symmetric release path is pgstat_release_entry_ref(), run when a
backend lets go of a local ref (at gc, detach, or explicit drop). It
decrements the refcount; if it was the last one (fetch_sub returns 1) it
re-finds the entry under an exclusive dshash lock and — only if the
generation still matches — frees it. If the generation moved, the slot
was reinitialized for a new object while this backend was releasing, so it
just drops the lock and leaves the new incarnation alone:
// pgstat_release_entry_ref() — src/backend/utils/activity/pgstat_shmem.cif (pg_atomic_fetch_sub_u32(&entry_ref->shared_entry->refcount, 1) == 1){ shent = dshash_find(pgStatLocal.shared_hash, &entry_ref->shared_entry->key, true); if (pg_atomic_read_u32(&entry_ref->shared_entry->generation) == entry_ref->generation) pgstat_free_entry(shent, NULL); /* same incarnation: safe to free */ else dshash_release_lock(pgStatLocal.shared_hash, shent); /* reused: back off */}When a drop could not free the body (some backend held a ref),
pgstat_drop_entry() / pgstat_drop_matching_entries() call
pgstat_request_entry_refs_gc(), which bumps the shared
gc_request_count epoch. On its next pgstat_get_entry_ref(), every
backend notices pgStatSharedRefAge != gc_request_count and runs
pgstat_gc_entry_refs(), releasing any local refs whose entry is now
dropped or whose generation drifted — which can be the release that finally
hits refcount 0 and frees the body.
flowchart TD
drop["DROP TABLE commits<br/>pgstat_drop_entry_internal"] --> tomb["dropped = true<br/>refcount -= 1 (sentinel)"]
tomb --> z{"refcount == 0?"}
z -- yes --> free1["pgstat_free_entry<br/>dshash_delete + dsa_free"]
z -- no --> keep["leave tombstone<br/>pgstat_request_entry_refs_gc<br/>bump gc_request_count"]
keep --> epoch["other backends:<br/>gc_request_count changed"]
epoch --> gc["pgstat_gc_entry_refs<br/>release stale local refs"]
gc --> rel["pgstat_release_entry_ref<br/>refcount -= 1"]
rel --> z2{"last ref AND<br/>generation matches?"}
z2 -- yes --> free2["pgstat_free_entry"]
z2 -- no, reused --> backoff["release lock, keep entry"]
Transactional drop and crash safety
Section titled “Transactional drop and crash safety”Object drops are transactional. pgstat_drop_transactional() (and its
create twin) just push a PgStat_PendingDroppedStatsItem onto the current
subxact’s pending_drops list via
create_drop_transactional_internal(). AtEOXact_PgStat_DroppedStats()
resolves them: on commit it executes the drops, on abort it executes the
creates’ undo (drop the just-created entry). Subtransaction abort/commit
propagates the list up or discards it via
AtEOSubXact_PgStat_DroppedStats(). The same item set is handed to the
commit/abort WAL record through pgstat_get_transactional_drops(), and
pgstat_execute_transactional_drops() replays them during recovery and 2PC
finish — that is what keeps a standby’s stats from accumulating orphan
entries for tables dropped on the primary.
Durability
Section titled “Durability”pgstat_before_server_shutdown() (run by the checkpointer at clean
shutdown) forces a final flush and calls pgstat_write_statsfile(), which
seq-scans the dshash and writes every write_to_file kind plus all fixed
kinds to pg_stat/pgstat.stat, tagged with PGSTAT_FILE_FORMAT_ID. At
startup the startup process calls pgstat_restore_stats() →
pgstat_read_statsfile(). After a crash, pgstat_discard_stats() unlinks
the file and resets everything — crash-surviving cumulative numbers would
be untrustworthy, so they are deliberately thrown away.
The statsfile format is a flat self-describing stream: a format-ID header,
then per-entry records tagged 'F' (fixed kind), 'S' (hash-keyed
variable entry), or 'N' (name-keyed entry, used for replication-slot
stats whose runtime objid is a slot index that is not stable across
restart — the slot name is serialized instead and re-resolved on load via
the kind’s from_serialized_name callback), terminated by 'E'. Only
kinds with write_to_file = true are persisted; PGSTAT_KIND_BACKEND is
explicitly not written, because per-backend stats are meaningless across
a restart. Because the write happens in the single last process accessing
shared memory, pgstat_write_statsfile() takes no locks — it documents and
relies on that invariant.
What is gated, and where the knobs sit
Section titled “What is gated, and where the knobs sit”The whole subsystem is conditional on pgstat_track_counts
(track_counts GUC). pgstat_init_relation() honours it at relcache-open
time: if counting is off it nulls the relation’s pgstat_info and sets
pgstat_enabled = false, so the hot-path macro
pgstat_should_count_relation() short-circuits with no branch into the
stats system at all. Function-call timing is separately gated by
track_functions, and block-IO timing by track_io_timing. These are the
write-side knobs. stats_fetch_consistency is the only read-side knob
and it is consulted entirely inside pgstat_fetch_entry() /
pgstat_snapshot_fixed(); changing it sets force_stats_snapshot_clear so
the next read discards any stale snapshot. The split matters: a DBA tuning
monitoring overhead changes track_*; a DBA worried about cross-view
consistency in a reporting query changes stats_fetch_consistency. They
never interact.
Putting it together: one relation’s counter lifecycle
Section titled “Putting it together: one relation’s counter lifecycle”To anchor the pieces, follow a single counter end to end. A backend opens
table T; pgstat_init_relation() marks it countable but allocates
nothing. The first INSERT calls pgstat_assoc_relation() →
pgstat_prep_relation_pending(), which calls pgstat_get_entry_ref(create = true); the dshash entry for {RELATION, mydb, T} is found or inserted,
its refcount goes from 1 (sentinel) to 2 (this backend), a local
PgStat_EntryRef is cached, and a zeroed PgStat_TableStatus pending
buffer is linked onto pgStatPending. Each inserted tuple bumps
trans->tuples_inserted in a per-subxact record — plain arithmetic, no
locks, no shared memory. At COMMIT, AtEOXact_PgStat_Relations() folds
the transactional counts into the pending buffer’s counts and derives the
live/dead deltas. The next pgstat_report_stat() (forced at commit) calls
pgstat_relation_flush_cb(), which takes T’s entry LWLock, adds the
counts into the shared PgStat_StatTabEntry, and also rolls them up into
the pending database entry. A concurrent SELECT ... FROM pg_stat_user_tables runs pgstat_fetch_entry(), which in the default
cache mode copies T’s shared body once under a shared lock and memoizes
it. Eventually someone runs DROP TABLE T: pgstat_drop_relation() queues
a transactional drop; at that commit AtEOXact_PgStat_DroppedStats() calls
pgstat_drop_entry(), which tombstones the entry and removes the sentinel
refcount. If no other backend still holds a ref the body is freed
immediately; otherwise the GC epoch is bumped and the last backend to pass
through pgstat_get_entry_ref() reaps it — generation-checked against
reuse. Every mechanism in this document appears once in that walkthrough.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
PgStatShared_HashEntry (struct) | src/include/utils/pgstat_internal.h | 65 |
PgStat_EntryRef (struct) | src/include/utils/pgstat_internal.h | 135 |
PgStat_KindInfo (struct) | src/include/utils/pgstat_internal.h | 202 |
PgStat_ShmemControl (struct) | src/include/utils/pgstat_internal.h | 466 |
PgStat_Snapshot (struct) | src/include/utils/pgstat_internal.h | 510 |
PgStat_LocalState (struct) | src/include/utils/pgstat_internal.h | 548 |
pgstat_kind_builtin_infos[] | src/backend/utils/activity/pgstat.c | 282 |
pgstat_report_stat | src/backend/utils/activity/pgstat.c | 693 |
pgstat_fetch_entry | src/backend/utils/activity/pgstat.c | 933 |
pgstat_build_snapshot | src/backend/utils/activity/pgstat.c | 1122 |
pgstat_prep_pending_entry | src/backend/utils/activity/pgstat.c | 1267 |
pgstat_flush_pending_entries | src/backend/utils/activity/pgstat.c | 1341 |
StatsShmemInit | src/backend/utils/activity/pgstat_shmem.c | 155 |
pgstat_attach_shmem | src/backend/utils/activity/pgstat_shmem.c | 244 |
pgstat_init_entry | src/backend/utils/activity/pgstat_shmem.c | 301 |
pgstat_reinit_entry | src/backend/utils/activity/pgstat_shmem.c | 340 |
pgstat_acquire_entry_ref | src/backend/utils/activity/pgstat_shmem.c | 381 |
pgstat_get_entry_ref | src/backend/utils/activity/pgstat_shmem.c | 457 |
pgstat_release_entry_ref | src/backend/utils/activity/pgstat_shmem.c | 603 |
pgstat_gc_entry_refs | src/backend/utils/activity/pgstat_shmem.c | 758 |
pgstat_drop_entry_internal | src/backend/utils/activity/pgstat_shmem.c | 888 |
pgstat_drop_entry | src/backend/utils/activity/pgstat_shmem.c | 990 |
pgstat_count_heap_insert | src/backend/utils/activity/pgstat_relation.c | 374 |
AtEOXact_PgStat_Relations | src/backend/utils/activity/pgstat_relation.c | 551 |
pgstat_relation_flush_cb | src/backend/utils/activity/pgstat_relation.c | 816 |
pgstat_report_vacuum | src/backend/utils/activity/pgstat_relation.c | 210 |
AtEOXact_PgStat_DroppedStats | src/backend/utils/activity/pgstat_xact.c | 67 |
create_drop_transactional_internal | src/backend/utils/activity/pgstat_xact.c | 335 |
pgstat_drop_transactional | src/backend/utils/activity/pgstat_xact.c | 384 |
PGSTAT_KIND_RELATION (macro) | src/include/utils/pgstat_kind.h | 28 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified against the REL_18_STABLE working tree at commit 273fe94852b.
- Storage split confirmed.
pgstat.c’s file header (lines 18-39) states fixed-numbered stats live in plain shared memory and variable-numbered stats in DSA reachable via dshash, with the counters separate from the dshash entry (PgStatShared_HashEntry->body). ThePgStat_ShmemControlstruct embedsarchiver,bgwriter,checkpointer,io,slru,waldirectly and carrieshash_handlefor the dshash. Confirmed. - Kind table and the five focus kinds.
pgstat_kind_builtin_infos[]containsPGSTAT_KIND_RELATIONandPGSTAT_KIND_FUNCTION(fixed_amount = false, withflush_pending_cb) andPGSTAT_KIND_IO,PGSTAT_KIND_WAL,PGSTAT_KIND_SLRU(fixed_amount = true, withflush_static_cb/init_shmem_cb/snapshot_cb). Confirmed at the cited lines. - Refcount sentinel + generation race.
pgstat_init_entry()setsrefcount = 1/generation = 0;pgstat_reinit_entry()bumps generation;pgstat_release_entry_ref()comparesgeneration == entry_ref->generationbefore freeing. Confirmed. - Transactional drop into WAL.
pgstat_drop_transactional()→create_drop_transactional_internal()queues ontopending_drops;pgstat_get_transactional_drops()/pgstat_execute_transactional_drops()carry the items into the commit/abort record and recovery. Confirmed inpgstat_xact.c. - REL_18 sanity. No reference to a
XLOG2rmgr orB_DATACHECKSUMSWORKER_*backend type appears anywhere in these files; thePGSTAT_KIND_BACKENDkind and the per-backend IO/WAL flush flags (PGSTAT_BACKEND_FLUSH_IO/_WAL) are present, consistent with PG18. - Line numbers in the position-hint table were read directly from the tree on 2026-06-05; they are hints that decay — the symbol names are the durable anchors.
- Scope boundary. Wait events, backend status, and progress reporting
also live under
utils/activity/but are a different mechanism (the live view) and are deferred topostgres-wait-events-progress.md; autovacuum’s consumption of the dead-tuple counters is covered inpostgres-autovacuum.md. This document does not re-derive those.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”The UDP collector PostgreSQL left behind. The pre-15 stats collector is
worth remembering as a cautionary point in the design space. Funneling all
counter deltas to one process over loopback UDP gave perfect write
isolation (only the collector mutated the totals) but at three costs the
shared-memory design eliminates: lossiness (UDP datagrams could be dropped
silently under load, so the numbers were approximate by construction), a
single-process throughput ceiling, and read amplification (every
pg_stat query slurped a whole serialized file). The PG15 redesign is a
textbook migration from a message-passing shared-state model to a
shared-memory with local aggregation model — the §2 process-model
trade-off from Architecture of a Database System (Hellerstein et al.
2007) playing out in one subsystem’s history. The lesson generalizes:
when the shared state is a commutative aggregate (counters that only add),
local accumulation plus periodic conflict-free merge beats centralized
serialization, because addition is associative and the merge order does
not matter.
Per-thread instruments elsewhere. Oracle’s V$ fixed tables and SGA
statistics, SQL Server’s DMVs, and MySQL/InnoDB’s performance_schema
all converged on the same per-worker-buffer-plus-aggregate shape, differing
mainly in the threading model (threads sharing an address space can use
plain shared structs with atomics, where PostgreSQL’s process model forces
DSA and dshash). MySQL’s performance_schema makes the freshness/overhead
knob explicit and per-instrument (you can disable individual instruments),
whereas PostgreSQL’s knob is coarser: track_counts, track_functions,
track_io_timing gate whole categories, and stats_fetch_consistency
governs only the read side.
Reference-counted reclamation as a recurring pattern. The
refcount + tombstone + generation triad PostgreSQL uses for stats entries
is the same shape as epoch-based reclamation (EBR) and hazard pointers in
lock-free data-structure research (Michael 2004; Fraser 2004). The
gc_request_count epoch is essentially a coarse-grained, cooperative EBR:
instead of each reader publishing a precise epoch, a dropper bumps a
global counter and trusts that every backend will eventually pass through
pgstat_get_entry_ref() and notice. This is cheaper than true EBR (no
per-access epoch publication) at the cost of unbounded reclamation latency
— acceptable here because stale stats memory is small and bounded by the
number of dropped-but-still-referenced objects. The generation field is
exactly the ABA-problem guard that lock-free literature insists on: it
turns “is this the same pointer?” into “is this the same incarnation?”
Consistency models for monitoring reads. The stats_fetch_consistency
GUC is a small, pragmatic answer to a question the streaming/observability
research community treats at length: what isolation should a monitoring
read get? snapshot gives serializable-ish consistency across all counters
at one instant (at the cost of copying the whole table); cache gives
read-committed-per-object-with-memoization; none gives the freshest but
torn-est read. Modern time-series and observability systems (Prometheus,
the work around streaming aggregation) typically choose the equivalent of
none — eventually-consistent, per-series — because cross-series
consistency rarely matters for alerting. PostgreSQL’s default of cache
is a middle ground tuned for the common case of a pg_stat query touching
a handful of objects within one statement.
Frontier: pushing aggregation further down. A recurring research and
engineering theme is moving aggregation closer to the event source to cut
the merge cost — e.g. sketch-based approximate counters (Count-Min,
HyperLogLog) for high-cardinality cases, or hardware-assisted atomics. For
exact, low-cardinality activity counters PostgreSQL’s per-backend exact
tallies are the right call; the interesting open question is the
variable-numbered extreme — workloads with millions of short-lived
tables/partitions, where the dshash and the per-object DSA bodies become
the cost center and the reference-counting GC churns. Custom cumulative
stats kinds (the pgstat_register_kind() extension point, reserved IDs at
or above PGSTAT_KIND_CUSTOM_MIN) let extensions add their own kinds into
exactly this machinery, which is where experimentation with denser
representations would land.
Sources
Section titled “Sources”- Source tree: PostgreSQL REL_18_STABLE at commit
273fe94852b(2026-06-05):src/backend/utils/activity/pgstat.c— infrastructure: kind table,pgstat_report_stat(), pending list, snapshot build, statsfile I/O.src/backend/utils/activity/pgstat_shmem.c— shared-memory layer: DSA/dshash bootstrap,pgstat_get_entry_ref(), refcount/generation lifetime, drop and GC.src/backend/utils/activity/pgstat_relation.c— the relation kind: transactional tuple counts, vacuum/analyze reporting, flush callback.src/backend/utils/activity/pgstat_xact.c— transactional create/drop integration and WAL/2PC hand-off.src/include/utils/pgstat_internal.h— the shared/local struct definitions and the changecount inline helpers for single-writer fixed stats.src/include/pgstat.h,src/include/utils/pgstat_kind.h— public types, kind IDs,stats_fetch_consistencyenum, file format ID.
- Theory anchors:
- Database System Concepts (Silberschatz, Korth & Sudarshan, 7e) —
ch. 16 (planner statistics vs. activity counters), ch. 25 §25.3
(storage manager housekeeping / autovacuum role). Captured in
knowledge/research/dbms-general/database-system-concepts.md. - Architecture of a Database System (Hellerstein, Stonebraker &
Hamilton 2007) — §2 process models, §6 shared components: the
message-passing-vs-shared-memory trade-off the PG15 redesign embodies.
Anchored in
.omc/plans/postgres-paper-bibliography.md.
- Database System Concepts (Silberschatz, Korth & Sudarshan, 7e) —
ch. 16 (planner statistics vs. activity counters), ch. 25 §25.3
(storage manager housekeeping / autovacuum role). Captured in
- Lock-free reclamation context (comparative section): epoch-based reclamation and hazard-pointer literature (Michael 2004; Fraser 2004) — cited for design lineage of the refcount/generation/epoch triad, not as PostgreSQL sources.
- Cross-references within this code-analysis tree:
postgres-wait-events-progress.md— the live view (status, wait events, progress) sharing theutils/activity/home.postgres-autovacuum.md— consumer of the dead-tuple / change counters.postgres-shared-memory-ipc.md,postgres-lwlock-spinlock.md— the DSA, dshash, and LWLock substrate this subsystem is built on.postgres-overview-monitoring-stats.md— the subcategory router.