Skip to content

PostgreSQL Statistics — From a Collector Process to Extended Stats and Shared-Memory Cumulative Stats

Contents:

Why this subsystem had to evolve (the original limitation)

Section titled “Why this subsystem had to evolve (the original limitation)”

“Statistics” in PostgreSQL is not one subsystem but two, and the distinction is the key to reading this whole timeline. They are different in their data, their consumers, their update cadence, and their failure modes — and they evolved independently. Conflating them is the classic mistake that the module docs (postgres-cumulative-stats.md and postgres-extended-statistics.md) both open by warning against.

Planner statistics are sampled estimates of data shape. ANALYZE draws a fixed-size reservoir sample of a table’s rows and condenses each column into a pg_statistic row: the fraction of nulls, the average width, the number of distinct values (stadistinct), a most-common-value (MCV) list, and an equi-depth histogram of the remaining values. The planner reads these back to turn a WHERE clause into a selectivity, and selectivity into a row-count estimate — the single most consequential number in query planning. These statistics are refreshed occasionally, are approximate by design, and feed the cost-based optimizer.

The original limitation here is structural: per-column statistics assume column independence. When a query filters on two correlated columns — the textbook example is WHERE city = 'New York' AND state = 'NY', where the city already determines the state — the planner multiplies the two individual selectivities as if they were independent events. The product is far smaller than reality, the row estimate collapses toward one row, and the planner picks a nested-loop plan that is catastrophic when the real cardinality is millions. No amount of sampling more rows per column fixes this; the information that the columns are correlated simply is not present in a collection of single-column summaries. That gap is what the PG10–PG14 extended-statistics work set out to close.

Cumulative activity statistics are exact event tallies of what has happened over time: how many sequential scans this table has served, how many tuples were inserted or updated, how many dead tuples have piled up since the last vacuum, how many WAL bytes were written, how many blocks were read from disk versus found in the buffer cache. These are not sampled — they are incremented on every relevant operation — and they drive the entire pg_stat_* view family plus, critically, autovacuum’s decision of which table to vacuum next.

The original limitation here is operational. From the earliest versions through PG14, these counters lived in a single dedicated stats collector process. Every backend serialized its counter deltas into UDP datagrams and fired them at the collector over a loopback socket. The collector owned the only authoritative in-memory copy of every counter, periodically wrote the whole thing to a file, and answered every pg_stat query by having the asking backend read and deserialize that file. This design had three structural costs that grew worse with scale:

  1. Lossy. UDP datagrams could be silently dropped under load. Counters simply undercounted; there was no retransmission and no error.
  2. A throughput ceiling. One process received and merged the entire cluster’s stats traffic. On a busy many-core machine that single consumer became the bottleneck.
  3. Expensive to read. Every pg_stat query slurped and parsed an entire serialized stats file even to read one counter for one table.

Neither limitation was a bug — both designs were reasonable for their era. But as PostgreSQL pushed into larger schemas (where column correlation is the norm, not the exception) and larger, busier clusters (where the collector’s single-process model could not keep up), both subsystems had to be rebuilt. The two rebuilds happened on parallel tracks, summarized in the timeline below.

timeline
    title PostgreSQL Statistics Evolution
    section Planner statistics (data shape)
        pre-10 : Per-column pg_statistic only : nullfrac, stadistinct, MCV, histogram : assumes column independence
        PG10 2017 : CREATE STATISTICS : functional dependencies : multivariate n-distinct
        PG12 2019 : Multivariate MCV lists : correlated multi-clause selectivity
        PG14 2021 : Statistics on expressions : inheritance-aware ext stats (stxdinherit)
    section Cumulative statistics (activity counters)
        pre-15 : Single UDP stats-collector process : lossy, single bottleneck, whole-file reads
        PG15 2022 : Collector removed : DSA + dshash shared memory : per-backend pending buffers
        PG16 2023 : pg_stat_io : I/O as a fixed stat kind
        PG18 2025 : Per-backend I/O stats : WAL + byte-level columns in pg_stat_io

Era 0 — Per-column pg_statistic + the UDP stats-collector process

Section titled “Era 0 — Per-column pg_statistic + the UDP stats-collector process”

This is the baseline both tracks start from, and it is worth being precise about because every later release is a delta against it.

Planner side — pg_statistic, one row per column. ANALYZE (in analyze.c) drew a sample sized at 300 * statistics_target rows using a two-stage block sample plus the Vitter reservoir algorithm, then ran a type-specific compute_stats routine per column. The result was a single pg_statistic row holding stanullfrac, stawidth, stadistinct, and up to a handful of slots, each tagged with a STATISTIC_KIND_* code: an MCV list (STATISTIC_KIND_MCV), an equi-depth histogram of the non-MCV residue (STATISTIC_KIND_HISTOGRAM), and a physical/logical correlation (STATISTIC_KIND_CORRELATION). The slot design was deliberately open-ended — a type’s typanalyze could fill arbitrary statistic shapes — which is why later extended statistics could bolt new kinds on without a catalog redesign of the per-column path.

The planner consumed these in selfuncs.c: var_eq_const probed the MCV list for an exact frequency or fell back to (1 - sumcommon - nullfrac) / otherdistinct; scalarineqsel binary-searched the histogram and interpolated within the bracketing bin. All of this is per column. Multi-clause selectivity was the product of per-clause selectivities — the independence assumption baked into the very shape of the catalog. Mechanism detail lives in postgres-extended-statistics.md.

Cumulative side — the stats collector. A dedicated auxiliary process, the stats collector, was forked by the postmaster. Backends did not write shared counters directly; they accumulated deltas locally and periodically serialized them into UDP messages sent over a loopback datagram socket to the collector. The collector merged every message into its private hashtables — the only in-memory copy — and on a timer wrote the tables out to a stats file under pg_stat_tmp/. A backend running SELECT * FROM pg_stat_user_tables triggered a request that caused a fresh file to be written and then read and deserialized by the asking backend.

flowchart LR
    subgraph backends["Backends (pre-15)"]
        B1["backend 1<br/>local deltas"]
        B2["backend 2<br/>local deltas"]
        B3["backend N<br/>local deltas"]
    end
    COLL["stats collector process<br/>owns the ONLY copy<br/>of every counter"]
    FILE[("pgstat.stat file<br/>(whole-file serialize)")]
    READER["any backend reading<br/>a pg_stat_* view"]
    B1 -. "UDP datagram<br/>(may be dropped)" .-> COLL
    B2 -. "UDP datagram<br/>(may be dropped)" .-> COLL
    B3 -. "UDP datagram<br/>(may be dropped)" .-> COLL
    COLL -- "timer write" --> FILE
    FILE -- "read + deserialize<br/>entire file per query" --> READER

The three weaknesses named in the previous section — lossiness, the single-consumer ceiling, and whole-file reads — are all visible in this diagram. The collector is the only writer (so the totals are internally consistent) but it is also the only place the numbers exist, the only process draining the UDP firehose, and the source of a file that must be re-read in full on every query. PG15 attacked all three at once.

PG10 (2017) — CREATE STATISTICS: functional dependencies + multivariate n-distinct

Section titled “PG10 (2017) — CREATE STATISTICS: functional dependencies + multivariate n-distinct”

PG10 introduced the extended statistics feature and the CREATE STATISTICS command, with a new catalog pg_statistic_ext holding the definition of a declared statistics object (which columns, which kinds) and statistics data stored alongside. The first release shipped two kinds, both aimed squarely at the column-independence problem.

Functional dependencies (STATS_EXT_DEPENDENCIES, 'f'). A functional dependency a -> b says that knowing the value of column a determines the value of column b (the city/state example). PostgreSQL does not store a boolean “yes/no”; it stores a degree — a number in [0,1] measuring how often the dependency holds across the sample, computed in dependencies.c by sorting on a and checking, within each group of equal a values, whether b is constant. At planning time the dependency lets the estimator stop multiplying correlated clauses: if a functionally determines b, the clause on b adds little independent selectivity, so the combined estimate is pulled back up toward the selectivity of a alone.

Multivariate n-distinct (STATS_EXT_NDISTINCT, 'd'). A single stadistinct answers “how many distinct values in column a?”. The multivariate version answers “how many distinct combinations of (a, b, c)?” — computed with the same Haas–Stokes Duj1 estimator applied to column groups (mvdistinct.c). This directly fixes GROUP BY a, b cardinality and the denominator of grouped-equality selectivity, which the independence product gets badly wrong whenever the columns are correlated.

The structural shift is that the definition and the data are now separate catalog objects from pg_statistic, and they are opt-in: a DBA must run CREATE STATISTICS naming the column group. The per-column path was untouched; extended stats are a correction layer the planner consults in addition. ANALYZE learned to walk the pg_statistic_ext rows for a table and, for each, call into the extended-stats builder to fill the requested kinds. The estimator-side combination rule and the build entry point are documented in postgres-extended-statistics.md.

PG12 (2019) — Multivariate MCV lists in extended statistics

Section titled “PG12 (2019) — Multivariate MCV lists in extended statistics”

The PG10 kinds handle aggregate correlation — how many distinct combinations exist, and whether one column determines another — but they cannot estimate the selectivity of a specific combination of values. Functional dependencies assume the dependency holds uniformly; they say nothing about which (city, state) pairs are common. For skewed, correlated data that uniformity assumption itself fails.

PG12 added the third extended-stats kind: the multivariate MCV list (STATS_EXT_MCV, 'm', built in mcv.c). It is the multi-column analogue of the per-column MCV list — a list of the most common value combinations across the declared columns, each with its observed frequency, plus a base-frequency (the independence-assumption frequency) so the planner can tell how far each combination departs from independence. ANALYZE builds it by sampling combinations and keeping those whose frequency clears the same kind of statistical cutoff used for per-column MCVs.

The estimator change is the payoff. Instead of either fully multiplying clause selectivities (independence) or fully collapsing them (a strict functional dependency), the multivariate MCV path lets the planner read the actual joint frequency for matched combinations and apply a blended rule for the residue:

P(a, b) = f * Min(P(a), P(b)) + (1 - f) * P(a) * P(b)

where f reflects how much of the distribution the MCV list explains. The MCV list covers the common, correlated head of the distribution exactly; the independence product handles the long tail. This is strictly more expressive than the PG10 kinds, which is why a typical CREATE STATISTICS s (ndistinct, dependencies, mcv) ON a, b FROM t now requests all three — they are complementary, not redundant. The combination logic (statext_clauselist_selectivity, mcv_clauselist_selectivity) is detailed in postgres-extended-statistics.md.

The structural shape did not change in PG12 — no new catalog, no new top-level path. A third 'm' kind was simply added to the same pg_statistic_ext enabled-kinds set and the same ANALYZE build loop, which is exactly what the open-ended PG10 design was built to allow.

PG14 (2021) — Statistics on expressions, and inheritance-aware ext stats

Section titled “PG14 (2021) — Statistics on expressions, and inheritance-aware ext stats”

By PG13 the extended-statistics catalog could describe correlation among columns, but only columns. A predicate over an expressionWHERE lower(email) = '...', WHERE (a + b) > 100, WHERE date_trunc('day', ts) = ... — had no statistics at all; the planner fell back to hard-coded default selectivities, which are routinely wrong by orders of magnitude.

PG14 closed this with two changes.

Statistics on expressions (STATS_EXT_EXPRESSIONS, 'e'). CREATE STATISTICS learned to accept arbitrary expressions, not just bare column names. ANALYZE evaluates each declared expression over the sample and then runs the ordinary per-column statistics machinery on the resulting values — producing nullfrac, stadistinct, an MCV list, and a histogram for the expression, stored in the extended-stats data. In extended_stats.c this shows up as a dedicated expression path (the STATS_EXT_EXPRESSIONS branch in the build and serialize loops): the expressions carried by a statistics object are evaluated, folded through eval_const_expressions, and analyzed one at a time. The effect is that a single-expression CREATE STATISTICS object behaves like “give this expression the same statistics a real column would have,” which is the common case users actually want. A single-column functional-index workaround existed before, but it required an actual index and the maintenance cost that comes with it; expression stats decouple the statistics from the index.

Inheritance-aware extended statistics (stxdinherit). The pg_statistic_ext_data catalog gained a boolean stxdinherit column and a two-column primary key (stxoid, stxdinherit). This mirrors the long-standing stainherit flag on per-column pg_statistic: a partitioned or inherited table now stores two sets of extended statistics — one describing only the parent’s own rows (stxdinherit = false) and one describing the parent plus all inheritance/partition children (stxdinherit = true). The planner picks whichever matches the scope of the scan it is costing. Before this, extended statistics on a partitioned root were effectively unusable because there was no place to record the all-children distribution. The shape change is visible directly in the catalog header:

// pg_statistic_ext_data — src/include/catalog/pg_statistic_ext_data.h
Oid stxoid; /* statistics object this data is for */
bool stxdinherit; /* true if inheritance children are included */
/* ... pg_ndistinct, pg_dependencies, pg_mcv_list payload columns ... */

With PG14 the planner-statistics track reaches essentially its current shape: per-column pg_statistic for the independence baseline, plus opt-in extended statistics over column groups and expressions, each in own-rows and with-children variants. Everything from here is refinement inside that frame — the mechanism as it stands on REL_18 is in postgres-extended-statistics.md.

PG15 (2022) — The collector dies: shared-memory cumulative statistics

Section titled “PG15 (2022) — The collector dies: shared-memory cumulative statistics”

This is the largest single change on either track, and it switched tracks: nothing about planner statistics changed, but the cumulative statistics subsystem was completely rebuilt. PG15 (commit 5891c7a8e, Andres Freund) removed the stats collector process entirely and replaced the UDP-message / single-owner / whole-file model with a shared-memory subsystem under src/backend/utils/activity/.

The redesign attacks each of Era 0’s three weaknesses directly:

  • Lossy → exact. There is no more UDP. Backends update counters in shared memory (via buffered local pending entries), so a delta is never silently dropped.
  • Single bottleneck → no central consumer. No process drains a firehose; every backend writes its own slice of shared memory under a fine-grained lock.
  • Whole-file reads → targeted lookups. A pg_stat read finds the one entry it needs in a hashtable instead of deserializing the whole file.

The data structure split. Stats are partitioned into two populations. Variable-numbered kinds — per-relation, per-function, per-replication-slot, per-subscription, per-backend — live in a DSA-backed dshash keyed by PgStat_HashKey {kind, dboid, objid}. The dshash entry holds only a refcount, a generation counter, a “dropped” flag, and a dsa_pointer to the separately allocated, variable-size stats body. Fixed-numbered kinds — checkpointer, bgwriter, archiver, WAL, SLRU, and (from PG16) IO — live in a plain shared-memory control block, one slot apiece, because there is exactly one of each per cluster.

The hot path stays local. A backend does not touch shared memory on every tuple. It buffers updates in process-local pending entries reached through a local hashtable (pgStatEntryRefHash) of reference-counted PgStat_EntryRef handles. pgstat_report_stat() flushes the pending list into shared memory under each entry’s dedicated LWLock at most once per second (forced at commit). Lifetime is reference-counted, drops are transactional (deferred to commit and written into the commit/abort WAL record so replicas and crash recovery stay consistent), and the checkpointer serializes the whole thing to pg_stat/pgstat.stat at shutdown for the startup process to reload — or discard after a crash. All of this mechanism (the dshash entry layout, the refcount/generation race handling, the flush callbacks) is documented in postgres-cumulative-stats.md; it is not re-derived here.

The before/after structural shift:

flowchart TB
    subgraph before["Before — PG14 and earlier"]
        bB1["backend"] -. "UDP delta" .-> bC["stats collector<br/>(sole owner)"]
        bC --> bF[("pgstat.stat<br/>whole file")]
        bF -- "full deserialize" --> bR["reader backend"]
    end
    subgraph after["After — PG15+"]
        aB["backend<br/>local pending entries<br/>(PgStat_EntryRef)"]
        aB -- "pgstat_report_stat()<br/>flush <= 1/sec under per-entry LWLock" --> aSHM
        subgraph aSHM["shared memory"]
            aDSH["dshash: variable kinds<br/>key {kind, dboid, objid}<br/>-> dsa_pointer to body"]
            aFIX["fixed control block<br/>checkpointer, bgwriter,<br/>archiver, WAL, SLRU, IO"]
        end
        aSHM -- "targeted lookup" --> aR["reader backend"]
        aSHM -. "checkpointer writes at shutdown" .-> aFILE[("pg_stat/pgstat.stat")]
        aFILE -. "startup reads<br/>discard after crash" .-> aSHM
    end
    before --> after

The lesson PostgreSQL drew is the classic message-passing-versus-shared-memory trade-off: the collector’s UDP model gave perfect write isolation (only one process ever mutated the totals) but paid for it in loss, a serialization bottleneck, and read cost. Shared memory with per-entry locking gives up the “single writer” simplicity in exchange for exactness, scalability, and cheap reads — and recovers consistency through reference counting and transactional drops rather than through a single owner.

PG16 (2023) — pg_stat_io: I/O as a first-class cumulative stat kind

Section titled “PG16 (2023) — pg_stat_io: I/O as a first-class cumulative stat kind”

Before PG16, I/O accounting was scattered: block-read and block-hit counters lived on per-relation stats, and there was no cluster-wide, broken-down view of who was doing I/O for what reason. You could not easily answer “how much of my read traffic is the bulk-read strategy used by sequential scans versus normal buffer accesses, and how much is being read by vacuum versus the checkpointer?”

PG16 added a brand-new fixed cumulative stat kind, PGSTAT_KIND_IO, and the pg_stat_io view over it. Because the cumulative subsystem had already been rebuilt in PG15 around pluggable stat kinds, adding I/O was a matter of registering a new fixed kind with its own flush callback — not a redesign. This is the PG15 architecture paying off: a new kind slots into the same shared-memory control block and the same pgstat_report_stat() flush discipline.

pg_stat_io reports counts dimensioned by backend type (client backend, autovacuum worker, checkpointer, background writer, …), I/O object (relation, temp relation), and I/O context (normal, vacuum, bulkread, bulkwrite). Each cell tallies reads, writes, extends, hits, evictions, reuses, fsyncs, and so on. Suddenly the operator can attribute physical I/O to a specific (backend type, context) pair — the single most requested I/O observability feature for years. The new kind is visible in the stat-kind table that survives on REL_18:

// pgstat_kind.h — fixed-numbered kinds, REL_18
#define PGSTAT_KIND_ARCHIVER 7
#define PGSTAT_KIND_BGWRITER 8
#define PGSTAT_KIND_CHECKPOINTER 9
#define PGSTAT_KIND_IO 10 /* added PG16 */
#define PGSTAT_KIND_SLRU 11
#define PGSTAT_KIND_WAL 12

PG18 (2025) — Per-backend I/O, WAL and byte-level columns in pg_stat_io

Section titled “PG18 (2025) — Per-backend I/O, WAL and byte-level columns in pg_stat_io”

PG18 refined the I/O accounting along two axes, again without touching the PG15 architecture.

Per-backend I/O statistics. PG18 added PGSTAT_KIND_BACKEND (kind 6) — a variable-numbered kind keyed per backend — and the pg_stat_get_backend_io() function, so I/O can now be attributed to an individual session, not only to a backend type in aggregate. A long-running session hammering the disk can be identified directly rather than inferred from the type-level totals. As a variable-numbered kind it lives in the same dshash machinery that holds per-relation and per-function stats; the entry is created and dropped over the session’s lifetime through the same reference-counted PgStat_EntryRef path.

WAL and byte-level columns folded into pg_stat_io. The PG18 pg_stat_io view gained byte-granular columns (read_bytes, write_bytes, and an op_bytes describing the unit size of an operation) and an object column that distinguishes the WAL I/O object from relation I/O — so WAL reads and writes are now visible in the same view and the same byte units as relation I/O. The view definition on REL_18 reflects this directly:

-- pg_stat_io — src/backend/catalog/system_views.sql (REL_18)
CREATE VIEW pg_stat_io AS
SELECT
b.backend_type,
b.object, -- relation vs WAL (PG18)
b.context,
...
b.read_bytes, -- byte-level accounting (PG18)
...

Counting in bytes rather than only in fixed-size blocks matters because WAL and some I/O paths operate in units other than the 8 KB page; op_bytes records the unit so a count can be turned into an accurate byte figure.

These PG16/PG18 changes are deliberately additive: each is a new stat kind or new columns on an existing kind, riding the PG15 shared-memory infrastructure. None of them reintroduces a central process or a whole-file read.

On REL_18 the two tracks have each reached a stable shape that the module docs describe in mechanism-level detail:

Planner statistics. ANALYZE (src/backend/commands/analyze.c) still produces per-column pg_statistic rows as the independence baseline: nullfrac, width, stadistinct, MCV, histogram, correlation. Layered on top, extended statistics (src/backend/statistics/extended_stats.c and the dependencies.c / mvdistinct.c / mcv.c builders) provide opt-in correlation-aware statistics over declared column groups and expressions, with four kinds — functional dependencies ('f'), multivariate n-distinct ('d'), multivariate MCV ('m'), and expression stats ('e') — each stored in own-rows and with-children (stxdinherit) variants in pg_statistic_ext_data. The estimator combines correlated clauses through the MCV-plus-residue blend rather than the independence product. The full mechanism is in postgres-extended-statistics.md.

Cumulative statistics. There is no stats collector process. The shared-memory subsystem in src/backend/utils/activity/pgstat.c (and its pgstat_*.c siblings) holds variable-numbered kinds in a DSA/dshash and fixed-numbered kinds in a control block. On REL_18 the built-in kinds run PGSTAT_KIND_DATABASE (1) through PGSTAT_KIND_WAL (12) — including PGSTAT_KIND_BACKEND (6, per-backend, PG18) and PGSTAT_KIND_IO (10, PG16) — with the IDs at and above 24 reserved for custom (extension-defined) stat kinds. Backends buffer pending deltas locally and flush under per-entry LWLocks at most once per second; the checkpointer serializes to pg_stat/pgstat.stat at shutdown and the startup process reloads or discards after a crash. The full mechanism is in postgres-cumulative-stats.md. The live-state counterpart — wait events and progress reporting, which is the other half of monitoring and is snapshot-not-cumulative — is covered in postgres-wait-events-progress.md.

Next step. PostgreSQL 19 (the next major release, in development) is expected to continue the additive pattern — new and extension-defined cumulative stat kinds slotting onto the PG15 shared-memory frame, and incremental planner-statistics refinement — rather than another structural rebuild of either track. The PG15 collector removal and the PG10–PG14 extended-statistics frame are the two architectural decisions everything since has built on.

Module docs (current-state mechanism — do not re-derive):

Release notes (release attributions):

  • PostgreSQL 10 release notes — CREATE STATISTICS (functional dependencies, multivariate n-distinct).
  • PostgreSQL 12 release notes — multivariate MCV lists in extended statistics.
  • PostgreSQL 14 release notes — statistics on expressions; inheritance-aware extended statistics.
  • PostgreSQL 15 release notes — stats collector removed; cumulative statistics moved to shared memory (commit 5891c7a8e).
  • PostgreSQL 16 release notes — pg_stat_io.
  • PostgreSQL 18 release notes — per-backend I/O statistics; WAL and byte-level columns in pg_stat_io.

Key source files (observable on REL_18, commit 273fe94):

  • src/backend/utils/activity/pgstat.c — cumulative stats core.
  • src/include/utils/pgstat_kind.h — the stat-kind ID registry (PGSTAT_KIND_*).
  • src/backend/statistics/extended_stats.c — extended-statistics build and estimate orchestration.
  • src/backend/commands/analyze.cANALYZE sampling and per-column compute_stats.
  • src/include/catalog/pg_statistic_ext.h, src/include/catalog/pg_statistic_ext_data.h — extended-stats catalogs (definition and data, including stxdinherit).
  • src/backend/catalog/system_views.sqlpg_stat_io and the pg_stat_* view definitions.