PostgreSQL Statistics — ANALYZE, pg_statistic, and Extended Statistics
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A cost-based optimizer cannot choose between an index scan and a
sequential scan, or between a nested loop and a hash join, without first
guessing how many rows each operator will emit. That guess — the
cardinality estimate — is the single most consequential number in
query planning: cost is roughly linear in row counts, and an estimate
that is wrong by an order of magnitude routinely produces a plan that is
wrong by an order of magnitude. The estimate is itself derived from a
prior quantity, the selectivity of a predicate: the fraction of rows
a WHERE clause is expected to pass.
Database System Concepts (Silberschatz, Korth & Sudarshan, 7e), Ch. 16 “Query Optimization”, §16.3 “Statistical Information for Cost Estimation”, names the catalog quantities an optimizer keeps per relation and per attribute. Among them (§16.3.1):
- n_r — “the number of tuples in the relation r.”
- V(A, r) — “the number of distinct values that appear in the
relation r for attribute A.” This is the pivot for equality
selectivity: under the uniform-distribution assumption, “the number of
tuples that would satisfy a selection of the form A = a … is estimated
to be n_r / V(A, r)” (§16.3.2). Equivalently the selectivity is
1 / V(A, r).
But uniformity is a fiction for most real columns, and the textbook says so directly: a histogram is the standard repair. §16.3.1 defines a histogram as a structure in which “the values for the attribute are divided into a number of ranges, and with each range the histogram associates the number of tuples whose attribute value lies in that range.” It distinguishes two shapes:
- An equi-width histogram “divides the range of values into equal-sized ranges.”
- An equi-depth histogram “adjusts the boundaries of the ranges such that each range has the same number of values”; the text notes that “equi-depth histograms are preferred to equi-width histograms since they provide better estimates.”
The textbook also flags the practical refinement PostgreSQL leans on hardest — splitting off the frequent values: a histogram “could store the most common values along with their counts” so that the residual ranges describe only the non-frequent tail (§16.3.1). And on how the catalog numbers are obtained, §16.3.4 is explicit that exhaustive scanning is unnecessary: “a fairly accurate histogram can be computed from a sample of a few thousand tuples.” Sampling, not full enumeration, is the expected method.
Two further theoretical facts frame everything below. First, estimating
V(A, r) — the number of distinct values — from a sample is genuinely
hard: a value that appears once in the sample might be unique in the
table or might be one of millions. The statistics literature (Haas &
Stokes, “Estimating the number of classes in a finite population”, and
the IBM Research Report RJ 10025 cited verbatim in the PostgreSQL source)
gives bias-corrected estimators that PostgreSQL adopts wholesale. Second,
when a predicate touches several columns, the joint selectivity is
P(a AND b). The cheap and universal assumption is independence,
P(a,b) = P(a) * P(b). When the columns are correlated (city implies
state; model implies manufacturer) the independence product is wildly
too small, and no amount of per-column accuracy fixes it — the error is
structural. That structural error is exactly what extended statistics
exist to address.
Common DBMS Design
Section titled “Common DBMS Design”Across systems the statistics machinery decomposes into the same parts, and the design choices cluster around a handful of axes.
Sample, don’t scan
Section titled “Sample, don’t scan”Full-table statistics are exact but cost a full scan per ANALYZE; on a terabyte table that is prohibitive. Every production optimizer samples. The design questions are how to sample (uniform row sample vs. block sample vs. two-stage), how much (a fixed budget vs. a fraction of the table), and whether the sample is biased. Block-level sampling reads far fewer pages but clusters correlated rows; pure row sampling is unbiased but I/O-heavy. The size question has a clean theoretical answer for histograms: Chaudhuri, Motwani & Narasayya (SIGMOD 1998, “Random sampling for histogram construction: how much is enough?”) show the required sample size grows with histogram resolution but only logarithmically with table size — so a fixed per-target budget suffices.
One condensed per-column record
Section titled “One condensed per-column record”The canonical output is a small fixed-shape record per column holding: the null fraction, the average width, a distinct-value count, a most-common-value (MCV) list with frequencies, and a histogram of the remaining values. The MCV-plus-histogram split is near-universal because it captures both the spiky head of a skewed distribution (the MCVs) and the smooth tail (the histogram) without storing the whole frequency table. A physical/logical correlation number is a PostgreSQL-flavored extra that lets the cost model judge how sequential an index scan’s heap fetches will be.
Distinct-value estimation is a special headache
Section titled “Distinct-value estimation is a special headache”Because V(A, r) drives equality selectivity, and because it is the
hardest quantity to recover from a sample, systems either (a) use a
sampling estimator (Haas-Stokes and relatives), (b) maintain it exactly
via sketches (HyperLogLog), or (c) let the DBA override it. PostgreSQL
does (a) with a manual-override escape hatch.
Independence, and the escape from it
Section titled “Independence, and the escape from it”The default joint-selectivity model is independence — it needs no extra storage and composes trivially. Its failure on correlated columns is well known, and the modern answer is opt-in multi-column statistics: the DBA declares which column groups are correlated, and the system stores joint information (multi-column distinct counts, multi-column MCV lists, or compact functional-dependency summaries) for just those groups. This is the SQL Server “multi-column statistics” / Oracle “extended statistics” / PostgreSQL “CREATE STATISTICS” family. It is opt-in because the combinatorics of all column groups is exponential.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Textbook concept (DSC §16.3) | PostgreSQL realization |
|---|---|
| n_r (relation cardinality) | pg_class.reltuples (set by VACUUM, read as rel->tuples) |
| V(A, r) (distinct values) | pg_statistic.stadistinct; Haas-Stokes Duj1 estimator in compute_scalar_stats |
| selectivity 1/V(A,r) | var_eq_const fallback (1 - sumcommon - nullfrac)/otherdistinct |
| most-common values | STATISTIC_KIND_MCV slot; analyze_mcv_list cutoff |
| equi-depth histogram | STATISTIC_KIND_HISTOGRAM slot; evenly-spaced quantiles over sorted residue |
| sampling a few thousand tuples | acquire_sample_rows, targrows = 300 * statistics_target |
| multi-column correlation repair | extended stats: pg_statistic_ext / statistics/ (ndistinct, dependencies, mcv) |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL splits the statistics world into two halves that meet only in the catalog.
The producing half is ANALYZE (also run inside autovacuum and as a
phase of VACUUM ANALYZE). analyze_rel acquires a relation lock and
dispatches to do_analyze_rel, which: (1) decides which columns to
process and builds a VacAttrStats per column via examine_attribute;
(2) computes a sample-size budget targrows as the max of each column’s
minrows; (3) draws one shared sample with acquire_sample_rows; (4)
runs each column’s compute_stats callback over that sample; (5) writes
the results into pg_statistic with update_attstats; and (6) calls
BuildRelationExtStatistics to compute any declared extended-statistics
objects from the same sample.
The consuming half is the selectivity functions in
utils/adt/selfuncs.c, invoked by the planner’s clauselist_selectivity
machinery (owned by postgres-cost-model.md). eqsel / scalarltsel
and friends pull the pg_statistic row through get_attstatsslot and
turn the stored MCV list, histogram, and stadistinct into a number in
[0,1].
Three design commitments are worth naming up front.
A fixed sample budget, not a table fraction
Section titled “A fixed sample budget, not a table fraction”std_typanalyze sets stats->minrows = 300 * stats->attstattarget,
citing Chaudhuri-Motwani-Narasayya Corollary 1 to Theorem 5 directly in
the source. With the default default_statistics_target = 100 that is
30,000 rows — independent of whether the table has a thousand rows or a
trillion. The “300×” constant absorbs the logarithmic table-size term so
the budget never has to scale with n.
Five stat-kind slots per pg_statistic row
Section titled “Five stat-kind slots per pg_statistic row”A pg_statistic tuple carries five generic (stakind, staop, stacoll, stanumbers[], stavalues[]) slots (STATISTIC_NUM_SLOTS = 5). Slot kinds
are small integers: STATISTIC_KIND_MCV = 1, STATISTIC_KIND_HISTOGRAM = 2, STATISTIC_KIND_CORRELATION = 3 (plus MCELEM/RANGE kinds for arrays
and range types). The slot scheme is what lets a type-specific
typanalyze store arbitrary statistic shapes — a scalar column fills MCV
- histogram + correlation; an array column fills a most-common-elements slot instead — without schema changes.
stadistinct: a sign-overloaded field
Section titled “stadistinct: a sign-overloaded field”stadistinct packs two meanings into one float4. A positive value
is an absolute distinct-value count. A negative value is a fraction
of the row count: stadistinct = -1 means “unique” (distinct count
scales 1:1 with rows), -0.5 means “half as many distinct values as
rows.” compute_scalar_stats switches an absolute estimate to the
relative form when it exceeds 10% of the table, so that the count tracks
the row count as the table grows between ANALYZE runs.
The four scalar compute_stats routines
Section titled “The four scalar compute_stats routines”std_typanalyze selects one of three standard algorithms based on which
operators the column’s type offers:
compute_scalar_stats— when both<and=exist (the common case: any orderable type). Produces null-frac, width,stadistinct, MCV, histogram, and correlation.compute_distinct_stats— when only=exists (hashable but not orderable). Produces everything except the histogram and correlation.compute_trivial_stats— when neither exists. Only null-frac and width.
(A type can supply its own typanalyze — e.g. ts_typanalyze for
tsvector, array_typanalyze for arrays — but the three above cover the
default path.)
Extended statistics: opt-in joint information
Section titled “Extended statistics: opt-in joint information”CREATE STATISTICS s (ndistinct, dependencies, mcv) ON a, b FROM t
registers a pg_statistic_ext catalog row. During ANALYZE,
BuildRelationExtStatistics walks the registered objects and, for each
requested kind character (STATS_EXT_NDISTINCT = 'd',
STATS_EXT_DEPENDENCIES = 'f', STATS_EXT_MCV = 'm',
STATS_EXT_EXPRESSIONS = 'e'), calls the corresponding builder and
serializes the result into pg_statistic_ext_data. At planning time
statext_clauselist_selectivity and dependencies_clauselist_selectivity
consult these to estimate groups of clauses jointly, before the
independence-based per-column path runs on whatever clauses remain.
Source Walkthrough
Section titled “Source Walkthrough”Anchor on symbol names, not line numbers. A function or struct name is the stable handle across releases; the line numbers in the position-hint table are quick hints observed at commit
273fe94(REL_18_STABLE, 2026-06-05). Usegit grep -n '<symbol>' src/backend/to relocate.
The walkthrough follows the data: first the producing path (ANALYZE →
pg_statistic), then the consuming path (selfuncs.c), then the
extended-statistics builders and estimators.
flowchart TD
AR["analyze_rel()<br/>lock relation, pick acquirefunc"] --> DAR["do_analyze_rel()"]
DAR --> EA["examine_attribute()<br/>build VacAttrStats per column<br/>std_typanalyze picks compute_stats"]
DAR --> CER["ComputeExtStatisticsRows()<br/>bump targrows for ext stats"]
EA --> ASR["acquire_sample_rows()<br/>two-stage block sample + Vitter<br/>targrows = max(300*stattarget)"]
CER --> ASR
ASR --> CS["per-column compute_stats():<br/>compute_scalar_stats /<br/>compute_distinct_stats /<br/>compute_trivial_stats"]
CS --> UA["update_attstats()<br/>write pg_statistic rows"]
ASR --> BRE["BuildRelationExtStatistics()<br/>ndistinct / dependencies / mcv<br/>into pg_statistic_ext_data"]
UA --> CAT[("pg_statistic")]
BRE --> CATX[("pg_statistic_ext_data")]
Entry and sampling (analyze.c)
Section titled “Entry and sampling (analyze.c)”analyze_rel is the command-level entry; it takes the lock, rejects
pg_statistic itself, and chooses the acquire function (the heap default
is acquire_sample_rows). do_analyze_rel then computes the sample
budget. The budget is the maximum minrows over all analyzed columns,
floored at 100, and further raised by extended statistics:
// do_analyze_rel — analyze.c (sample-budget computation)targrows = 100;for (i = 0; i < attr_cnt; i++){ if (targrows < vacattrstats[i]->minrows) targrows = vacattrstats[i]->minrows;}/* ... index expression columns considered too ... */minrows = ComputeExtStatisticsRows(onerel, attr_cnt, vacattrstats);if (targrows < minrows) targrows = minrows;Sampling is the two-stage method documented in the header comment: stage
one (BlockSampler_Init / BlockSampler_Next, driven through a
ReadStream) picks up to targrows random blocks; stage two applies
Vitter’s reservoir algorithm to the rows of those blocks, so every row
has equal selection probability with a single pass:
// acquire_sample_rows — analyze.c (Vitter reservoir core)if (numrows < targrows) rows[numrows++] = ExecCopySlotHeapTuple(slot);else{ if (rowstoskip < 0) rowstoskip = reservoir_get_next_S(&rstate, samplerows, targrows);
if (rowstoskip <= 0) { /* Found a suitable tuple, replacing one old tuple at random */ int k = (int) (targrows * sampler_random_fract(&rstate.randstate)); Assert(k >= 0 && k < targrows); heap_freetuple(rows[k]); rows[k] = ExecCopySlotHeapTuple(slot); } rowstoskip -= 1;}samplerows += 1;The sampled rows are returned in physical (item-pointer) order — the
final qsort_interruptible(rows, ..., compare_rows, ...) enforces this —
because compute_scalar_stats relies on physical order to measure
correlation.
Choosing the algorithm (std_typanalyze)
Section titled “Choosing the algorithm (std_typanalyze)”examine_attribute builds the VacAttrStats and, unless the type has a
custom typanalyze, calls std_typanalyze, which both selects the
compute_stats callback and sets minrows from the Chaudhuri-Motwani-
Narasayya bound:
// std_typanalyze — analyze.cif (OidIsValid(eqopr) && OidIsValid(ltopr)){ /* Seems to be a scalar datatype */ stats->compute_stats = compute_scalar_stats; /* r = 4 * k * ln(2n/gamma) / f^2 ; with f=0.5,gamma=0.01 => ~300*k */ stats->minrows = 300 * stats->attstattarget;}else if (OidIsValid(eqopr)){ /* We can still recognize distinct values */ stats->compute_stats = compute_distinct_stats; stats->minrows = 300 * stats->attstattarget;}else{ /* Can't do much but the trivial stuff */ stats->compute_stats = compute_trivial_stats; stats->minrows = 300 * stats->attstattarget;}attstattarget here has already been resolved: examine_attribute reads
the per-column pg_attribute.attstattarget, treating null as -1
(“use default”), and a value of 0 disables the column entirely
(examine_attribute returns NULL). std_typanalyze then substitutes
default_statistics_target when the target is still negative.
The distinct-value estimator (compute_scalar_stats)
Section titled “The distinct-value estimator (compute_scalar_stats)”After sorting the sample, compute_scalar_stats counts ndistinct,
nmultiple (values seen more than once), and toowide_cnt. The
distinct-value count is the Haas-Stokes Duj1 estimator, quoted with
its IBM Research Report citation in the source:
// compute_scalar_stats — analyze.c (Haas-Stokes Duj1 distinct estimator)int f1 = ndistinct - nmultiple + toowide_cnt; /* values seen once */int d = f1 + nmultiple; /* distinct in sample */double n = samplerows - null_cnt; /* sample non-nulls */double N = totalrows * (1.0 - stats->stanullfrac); /* table non-nulls */double stadistinct;
if (N > 0) stadistinct = (n * d) / ((n - f1) + f1 * n / N);else stadistinct = 0;
/* Clamp to sane range in case of roundoff error */if (stadistinct < d) stadistinct = d;if (stadistinct > N) stadistinct = N;stats->stadistinct = floor(stadistinct + 0.5);Two special cases bracket this: if no value repeats (nmultiple == 0) the
column is treated as unique, stadistinct = -1.0 * (1.0 - stanullfrac);
if every sampled value repeats and the track list is complete, the column
is assumed to have exactly that many distinct values. And the >10% rule
converts an absolute count to the negative (relative) form:
// compute_scalar_stats — analyze.c (relative-distinct conversion)if (stats->stadistinct > 0.1 * totalrows) stats->stadistinct = -(stats->stadistinct / totalrows);The MCV cutoff and the equi-depth histogram
Section titled “The MCV cutoff and the equi-depth histogram”The MCV list is not simply filled to the target. analyze_mcv_list
decides how many of the tracked values are “significantly more common”
than the non-MCV average; only a complete list (all sample values fit and
are believed to be all the table’s values) bypasses that test. The
histogram is then built over the residue — the sample minus the MCV
entries — as evenly spaced quantiles, i.e. an equi-depth histogram:
// compute_scalar_stats — analyze.c (equi-depth histogram quantiles)num_hist = ndistinct - num_mcv;if (num_hist > num_bins) num_hist = num_bins + 1;if (num_hist >= 2){ /* i'th value is values[(i*(nvals-1))/(num_hist-1)], computed by * carrying integer + fractional parts to avoid overflow */ delta = (nvals - 1) / (num_hist - 1); deltafrac = (nvals - 1) % (num_hist - 1); pos = posfrac = 0; for (i = 0; i < num_hist; i++) { hist_values[i] = datumCopy(values[pos].value, ...); pos += delta; posfrac += deltafrac; if (posfrac >= (num_hist - 1)) { pos++; posfrac -= (num_hist - 1); } } stats->stakind[slot_idx] = STATISTIC_KIND_HISTOGRAM; stats->staop[slot_idx] = mystats->ltopr; /* ... */}The correlation slot reuses the fact that, in physical order, the sorted
x (logical) and y (physical) value sets are both 0..values_cnt-1,
so Pearson’s coefficient collapses to a closed form driven by a single
running cross-sum corr_xysum accumulated during the sort comparator:
// compute_scalar_stats — analyze.c (physical/logical correlation)corr_xsum = ((double)(values_cnt - 1)) * ((double) values_cnt) / 2.0;corr_x2sum = ((double)(values_cnt - 1)) * ((double) values_cnt) * (double)(2 * values_cnt - 1) / 6.0;corrs[0] = (values_cnt * corr_xysum - corr_xsum * corr_xsum) / (values_cnt * corr_x2sum - corr_xsum * corr_xsum);stats->stakind[slot_idx] = STATISTIC_KIND_CORRELATION;Persisting to the catalog (update_attstats)
Section titled “Persisting to the catalog (update_attstats)”update_attstats opens pg_statistic with RowExclusiveLock and writes
one tuple per analyzed column, marshalling the per-slot arrays into the
five stakindN / staopN / stacollN / stanumbersN / stavaluesN
catalog columns:
// update_attstats — analyze.c (marshalling the scalar header fields)values[Anum_pg_statistic_stanullfrac - 1] = Float4GetDatum(stats->stanullfrac);values[Anum_pg_statistic_stawidth - 1] = Int32GetDatum(stats->stawidth);values[Anum_pg_statistic_stadistinct - 1] = Float4GetDatum(stats->stadistinct);i = Anum_pg_statistic_stakind1 - 1;for (k = 0; k < STATISTIC_NUM_SLOTS; k++) values[i++] = Int16GetDatum(stats->stakind[k]); /* stakindN */The consuming path — equality selectivity (var_eq_const)
Section titled “The consuming path — equality selectivity (var_eq_const)”eqsel (the selectivity estimator registered for =) delegates to
var_eq_const for a constant comparand. It first probes the MCV list: if
the constant equals an MCV, the stored frequency is the selectivity,
exactly. Otherwise it spreads the non-common, non-null mass over the
remaining distinct values:
// var_eq_const — selfuncs.c (non-MCV equality fallback)double sumcommon = 0.0;double otherdistinct;
for (i = 0; i < sslot.nnumbers; i++) sumcommon += sslot.numbers[i];selec = 1.0 - sumcommon - nullfrac;CLAMP_PROBABILITY(selec);
/* all not-common values share the remainder equally */otherdistinct = get_variable_numdistinct(vardata, &isdefault) - sslot.nnumbers;if (otherdistinct > 1) selec /= otherdistinct;
/* selectivity can't exceed the least common MCV */if (sslot.nnumbers > 0 && selec > sslot.numbers[sslot.nnumbers - 1]) selec = sslot.numbers[sslot.nnumbers - 1];get_variable_numdistinct is where the sign convention of stadistinct
is decoded into a concrete count: a positive value is used as-is; a
negative value is multiplied by rel->tuples; zero (unknown) falls back
to DEFAULT_NUM_DISTINCT = 200:
// get_variable_numdistinct — selfuncs.c (decoding the stadistinct sign)if (stadistinct > 0.0) return clamp_row_est(stadistinct); /* absolute count */if (vardata->rel == NULL) { *isdefault = true; return DEFAULT_NUM_DISTINCT; }ntuples = vardata->rel->tuples;if (ntuples <= 0.0) { *isdefault = true; return DEFAULT_NUM_DISTINCT; }if (stadistinct < 0.0) return clamp_row_est(-stadistinct * ntuples); /* relative -> absolute */The consuming path — range selectivity (ineq_histogram_selectivity)
Section titled “The consuming path — range selectivity (ineq_histogram_selectivity)”For <, <=, >, >=, scalarineqsel combines an MCV scan
(mcv_selectivity) with a histogram lookup. The histogram lookup binary-
searches for the bin that brackets the constant, then linearly
interpolates the fraction within that bin — the equi-depth structure
means each whole bin contributes 1/(nvalues-1) of the population:
// ineq_histogram_selectivity — selfuncs.c (binary search for the bracketing bin)int lobound = 0; /* first possible slot */int hibound = sslot.nvalues; /* last+1 slot */while (lobound < hibound){ int probe = (lobound + hibound) / 2; bool ltcmp; /* (endpoints may be refreshed via get_actual_variable_range) */ ltcmp = DatumGetBool(FunctionCall2Coll(opproc, collation, sslot.values[probe], constval)); if (isgt) ltcmp = !ltcmp; if (ltcmp) lobound = probe + 1; else hibound = probe;}When the search lands inside the array (values[i-1] <= const <= values[i]), the code converts the in-bin position to a fraction
binfrac and forms histfrac = (i - 1 + binfrac) / (nvalues - 1). The
two endpoint bins are optionally refreshed against the live min/max via
get_actual_variable_range so a moving max (since the last ANALYZE)
doesn’t poison the estimate — an important detail for monotonically
increasing keys such as serial PKs and timestamps.
flowchart TD
Q["WHERE a = $1 AND b = $2<br/>(correlated columns)"] --> CL["clauselist_selectivity<br/>(cost model)"]
CL --> EXT{"extended stats<br/>cover these columns?"}
EXT -->|"MCV / dependencies present"| SCS["statext_clauselist_selectivity<br/>+ dependencies_clauselist_selectivity"]
SCS --> COMBINE["combine: P(a,b) =<br/>f*Min(P(a),P(b)) +<br/>(1-f)*P(a)*P(b)"]
EXT -->|"no / remaining clauses"| INDEP["per-column path:<br/>eqsel / scalarineqsel"]
INDEP --> PROD["independence product<br/>P(a)*P(b)"]
COMBINE --> OUT["joint selectivity"]
PROD --> OUT
SCS -.->|"clauses it could not handle"| INDEP
Extended statistics — building (extended_stats.c)
Section titled “Extended statistics — building (extended_stats.c)”BuildRelationExtStatistics is the ANALYZE-time driver. For each
registered object it resolves the per-object statistics target
(statext_compute_stattarget), evaluates any expression columns, then
dispatches per requested kind:
// BuildRelationExtStatistics — extended_stats.c (per-kind dispatch)foreach(lc2, stat->types){ char t = (char) lfirst_int(lc2);
if (t == STATS_EXT_NDISTINCT) ndistinct = statext_ndistinct_build(totalrows, data); else if (t == STATS_EXT_DEPENDENCIES) dependencies = statext_dependencies_build(data); else if (t == STATS_EXT_MCV) mcv = statext_mcv_build(data, totalrows, stattarget); else if (t == STATS_EXT_EXPRESSIONS) /* build_expr_data + compute_expr_stats + serialize */ ;}statext_store(stat->statOid, inh, ndistinct, dependencies, mcv, exprstats, stats);Extended statistics — functional dependencies (dependencies.c)
Section titled “Extended statistics — functional dependencies (dependencies.c)”A functional dependency (a,...) -> z is built by measuring its degree
of validity: sort the sample lexicographically, group by the first
k-1 columns, and count how many rows fall in groups whose last column is
constant. The degree is supporting-rows / total-rows:
// dependency_degree — dependencies.c (validity by sort-and-group)for (i = 1; i <= nitems; i++){ if (i == nitems || multi_sort_compare_dims(0, k - 2, &items[i - 1], &items[i], mss) != 0) { if (n_violations == 0) n_supporting_rows += group_size; /* group is consistent */ n_violations = 0; group_size = 1; continue; } else if (multi_sort_compare_dim(k - 1, &items[i - 1], &items[i], mss) != 0) n_violations++; /* same prefix, different z */ group_size++;}return (n_supporting_rows * 1.0 / data->numrows);At planning time, dependencies_clauselist_selectivity finds the
strongest fully-matched dependency (find_strongest_dependency, ordered
by attribute count then degree) and combines the per-column selectivities
with the degree-weighted blend, never letting the joint estimate exceed
either component:
// clauselist_apply_dependencies — dependencies.c (the combining formula, from the comment) * P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b) * to ensure that the combined selectivity is never greater than either * individual selectivity.At f = 1 (a perfect dependency) the joint selectivity is exactly
Min(P(a),P(b)) — knowing a tells you b, so the second predicate
adds nothing. At f = 0 it degenerates to the independence product.
Extended statistics — multivariate n-distinct (mvdistinct.c)
Section titled “Extended statistics — multivariate n-distinct (mvdistinct.c)”statext_ndistinct_build enumerates every column combination of size
2..k and, for each, counts distinct tuples and singletons, then reuses
the same Duj1 estimator as the per-column path:
// estimate_ndistinct — mvdistinct.c (Duj1, identical to analyze.c's formula)numer = (double) numrows * (double) d;denom = (double) (numrows - f1) + (double) f1 * (double) numrows / totalrows;ndistinct = numer / denom;if (ndistinct < (double) d) ndistinct = (double) d;if (ndistinct > totalrows) ndistinct = totalrows;return floor(ndistinct + 0.5);The distinct tuple count for a combination is obtained by multi-column
sort and a single adjacent-difference pass (ndistinct_for_combination),
exactly mirroring how the single-column path counts distinct values after
its sort. The planner uses these combination counts in
estimate_num_groups (GROUP BY / DISTINCT cardinality) to avoid the
classic over-count where n_distinct(a) * n_distinct(b) vastly exceeds
the real number of (a,b) groups.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
analyze_rel | commands/analyze.c | 109 |
do_analyze_rel | commands/analyze.c | 278 |
examine_attribute | commands/analyze.c | 1036 |
acquire_sample_rows | commands/analyze.c | 1199 |
update_attstats | commands/analyze.c | 1655 |
std_typanalyze | commands/analyze.c | 1891 |
compute_trivial_stats | commands/analyze.c | 1969 |
compute_distinct_stats | commands/analyze.c | 2059 |
compute_scalar_stats | commands/analyze.c | 2402 |
eqsel | utils/adt/selfuncs.c | 235 |
var_eq_const | utils/adt/selfuncs.c | 303 |
var_eq_non_const | utils/adt/selfuncs.c | 474 |
scalarineqsel | utils/adt/selfuncs.c | 588 |
mcv_selectivity | utils/adt/selfuncs.c | 740 |
histogram_selectivity | utils/adt/selfuncs.c | 831 |
ineq_histogram_selectivity | utils/adt/selfuncs.c | 1049 |
scalarltsel | utils/adt/selfuncs.c | 1479 |
get_variable_numdistinct | utils/adt/selfuncs.c | 6274 |
BuildRelationExtStatistics | statistics/extended_stats.c | 111 |
statext_compute_stattarget | statistics/extended_stats.c | 344 |
choose_best_statistics | statistics/extended_stats.c | 1216 |
statext_clauselist_selectivity | statistics/extended_stats.c | 1991 |
dependency_degree | statistics/dependencies.c | 221 |
statext_dependencies_build | statistics/dependencies.c | 348 |
find_strongest_dependency | statistics/dependencies.c | 929 |
clauselist_apply_dependencies | statistics/dependencies.c | 1014 |
dependencies_clauselist_selectivity | statistics/dependencies.c | 1370 |
statext_ndistinct_build | statistics/mvdistinct.c | 88 |
ndistinct_for_combination | statistics/mvdistinct.c | 425 |
estimate_ndistinct | statistics/mvdistinct.c | 521 |
statext_mcv_build | statistics/mcv.c | 180 |
mcv_clauselist_selectivity | statistics/mcv.c | 2048 |
STATISTIC_KIND_MCV (= 1) | catalog/pg_statistic.h | 190 |
STATISTIC_KIND_HISTOGRAM (= 2) | catalog/pg_statistic.h | 210 |
STATISTIC_KIND_CORRELATION (= 3) | catalog/pg_statistic.h | 222 |
STATISTIC_NUM_SLOTS (= 5) | catalog/pg_statistic.h | 127 |
STATS_EXT_NDISTINCT (= ‘d’) | catalog/pg_statistic_ext.h | 84 |
STATS_EXT_DEPENDENCIES (= ‘f’) | catalog/pg_statistic_ext.h | 85 |
STATS_EXT_MCV (= ‘m’) | catalog/pg_statistic_ext.h | 86 |
DEFAULT_NUM_DISTINCT (= 200) | utils/selfuncs.h | 52 |
DEFAULT_EQ_SEL (= 0.005) | utils/selfuncs.h | 34 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Each entry leads with a fact about the current source at commit
273fe94(REL_18_STABLE), readable on its own. The trailing sentence shows how it was checked. Open questions follow as recorded gaps.
Verified facts
Section titled “Verified facts”-
The sample budget is
300 * statistics_target, floored at 100, and bounded below by extended-statistics needs.std_typanalyzesetsstats->minrows = 300 * stats->attstattarget(with the Chaudhuri-Motwani-Narasayya derivation in a comment), anddo_analyze_reltakestargrowsas the max over columns starting from 100, then raises it withComputeExtStatisticsRows. Verified by readingstd_typanalyzeand thetargrowsloop indo_analyze_relon 2026-06-05. Withdefault_statistics_target = 100, the per-column budget is 30,000 rows. -
Sampling is two-stage: random blocks via
BlockSampler, then a Vitter reservoir over their rows.acquire_sample_rowsinitializesBlockSampler_Init, drives aReadStreamover the chosen blocks, copies the firsttargrowsrows directly intorows[], then usesreservoir_get_next_S/sampler_random_fractto replace random reservoir slots. Verified by readingacquire_sample_rows. The result is re-sorted into item-pointer order bycompare_rowsso correlation can be measured. -
stadistinctis the Haas-Stokes Duj1 estimator, and the source cites IBM Research Report RJ 10025 by name.compute_scalar_stats(andcompute_distinct_stats) compute(n*d)/((n-f1)+f1*n/N)withf1= values seen once,d= distinct in sample,n/N= sample/table non-nulls, then clamp to[d, N]. Verified by reading the estimator block and its comment incompute_scalar_stats. The identical formula is factored intoestimate_ndistinctinmvdistinct.c. -
stadistinct’s sign is meaningful: positive = absolute, negative = fraction of rows, zero = unknown.compute_scalar_statsflips an absolute estimate above0.1 * totalrowsto-(stadistinct/totalrows);get_variable_numdistinctdecodes a negative value as-stadistinct * ntuplesand zero asDEFAULT_NUM_DISTINCT(200). Verified by reading both.compute_*setstadistinct = -1.0 * (1.0 - stanullfrac)for apparently-unique columns. -
The histogram is equi-depth over the non-MCV residue, stored under
STATISTIC_KIND_HISTOGRAM.compute_scalar_statsfirst collapses MCV entries out ofvalues[], then copiesnum_histevenly spaced quantiles (carrying integer + fractional position to avoid overflow). Verified by reading the histogram block. It is only built whennum_hist >= 2. -
Equality selectivity exactly reads an MCV frequency when the constant is an MCV, else spreads the residual mass over
otherdistinctvalues.var_eq_constwalks the MCV slot calling the equality function; on a matchselec = sslot.numbers[i]; otherwiseselec = (1 - sumcommon - nullfrac) / otherdistinct, capped at the least common MCV. Verified by readingvar_eq_const. -
Range selectivity binary-searches the histogram and refreshes endpoints against the live min/max.
ineq_histogram_selectivitybisects for the bracketing bin, interpolatesbinfracwithin it, and callsget_actual_variable_rangeto replace the first/last histogram entries with the current column extremes. Verified by reading the function. This is the mechanism that keeps estimates sane for ever-increasing keys between ANALYZE runs. -
Extended statistics are built from the same ANALYZE sample and dispatched per kind character.
BuildRelationExtStatisticsfetchespg_statistic_extentries, computes a per-object target withstatext_compute_stattarget, and routes'd'/'f'/'m'/'e'tostatext_ndistinct_build/statext_dependencies_build/statext_mcv_build/ the expression path, thenstatext_stores the serialized result. Verified by readingBuildRelationExtStatistics. -
A functional dependency’s degree is supporting-rows / total-rows, measured by sort + group.
dependency_degreesorts the sample on thekcolumns, walks groups by the firstk-1, and accumulatesgroup_sizeinton_supporting_rowsonly when the group had no last-column violation. Verified by readingdependency_degree. The return isn_supporting_rows * 1.0 / data->numrows. -
Dependency-aware selectivity uses
f*Min(P(a),P(b)) + (1-f)*P(a)*P(b), never exceeding either component. The combining formula is stated in theclauselist_apply_dependenciesheader comment and applied afterfind_strongest_dependencyselects the dependency (by attribute count, then degree). Verified by reading both functions. Atf=1it reduces toMin, atf=0to the independence product.
Open questions
Section titled “Open questions”-
How exactly does
statext_mcv_builddecide list length and serialize the multivariate MCV list? This document reads the build dispatch and the per-columnanalyze_mcv_listcutoff, but the multivariate MCV construction (mcv.c, ~2175 lines incl.mcv_clauselist_selectivityand the base-selectivity correction) was not traced statement by statement. Investigation path: readstatext_mcv_buildandmcv_clauselist_selectivityinstatistics/mcv.cand thestatistics/README.mcvdesign note. -
What is the precedence and clause-consumption order among the three extended-stat kinds at planning time?
statext_clauselist_selectivity(the MCV path) anddependencies_clauselist_selectivityboth mark*estimatedclausesso the independence path skips them, but the exact ordering — and howchoose_best_statisticspicks among overlapping objects — was only partially read. Investigation path: tracestatext_clauselist_selectivityand its callers inclausesel.c(owned bypostgres-cost-model.md). -
How do inheritance/partitioning interact with the
stainheritflag?update_attstatsandstatext_storeboth take aninhparameter and store separate inherited vs. non-inherited rows, anddo_analyze_relruns twice for inheritance trees, but the planner-side choice of which row to read for a partitioned query was not verified here. Investigation path: read thestainherithandling inexamine_variable/get_relation_stats_hookconsumers.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”Pointers, not analysis. Each bullet is a starting handle for a follow-up doc; depth here is intentionally shallow.
-
Selinger and the origin of catalog-driven selectivity. Selinger et al., “Access Path Selection in a Relational Database Management System” (SIGMOD 1979, the System R optimizer; captured as
dbms-papers/systemr-optimizer.md) introduced the practice of driving join and access-path choice from catalog cardinalities and fixed “selectivity factors” for predicates lacking statistics. PostgreSQL’sDEFAULT_EQ_SEL = 0.005andDEFAULT_INEQ_SEL = 1/3are the direct descendants of Selinger’s magic constants;pg_statisticis the modern, histogram-bearing version of System R’s per-column catalog numbers. -
CUBRID’s per-attribute statistics vs. PostgreSQL’s MCV+histogram. CUBRID keeps
db_stat-style per-attribute min/max, cardinality, and a B-tree-derived value distribution, but historically without a separate most-common-value list or opt-in multi-column statistics. A side-by-side of how each engine handles a skewed-plus-correlated workload (e.g.WHERE city = ? AND state = ?) would sharpen what PostgreSQL’s MCV list andCREATE STATISTICSbuy in misestimate avoidance. -
HyperLogLog and exact-ish distinct counts. PostgreSQL’s sample-based
stadistinctis the classic weak spot — the source itself warns the Duj1 estimator “usually results in a drastic overestimate.” Sketch-based approaches (Flajolet et al.’s HyperLogLog; thepostgres_hllextension; Redshift’s and BigQuery’sAPPROX_COUNT_DISTINCT) maintain distinct counts incrementally with bounded error and no full re-scan. Whether a sketch could replace or augmentstadistinctin core is a recurring -hackers discussion. -
Learned and self-correcting cardinality estimation. The research frontier replaces fixed histograms with models trained on data or query feedback: DeepDB and Naru (deep autoregressive density models), and the older DB2 LEO (“LEarning Optimizer”, Stillger et al., VLDB 2001) that feeds actual row counts from executed plans back into the estimator. PostgreSQL has no execution-feedback loop in core — every estimate is derived purely from the last ANALYZE — so a comparison would quantify what a feedback channel could recover on repeatedly-misestimated queries.
-
Multidimensional histograms vs. PostgreSQL’s three extended-stat kinds. PostgreSQL deliberately avoids true multidimensional histograms (the storage and build cost are exponential in dimensions), choosing instead three cheaper joint summaries: functional dependencies, multivariate n-distinct, and multivariate MCV lists. The classic alternative is the GENHIST / multidimensional-histogram line (Gunopulos et al.); the trade-off is build cost and storage vs. estimation accuracy on arbitrary multi-column predicates. Mapping which query shapes each PostgreSQL kind actually helps (dependencies for equality chains, n-distinct for GROUP BY, MCV for skewed combinations) against what a full 2-D histogram would cover is the natural follow-up.
Sources
Section titled “Sources”In-tree README
Section titled “In-tree README”src/backend/statistics/README(andREADME.dependencies,README.mcv) — the design notes for extended statistics: why per-column independence fails, the functional-dependency degree model, the multivariate MCV representation, and the clause-combining formulas.
Textbook chapters (under knowledge/research/dbms-general/)
Section titled “Textbook chapters (under knowledge/research/dbms-general/)”- Database System Concepts (Silberschatz, Korth & Sudarshan, 7e),
Ch. 16 “Query Optimization”, §16.3 “Statistical Information for Cost
Estimation” — §16.3.1 catalog quantities (n_r, V(A,r), histograms;
equi-width vs. equi-depth; storing most-common values), §16.3.2
selection-size estimation (
n_r / V(A,r)), §16.3.4 “a fairly accurate histogram can be computed from a sample of a few thousand tuples.”
Papers (under knowledge/research/dbms-papers/)
Section titled “Papers (under knowledge/research/dbms-papers/)”- Access Path Selection in a Relational Database Management System
(Selinger et al., SIGMOD 1979) —
dbms-papers/systemr-optimizer.md: catalog-driven selectivity factors, the ancestor ofDEFAULT_EQ_SEL. - (Comparative, not yet captured) Chaudhuri, Motwani & Narasayya 1998
“Random sampling for histogram construction” (the
300*kbound); Haas & Stokes / IBM RR RJ 10025 (the Duj1 distinct estimator); Stillger et al. 2001 “LEO – DB2’s LEarning Optimizer” — see §“Beyond PostgreSQL”.
PostgreSQL source (under /data/hgryoo/references/postgres/, REL_18 273fe94)
Section titled “PostgreSQL source (under /data/hgryoo/references/postgres/, REL_18 273fe94)”src/backend/commands/analyze.c—analyze_rel/do_analyze_rel,acquire_sample_rows(two-stage sample),examine_attribute,std_typanalyze, the threecompute_*_statsroutines,update_attstats.src/backend/statistics/extended_stats.c—BuildRelationExtStatistics,statext_compute_stattarget,statext_clauselist_selectivity,choose_best_statistics.src/backend/statistics/dependencies.c—dependency_degree,statext_dependencies_build,find_strongest_dependency,clauselist_apply_dependencies,dependencies_clauselist_selectivity.src/backend/statistics/mvdistinct.c—statext_ndistinct_build,ndistinct_for_combination,estimate_ndistinct(Duj1).src/backend/statistics/mcv.c—statext_mcv_build,mcv_clauselist_selectivity.src/backend/utils/adt/selfuncs.c—eqsel/var_eq_const,scalarineqsel/ineq_histogram_selectivity,mcv_selectivity,histogram_selectivity,get_variable_numdistinct.src/include/catalog/pg_statistic.h— the five-slot tuple layout andSTATISTIC_KIND_*codes;src/include/catalog/pg_statistic_ext.h—STATS_EXT_*kind characters;src/include/utils/selfuncs.h—DEFAULT_*_SEL,DEFAULT_NUM_DISTINCT,CLAMP_PROBABILITY.
Cross-references (sibling docs that own adjacent mechanism)
Section titled “Cross-references (sibling docs that own adjacent mechanism)”postgres-cost-model.md—clauselist_selectivityand how selectivity feeds path cost; owns the clause-combining entry points that call into the extended-stats estimators.postgres-system-catalogs.md—pg_statistic,pg_statistic_ext,pg_statistic_ext_datacatalog layout and thepg_stats/pg_stats_extviews.postgres-planner-overview.md— where cardinality estimates are consumed during path generation.postgres-vacuum.md— autovacuum’s analyze threshold and thepg_class.reltuples/relpagesit maintains (then_rinput).postgres-executor.md—EXPLAIN ANALYZEactual-row counts, the ground truth against which these estimates are judged.