Skip to content

PostgreSQL Statistics — ANALYZE, pg_statistic, and Extended Statistics

Contents:

A cost-based optimizer cannot choose between an index scan and a sequential scan, or between a nested loop and a hash join, without first guessing how many rows each operator will emit. That guess — the cardinality estimate — is the single most consequential number in query planning: cost is roughly linear in row counts, and an estimate that is wrong by an order of magnitude routinely produces a plan that is wrong by an order of magnitude. The estimate is itself derived from a prior quantity, the selectivity of a predicate: the fraction of rows a WHERE clause is expected to pass.

Database System Concepts (Silberschatz, Korth & Sudarshan, 7e), Ch. 16 “Query Optimization”, §16.3 “Statistical Information for Cost Estimation”, names the catalog quantities an optimizer keeps per relation and per attribute. Among them (§16.3.1):

  • n_r — “the number of tuples in the relation r.”
  • V(A, r) — “the number of distinct values that appear in the relation r for attribute A.” This is the pivot for equality selectivity: under the uniform-distribution assumption, “the number of tuples that would satisfy a selection of the form A = a … is estimated to be n_r / V(A, r)” (§16.3.2). Equivalently the selectivity is 1 / V(A, r).

But uniformity is a fiction for most real columns, and the textbook says so directly: a histogram is the standard repair. §16.3.1 defines a histogram as a structure in which “the values for the attribute are divided into a number of ranges, and with each range the histogram associates the number of tuples whose attribute value lies in that range.” It distinguishes two shapes:

  • An equi-width histogram “divides the range of values into equal-sized ranges.”
  • An equi-depth histogram “adjusts the boundaries of the ranges such that each range has the same number of values”; the text notes that “equi-depth histograms are preferred to equi-width histograms since they provide better estimates.”

The textbook also flags the practical refinement PostgreSQL leans on hardest — splitting off the frequent values: a histogram “could store the most common values along with their counts” so that the residual ranges describe only the non-frequent tail (§16.3.1). And on how the catalog numbers are obtained, §16.3.4 is explicit that exhaustive scanning is unnecessary: “a fairly accurate histogram can be computed from a sample of a few thousand tuples.” Sampling, not full enumeration, is the expected method.

Two further theoretical facts frame everything below. First, estimating V(A, r) — the number of distinct values — from a sample is genuinely hard: a value that appears once in the sample might be unique in the table or might be one of millions. The statistics literature (Haas & Stokes, “Estimating the number of classes in a finite population”, and the IBM Research Report RJ 10025 cited verbatim in the PostgreSQL source) gives bias-corrected estimators that PostgreSQL adopts wholesale. Second, when a predicate touches several columns, the joint selectivity is P(a AND b). The cheap and universal assumption is independence, P(a,b) = P(a) * P(b). When the columns are correlated (city implies state; model implies manufacturer) the independence product is wildly too small, and no amount of per-column accuracy fixes it — the error is structural. That structural error is exactly what extended statistics exist to address.

Across systems the statistics machinery decomposes into the same parts, and the design choices cluster around a handful of axes.

Full-table statistics are exact but cost a full scan per ANALYZE; on a terabyte table that is prohibitive. Every production optimizer samples. The design questions are how to sample (uniform row sample vs. block sample vs. two-stage), how much (a fixed budget vs. a fraction of the table), and whether the sample is biased. Block-level sampling reads far fewer pages but clusters correlated rows; pure row sampling is unbiased but I/O-heavy. The size question has a clean theoretical answer for histograms: Chaudhuri, Motwani & Narasayya (SIGMOD 1998, “Random sampling for histogram construction: how much is enough?”) show the required sample size grows with histogram resolution but only logarithmically with table size — so a fixed per-target budget suffices.

The canonical output is a small fixed-shape record per column holding: the null fraction, the average width, a distinct-value count, a most-common-value (MCV) list with frequencies, and a histogram of the remaining values. The MCV-plus-histogram split is near-universal because it captures both the spiky head of a skewed distribution (the MCVs) and the smooth tail (the histogram) without storing the whole frequency table. A physical/logical correlation number is a PostgreSQL-flavored extra that lets the cost model judge how sequential an index scan’s heap fetches will be.

Distinct-value estimation is a special headache

Section titled “Distinct-value estimation is a special headache”

Because V(A, r) drives equality selectivity, and because it is the hardest quantity to recover from a sample, systems either (a) use a sampling estimator (Haas-Stokes and relatives), (b) maintain it exactly via sketches (HyperLogLog), or (c) let the DBA override it. PostgreSQL does (a) with a manual-override escape hatch.

The default joint-selectivity model is independence — it needs no extra storage and composes trivially. Its failure on correlated columns is well known, and the modern answer is opt-in multi-column statistics: the DBA declares which column groups are correlated, and the system stores joint information (multi-column distinct counts, multi-column MCV lists, or compact functional-dependency summaries) for just those groups. This is the SQL Server “multi-column statistics” / Oracle “extended statistics” / PostgreSQL “CREATE STATISTICS” family. It is opt-in because the combinatorics of all column groups is exponential.

Textbook concept (DSC §16.3)PostgreSQL realization
n_r (relation cardinality)pg_class.reltuples (set by VACUUM, read as rel->tuples)
V(A, r) (distinct values)pg_statistic.stadistinct; Haas-Stokes Duj1 estimator in compute_scalar_stats
selectivity 1/V(A,r)var_eq_const fallback (1 - sumcommon - nullfrac)/otherdistinct
most-common valuesSTATISTIC_KIND_MCV slot; analyze_mcv_list cutoff
equi-depth histogramSTATISTIC_KIND_HISTOGRAM slot; evenly-spaced quantiles over sorted residue
sampling a few thousand tuplesacquire_sample_rows, targrows = 300 * statistics_target
multi-column correlation repairextended stats: pg_statistic_ext / statistics/ (ndistinct, dependencies, mcv)

PostgreSQL splits the statistics world into two halves that meet only in the catalog.

The producing half is ANALYZE (also run inside autovacuum and as a phase of VACUUM ANALYZE). analyze_rel acquires a relation lock and dispatches to do_analyze_rel, which: (1) decides which columns to process and builds a VacAttrStats per column via examine_attribute; (2) computes a sample-size budget targrows as the max of each column’s minrows; (3) draws one shared sample with acquire_sample_rows; (4) runs each column’s compute_stats callback over that sample; (5) writes the results into pg_statistic with update_attstats; and (6) calls BuildRelationExtStatistics to compute any declared extended-statistics objects from the same sample.

The consuming half is the selectivity functions in utils/adt/selfuncs.c, invoked by the planner’s clauselist_selectivity machinery (owned by postgres-cost-model.md). eqsel / scalarltsel and friends pull the pg_statistic row through get_attstatsslot and turn the stored MCV list, histogram, and stadistinct into a number in [0,1].

Three design commitments are worth naming up front.

A fixed sample budget, not a table fraction

Section titled “A fixed sample budget, not a table fraction”

std_typanalyze sets stats->minrows = 300 * stats->attstattarget, citing Chaudhuri-Motwani-Narasayya Corollary 1 to Theorem 5 directly in the source. With the default default_statistics_target = 100 that is 30,000 rows — independent of whether the table has a thousand rows or a trillion. The “300×” constant absorbs the logarithmic table-size term so the budget never has to scale with n.

A pg_statistic tuple carries five generic (stakind, staop, stacoll, stanumbers[], stavalues[]) slots (STATISTIC_NUM_SLOTS = 5). Slot kinds are small integers: STATISTIC_KIND_MCV = 1, STATISTIC_KIND_HISTOGRAM = 2, STATISTIC_KIND_CORRELATION = 3 (plus MCELEM/RANGE kinds for arrays and range types). The slot scheme is what lets a type-specific typanalyze store arbitrary statistic shapes — a scalar column fills MCV

  • histogram + correlation; an array column fills a most-common-elements slot instead — without schema changes.

stadistinct packs two meanings into one float4. A positive value is an absolute distinct-value count. A negative value is a fraction of the row count: stadistinct = -1 means “unique” (distinct count scales 1:1 with rows), -0.5 means “half as many distinct values as rows.” compute_scalar_stats switches an absolute estimate to the relative form when it exceeds 10% of the table, so that the count tracks the row count as the table grows between ANALYZE runs.

std_typanalyze selects one of three standard algorithms based on which operators the column’s type offers:

  • compute_scalar_stats — when both < and = exist (the common case: any orderable type). Produces null-frac, width, stadistinct, MCV, histogram, and correlation.
  • compute_distinct_stats — when only = exists (hashable but not orderable). Produces everything except the histogram and correlation.
  • compute_trivial_stats — when neither exists. Only null-frac and width.

(A type can supply its own typanalyze — e.g. ts_typanalyze for tsvector, array_typanalyze for arrays — but the three above cover the default path.)

Extended statistics: opt-in joint information

Section titled “Extended statistics: opt-in joint information”

CREATE STATISTICS s (ndistinct, dependencies, mcv) ON a, b FROM t registers a pg_statistic_ext catalog row. During ANALYZE, BuildRelationExtStatistics walks the registered objects and, for each requested kind character (STATS_EXT_NDISTINCT = 'd', STATS_EXT_DEPENDENCIES = 'f', STATS_EXT_MCV = 'm', STATS_EXT_EXPRESSIONS = 'e'), calls the corresponding builder and serializes the result into pg_statistic_ext_data. At planning time statext_clauselist_selectivity and dependencies_clauselist_selectivity consult these to estimate groups of clauses jointly, before the independence-based per-column path runs on whatever clauses remain.

Anchor on symbol names, not line numbers. A function or struct name is the stable handle across releases; the line numbers in the position-hint table are quick hints observed at commit 273fe94 (REL_18_STABLE, 2026-06-05). Use git grep -n '<symbol>' src/backend/ to relocate.

The walkthrough follows the data: first the producing path (ANALYZE → pg_statistic), then the consuming path (selfuncs.c), then the extended-statistics builders and estimators.

flowchart TD
    AR["analyze_rel()<br/>lock relation, pick acquirefunc"] --> DAR["do_analyze_rel()"]
    DAR --> EA["examine_attribute()<br/>build VacAttrStats per column<br/>std_typanalyze picks compute_stats"]
    DAR --> CER["ComputeExtStatisticsRows()<br/>bump targrows for ext stats"]
    EA --> ASR["acquire_sample_rows()<br/>two-stage block sample + Vitter<br/>targrows = max(300*stattarget)"]
    CER --> ASR
    ASR --> CS["per-column compute_stats():<br/>compute_scalar_stats /<br/>compute_distinct_stats /<br/>compute_trivial_stats"]
    CS --> UA["update_attstats()<br/>write pg_statistic rows"]
    ASR --> BRE["BuildRelationExtStatistics()<br/>ndistinct / dependencies / mcv<br/>into pg_statistic_ext_data"]
    UA --> CAT[("pg_statistic")]
    BRE --> CATX[("pg_statistic_ext_data")]

analyze_rel is the command-level entry; it takes the lock, rejects pg_statistic itself, and chooses the acquire function (the heap default is acquire_sample_rows). do_analyze_rel then computes the sample budget. The budget is the maximum minrows over all analyzed columns, floored at 100, and further raised by extended statistics:

// do_analyze_rel — analyze.c (sample-budget computation)
targrows = 100;
for (i = 0; i < attr_cnt; i++)
{
if (targrows < vacattrstats[i]->minrows)
targrows = vacattrstats[i]->minrows;
}
/* ... index expression columns considered too ... */
minrows = ComputeExtStatisticsRows(onerel, attr_cnt, vacattrstats);
if (targrows < minrows)
targrows = minrows;

Sampling is the two-stage method documented in the header comment: stage one (BlockSampler_Init / BlockSampler_Next, driven through a ReadStream) picks up to targrows random blocks; stage two applies Vitter’s reservoir algorithm to the rows of those blocks, so every row has equal selection probability with a single pass:

// acquire_sample_rows — analyze.c (Vitter reservoir core)
if (numrows < targrows)
rows[numrows++] = ExecCopySlotHeapTuple(slot);
else
{
if (rowstoskip < 0)
rowstoskip = reservoir_get_next_S(&rstate, samplerows, targrows);
if (rowstoskip <= 0)
{
/* Found a suitable tuple, replacing one old tuple at random */
int k = (int) (targrows * sampler_random_fract(&rstate.randstate));
Assert(k >= 0 && k < targrows);
heap_freetuple(rows[k]);
rows[k] = ExecCopySlotHeapTuple(slot);
}
rowstoskip -= 1;
}
samplerows += 1;

The sampled rows are returned in physical (item-pointer) order — the final qsort_interruptible(rows, ..., compare_rows, ...) enforces this — because compute_scalar_stats relies on physical order to measure correlation.

examine_attribute builds the VacAttrStats and, unless the type has a custom typanalyze, calls std_typanalyze, which both selects the compute_stats callback and sets minrows from the Chaudhuri-Motwani- Narasayya bound:

// std_typanalyze — analyze.c
if (OidIsValid(eqopr) && OidIsValid(ltopr))
{
/* Seems to be a scalar datatype */
stats->compute_stats = compute_scalar_stats;
/* r = 4 * k * ln(2n/gamma) / f^2 ; with f=0.5,gamma=0.01 => ~300*k */
stats->minrows = 300 * stats->attstattarget;
}
else if (OidIsValid(eqopr))
{
/* We can still recognize distinct values */
stats->compute_stats = compute_distinct_stats;
stats->minrows = 300 * stats->attstattarget;
}
else
{
/* Can't do much but the trivial stuff */
stats->compute_stats = compute_trivial_stats;
stats->minrows = 300 * stats->attstattarget;
}

attstattarget here has already been resolved: examine_attribute reads the per-column pg_attribute.attstattarget, treating null as -1 (“use default”), and a value of 0 disables the column entirely (examine_attribute returns NULL). std_typanalyze then substitutes default_statistics_target when the target is still negative.

The distinct-value estimator (compute_scalar_stats)

Section titled “The distinct-value estimator (compute_scalar_stats)”

After sorting the sample, compute_scalar_stats counts ndistinct, nmultiple (values seen more than once), and toowide_cnt. The distinct-value count is the Haas-Stokes Duj1 estimator, quoted with its IBM Research Report citation in the source:

// compute_scalar_stats — analyze.c (Haas-Stokes Duj1 distinct estimator)
int f1 = ndistinct - nmultiple + toowide_cnt; /* values seen once */
int d = f1 + nmultiple; /* distinct in sample */
double n = samplerows - null_cnt; /* sample non-nulls */
double N = totalrows * (1.0 - stats->stanullfrac); /* table non-nulls */
double stadistinct;
if (N > 0)
stadistinct = (n * d) / ((n - f1) + f1 * n / N);
else
stadistinct = 0;
/* Clamp to sane range in case of roundoff error */
if (stadistinct < d) stadistinct = d;
if (stadistinct > N) stadistinct = N;
stats->stadistinct = floor(stadistinct + 0.5);

Two special cases bracket this: if no value repeats (nmultiple == 0) the column is treated as unique, stadistinct = -1.0 * (1.0 - stanullfrac); if every sampled value repeats and the track list is complete, the column is assumed to have exactly that many distinct values. And the >10% rule converts an absolute count to the negative (relative) form:

// compute_scalar_stats — analyze.c (relative-distinct conversion)
if (stats->stadistinct > 0.1 * totalrows)
stats->stadistinct = -(stats->stadistinct / totalrows);

The MCV cutoff and the equi-depth histogram

Section titled “The MCV cutoff and the equi-depth histogram”

The MCV list is not simply filled to the target. analyze_mcv_list decides how many of the tracked values are “significantly more common” than the non-MCV average; only a complete list (all sample values fit and are believed to be all the table’s values) bypasses that test. The histogram is then built over the residue — the sample minus the MCV entries — as evenly spaced quantiles, i.e. an equi-depth histogram:

// compute_scalar_stats — analyze.c (equi-depth histogram quantiles)
num_hist = ndistinct - num_mcv;
if (num_hist > num_bins)
num_hist = num_bins + 1;
if (num_hist >= 2)
{
/* i'th value is values[(i*(nvals-1))/(num_hist-1)], computed by
* carrying integer + fractional parts to avoid overflow */
delta = (nvals - 1) / (num_hist - 1);
deltafrac = (nvals - 1) % (num_hist - 1);
pos = posfrac = 0;
for (i = 0; i < num_hist; i++)
{
hist_values[i] = datumCopy(values[pos].value, ...);
pos += delta;
posfrac += deltafrac;
if (posfrac >= (num_hist - 1)) { pos++; posfrac -= (num_hist - 1); }
}
stats->stakind[slot_idx] = STATISTIC_KIND_HISTOGRAM;
stats->staop[slot_idx] = mystats->ltopr;
/* ... */
}

The correlation slot reuses the fact that, in physical order, the sorted x (logical) and y (physical) value sets are both 0..values_cnt-1, so Pearson’s coefficient collapses to a closed form driven by a single running cross-sum corr_xysum accumulated during the sort comparator:

// compute_scalar_stats — analyze.c (physical/logical correlation)
corr_xsum = ((double)(values_cnt - 1)) * ((double) values_cnt) / 2.0;
corr_x2sum = ((double)(values_cnt - 1)) * ((double) values_cnt) *
(double)(2 * values_cnt - 1) / 6.0;
corrs[0] = (values_cnt * corr_xysum - corr_xsum * corr_xsum) /
(values_cnt * corr_x2sum - corr_xsum * corr_xsum);
stats->stakind[slot_idx] = STATISTIC_KIND_CORRELATION;

Persisting to the catalog (update_attstats)

Section titled “Persisting to the catalog (update_attstats)”

update_attstats opens pg_statistic with RowExclusiveLock and writes one tuple per analyzed column, marshalling the per-slot arrays into the five stakindN / staopN / stacollN / stanumbersN / stavaluesN catalog columns:

// update_attstats — analyze.c (marshalling the scalar header fields)
values[Anum_pg_statistic_stanullfrac - 1] = Float4GetDatum(stats->stanullfrac);
values[Anum_pg_statistic_stawidth - 1] = Int32GetDatum(stats->stawidth);
values[Anum_pg_statistic_stadistinct - 1] = Float4GetDatum(stats->stadistinct);
i = Anum_pg_statistic_stakind1 - 1;
for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
values[i++] = Int16GetDatum(stats->stakind[k]); /* stakindN */

The consuming path — equality selectivity (var_eq_const)

Section titled “The consuming path — equality selectivity (var_eq_const)”

eqsel (the selectivity estimator registered for =) delegates to var_eq_const for a constant comparand. It first probes the MCV list: if the constant equals an MCV, the stored frequency is the selectivity, exactly. Otherwise it spreads the non-common, non-null mass over the remaining distinct values:

// var_eq_const — selfuncs.c (non-MCV equality fallback)
double sumcommon = 0.0;
double otherdistinct;
for (i = 0; i < sslot.nnumbers; i++)
sumcommon += sslot.numbers[i];
selec = 1.0 - sumcommon - nullfrac;
CLAMP_PROBABILITY(selec);
/* all not-common values share the remainder equally */
otherdistinct = get_variable_numdistinct(vardata, &isdefault) - sslot.nnumbers;
if (otherdistinct > 1)
selec /= otherdistinct;
/* selectivity can't exceed the least common MCV */
if (sslot.nnumbers > 0 && selec > sslot.numbers[sslot.nnumbers - 1])
selec = sslot.numbers[sslot.nnumbers - 1];

get_variable_numdistinct is where the sign convention of stadistinct is decoded into a concrete count: a positive value is used as-is; a negative value is multiplied by rel->tuples; zero (unknown) falls back to DEFAULT_NUM_DISTINCT = 200:

// get_variable_numdistinct — selfuncs.c (decoding the stadistinct sign)
if (stadistinct > 0.0)
return clamp_row_est(stadistinct); /* absolute count */
if (vardata->rel == NULL) { *isdefault = true; return DEFAULT_NUM_DISTINCT; }
ntuples = vardata->rel->tuples;
if (ntuples <= 0.0) { *isdefault = true; return DEFAULT_NUM_DISTINCT; }
if (stadistinct < 0.0)
return clamp_row_est(-stadistinct * ntuples); /* relative -> absolute */

The consuming path — range selectivity (ineq_histogram_selectivity)

Section titled “The consuming path — range selectivity (ineq_histogram_selectivity)”

For <, <=, >, >=, scalarineqsel combines an MCV scan (mcv_selectivity) with a histogram lookup. The histogram lookup binary- searches for the bin that brackets the constant, then linearly interpolates the fraction within that bin — the equi-depth structure means each whole bin contributes 1/(nvalues-1) of the population:

// ineq_histogram_selectivity — selfuncs.c (binary search for the bracketing bin)
int lobound = 0; /* first possible slot */
int hibound = sslot.nvalues; /* last+1 slot */
while (lobound < hibound)
{
int probe = (lobound + hibound) / 2;
bool ltcmp;
/* (endpoints may be refreshed via get_actual_variable_range) */
ltcmp = DatumGetBool(FunctionCall2Coll(opproc, collation,
sslot.values[probe], constval));
if (isgt) ltcmp = !ltcmp;
if (ltcmp) lobound = probe + 1;
else hibound = probe;
}

When the search lands inside the array (values[i-1] <= const <= values[i]), the code converts the in-bin position to a fraction binfrac and forms histfrac = (i - 1 + binfrac) / (nvalues - 1). The two endpoint bins are optionally refreshed against the live min/max via get_actual_variable_range so a moving max (since the last ANALYZE) doesn’t poison the estimate — an important detail for monotonically increasing keys such as serial PKs and timestamps.

flowchart TD
    Q["WHERE a = $1 AND b = $2<br/>(correlated columns)"] --> CL["clauselist_selectivity<br/>(cost model)"]
    CL --> EXT{"extended stats<br/>cover these columns?"}
    EXT -->|"MCV / dependencies present"| SCS["statext_clauselist_selectivity<br/>+ dependencies_clauselist_selectivity"]
    SCS --> COMBINE["combine: P(a,b) =<br/>f*Min(P(a),P(b)) +<br/>(1-f)*P(a)*P(b)"]
    EXT -->|"no / remaining clauses"| INDEP["per-column path:<br/>eqsel / scalarineqsel"]
    INDEP --> PROD["independence product<br/>P(a)*P(b)"]
    COMBINE --> OUT["joint selectivity"]
    PROD --> OUT
    SCS -.->|"clauses it could not handle"| INDEP

Extended statistics — building (extended_stats.c)

Section titled “Extended statistics — building (extended_stats.c)”

BuildRelationExtStatistics is the ANALYZE-time driver. For each registered object it resolves the per-object statistics target (statext_compute_stattarget), evaluates any expression columns, then dispatches per requested kind:

// BuildRelationExtStatistics — extended_stats.c (per-kind dispatch)
foreach(lc2, stat->types)
{
char t = (char) lfirst_int(lc2);
if (t == STATS_EXT_NDISTINCT)
ndistinct = statext_ndistinct_build(totalrows, data);
else if (t == STATS_EXT_DEPENDENCIES)
dependencies = statext_dependencies_build(data);
else if (t == STATS_EXT_MCV)
mcv = statext_mcv_build(data, totalrows, stattarget);
else if (t == STATS_EXT_EXPRESSIONS)
/* build_expr_data + compute_expr_stats + serialize */ ;
}
statext_store(stat->statOid, inh, ndistinct, dependencies, mcv, exprstats, stats);

Extended statistics — functional dependencies (dependencies.c)

Section titled “Extended statistics — functional dependencies (dependencies.c)”

A functional dependency (a,...) -> z is built by measuring its degree of validity: sort the sample lexicographically, group by the first k-1 columns, and count how many rows fall in groups whose last column is constant. The degree is supporting-rows / total-rows:

// dependency_degree — dependencies.c (validity by sort-and-group)
for (i = 1; i <= nitems; i++)
{
if (i == nitems ||
multi_sort_compare_dims(0, k - 2, &items[i - 1], &items[i], mss) != 0)
{
if (n_violations == 0)
n_supporting_rows += group_size; /* group is consistent */
n_violations = 0;
group_size = 1;
continue;
}
else if (multi_sort_compare_dim(k - 1, &items[i - 1], &items[i], mss) != 0)
n_violations++; /* same prefix, different z */
group_size++;
}
return (n_supporting_rows * 1.0 / data->numrows);

At planning time, dependencies_clauselist_selectivity finds the strongest fully-matched dependency (find_strongest_dependency, ordered by attribute count then degree) and combines the per-column selectivities with the degree-weighted blend, never letting the joint estimate exceed either component:

// clauselist_apply_dependencies — dependencies.c (the combining formula, from the comment)
* P(a,b) = f * Min(P(a), P(b)) + (1-f) * P(a) * P(b)
* to ensure that the combined selectivity is never greater than either
* individual selectivity.

At f = 1 (a perfect dependency) the joint selectivity is exactly Min(P(a),P(b)) — knowing a tells you b, so the second predicate adds nothing. At f = 0 it degenerates to the independence product.

Extended statistics — multivariate n-distinct (mvdistinct.c)

Section titled “Extended statistics — multivariate n-distinct (mvdistinct.c)”

statext_ndistinct_build enumerates every column combination of size 2..k and, for each, counts distinct tuples and singletons, then reuses the same Duj1 estimator as the per-column path:

// estimate_ndistinct — mvdistinct.c (Duj1, identical to analyze.c's formula)
numer = (double) numrows * (double) d;
denom = (double) (numrows - f1) + (double) f1 * (double) numrows / totalrows;
ndistinct = numer / denom;
if (ndistinct < (double) d) ndistinct = (double) d;
if (ndistinct > totalrows) ndistinct = totalrows;
return floor(ndistinct + 0.5);

The distinct tuple count for a combination is obtained by multi-column sort and a single adjacent-difference pass (ndistinct_for_combination), exactly mirroring how the single-column path counts distinct values after its sort. The planner uses these combination counts in estimate_num_groups (GROUP BY / DISTINCT cardinality) to avoid the classic over-count where n_distinct(a) * n_distinct(b) vastly exceeds the real number of (a,b) groups.

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
analyze_relcommands/analyze.c109
do_analyze_relcommands/analyze.c278
examine_attributecommands/analyze.c1036
acquire_sample_rowscommands/analyze.c1199
update_attstatscommands/analyze.c1655
std_typanalyzecommands/analyze.c1891
compute_trivial_statscommands/analyze.c1969
compute_distinct_statscommands/analyze.c2059
compute_scalar_statscommands/analyze.c2402
eqselutils/adt/selfuncs.c235
var_eq_constutils/adt/selfuncs.c303
var_eq_non_constutils/adt/selfuncs.c474
scalarineqselutils/adt/selfuncs.c588
mcv_selectivityutils/adt/selfuncs.c740
histogram_selectivityutils/adt/selfuncs.c831
ineq_histogram_selectivityutils/adt/selfuncs.c1049
scalarltselutils/adt/selfuncs.c1479
get_variable_numdistinctutils/adt/selfuncs.c6274
BuildRelationExtStatisticsstatistics/extended_stats.c111
statext_compute_stattargetstatistics/extended_stats.c344
choose_best_statisticsstatistics/extended_stats.c1216
statext_clauselist_selectivitystatistics/extended_stats.c1991
dependency_degreestatistics/dependencies.c221
statext_dependencies_buildstatistics/dependencies.c348
find_strongest_dependencystatistics/dependencies.c929
clauselist_apply_dependenciesstatistics/dependencies.c1014
dependencies_clauselist_selectivitystatistics/dependencies.c1370
statext_ndistinct_buildstatistics/mvdistinct.c88
ndistinct_for_combinationstatistics/mvdistinct.c425
estimate_ndistinctstatistics/mvdistinct.c521
statext_mcv_buildstatistics/mcv.c180
mcv_clauselist_selectivitystatistics/mcv.c2048
STATISTIC_KIND_MCV (= 1)catalog/pg_statistic.h190
STATISTIC_KIND_HISTOGRAM (= 2)catalog/pg_statistic.h210
STATISTIC_KIND_CORRELATION (= 3)catalog/pg_statistic.h222
STATISTIC_NUM_SLOTS (= 5)catalog/pg_statistic.h127
STATS_EXT_NDISTINCT (= ‘d’)catalog/pg_statistic_ext.h84
STATS_EXT_DEPENDENCIES (= ‘f’)catalog/pg_statistic_ext.h85
STATS_EXT_MCV (= ‘m’)catalog/pg_statistic_ext.h86
DEFAULT_NUM_DISTINCT (= 200)utils/selfuncs.h52
DEFAULT_EQ_SEL (= 0.005)utils/selfuncs.h34

Each entry leads with a fact about the current source at commit 273fe94 (REL_18_STABLE), readable on its own. The trailing sentence shows how it was checked. Open questions follow as recorded gaps.

  • The sample budget is 300 * statistics_target, floored at 100, and bounded below by extended-statistics needs. std_typanalyze sets stats->minrows = 300 * stats->attstattarget (with the Chaudhuri-Motwani-Narasayya derivation in a comment), and do_analyze_rel takes targrows as the max over columns starting from 100, then raises it with ComputeExtStatisticsRows. Verified by reading std_typanalyze and the targrows loop in do_analyze_rel on 2026-06-05. With default_statistics_target = 100, the per-column budget is 30,000 rows.

  • Sampling is two-stage: random blocks via BlockSampler, then a Vitter reservoir over their rows. acquire_sample_rows initializes BlockSampler_Init, drives a ReadStream over the chosen blocks, copies the first targrows rows directly into rows[], then uses reservoir_get_next_S / sampler_random_fract to replace random reservoir slots. Verified by reading acquire_sample_rows. The result is re-sorted into item-pointer order by compare_rows so correlation can be measured.

  • stadistinct is the Haas-Stokes Duj1 estimator, and the source cites IBM Research Report RJ 10025 by name. compute_scalar_stats (and compute_distinct_stats) compute (n*d)/((n-f1)+f1*n/N) with f1 = values seen once, d = distinct in sample, n/N = sample/table non-nulls, then clamp to [d, N]. Verified by reading the estimator block and its comment in compute_scalar_stats. The identical formula is factored into estimate_ndistinct in mvdistinct.c.

  • stadistinct’s sign is meaningful: positive = absolute, negative = fraction of rows, zero = unknown. compute_scalar_stats flips an absolute estimate above 0.1 * totalrows to -(stadistinct/totalrows); get_variable_numdistinct decodes a negative value as -stadistinct * ntuples and zero as DEFAULT_NUM_DISTINCT (200). Verified by reading both. compute_* set stadistinct = -1.0 * (1.0 - stanullfrac) for apparently-unique columns.

  • The histogram is equi-depth over the non-MCV residue, stored under STATISTIC_KIND_HISTOGRAM. compute_scalar_stats first collapses MCV entries out of values[], then copies num_hist evenly spaced quantiles (carrying integer + fractional position to avoid overflow). Verified by reading the histogram block. It is only built when num_hist >= 2.

  • Equality selectivity exactly reads an MCV frequency when the constant is an MCV, else spreads the residual mass over otherdistinct values. var_eq_const walks the MCV slot calling the equality function; on a match selec = sslot.numbers[i]; otherwise selec = (1 - sumcommon - nullfrac) / otherdistinct, capped at the least common MCV. Verified by reading var_eq_const.

  • Range selectivity binary-searches the histogram and refreshes endpoints against the live min/max. ineq_histogram_selectivity bisects for the bracketing bin, interpolates binfrac within it, and calls get_actual_variable_range to replace the first/last histogram entries with the current column extremes. Verified by reading the function. This is the mechanism that keeps estimates sane for ever-increasing keys between ANALYZE runs.

  • Extended statistics are built from the same ANALYZE sample and dispatched per kind character. BuildRelationExtStatistics fetches pg_statistic_ext entries, computes a per-object target with statext_compute_stattarget, and routes 'd'/'f'/'m'/'e' to statext_ndistinct_build / statext_dependencies_build / statext_mcv_build / the expression path, then statext_stores the serialized result. Verified by reading BuildRelationExtStatistics.

  • A functional dependency’s degree is supporting-rows / total-rows, measured by sort + group. dependency_degree sorts the sample on the k columns, walks groups by the first k-1, and accumulates group_size into n_supporting_rows only when the group had no last-column violation. Verified by reading dependency_degree. The return is n_supporting_rows * 1.0 / data->numrows.

  • Dependency-aware selectivity uses f*Min(P(a),P(b)) + (1-f)*P(a)*P(b), never exceeding either component. The combining formula is stated in the clauselist_apply_dependencies header comment and applied after find_strongest_dependency selects the dependency (by attribute count, then degree). Verified by reading both functions. At f=1 it reduces to Min, at f=0 to the independence product.

  1. How exactly does statext_mcv_build decide list length and serialize the multivariate MCV list? This document reads the build dispatch and the per-column analyze_mcv_list cutoff, but the multivariate MCV construction (mcv.c, ~2175 lines incl. mcv_clauselist_selectivity and the base-selectivity correction) was not traced statement by statement. Investigation path: read statext_mcv_build and mcv_clauselist_selectivity in statistics/mcv.c and the statistics/README.mcv design note.

  2. What is the precedence and clause-consumption order among the three extended-stat kinds at planning time? statext_clauselist_selectivity (the MCV path) and dependencies_clauselist_selectivity both mark *estimatedclauses so the independence path skips them, but the exact ordering — and how choose_best_statistics picks among overlapping objects — was only partially read. Investigation path: trace statext_clauselist_selectivity and its callers in clausesel.c (owned by postgres-cost-model.md).

  3. How do inheritance/partitioning interact with the stainherit flag? update_attstats and statext_store both take an inh parameter and store separate inherited vs. non-inherited rows, and do_analyze_rel runs twice for inheritance trees, but the planner-side choice of which row to read for a partitioned query was not verified here. Investigation path: read the stainherit handling in examine_variable / get_relation_stats_hook consumers.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

Pointers, not analysis. Each bullet is a starting handle for a follow-up doc; depth here is intentionally shallow.

  • Selinger and the origin of catalog-driven selectivity. Selinger et al., “Access Path Selection in a Relational Database Management System” (SIGMOD 1979, the System R optimizer; captured as dbms-papers/systemr-optimizer.md) introduced the practice of driving join and access-path choice from catalog cardinalities and fixed “selectivity factors” for predicates lacking statistics. PostgreSQL’s DEFAULT_EQ_SEL = 0.005 and DEFAULT_INEQ_SEL = 1/3 are the direct descendants of Selinger’s magic constants; pg_statistic is the modern, histogram-bearing version of System R’s per-column catalog numbers.

  • CUBRID’s per-attribute statistics vs. PostgreSQL’s MCV+histogram. CUBRID keeps db_stat-style per-attribute min/max, cardinality, and a B-tree-derived value distribution, but historically without a separate most-common-value list or opt-in multi-column statistics. A side-by-side of how each engine handles a skewed-plus-correlated workload (e.g. WHERE city = ? AND state = ?) would sharpen what PostgreSQL’s MCV list and CREATE STATISTICS buy in misestimate avoidance.

  • HyperLogLog and exact-ish distinct counts. PostgreSQL’s sample-based stadistinct is the classic weak spot — the source itself warns the Duj1 estimator “usually results in a drastic overestimate.” Sketch-based approaches (Flajolet et al.’s HyperLogLog; the postgres_hll extension; Redshift’s and BigQuery’s APPROX_COUNT_DISTINCT) maintain distinct counts incrementally with bounded error and no full re-scan. Whether a sketch could replace or augment stadistinct in core is a recurring -hackers discussion.

  • Learned and self-correcting cardinality estimation. The research frontier replaces fixed histograms with models trained on data or query feedback: DeepDB and Naru (deep autoregressive density models), and the older DB2 LEO (“LEarning Optimizer”, Stillger et al., VLDB 2001) that feeds actual row counts from executed plans back into the estimator. PostgreSQL has no execution-feedback loop in core — every estimate is derived purely from the last ANALYZE — so a comparison would quantify what a feedback channel could recover on repeatedly-misestimated queries.

  • Multidimensional histograms vs. PostgreSQL’s three extended-stat kinds. PostgreSQL deliberately avoids true multidimensional histograms (the storage and build cost are exponential in dimensions), choosing instead three cheaper joint summaries: functional dependencies, multivariate n-distinct, and multivariate MCV lists. The classic alternative is the GENHIST / multidimensional-histogram line (Gunopulos et al.); the trade-off is build cost and storage vs. estimation accuracy on arbitrary multi-column predicates. Mapping which query shapes each PostgreSQL kind actually helps (dependencies for equality chains, n-distinct for GROUP BY, MCV for skewed combinations) against what a full 2-D histogram would cover is the natural follow-up.

  • src/backend/statistics/README (and README.dependencies, README.mcv) — the design notes for extended statistics: why per-column independence fails, the functional-dependency degree model, the multivariate MCV representation, and the clause-combining formulas.

Textbook chapters (under knowledge/research/dbms-general/)

Section titled “Textbook chapters (under knowledge/research/dbms-general/)”
  • Database System Concepts (Silberschatz, Korth & Sudarshan, 7e), Ch. 16 “Query Optimization”, §16.3 “Statistical Information for Cost Estimation” — §16.3.1 catalog quantities (n_r, V(A,r), histograms; equi-width vs. equi-depth; storing most-common values), §16.3.2 selection-size estimation (n_r / V(A,r)), §16.3.4 “a fairly accurate histogram can be computed from a sample of a few thousand tuples.”

Papers (under knowledge/research/dbms-papers/)

Section titled “Papers (under knowledge/research/dbms-papers/)”
  • Access Path Selection in a Relational Database Management System (Selinger et al., SIGMOD 1979) — dbms-papers/systemr-optimizer.md: catalog-driven selectivity factors, the ancestor of DEFAULT_EQ_SEL.
  • (Comparative, not yet captured) Chaudhuri, Motwani & Narasayya 1998 “Random sampling for histogram construction” (the 300*k bound); Haas & Stokes / IBM RR RJ 10025 (the Duj1 distinct estimator); Stillger et al. 2001 “LEO – DB2’s LEarning Optimizer” — see §“Beyond PostgreSQL”.

PostgreSQL source (under /data/hgryoo/references/postgres/, REL_18 273fe94)

Section titled “PostgreSQL source (under /data/hgryoo/references/postgres/, REL_18 273fe94)”
  • src/backend/commands/analyze.canalyze_rel / do_analyze_rel, acquire_sample_rows (two-stage sample), examine_attribute, std_typanalyze, the three compute_*_stats routines, update_attstats.
  • src/backend/statistics/extended_stats.cBuildRelationExtStatistics, statext_compute_stattarget, statext_clauselist_selectivity, choose_best_statistics.
  • src/backend/statistics/dependencies.cdependency_degree, statext_dependencies_build, find_strongest_dependency, clauselist_apply_dependencies, dependencies_clauselist_selectivity.
  • src/backend/statistics/mvdistinct.cstatext_ndistinct_build, ndistinct_for_combination, estimate_ndistinct (Duj1).
  • src/backend/statistics/mcv.cstatext_mcv_build, mcv_clauselist_selectivity.
  • src/backend/utils/adt/selfuncs.ceqsel / var_eq_const, scalarineqsel / ineq_histogram_selectivity, mcv_selectivity, histogram_selectivity, get_variable_numdistinct.
  • src/include/catalog/pg_statistic.h — the five-slot tuple layout and STATISTIC_KIND_* codes; src/include/catalog/pg_statistic_ext.hSTATS_EXT_* kind characters; src/include/utils/selfuncs.hDEFAULT_*_SEL, DEFAULT_NUM_DISTINCT, CLAMP_PROBABILITY.

Cross-references (sibling docs that own adjacent mechanism)

Section titled “Cross-references (sibling docs that own adjacent mechanism)”
  • postgres-cost-model.mdclauselist_selectivity and how selectivity feeds path cost; owns the clause-combining entry points that call into the extended-stats estimators.
  • postgres-system-catalogs.mdpg_statistic, pg_statistic_ext, pg_statistic_ext_data catalog layout and the pg_stats / pg_stats_ext views.
  • postgres-planner-overview.md — where cardinality estimates are consumed during path generation.
  • postgres-vacuum.md — autovacuum’s analyze threshold and the pg_class.reltuples / relpages it maintains (the n_r input).
  • postgres-executor.mdEXPLAIN ANALYZE actual-row counts, the ground truth against which these estimates are judged.