CUBRID Statistics — Cardinality, NDV, Min/Max, and Histograms for the Cost Optimizer

Contents:

Theoretical Background
Common DBMS Design
CUBRID’s Approach
Source Walkthrough
Cross-check Notes
Open Questions
Sources

Theoretical Background

A cost-based optimizer has to predict, before the query runs, how many tuples each operator will produce. Database Internals (Petrov, ch. “Query Processing and Optimization” in the print edition; equivalently Hellerstein–Stonebraker– Hamilton, Architecture of a Database System, §6.3 “Plan Space” and §6.4 “Selectivity Estimation”) frames the problem as three nested questions:

What numbers does the cost model want? The two universal inputs are cardinality (rows that survive a predicate or join) and I/O footprint (pages those rows live on). Selinger’s System R paper, Access Path Selection in a Relational Database Management System (Selinger et al., SIGMOD 1979) §3 fixes the still-current cost equation: cost = page_fetches + W * RSI_calls where W weights CPU against I/O. Cardinality is the ratio selectivity * |R| for a base table and a recursively-propagated product for join inputs.
How is selectivity derived from the data? Three approximations have been deployed in production for forty years and all three are visible in CUBRID’s code:
- Uniformity assumption. For an equality predicate col = const on column col with NDV D, assume each distinct value appears |R|/D times, giving selectivity 1/D (Selinger §3.2). Range predicates get (hi-lo)/(MAX-MIN) if min/max are known.
- Independence assumption. For p AND q, multiply: sel(p) * sel(q). For p OR q, inclusion-exclusion: sel(p) + sel(q) - sel(p)*sel(q). Known to be wrong when columns correlate, but cheap.
- Histograms. Equi-width or equi-depth bucketing of the column’s value space, with optional MCV (most common values) lists for skew. The classic survey is Poosala et al., Improved Histograms for Selectivity Estimation of Range Predicates (SIGMOD 1996); the idea goes back to Kooi’s PhD thesis (1980) and Piatetsky- Shapiro & Connell, Accurate Estimation of the Number of Tuples Satisfying a Condition (SIGMOD 1984).
How do we collect those numbers without reading the whole table every time? Three sampling regimes:
- Full scan. Exact, but proportional to table size.
- Page sampling. Pick k random heap pages, count distinct values, scale up. Variance-bounded; the mathematical foundation is Haas & Stokes, Estimating the Number of Classes via Sample Coverage (J. Stat. Comp. 1998). NDV estimation from a sample is provably hard (Charikar–Chaudhuri–Motwani– Narasayya, Towards Estimation Error Guarantees for Distinct Values, PODS 2000) — most engines accept the bias and live with it.
- Streaming sketches. HyperLogLog and friends; not yet present in CUBRID.

Two implementation choices the model leaves open shape every real engine and frame the rest of this document:

What is the granularity of “the histogram”? Per-column full equi-depth bucket array (PostgreSQL, pg_statistic.stahistogram)? Per-index partial-key distinct counts (CUBRID, BTREE_STATS::pkeys)? Both (Oracle)? CUBRID picks per-index partial-key counts plus a per-column NDV scalar — no per-column bucket array.
Where does the optimizer get the numbers? Cached on the in-memory schema object (CUBRID, MySQL), or fetched on demand via a system view (PostgreSQL’s pg_statistic)? CUBRID uses the cached schema object path — the statistics travel with the SM_CLASS workspace cache and are refreshed on UPDATE STATISTICS.

Once those choices are named, every CUBRID-specific structure in this document either implements one of them or makes the access faster.

Common DBMS Design

Every relational engine reaches for a similar set of patterns around statistics collection.

Per-table sample plus per-column digest

Some kind of ANALYZE walks the heap (full or sampled), computes per-column summaries (NDV, min, max, MCV, histogram), and writes them to a system catalog. The optimizer reads the catalog at parse/plan time, never at runtime. The summaries decay and need to be refreshed — either on a background schedule (Oracle’s auto stats job), on a row-count threshold (PostgreSQL autovacuum_analyze), or only when the user issues the DDL (CUBRID’s UPDATE STATISTICS).

B+Tree as a free statistics source

Every leaf-walked B+Tree already knows its key count, leaf page count, height, and (with key concatenation) the distinct count for any prefix of a multi-column key. Using the index as the source of truth is cheaper than re-deriving from the heap and is what CUBRID, MySQL InnoDB (mysql.innodb_index_stats), and Oracle’s DBMS_STATS.GATHER_INDEX_STATS all do for indexed columns.

NDV by query, not by leaf walk

NDV for non-indexed columns cannot be read off a B+Tree — there isn’t one. The cheapest implementation is to issue a SELECT count(distinct col1), count(distinct col2), …, count(*) FROM tbl and let the existing query executor do the bookkeeping, optionally with a sampling hint. CUBRID takes exactly this path — stats_make_select_list_for_ndv builds the column list and stats_get_ndv_by_query runs the SQL, with /*+ SAMPLING_SCAN */ for the non-fullscan case.

Cached schema objects vs. on-demand catalog reads

PostgreSQL re-reads pg_statistic on every plan. CUBRID caches CLASS_STATS on the SM_CLASS workspace object and refreshes it after each UPDATE STATISTICS; subsequent queries on the same connection read the cached structure without a server round-trip. The trade-off is staleness in exchange for plan-time latency, mitigated by the timestamp that xstats_get_statistics_from_server checks.

Where CUBRID sits

CUBRID’s design is a hybrid of the three classical engines:

Aspect	PostgreSQL	MySQL InnoDB	Oracle	CUBRID
Trigger	autovacuum + manual `ANALYZE`	persistent on DDL + manual	auto job + manual	manual `UPDATE STATISTICS` only
Heap walk	random page sample	random index sample	block sample	full scan + reservoir on B+Tree
Histogram type	equi-depth + MCV	none, only NDV per prefix	equi-depth or hybrid	none, only NDV per prefix
NDV source	sample + adjustment formula	index leaf walk	sample + adjustment	SQL `count(distinct)` (sampled or full)
Storage	system table `pg_statistic`	`mysql.innodb_*_stats` tables	`SYS.WRI$_OPTSTAT_*`	catalog `CLS_INFO` + `DISK_REPR` + system class `_db_class` row
Optimizer access	catalog lookup at plan time	schema cache	catalog lookup	`SM_CLASS::stats` workspace cache

The most consequential CUBRID-specific choice is the absence of per-column histograms — there is no equi-depth bucket array on disk. This forces every range predicate cost computation back to the DEFAULT_*_SELECTIVITY constants in query_planner.c whenever the predicate cannot be answered by partial-key NDV alone.

CUBRID’s Approach

The statistics subsystem is the bridge between the storage layer (which knows row layout and can walk B+Trees) and the optimizer (which knows query shape and needs row-count estimates). The bridge is built once per UPDATE STATISTICS DDL — there is no autovacuum, no time-based refresh, and no sampling on the read path.

What is collected

For each class CUBRID maintains exactly three families of numbers, all anchored on the on-disk class representation record (DISK_REPR) and the class info record (CLS_INFO):

// btree_stats — src/storage/statistics.h
struct btree_stats
{
  BTID btid;
  int leafs;        /* number of leaf pages including overflow pages */
  int pages;        /* number of total pages */
  int height;       /* the height of the B+tree */
  int keys;         /* number of keys */
  int has_function; /* is a function index */
  TP_DOMAIN *key_type;
  int pkeys_size;   /* pkeys array size */
  int *pkeys;       /* partial keys info: pkeys[0] -> # of {a},
                     *                    pkeys[1] -> # of {a,b}, ... */
  int dedup_idx;    /* support for SUPPORT_DEDUPLICATE_KEY_MODE */
};

struct attr_stats
{
  int id;
  DB_TYPE type;
  int n_btstats;
  BTREE_STATS *bt_stats;
  INT64 ndv;        /* Number of Distinct Values of column */
};

struct class_stats
{
  unsigned int time_stamp;
  int heap_num_objects;  /* cardinality of the class */
  int heap_num_pages;    /* heap-file footprint */
  int n_attrs;
  ATTR_STATS *attr_stats;
};

The class-level numbers are heap_num_objects (the total row count, denoted |R| in the cost equations) and heap_num_pages (the heap-file footprint in pages, used to size sequential and full-table I/O). The attribute-level number is the per-column NDV attr_stats::ndv. The per-index numbers are leafs, pages, height, keys and the partial-key array pkeys[], where pkeys[i] is the distinct count of the first i+1 columns of a multi-column index.

There are no histograms — no MCV array, no equi-depth boundary list, no per-bucket frequency. There are no min/max records either at the public level (STATS_MIN_MAX_SIZE is defined but the surrounding fields were retired); the only durable per-column scalar shipped to the optimizer is ndv. This is a deliberate simplification: the optimizer falls back to fixed selectivity constants when partial-key NDV cannot answer a predicate.

The collection path

UPDATE STATISTICS dispatches through the network layer to the server, where xstats_update_statistics runs end-to-end with a single SCH_S_LOCK on the class — readers can scan, schema cannot change, and other UPDATE STATISTICS runs on the same class are serialized.

flowchart TD
  A["UPDATE STATISTICS<br/>parser PT_UPDATE_STATS"] --> B["execute_statement.c<br/>do_update_stats()"]
  B --> C["sm_update_statistics<br/>schema_manager.c"]
  C --> D["network_interface_cl<br/>stats_update_statistics"]
  D --> E["network_interface_sr<br/>handler 4278"]
  E --> F["xstats_update_statistics<br/>(server, SCH_S_LOCK)"]
  F --> G["catalog_get_class_info<br/>cls_info_p"]
  F --> H["catalog_get_representation<br/>disk_repr_p"]
  F --> I["partition_get_partition_oids"]
  I -->|count > 0| J["stats_update_partitioned_statistics<br/>(per-partition + accumulator)"]
  I -->|count == 0| K["btree_get_stats per index"]
  K --> L["btree_get_stats_with_fullscan<br/>OR btree_get_stats_with_AR_sampling"]
  C --> M["NDV via SQL:<br/>SELECT count(distinct ...), count(*)"]
  M --> N["stats_get_ndv_by_query<br/>statistics_cl.c"]
  N --> F
  F --> O["catalog_add_representation<br/>(persist disk_repr_p)"]
  F --> P["catalog_add_class_info<br/>(persist time_stamp, tot_∗)"]
  F --> Q["catcls_update_class_stats<br/>(write _db_class row)"]

There are two distinct walks running for every UPDATE STATISTICS:

The NDV walk runs on the client before the server RPC. stats_get_ndv_by_query (in statistics_cl.c) builds a SELECT count(distinct c1), count(distinct c2), …, count(*) FROM [classname] and sends it to the server as a normal query. The SAMPLING_SCAN hint is added unless WITH FULLSCAN was specified. Char/varchar columns wider than STATS_MAX_PRECISION = 4000 and all LOB and JSON columns get a sentinel cast(-1 as BIGINT) — they are opted out of distinct-counting because the comparison cost on a multi-thousand-character column is unbounded. The result tuple becomes the CLASS_ATTR_NDV array shipped into xstats_update_statistics.

The B+Tree walk runs on the server inside xstats_update_statistics. For each index attribute, btree_get_stats decides between full scan and acceptance-rejection sampling based on page count:

// btree_get_stats — src/storage/btree.c
if (with_fullscan || npages <= STATS_SAMPLING_THRESHOLD)
  {
    /* do fullscan at small table */
    ret = btree_get_stats_with_fullscan (thread_p, env);
  }
else
  {
    ret = btree_get_stats_with_AR_sampling (thread_p, env);
  }

STATS_SAMPLING_THRESHOLD is 5000 pages — below that, full scan is cheaper than sampling. Above that, the AR sampler picks random leaves until either 5000 distinct leaves have been visited (STATS_SAMPLING_LEAFS_MAX) or 5000 trials have elapsed (STATS_SAMPLING_THRESHOLD):

// btree_get_stats_with_AR_sampling — src/storage/btree.c
for (n = 0; n < STATS_SAMPLING_THRESHOLD; n++)
  {
    if (env->stat_info->leafs >= STATS_SAMPLING_LEAFS_MAX)
      break;
    BTS->C_page = btree_find_AR_sampling_leaf (thread_p, ...);
    /* ... walk every key on this leaf, increment keys, pkeys[] ... */
  }
/* apply distributed expansion */
exp_ratio = (double) env->stat_info->pages / (double) env->stat_info->leafs;
env->stat_info->leafs *= exp_ratio;
env->stat_info->keys  *= exp_ratio;
for (i = 0; i < env->pkeys_val_num; i++)
  env->stat_info->pkeys[i] *= exp_ratio;

The sampler relies on a Lehman-style assumption that the B+Tree’s page distribution is roughly uniform — under skew, expansion overshoots or undershoots and the optimizer gets a biased estimate. For typical OLTP workloads the bias is absorbed by the order-of-magnitude granularity of the selectivity model.

The full-scan path is straightforward: a left-most leaf descent (btree_find_lower_bound_leaf) followed by leaf- level next-key traversal (btree_find_next_index_record), with an MVCC visibility filter applied at each key:

// btree_get_stats_with_fullscan — src/storage/btree.c
mvcc_snapshot = logtb_get_mvcc_snapshot (thread_p);
ret = btree_find_lower_bound_leaf (thread_p, BTS, env->stat_info);
while (!BTREE_END_OF_SCAN (BTS))
  {
    if (!VPID_EQ (&(BTS->C_vpid), &C_vpid))
      env->stat_info->leafs++;
    ret = btree_get_stats_key (thread_p, env, mvcc_snapshot);
    ret = btree_find_next_index_record (thread_p, BTS);
  }

Each call to btree_get_stats_key increments keys by 1 and walks the multi-column key into btree_get_stats_midxkey to update each pkeys[i] — the distinct count of the first i+1 index columns. The MVCC filter (btree_get_num_visible_from_leaf_and_ovf) ensures keys whose only visible OIDs were deleted by other transactions are not counted, matching the row count visible to readers.

The total row count cls_info_p->ci_tot_objects is not derived from the heap walk — it is taken straight from the NDV query’s count(*):

// xstats_update_statistics — src/storage/statistics_sr.c
/* use value from "select --+ sampling count(*) ..." */
cls_info_p->ci_tot_objects =
  class_attr_ndv->attr_ndv[class_attr_ndv->attr_cnt].ndv;

The slot at index attr_cnt is the trailing count(*) appended by stats_make_select_list_for_ndv. The page count ci_tot_pages comes from the file manager (file_get_num_user_pages) — the heap-page count is a purely physical measurement, independent of MVCC visibility.

Storage in the catalog

Statistics live in two parallel places, mirroring the catalog manager’s split design (see cubrid-catalog-manager.md §“Disk representation vs. logical schema”):

The internal catalog stores the canonical, untyped representation — xstats_update_statistics calls catalog_add_representation to persist disk_repr_p (per-attribute and per-index numbers) and catalog_add_class_info to persist cls_info_p (time_stamp, tot_objects, tot_pages, heap file id). These records are stored in the catalog manager’s CTID-keyed overflow file and can be re-read by the server without parsing SQL.

The user-visible system class _db_class carries a mirror of the same fields so that SELECT … FROM _db_class sees up-to-date statistics. catcls_update_class_stats locates the _db_class row by class name, reads it via the heap manager, mutates the stats fields with catcls_update_or_value_class_stats_fields, and writes it back through the standard heap update path (with index maintenance via locator_update_index).

The wire format that ships statistics from server to client is purpose-built — not a memcpy of the structs (the structs contain pointers and 64-bit values whose layout is host-specific) but an OR-packed buffer assembled by xstats_get_statistics_from_server:

// xstats_get_statistics_from_server — src/storage/statistics_sr.c
size = (OR_INT_SIZE   /* time_stamp of CLS_INFO */
        + OR_INT_SIZE /* tot_objects of CLS_INFO */
        + OR_INT_SIZE /* tot_pages of CLS_INFO */
        + OR_INT_SIZE /* n_attrs from DISK_REPR */
        + (OR_INT_SIZE   /* id of DISK_ATTR */
           + OR_INT_SIZE /* type of DISK_ATTR */
           + OR_INT_SIZE /* n_btstats of DISK_ATTR */
           + OR_INT64_SIZE /* Number of Distinct Values */
        ) * n_attrs);

size += ((OR_BTID_ALIGNED_SIZE  /* btid of BTREE_STATS */
          + OR_INT_SIZE /* leafs */ + OR_INT_SIZE /* pages */
          + OR_INT_SIZE /* height */ + OR_INT_SIZE /* keys */
          + OR_INT_SIZE /* dedup_idx */
          + OR_INT_SIZE /* has_function */
         ) * tot_n_btstats);
size += tot_key_info_size;  /* key_type, pkeys[] */

Important wire-level observations: NDV is shipped as OR_INT64_SIZE (the catalog field is INT64) but every B+Tree stat is OR_INT_SIZE (signed 32-bit). On a billion- row table the per-index keys field can saturate; the sampling expansion code defends with if (… < 0) … = INT_MAX overflow guards, but a true full-scan past 2^31 keys would silently truncate. The class-level ci_tot_objects is also 32-bit on the wire, despite the underlying source being the 64-bit NDV count(*) value.

The client side (stats_client_unpack_statistics in statistics_cl.c) inverts the packing into an in-memory CLASS_STATS allocated from the workspace allocator (db_ws_alloc), and crucially clamps the per-key counts back into a sane range:

// stats_client_unpack_statistics — src/storage/statistics_cl.c
/* validate key stats info */
for (...)
  {
    btree_stats_p->keys =
      MIN (btree_stats_p->keys, class_stats_p->heap_num_objects);
    for (k = 0; k < btree_stats_p->pkeys_size; k++)
      btree_stats_p->pkeys[k] =
        MIN (btree_stats_p->pkeys[k], btree_stats_p->keys);
  }

The MIN clamp matters because the B+Tree’s keys count can transiently exceed the class’s heap_num_objects — either because the B+Tree saw uncommitted inserts the heap NDV query missed, or because of MVCC visibility differences between the two passes. Letting the optimizer see keys > heap_num_objects would corrupt the cardinality math, so the client trims defensively.

Consumption by the optimizer

The optimizer never reads the catalog directly. At plan time, qo_get_attr_info (in query_graph.c) is called for each QO_SEGMENT (a column reference in the query graph) and pulls ATTR_STATS out of the cached CLASS_STATS hanging off the workspace SM_CLASS:

// qo_get_attr_info — src/optimizer/query_graph.c
stats = QO_GET_CLASS_STATS (class_info_entryp);
QO_ASSERT (env, stats != NULL);
/* search the attribute from the class information */
attr_statsp = stats->attr_stats;
n_attrs = stats->n_attrs;
for (j = 0; j < n_attrs; j++, attr_statsp++)
  {
    if (attr_statsp->id == attr_id)
      break;
  }

If the segment has B+Tree statistics (i.e., the column is indexed), the per-index BTREE_STATS records are folded into a single QO_ATTR_CUM_STATS summary on the segment. The fold rule keeps the maximum height and the index with the maximum keys count — this is the “best-cardinality index” heuristic:

// qo_get_attr_info — src/optimizer/query_graph.c (loop body)
cum_statsp->leafs += bstatsp->leafs;
cum_statsp->pages += bstatsp->pages;
cum_statsp->height = MAX (cum_statsp->height, bstatsp->height);

if (cum_statsp->pkeys_size == 0
    || cum_statsp->keys < bstatsp->keys)
  {
    cum_statsp->keys = bstatsp->keys;
    cum_statsp->key_type = bstatsp->key_type;
    cum_statsp->pkeys_size = bstatsp->pkeys_size;
    /* … reallocate pkeys array … */
    for (i = 0; i < cum_statsp->pkeys_size; i++)
      cum_statsp->pkeys[i] = bstatsp->pkeys[i];
  }

The aggregate is cached on QO_SEG_INFO(seg) and read by both the cost functions and the selectivity estimators.

Cost functions

qo_iscan_cost and qo_sscan_cost compute the four-tuple (fixed_cpu_cost, fixed_io_cost, variable_cpu_cost, variable_io_cost) that drives qo_plan_compute_cost’s plan ordering:

// qo_iscan_cost — src/optimizer/query_planner.c
sel = 1.0;
for (... iterate index terms ...)
  sel *= QO_TERM_SELECTIVITY (termp);
sel = MIN (sel, 1.0);

/* selectivity floor */
if (i <= pkeys_num && cum_statsp->pkeys[index] >= 1)
  sel_limit = 1.0 / (double) cum_statsp->pkeys[index];
else if (cum_statsp->keys >= 1)
  sel_limit = 1.0 / (double) cum_statsp->keys;
else if (QO_NODE_NCARD (nodep) >= 1)
  sel_limit = 1.0 / (double) QO_NODE_NCARD (nodep);
sel = MAX (sel, sel_limit);

leaf_access = sel * (double) QO_NODE_NCARD (nodep);
height      = (double) cum_statsp->height - 1;
leaves      = ceil (sel * (double) cum_statsp->leafs);
opages      = (double) QO_NODE_TCARD (nodep);
index_IO    = (n * height) + leaves;

planp->fixed_io_cost     = index_IO;
planp->variable_cpu_cost = (leaf_access + heap_access) * QO_CPU_WEIGHT;
planp->variable_io_cost  = object_IO;

The B+Tree statistics map directly onto cost components: height is the random-access depth charged once per scan; leafs becomes the leaf-page I/O charged per matching key; pkeys[] provides the selectivity floor — 1/pkeys[i] is the smallest cardinality the optimizer is willing to believe for a predicate matched to i+1 prefix columns, so the cost cannot collapse to zero on a predicate the index cannot resolve.

The selectivity floor is the load-bearing use of partial-key NDV. Without it, the chain of qo_*_selectivity multiplications (each at most 1.0, often DEFAULT_*_SELECTIVITY = 0.001 to 0.1) underestimates row counts on multi-AND predicates and the optimizer would over-prefer low-cost-but-low-cardinality plans.

qo_sscan_cost is trivial by comparison — sequential scan costs QO_NODE_NCARD * QO_CPU_WEIGHT CPU and QO_NODE_TCARD I/O, where NCARD is heap_num_objects and TCARD is heap_num_pages:

// qo_sscan_cost — src/optimizer/query_planner.c
planp->variable_cpu_cost = (double) QO_NODE_NCARD (nodep) * QO_CPU_WEIGHT;
planp->variable_io_cost  = (double) QO_NODE_TCARD (nodep);

Selectivity estimators

qo_expr_selectivity is the dispatcher: walk the predicate tree, recurse on AND/OR/NOT, and call the appropriate leaf estimator:

// qo_expr_selectivity — src/optimizer/query_planner.c
case PT_OR:    selectivity = qo_or_selectivity   (env, lhs, rhs);  break;
case PT_AND:   selectivity = qo_and_selectivity  (env, lhs, rhs);  break;
case PT_EQ:
case PT_NULLSAFE_EQ:
                selectivity = qo_equal_selectivity (env, node);    break;
case PT_GE: case PT_GT: case PT_LT: case PT_LE:
                selectivity = qo_comp_selectivity  (env, node);    break;
case PT_BETWEEN:
                selectivity = qo_between_selectivity (env, node);  break;
case PT_RANGE:  selectivity = qo_range_selectivity (env, node);    break;
case PT_LIKE:   selectivity = prm_get_float_value (PRM_ID_LIKE_TERM_SELECTIVITY); break;
case PT_IS_NULL: selectivity = DEFAULT_NULL_SELECTIVITY; break;
case PT_IS_IN:   selectivity = qo_all_some_in_selectivity (env, node); break;

qo_and_selectivity and qo_or_selectivity are pure independence-assumption arithmetic (p*q and p+q-p*q), a direct quote from Selinger §3.2.

qo_equal_selectivity is the only estimator that actually uses the statistics, and it does so through the index cardinality, not the column NDV directly:

// qo_equal_selectivity — src/optimizer/query_planner.c
case PC_ATTR:
  /* attr = const */
  lhs_icard = qo_index_cardinality (env, lhs);
  if (lhs_icard != 0)
    selectivity = (1.0 / lhs_icard);
  else
    selectivity = DEFAULT_EQUAL_SELECTIVITY;

qo_index_cardinality blends the per-column NDV (attr_stats::ndv) and the per-index partial-key distinct count (pkeys[0]):

// qo_index_cardinality — src/optimizer/query_planner.c
if (info->ndv > 0)
  {
    int ndv = (info->ndv > INT_MAX) ? INT_MAX : info->ndv;
    if (info->cum_stats.is_indexed == true
        && info->cum_stats.pkeys[0] > 0)
      return MIN (ndv, info->cum_stats.pkeys[0]);
    return ndv;
  }
if (info->cum_stats.is_indexed != true)
  return 0;
/* return number of the first partial-key */
return info->cum_stats.pkeys[0];

The MIN between the two sources is the safety net for the case where the NDV query and the B+Tree walk disagree — e.g., the column is indexed but the SQL count(distinct) ran with sampling and underestimated, or the B+Tree includes deleted-but-not-vacuumed keys the heap NDV missed. Choosing MIN biases toward the smaller distinct count, i.e., the larger estimated row count per equality predicate, which is the more conservative direction for optimizer mistakes (overestimating wastes cycles; under- estimating picks bad plans).

The non-equality estimators are constants. qo_comp_selectivity, qo_between_selectivity, and qo_range_selectivity ignore the actual range bounds of the predicate and return the fixed DEFAULT_*_SELECTIVITY macros:

// query_planner.c — selectivity defaults
#define DEFAULT_NULL_SELECTIVITY     (double) 0.01
#define DEFAULT_EXISTS_SELECTIVITY   (double) 0.1
#define DEFAULT_SELECTIVITY          (double) 0.1
#define DEFAULT_EQUAL_SELECTIVITY    (double) 0.001
#define DEFAULT_EQUIJOIN_SELECTIVITY (double) 0.001
#define DEFAULT_COMP_SELECTIVITY     (double) 0.1
#define DEFAULT_BETWEEN_SELECTIVITY  (double) 0.01
#define DEFAULT_IN_SELECTIVITY       (double) 0.01
#define DEFAULT_RANGE_SELECTIVITY    (double) 0.1

qo_comp_selectivity is one line — return DEFAULT_COMP_SELECTIVITY — and qo_between_selectivity similarly. The structural reason is the absence of min/max or histogram boundaries on disk: there is nothing to compare the predicate constant against, so the estimator falls back to a hard-coded “10%” guess for </>/<=/>= and “1%” for BETWEEN. The cost of this simplification is paid in plan quality on range-heavy workloads.

The only range-shaped predicate that gets a data-driven estimate is IN/= SOME (qo_all_some_in_selectivity), which multiplies the equality selectivity by the IN-list cardinality. The IN-list cardinality is either the literal length of the value list or, if the right-hand side is a subquery, ((XASL_NODE *) arg2->info.query.xasl)->cardinality — which itself was set by an earlier optimizer pass off the subquery’s own statistics. The result is capped at 0.5 to prevent runaway estimates.

NDV adjustment for sampling

The header file ships a helper for adjusting sampled NDV upward when the sample exhibits very low distinct counts:

// stats_adjust_sampling_weight — src/storage/statistics.h
STATIC_INLINE int
stats_adjust_sampling_weight (INT64 sampling_ndv, int sampling_weight)
{
  /* Differential weight is applied to NDV within 1% of all rows of sample data. */
  if (sampling_weight <= 1)
    return sampling_weight;
  int min_NDV = NUMBER_OF_SAMPLING_PAGES * EXPECTED_ROWS_PER_PAGE / 100;
  if (sampling_ndv < min_NDV)
    return MAX (sampling_weight * sampling_ndv / min_NDV, 1);
  return sampling_weight;
}

The constants NUMBER_OF_SAMPLING_PAGES = 5000 and EXPECTED_ROWS_PER_PAGE = 20 give a 1%-of-sample threshold of 5000 * 20 / 100 = 1000 distinct values — below that, the sampled NDV is treated as a strong signal that the column has heavy duplication, and the up-scaling weight is reduced proportionally rather than letting the sample-to-population extrapolation overshoot. The formula at a population level is

weight' = weight * (sampling_ndv / 1000)   if sampling_ndv < 1000

with a floor of 1. There is no Haas-Stokes coverage estimator and no Charikar-style worst-case bound — CUBRID trusts the sample if the NDV is healthy and dampens extrapolation when it isn’t.

A second formula in the same family is qo_estimate_ndv, used for group-by cardinality:

// qo_estimate_ndv — src/optimizer/query_planner.c
/* formula: n * (1 - ((N - p) / N)^(N / n))
 *   N: total_nrows
 *   p: expected_nrows
 *   n: NDV of group columns
 */
double qo_estimate_ndv (double N, double p, double n)
{
  if (N <= 0.0 || n <= 0.0)
    return 0.0;
  double ratio = (N - p) / N;
  double exponent = N / n;
  return n * (1.0 - pow (ratio, exponent));
}

This is the classic balls-in-bins distinct-count formula (Yao 1977; Cardenas 1975) — given N items uniformly distributed across n distinct groups, after observing p items the expected number of non-empty groups is n * (1 - ((N-p)/N)^(N/n)). It powers qo_estimate_ngroups which sets QO_PLAN::info::ngroups for GROUP BY plans, but it never reaches back into BTREE_STATS::pkeys itself — it consumes whatever NDV the caller passes in.

DDL trigger

UPDATE STATISTICS is parsed into a PT_NODE whose info union holds pt_update_stats_info:

// pt_update_stats_info — src/parser/parse_tree.h
struct pt_update_stats_info
{
  PT_NODE *class_list;   /* PT_NAME */
  int all_classes;       /* 1 iff ALL CLASSES, -1 iff ON CATALOG CLASSES */
  int with_fullscan;     /* 1 iff WITH FULLSCAN */
};

The semantic checker pt_check_update_stats validates permissions (each class needs ALTER authority) and expands ALL CLASSES into the actual class set. The executor lives in execute_statement.c:

// do_update_stats — src/query/execute_statement.c
for (cls = statement->info.update_stats.class_list;
     cls != NULL && error == NO_ERROR;
     cls = cls->next)
  {
    class_mop = cls->info.name.db_object;
    class_type = ((SM_CLASS *) class_mop->object)->class_type;
    if (class_type == SM_CLASS_CT)
      error = sm_update_statistics (class_mop, statement->info.update_stats.with_fullscan);
  }

sm_update_statistics (in schema_manager.c) flushes any dirty workspace state for the class, then calls stats_update_statistics which ultimately reaches the server-side xstats_update_statistics. After the server returns success, sm_update_statistics invalidates the cached class_->stats, then implicitly re-fetches it on the next sm_get_class_with_statistics call:

// sm_update_statistics — src/object/schema_manager.c
error = stats_update_statistics (classop, with_fullscan);
if (error == NO_ERROR)
  {
    if (classop->object != NULL)
      {
        error = au_fetch_class_force (classop, &class_, AU_FETCH_READ);
        if (error == NO_ERROR)
          {
            if (class_->stats != NULL)
              {
                stats_free_statistics (class_->stats);
                class_->stats = NULL;
              }
            /* … flush class so subsequent fetch picks up new stats … */
          }
      }
  }

There is also a parallel auto-trigger path: bulk loaders (load_sa_loader.cpp), CREATE INDEX and ALTER TABLE (execute_schema.c), and the standalone-mode boot (util_sa.c) all call sm_update_statistics with STATS_WITH_SAMPLING to refresh statistics after large row-shape changes. There is no time-based or threshold-based auto-trigger — purely event-driven on schema/data DDL.

Histogram bucket layout (or absence thereof)

The CUBRID equivalent of an “equi-depth histogram” is the partial-key distinct-count vector pkeys[]. For a multi-column index (a, b, c) it carries:

graph LR
  subgraph BTREE_STATS_pkeys["BTREE_STATS::pkeys[] for index(a,b,c)"]
    P0["pkeys[0] = NDV of {a}<br/>e.g., 100"]
    P1["pkeys[1] = NDV of {a,b}<br/>e.g., 1000"]
    P2["pkeys[2] = NDV of {a,b,c}<br/>e.g., 100000<br/>== keys"]
    P0 --> P1 --> P2
  end
  subgraph cost["used by qo_iscan_cost"]
    SF["sel_limit = 1.0 / pkeys[i]<br/>where i = number of leading equal-bound segments"]
  end
  P0 -.->|i=0, predicate on a| SF
  P1 -.->|i=1, predicate on a AND b| SF
  P2 -.->|i=2, predicate on a AND b AND c| SF

This is not a histogram: there is no value-axis quantization, no per-bucket frequency, no upper/lower bounds. The single number pkeys[i] answers exactly one question: “given equality bindings on the first i+1 columns of this index, how many distinct keys remain?” That suffices to bound equality selectivity from below (1/pkeys[i]) but tells the optimizer nothing about the shape of the value distribution. The maximum prefix length is BTREE_STATS_PKEYS_NUM = 8 (defined in statistics.h); indexes with more than 8 columns truncate, and the trailing pkeys[pkeys_size-1] is forced equal to keys by the client unpacker as a sanity invariant.

Source Walkthrough

Symbols grouped by phase. Line numbers are hints scoped to the document’s updated: date.

DDL surface

pt_update_stats_info — parser node
pt_check_update_stats — semantic check, authorization
do_update_stats — executor entry (in execute_statement.c, near line 4483)
sm_update_statistics — schema-manager dispatcher
stats_update_statistics — client-side network call
network_interface_sr.cpp (server handler near line 4278) — server entry

Server-side collection

xstats_update_statistics — top-level server routine
stats_update_partitioned_statistics — partition aggregator
btree_get_stats — per-index dispatcher (full vs. AR sampling)
btree_get_stats_with_fullscan — left-most descent + leaf walk
btree_get_stats_with_AR_sampling — random-leaf sampler with STATS_SAMPLING_THRESHOLD / STATS_SAMPLING_LEAFS_MAX
btree_find_AR_sampling_leaf — random leaf chooser
btree_get_stats_key — per-key counter + MVCC visibility filter
btree_get_stats_midxkey — multi-column partial-key updater
btree_get_num_visible_from_leaf_and_ovf — MVCC visibility for leaf+overflow
stats_find_inherited_index_stats — per-partition index lookup
stats_get_time_stamp — time() wrapper
stats_dump_class_statistics — debug dumper (DEBUG-only)

Client-side NDV collection

stats_get_ndv_by_query — SELECT count(distinct …), count(*) runner
stats_make_select_list_for_ndv — column list builder, opts out LOB/JSON/large-char
stats_ndv_dump — debug dumper
STATS_MAX_PRECISION = 4000 — char-precision cutoff
STATS_WITH_FULLSCAN, STATS_WITH_SAMPLING — boolean modes
stats_adjust_sampling_weight — inline NDV up-scaling helper

Catalog persistence

catalog_get_class_info, catalog_add_class_info — CLS_INFO read/write
catalog_get_representation, catalog_add_representation — DISK_REPR read/write
catalog_start_access_with_dir_oid — directory-OID lock
catcls_update_class_stats — _db_class row mutator
catcls_update_or_value_class_stats_fields — field-level stat updater

Wire format

xstats_get_statistics_from_server — server-side packer
OR_PUT_INT, OR_PUT_INT64, OR_PUT_BTID, or_pack_domain — packing primitives
stats_client_unpack_statistics — client-side unpacker with MIN clamping
stats_get_statistics — client wrapper, calls stats_get_statistics_from_server
stats_free_statistics — workspace-allocator deallocator

Optimizer consumption

qo_get_attr_info — fold BTREE_STATS array into QO_ATTR_CUM_STATS
qo_get_attr_info_func_index — function-index variant
QO_ATTR_CUM_STATS — folded cumulative struct (in optimizer.h)
QO_GET_CLASS_STATS — pointer to CLASS_STATS from class info entry
qo_index_cardinality — MIN(ndv, pkeys[0]) blender
qo_index_cardinality_with_dedup — same with seg-bitset dedup
qo_classify — predicate operand classifier (PC_ATTR / PC_CONST / …)
qo_iscan_cost — index-scan cost equation
qo_sscan_cost — sequential-scan cost equation
qo_plan_compute_cost — plan cost dispatcher
qo_expr_selectivity — predicate-tree walker
qo_or_selectivity, qo_and_selectivity, qo_not_selectivity — independence-assumption arithmetic
qo_equal_selectivity — only data-driven leaf estimator (uses qo_index_cardinality)
qo_comp_selectivity — constant DEFAULT_COMP_SELECTIVITY
qo_between_selectivity — constant DEFAULT_BETWEEN_SELECTIVITY
qo_range_selectivity — uses index cardinality only for PT_BETWEEN_EQ_NA
qo_all_some_in_selectivity — IN-list / IN-subquery selectivity
qo_estimate_ndv — Yao-Cardenas balls-in-bins NDV under sampling
qo_estimate_ngroups — group-by cardinality

Public structs / constants

CLASS_STATS, ATTR_STATS, BTREE_STATS — in statistics.h
CLASS_ATTR_NDV, ATTR_NDV — in statistics.h
BTREE_STATS_PKEYS_NUM = 8 — partial-key fanout cap
STATS_SAMPLING_THRESHOLD = 5000, STATS_SAMPLING_LEAFS_MAX = 5000 — sampler bounds
NUMBER_OF_SAMPLING_PAGES = 5000, EXPECTED_ROWS_PER_PAGE = 20 — NDV-adjust constants
DEFAULT_*_SELECTIVITY — fallback selectivity table in query_planner.c

Position hints (as of 2026-04-30)

Symbol	File	Line
`xstats_update_statistics`	`src/storage/statistics_sr.c`	109
`stats_update_partitioned_statistics`	`src/storage/statistics_sr.c`	1048
`xstats_get_statistics_from_server`	`src/storage/statistics_sr.c`	376
`stats_find_inherited_index_stats`	`src/storage/statistics_sr.c`	1443
`stats_get_time_stamp`	`src/storage/statistics_sr.c`	831
`stats_get_statistics`	`src/storage/statistics_cl.c`	56
`stats_client_unpack_statistics`	`src/storage/statistics_cl.c`	85
`stats_free_statistics`	`src/storage/statistics_cl.c`	252
`stats_dump`	`src/storage/statistics_cl.c`	293
`stats_get_ndv_by_query`	`src/storage/statistics_cl.c`	435
`stats_make_select_list_for_ndv`	`src/storage/statistics_cl.c`	569
`BTREE_STATS`	`src/storage/statistics.h`	60
`ATTR_STATS`	`src/storage/statistics.h`	82
`CLASS_STATS`	`src/storage/statistics.h`	93
`stats_adjust_sampling_weight`	`src/storage/statistics.h`	135
`BTREE_STATS_PKEYS_NUM`	`src/storage/statistics.h`	43
`STATS_SAMPLING_THRESHOLD`	`src/storage/statistics.h`	37
`btree_get_stats`	`src/storage/btree.c`	7397
`btree_get_stats_with_fullscan`	`src/storage/btree.c`	7264
`btree_get_stats_with_AR_sampling`	`src/storage/btree.c`	7131
`btree_get_stats_key`	`src/storage/btree.c`	6959
`btree_get_stats_midxkey`	`src/storage/btree.c`	6854
`qo_iscan_cost`	`src/optimizer/query_planner.c`	2078
`qo_sscan_cost`	`src/optimizer/query_planner.c`	1711
`qo_expr_selectivity`	`src/optimizer/query_planner.c`	9711
`qo_equal_selectivity`	`src/optimizer/query_planner.c`	9916
`qo_range_selectivity`	`src/optimizer/query_planner.c`	10207
`qo_comp_selectivity`	`src/optimizer/query_planner.c`	10174
`qo_between_selectivity`	`src/optimizer/query_planner.c`	10188
`qo_all_some_in_selectivity`	`src/optimizer/query_planner.c`	10338
`qo_index_cardinality`	`src/optimizer/query_planner.c`	10504
`qo_estimate_ndv`	`src/optimizer/query_planner.c`	684
`DEFAULT_*_SELECTIVITY` macros	`src/optimizer/query_planner.c`	427-435
`qo_get_attr_info`	`src/optimizer/query_graph.c`	5168
`qo_get_attr_info_func_index`	`src/optimizer/query_graph.c`	5003
`QO_ATTR_CUM_STATS`	`src/optimizer/optimizer.h`	119
`catcls_update_class_stats`	`src/storage/catalog_class.c`	4453
`sm_update_statistics`	`src/object/schema_manager.c`	4201
`do_update_stats`	`src/query/execute_statement.c`	~4483
`pt_update_stats_info`	`src/parser/parse_tree.h`	2977

Cross-check Notes

Cost-function arguments vs. fields actually filled

Cross-checking what qo_iscan_cost reads against what xstats_update_statistics writes shows three points worth flagging.

cum_statsp->keys is filled but not used for I/O cost. The field is touched in qo_iscan_cost only as the secondary selectivity-floor source (sel_limit = 1.0 / cum_statsp->keys when the partial-key fallback fails). The actual leaf I/O cost (leaves = ceil (sel * (double) cum_statsp->leafs)) does not depend on keys. This means an index whose keys count is wrong (because of an old stats sample) but whose leafs is right will produce a correct I/O cost but a possibly-wrong selectivity floor — a subtle, common real-world skew.

heap_num_objects reaches the optimizer twice. It arrives as QO_NODE_NCARD (from CLASS_STATS::heap_num_objects, which the server set from the NDV count(*)) and again implicitly as pkeys[pkeys_size-1] == keys for the leading unique index. The two should agree to within MVCC visibility skew, but the client-side MIN clamp in stats_client_unpack_statistics only clamps down (pkeys <= keys <= heap_num_objects). If heap_num_objects is too small (NDV count(*) ran with sampling, or the heap was pruned between the two passes) the index keys is silently truncated, masking the discrepancy.

The dedup_idx field is shipped but its semantics depend on a feature flag. The SUPPORT_DEDUPLICATE_KEY_MODE codepath in btree_get_stats_key and the btree_is_same_key_for_stats filter rewrite the per-key counter when adjacent keys differ only in the dedup suffix. The dump code asserts dedup_idx != 0 but the feature is not enabled in all builds — when disabled, the field is shipped as 0 and the optimizer ignores it. There is no documented handshake in the wire protocol confirming which side enabled deduplication, so a server-client mismatch could in principle leak biased pkeys[] counts into the optimizer.

NDV vs. partial-key blending

qo_index_cardinality returns MIN(ndv, pkeys[0]) when both are positive — the comment (“Choose the better NDV of the two”) asserts this is the better estimate, which is true under the assumption that NDV cannot exceed the index’s distinct-key count. But the MIN is only safe when the index spans the column being estimated; for a column indexed by a different column’s leading-prefix index, pkeys[0] is the wrong dimension. The function does verify info->cum_stats.is_indexed == true first, but qo_get_attr_info’s aggregation rule (keep the index with the maximum keys count) means a distinct column’s index can dominate. In practice this manifests as optimistic equality selectivity for a column that shares a table with a high-cardinality unrelated index.

Range selectivity is not cardinality-driven

qo_comp_selectivity and qo_between_selectivity are literally one-liners returning a constant. There is no path from BTREE_STATS::keys or attr_stats::ndv to the result. This is consistent with the absence of min/max records on disk — the engine has nothing to measure the predicate constant against — but it means the cost model effectively treats WHERE c >= 5 and WHERE c >= 5000000 as having the same selectivity. On large analytic ranges this leads to systematically miscosted plans; the user-visible mitigation is the /*+ INDEX(t, idx) */ hint, which short-circuits the cost comparison.

Partitioned-class statistics are an arithmetic mean,

not a probabilistic blend

stats_update_partitioned_statistics accumulates each partition’s leafs, pages, keys, pkeys[] into a PARTITION_STATS_ACUMULATOR, then the parent class’s btree_stats_p->leafs = sum[btree_iter].leafs is a sum, not a mean — except for height which is a per-partition mean rounded up. The header comment explains the rationale (“only one partition is pruned”), but the consequence is that a SELECT against the partitioned class without partition pruning sees inflated keys and leafs counts that may be much larger than the actual leaf-page count of any single partition. This biases the optimizer toward sequential scan of partitioned classes when partition pruning is not visible at plan time.

Open Questions

Auto-refresh. Statistics are refreshed only by explicit UPDATE STATISTICS (or its sm_update_statistics callers in DDL paths). Is there a threshold or trigger that should kick off automatic refresh — and if so, who measures the staleness? The catalog stamp ci_time_stamp is checked by xstats_get_statistics_from_server, but only to short- circuit redundant fetches, not to trigger collection.
Concurrency during UPDATE STATISTICS. xstats_update_statistics holds SCH_S_LOCK and writes the catalog under X_LOCK on the directory OID. Reads during the update either wait (xstats_get_statistics_from_server takes SCH_S_LOCK unconditional) or see the pre-update representation. Is there a window where a concurrent query catches the workspace cache between stats_free_statistics and the next sm_get_class_with_statistics? The fix-up in sm_update_statistics flushes class state before re-fetch, but the workspace class_->stats = NULL assignment looks racy.
64-bit row counts on the wire. OR_PUT_INT64 is used for attr_stats::ndv but OR_PUT_INT for cls_info_p->ci_tot_objects and every BTREE_STATS field. Is this an oversight (ci_tot_objects is set from the 64-bit count(*) value) or is the implicit cap to ~2 billion rows intentional?
Histograms as a future extension. The structures reserve space (BTREE_STATS_RESERVED_NUM = 4) and the DEFAULT_*_SELECTIVITY constants would be the natural thing to displace. Is there a roadmap for shipping per-column equi-depth histograms or MCV lists?
Sampling and MVCC. Both the AR sampler and the SQL count(distinct) query run under their own MVCC snapshots, which need not be the same one. A long-running transaction concurrently inserting rows can produce BTREE_STATS::keys < ATTR_STATS::ndv for a unique index; the client MIN clamp papers over this but does not attribute the cause. Should the two passes share a snapshot?
Function-index NDV. qo_get_attr_info_func_index builds a synthetic QO_ATTR_INFO for function-index expressions, but the underlying B+Tree statistics are collected against the function’s evaluated value, not the source column. The optimizer’s qo_index_cardinality blends NDV using attr_stats::ndv, which is a per-column quantity — for function indexes there is no equivalent attr_stats::ndv for the synthetic column. Is the function-index path silently using pkeys[0] only?

Sources

Code paths read while writing this document (lines as of the source revision corresponding to updated:):

src/storage/statistics.h (153 lines)
src/storage/statistics_sr.h (45 lines)
src/storage/statistics_sr.c (1496 lines)
src/storage/statistics_cl.c (651 lines)
src/storage/btree.c (lines around 6847-7560 — stats collection)
src/optimizer/query_planner.c (lines 427-435 selectivity defaults; 670-695 qo_estimate_ndv; 1700-1730 qo_sscan_cost; 2073-2272 qo_iscan_cost; 9700-10650 selectivity family)
src/optimizer/query_graph.c (lines 5003-5160 qo_get_attr_info_func_index; 5161-5300 qo_get_attr_info)
src/optimizer/optimizer.h (lines 95-132 QO_ATTR_CUM_STATS)
src/storage/catalog_class.c (lines 4453-4559 catcls_update_class_stats)
src/object/schema_manager.c (lines 4189-4250 sm_update_statistics)
src/query/execute_statement.c (around line 4483 do_update_stats)
src/parser/parse_tree.h (line 2977 pt_update_stats_info; line 1233 PT_HINT_SAMPLING_SCAN)
src/parser/semantic_check.c (line 10419 pt_check_update_stats)
src/communication/network_interface_sr.cpp (line 4278 server handler)
src/communication/network_interface_cl.c (line 5772 stats_update_statistics client wrapper)

Textbook references:

Selinger, P. G. et al. Access Path Selection in a Relational Database Management System. SIGMOD 1979. §3.2 selectivity estimation, §3.3 cost equation decomposition.
Hellerstein, J. M.; Stonebraker, M.; Hamilton, J. Architecture of a Database System. Foundations and Trends in Databases, 2007. §6.3 plan space, §6.4 selectivity estimation.
Petrov, A. Database Internals. O’Reilly, 2019. Ch. “Query Processing and Optimization”.
Poosala, V.; Ioannidis, Y. E.; Haas, P. J.; Shekita, E. Improved Histograms for Selectivity Estimation of Range Predicates. SIGMOD 1996. Equi-depth + MCV histogram framework that CUBRID does not implement.
Charikar, M.; Chaudhuri, S.; Motwani, R.; Narasayya, V. Towards Estimation Error Guarantees for Distinct Values. PODS 2000. Sample-based NDV lower bounds.
Haas, P. J.; Stokes, L. Estimating the Number of Classes via Sample Coverage. Journal of the American Statistical Association, 1998. Foundation for sampled-NDV adjustment.
Yao, S. B. Approximating Block Accesses in Database Organizations. CACM 20(4), 1977. Origin of qo_estimate_ndv’s balls-in-bins formula.
Cardenas, A. F. Analysis and Performance of Inverted Database Structures. CACM 18(5), 1975. Companion to Yao, also referenced for the same formula.

Cross-references inside this knowledge base:

cubrid-query-optimizer.md — plan-space, QO_ENV / QO_NODE / QO_PLAN data flow, where qo_iscan_cost fits in the dynamic-program join enumerator
cubrid-catalog-manager.md — CLS_INFO, DISK_REPR, _db_class, catalog_get_representation, catalog_add_class_info
cubrid-btree.md — leaf-page layout, latch-coupling discipline, BTID, the descent path that btree_get_stats_with_fullscan reuses