PostgreSQL Cost Model — Page/CPU Costs and Path Comparison
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A relational query has, in general, an astronomically large space of equivalent physical execution plans: each table can be reached by a sequential scan or one of several index scans, each pair of relations can be joined by nested loop, merge, or hash, and the join order itself is a permutation problem. The optimizer cannot run them to find out which is fastest. It needs a cost model — a function that, given a candidate physical operator tree, returns a number estimating how expensive it is to execute, so that “is plan A better than plan B?” reduces to “is cost(A) < cost(B)?”. The cost model is what turns optimization from search over an unmeasurable space into search over a scalar-valued one.
The canonical formulation is Selinger et al., Access Path Selection in a
Relational Database Management System (SIGMOD 1979 — the System R
optimizer paper, captured at dbms-papers/systemr-optimizer.md). System R
introduced the three ideas every cost-based optimizer since has inherited:
-
A cost metric combining I/O and CPU. Selinger’s cost is
COST = PAGE FETCHES + W * (RSI calls)— page fetches (the I/O term) plus a weighted count of “RSI calls” (the tuple-processing CPU term), withWa tunable weight balancing the two. The insight is that a plan’s cost is not one-dimensional: a plan that touches fewer pages but evaluates more predicates per tuple may or may not win, and the weightWis the dial that decides. PostgreSQL’s five cost GUCs are exactly this idea generalized — separate weights for sequential vs. random page fetches and for tuple, index-tuple, and operator CPU. -
Statistics-driven selectivity. To cost a plan you must know how many rows flow between operators, and that depends on how selective the predicates are. System R kept catalog statistics (cardinalities, index value counts) and derived selectivity factors — the fraction of rows a predicate passes — from them, with hard-coded guesses (e.g. 1/10 for an equality on an unindexed column) where statistics were absent. Row-count estimation and cost estimation are therefore two coupled subsystems: the selectivity machinery feeds row counts into the cost machinery.
-
Cost as the pruning key for dynamic-programming join enumeration. System R built join orders bottom-up, keeping for each subset of relations only the cheapest plan (plus “interesting order” exceptions for plans whose sort order might save a later sort). Cost is the comparison key that makes this pruning sound: a more expensive plan over the same relations producing the same output order can be discarded.
The cost model itself is the focus of this document — the function that
maps a physical operator to a number. The selectivity machinery that feeds
it row counts is owned by postgres-extended-statistics.md; the
join-enumeration search that consumes costs as its pruning key is owned by
postgres-join-ordering.md; the construction of the candidate Path
objects that get costed is owned by postgres-path-generation.md. This
document covers the scoring function at the center of all three.
One subtlety the textbook formulation under-emphasizes but every real
engine confronts: cost is not a single number but a (startup, total)
pair. The price to produce the first row differs from the price to
produce all rows, and the difference matters because of LIMIT and
because pipelined parents care about when their child first yields. A sort
must consume its entire input before it can emit anything (high startup,
the run cost is then cheap); a sequential scan emits its first tuple almost
immediately (near-zero startup). An optimizer that scored only total cost
would systematically misjudge LIMIT 1 queries. PostgreSQL carries both
numbers on every path.
Common DBMS Design
Section titled “Common DBMS Design”Selinger gives the skeleton — a cost metric, statistics-driven row counts,
cost-as-pruning-key. This section names the engineering conventions
that production cost models layer on top, the patterns the 1979 paper
leaves implicit. PostgreSQL’s specific dial settings in ## PostgreSQL's Approach are one point in this shared design space.
Abstract cost units, not milliseconds
Section titled “Abstract cost units, not milliseconds”A cost model does not predict wall-clock time; it predicts a unit-free score that is monotonic with time. The convention is to pick one operation as the unit — almost always a single sequential page fetch = 1.0 — and express everything else as a ratio to it. A random page fetch costs several sequential fetches; processing one tuple in the executor costs a small fraction of a page fetch. These ratios are configuration, not constants of nature: they depend on the hardware (a random fetch on an SSD is far cheaper relative to a sequential one than on a spinning disk), so they are exposed as tunable parameters. The model’s job is only to get the relative ordering of plans right.
Two cost terms: page I/O and CPU
Section titled “Two cost terms: page I/O and CPU”Following Selinger, the cost of any operator decomposes into a page-access term (how many pages it reads/writes, times the per-page cost) and a CPU term (how many tuples it processes, times the per-tuple cost, plus how many predicate operators it evaluates, times the per-operator cost). A sequential scan is page-dominated; a sort with everything in memory is CPU-dominated; an external sort pays both. Keeping the terms separate is what lets a single set of GUCs re-weight the whole model for different hardware.
Startup cost vs. total cost
Section titled “Startup cost vs. total cost”Every operator is scored with a startup cost (work done before the
first output tuple can be produced) and a total cost (startup plus the
run cost of producing all tuples). The split is load-bearing: a path that
loses on total cost can still be the best choice under a LIMIT, because
only a fraction of the run cost will actually be paid. The path-keeping
logic must therefore treat the two as a partial order — a path is
dominated only if it is worse on both axes (and on output order, and on
parameterization). This is why an optimizer keeps multiple paths per
relation rather than a single cheapest one.
Caching effects in the I/O term
Section titled “Caching effects in the I/O term”The naive page count for an index scan that fetches k tuples is k heap page fetches, but if the table is larger than memory the same page may be re-fetched, and if it is smaller than memory pages stay cached. Real cost models incorporate a buffering model — the classic one being Mackert and Lohman’s finite-LRU formula — to estimate actual distinct pages fetched given the working-set size relative to available cache. Ignoring this systematically over-costs selective index scans on cached tables.
Correlation between index order and heap order
Section titled “Correlation between index order and heap order”An index scan that returns TIDs in index-key order touches heap pages in
that order. If the heap is physically sorted the same way (high
correlation — e.g. right after CLUSTER), those heap fetches are mostly
sequential and cheap; if the index and heap orders are unrelated (zero
correlation) every heap fetch is a random access. A cost model that
charged the same per-page cost in both cases would misprice index scans by
the random/sequential ratio (4x by default). The convention is to compute
both bounds and interpolate by a function of the correlation.
Two-phase join costing for search pruning
Section titled “Two-phase join costing for search pruning”Join cost estimation is expensive — it must inspect the join quals, consult
selectivity statistics, and sometimes cost a sub-sort. But the
join-enumeration search proposes many candidate joins, most of which will
be pruned. The convention is a two-phase split: a cheap
initial_cost_* pass computes a lower bound on the path’s cost from the
input paths alone (no qual inspection), and only paths that survive
comparison against the lower bound get the expensive final_cost_* pass.
This keeps the optimizer’s own runtime bounded.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”By the time you reach a named PostgreSQL symbol in the next section, you should already know what kind of thing it is:
| Theory / convention | PostgreSQL name |
|---|---|
| Page-fetch unit = 1.0 | seq_page_cost GUC (DEFAULT_SEQ_PAGE_COST = 1.0) |
| Random fetch as a multiple of sequential | random_page_cost (default 4.0) |
Per-tuple CPU weight (W) | cpu_tuple_cost (0.01) |
| Per-index-tuple CPU weight | cpu_index_tuple_cost (0.005) |
| Per-operator CPU weight | cpu_operator_cost (0.0025) |
| (startup, total) cost pair | Path.startup_cost, Path.total_cost |
| Cost of a sequential scan | cost_seqscan |
| Cost of an index scan | cost_index (+ AM’s amcostestimate) |
| Mackert-Lohman buffering model | index_pages_fetched |
| Correlation interpolation | csquared term in cost_index |
| Cost of a sort | cost_sort → cost_tuplesort |
| Two-phase nestloop cost | initial_cost_nestloop / final_cost_nestloop |
| Two-phase hashjoin cost | initial_cost_hashjoin / final_cost_hashjoin |
Predicate CPU charge from pg_proc.procost | cost_qual_eval → add_function_cost |
| Row-count seam (selectivity) | set_baserel_size_estimates → clauselist_selectivity |
| Row-estimate sanity clamp | clamp_row_est |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL’s cost model lives almost entirely in one file,
src/backend/optimizer/path/costsize.c, and is the Selinger model rendered
with five design decisions that give it its shape:
- Five tunable cost parameters, all global
doubles with documented defaults, expressing the I/O and CPU terms as ratios to one sequential page fetch. - Every
Pathcarries(startup_cost, total_cost), plus adisabled_nodescount (the replacement, since PG 17, for the olddisable_costadd-on) and arowsestimate. Thecost_*functions fill these three fields in. - A
cost_<node>function per physical operator, each taking the already-costed child paths and producing this node’s costs by adding its own page and CPU terms. - A two-phase split for joins (
initial_cost_*/final_cost_*) so the join search can prune on a cheap lower bound before paying for qual inspection. - A clean seam to the selectivity machinery: cost functions take
rowsas already-estimated input (fromset_baserel_size_estimatesand friends, which callclauselist_selectivity), and callcost_qual_evalto turn predicate complexity into a per-tuple CPU charge.
The five cost parameters
Section titled “The five cost parameters”The file header documents the parameters and their meaning; the variables
are plain globals initialized to the DEFAULT_* macros from cost.h:
// cost parameter globals — src/backend/optimizer/path/costsize.cdouble seq_page_cost = DEFAULT_SEQ_PAGE_COST; /* 1.0 */double random_page_cost = DEFAULT_RANDOM_PAGE_COST; /* 4.0 */double cpu_tuple_cost = DEFAULT_CPU_TUPLE_COST; /* 0.01 */double cpu_index_tuple_cost = DEFAULT_CPU_INDEX_TUPLE_COST; /* 0.005 */double cpu_operator_cost = DEFAULT_CPU_OPERATOR_COST; /* 0.0025 */double parallel_tuple_cost = DEFAULT_PARALLEL_TUPLE_COST; /* 0.1 */double parallel_setup_cost = DEFAULT_PARALLEL_SETUP_COST; /* 1000 */
int effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE; /* 524288 pages */The defaults encode a model of spinning-disk hardware with a substantial
buffer cache. The header comment is explicit about the relationship:
“seq_page_cost is normally considerably less than random_page_cost” because
sequential reads benefit from readahead and avoid seeks. The CPU weights are
two orders of magnitude smaller than a page fetch — processing a tuple is
cheap compared to bringing a page off disk — which is why I/O dominates the
cost of most scans. effective_cache_size is not a memory allocation; it
is the cost model’s estimate of how much OS+shared-buffer cache is available
to a query, used only by index_pages_fetched. All seven are settable
per-session and seq_page_cost/random_page_cost are settable
per-tablespace (fetched via get_tablespace_page_costs).
The enable_* GUCs (enable_seqscan, enable_indexscan, …) are not
parameters in this sense — flipping one to off does not remove the path,
it adds 1 to the path’s disabled_nodes count, so a disabled path is
chosen only if no enabled alternative exists. This replaced the old trick of
adding a giant disable_cost (1e10) to the total.
flowchart TB
subgraph PARAMS["Five cost parameters (ratios to 1 seq page fetch)"]
SPC["seq_page_cost = 1.0<br/>(the unit)"]
RPC["random_page_cost = 4.0<br/>(seek penalty)"]
CTC["cpu_tuple_cost = 0.01"]
CITC["cpu_index_tuple_cost = 0.005"]
COC["cpu_operator_cost = 0.0025"]
end
subgraph TERMS["Cost of any operator"]
IO["page-access term<br/>pages x {seq|random}_page_cost"]
CPU["CPU term<br/>tuples x cpu_tuple_cost<br/>+ operators x cpu_operator_cost"]
end
SPC --> IO
RPC --> IO
CTC --> CPU
CITC --> CPU
COC --> CPU
IO --> PAIR["Path.total_cost = startup_cost + run_cost"]
CPU --> PAIR
Figure 1 — The five parameters and the two cost terms. Every cost function decomposes its work into a page-access term (weighted by the page-cost parameters, possibly per-tablespace) and a CPU term (weighted by the per-tuple and per-operator parameters), and sums them into the path’s startup/total pair.
startup_cost vs. total_cost and the row estimate
Section titled “startup_cost vs. total_cost and the row estimate”Three fields on every Path are the cost model’s output:
startup_cost (cost before the first tuple), total_cost (cost to drain
the path), and rows (estimated output cardinality). The convention each
cost_* function follows is to accumulate a startup_cost and a
run_cost, then set total_cost = startup_cost + run_cost. Costs that must
be paid before any tuple emerges (building a hash table, sorting all input,
evaluating one-time startup predicates) go into startup_cost; costs paid
per output tuple go into run_cost. The pathkeeping logic in
add_path (owned by postgres-path-generation.md) uses this pair as a
partial order: a path dominates another only if it is no worse on startup,
total, row count’s relationship to parameterization, and pathkeys — which
is precisely why a high-startup sorted path and a low-startup unsorted path
can coexist.
The rows field is set into the path from the relation’s pre-computed
estimate (baserel->rows) or, for parameterized paths, from
param_info->ppi_rows. The cost functions do not themselves run
selectivity estimation for the base row count — that already happened in
set_baserel_size_estimates.
cost_seqscan — the simplest case
Section titled “cost_seqscan — the simplest case”cost_seqscan is the cleanest illustration of the page + CPU decomposition.
Its entire body:
// cost_seqscan — src/backend/optimizer/path/costsize.c (condensed)if (param_info) path->rows = param_info->ppi_rows;else path->rows = baserel->rows;
/* fetch estimated page cost for tablespace containing table */get_tablespace_page_costs(baserel->reltablespace, NULL, &spc_seq_page_cost);
/* disk costs */disk_run_cost = spc_seq_page_cost * baserel->pages;
/* CPU costs */get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);startup_cost += qpqual_cost.startup;cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;cpu_run_cost = cpu_per_tuple * baserel->tuples;
/* tlist eval costs are paid per output row, not per tuple scanned */startup_cost += path->pathtarget->cost.startup;cpu_run_cost += path->pathtarget->cost.per_tuple * path->rows;
path->disabled_nodes = enable_seqscan ? 0 : 1;path->startup_cost = startup_cost;path->total_cost = startup_cost + cpu_run_cost + disk_run_cost;Read it as the Selinger formula made concrete. The page term is
spc_seq_page_cost * baserel->pages — every page is read, regardless of
selectivity, because a seqscan cannot skip pages. The CPU term is
(cpu_tuple_cost + qual_per_tuple) * baserel->tuples — every tuple is
examined (note tuples, not the post-filter rows), and each costs the
base tuple cost plus the per-tuple cost of evaluating the WHERE clause
(from get_restriction_qual_cost, which returns baserel->baserestrictcost
that cost_qual_eval filled in earlier). Two subtleties: (1) the qual’s
startup component (e.g. a one-time array hash build) goes into
startup_cost, not the per-tuple loop; (2) target-list evaluation cost is
charged per output row (path->rows), since projection happens only on
surviving tuples. A seqscan’s startup cost is essentially zero — it can emit
its first tuple as soon as the first page is read.
cost_index — deferral, Mackert-Lohman, and correlation
Section titled “cost_index — deferral, Mackert-Lohman, and correlation”cost_index is the richest of the scan cost functions and the one that
exercises every convention from ## Common DBMS Design. It does not
compute the index-internal cost itself — it calls the access method’s
amcostestimate callback (for B-tree, btcostestimate; covered in
postgres-index-am.md / postgres-nbtree.md), which returns the index
descent cost, the selectivity (fraction of heap tuples matched), and the
correlation of index order to heap order:
// cost_index — src/backend/optimizer/path/costsize.c (condensed)amcostestimate = (amcostestimate_function) index->amcostestimate;amcostestimate(root, path, loop_count, &indexStartupCost, &indexTotalCost, &indexSelectivity, &indexCorrelation, &index_pages);
/* all costs for touching index itself included here */startup_cost += indexStartupCost;run_cost += indexTotalCost - indexStartupCost;
/* estimate number of main-table tuples fetched */tuples_fetched = clamp_row_est(indexSelectivity * baserel->tuples);The hard part is the heap-access I/O cost, which depends on both the
buffering model and the correlation. PostgreSQL computes two bounds and
interpolates. The uncorrelated bound (max_IO_cost) assumes every heap page
fetch is random; the correlated bound (min_IO_cost) assumes the fetches
walk the heap in order, so all but the first are sequential:
// cost_index — heap-cost bounds (normal loop_count <= 1 case), costsize.cpages_fetched = index_pages_fetched(tuples_fetched, baserel->pages, (double) index->pages, root);/* max_IO_cost is for the perfectly uncorrelated case (csquared=0) */max_IO_cost = pages_fetched * spc_random_page_cost;
/* min_IO_cost is for the perfectly correlated case (csquared=1) */pages_fetched = ceil(indexSelectivity * (double) baserel->pages);if (pages_fetched > 0){ min_IO_cost = spc_random_page_cost; /* first fetch is random */ if (pages_fetched > 1) min_IO_cost += (pages_fetched - 1) * spc_seq_page_cost; /* rest sequential */}else min_IO_cost = 0;The interpolation is by the square of the correlation:
// cost_index — correlation interpolation, costsize.ccsquared = indexCorrelation * indexCorrelation;run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost);When csquared == 0 (uncorrelated) the run cost is max_IO_cost; when
csquared == 1 (perfectly correlated, e.g. just after CLUSTER) it is
min_IO_cost; in between it slides linearly in csquared. The squaring is
a deliberate, admittedly heuristic choice (the comment says “XXX is that
appropriate?”) — it makes the model favor the random-access assumption
unless correlation is high. This single term is why a freshly-clustered
table makes index scans dramatically cheaper in the planner’s eyes.
The CPU side mirrors cost_seqscan but charges
cpu_tuple_cost + qpqual_cost.per_tuple only over tuples_fetched (the
rows the index actually returns to the heap-fetch step), not the whole
table — the selectivity win shows up here as well as in the page term.
index_pages_fetched is the Mackert-Lohman finite-LRU buffer model.
Given the number of tuples to fetch, the table’s page count, and a
pro-rated share of effective_cache_size, it estimates how many distinct
pages will actually be read, accounting for the fact that fetching many
tuples from a small table re-hits cached pages:
// index_pages_fetched — Mackert & Lohman 1989, costsize.c (core)T = (pages > 1) ? (double) pages : 1.0; /* # pages in table */total_pages = root->total_table_pages + index_pages;b = (double) effective_cache_size * T / total_pages; /* pro-rated cache share *//* ... force b positive and integral ... */if (T <= b) /* table fits in cache share */{ pages_fetched = (2.0 * T * tuples_fetched) / (2.0 * T + tuples_fetched); if (pages_fetched >= T) pages_fetched = T; /* never more than all pages */}/* ... T > b cases use the piecewise formula ... */The 2TN/(2T+N) shape is the asymptote: for few tuples it is ≈ N (each
tuple a fresh page); as tuples grow it saturates at T (you have read every
page). This is the buffering-effects convention made concrete, and the
single reason a selective index scan is not charged one random page per row.
flowchart TB START["cost_index(path, loop_count)"] AM["amcostestimate (btcostestimate)<br/>=> indexStartupCost, indexTotalCost,<br/>indexSelectivity, indexCorrelation"] TF["tuples_fetched = selectivity x baserel.tuples"] ML["index_pages_fetched<br/>(Mackert-Lohman: distinct heap pages<br/>given effective_cache_size share)"] MAX["max_IO_cost = pages x random_page_cost<br/>(uncorrelated bound)"] MIN["min_IO_cost = random + (pages-1) x seq<br/>(correlated bound)"] INTERP["csquared = correlation^2<br/>run_cost += max + csquared x (min - max)"] CPU["+ (cpu_tuple_cost + qual) x tuples_fetched<br/>+ tlist cost x rows"] OUT["path.startup_cost / path.total_cost"] START --> AM --> TF --> ML ML --> MAX ML --> MIN MAX --> INTERP MIN --> INTERP INTERP --> CPU --> OUT
Figure 2 — cost_index data flow. The AM callback supplies selectivity
and correlation; Mackert-Lohman converts tuples-fetched into distinct heap
pages; two I/O bounds are computed and interpolated by correlation-squared;
the CPU term is added over the fetched tuples. Index-internal cost is a
black box delegated to the AM.
cost_qual_eval — the predicate CPU seam
Section titled “cost_qual_eval — the predicate CPU seam”Where do the qual_per_tuple charges in the scan costs come from?
cost_qual_eval walks a qual expression tree and sums a (startup, per_tuple) QualCost. The driver loops over the implicitly-ANDed clause
list and walks each:
// cost_qual_eval — src/backend/optimizer/path/costsize.cvoidcost_qual_eval(QualCost *cost, List *quals, PlannerInfo *root){ cost_qual_eval_context context; context.root = root; context.total.startup = 0; context.total.per_tuple = 0; foreach(l, quals) cost_qual_eval_walker((Node *) lfirst(l), &context); *cost = context.total;}The walker caches its result on RestrictInfo nodes (eval_cost, sentinel
startup < 0 = not yet computed) so the same clause is never costed twice
across the many paths that reference it. For each function/operator node it
defers to add_function_cost, which is the bridge from the planner to
pg_proc: it charges procost * cpu_operator_cost, or asks a support
function (prosupport) for a smarter estimate. (add_function_cost lives
in plancat.c, the catalog-access layer, because it must read the
pg_proc row.)
// add_function_cost — src/backend/optimizer/util/plancat.c (tail)/* No support function, or it failed, so rely on procost */cost->per_tuple += procform->procost * cpu_operator_cost;This is the System R W * (RSI calls) term, refined: instead of a flat
per-tuple charge, each predicate’s CPU cost is the declared cost of its
functions (default procost = 1, so one cpu_operator_cost per operator),
with expensive functions (e.g. a PL/pgSQL function declared COST 100)
charged proportionally. This is why predicate ordering and
expensive-function placement can change a plan’s cost.
The same machinery has a second consumer in clauses.c: when the planner
considers inlining a SQL function whose parameter appears more than once,
it uses cost_qual_eval as an “is this expensive?” gate — defining
expensive as “more than 10 operators” — to avoid duplicating costly
sub-expressions:
// inline_function expensiveness gate — src/backend/optimizer/util/clauses.ccost_qual_eval(&eval_cost, list_make1(param), NULL);if (eval_cost.startup + eval_cost.per_tuple > 10 * cpu_operator_cost) goto fail; /* don't inline; would duplicate an expensive expression */So cost_qual_eval is not only a cost-model component but also a
decision oracle reused by expression-tree transformations.
cost_sort — the canonical high-startup operator
Section titled “cost_sort — the canonical high-startup operator”A sort is the textbook example of a high-startup_cost, low-run_cost
operator: it must read and order its entire input before it can emit the
first tuple, after which emitting is cheap. cost_sort is a thin wrapper —
it adds the input path’s cost to the startup, then delegates the model to
cost_tuplesort:
// cost_sort — src/backend/optimizer/path/costsize.cvoidcost_sort(Path *path, PlannerInfo *root, List *pathkeys, int input_disabled_nodes, Cost input_cost, double tuples, int width, Cost comparison_cost, int sort_mem, double limit_tuples){ Cost startup_cost, run_cost; cost_tuplesort(&startup_cost, &run_cost, tuples, width, comparison_cost, sort_mem, limit_tuples); startup_cost += input_cost; /* must consume all input first */ path->rows = tuples; path->disabled_nodes = input_disabled_nodes + (enable_sort ? 0 : 1); path->startup_cost = startup_cost; path->total_cost = startup_cost + run_cost;}cost_tuplesort branches on whether the data fits in sort_mem (work_mem).
The in-memory case is the classic comparison-sort cost,
comparison_cost * N * log2(N), all charged to startup; the external case
adds page I/O for the merge passes:
// cost_tuplesort — src/backend/optimizer/path/costsize.c (condensed)comparison_cost += 2.0 * cpu_operator_cost; /* default per-comparison cost */
if (output_bytes > sort_mem_bytes){ /* disk-based merge sort */ double npages = ceil(input_bytes / BLCKSZ); double nruns = input_bytes / sort_mem_bytes; double mergeorder = tuplesort_merge_order(sort_mem_bytes); double log_runs, npageaccesses;
*startup_cost = comparison_cost * tuples * LOG2(tuples); /* N log2 N comparisons */
if (nruns > mergeorder) log_runs = ceil(log(nruns) / log(mergeorder)); else log_runs = 1.0; npageaccesses = 2.0 * npages * log_runs; /* Assume 3/4ths of accesses are sequential, 1/4th are not */ *startup_cost += npageaccesses * (seq_page_cost * 0.75 + random_page_cost * 0.25);}else{ /* plain in-memory quicksort */ *startup_cost = comparison_cost * tuples * LOG2(tuples);}
/* small per-extracted-tuple charge; NOT cpu_tuple_cost — Sort does no qual/projection */*run_cost = cpu_operator_cost * tuples;Three modeling choices to notice. (1) The whole sorting cost is
startup — the run cost is just a token cpu_operator_cost per emitted
tuple, because a Sort node does no qual checking or projection, so the
planner deliberately charges it less than a normal node’s cpu_tuple_cost.
(2) The external case charges page I/O for 2 * npages * log_runs page
accesses across the merge passes, mixing 75% sequential / 25% random — an
explicit caching/locality assumption baked into the constants. (3) A useful
LIMIT (passed as limit_tuples) switches the model to a bounded heap
sort of N * log2(K) comparisons, which is why ORDER BY ... LIMIT 10
over a huge input is costed far below a full sort. This is the
startup_cost machinery earning its keep: the sort’s high startup is
exactly what makes the planner prefer an already-sorted index path under a
small LIMIT.
The two-phase join split
Section titled “The two-phase join split”Join costing is where the initial_cost_* / final_cost_* convention
appears. initial_cost_nestloop produces a lower bound from the input
paths’ costs alone — it does not inspect the join quals (the comment
says so explicitly) — and stashes intermediate values in a
JoinCostWorkspace. The join search compares this lower bound against the
best path found so far; only survivors get final_cost_nestloop, which
does the expensive qual-eval and final row estimate.
// initial_cost_nestloop — src/backend/optimizer/path/costsize.c (condensed)disabled_nodes = enable_nestloop ? 0 : 1;disabled_nodes += inner_path->disabled_nodes + outer_path->disabled_nodes;
cost_rescan(root, inner_path, &inner_rescan_start_cost, &inner_rescan_total_cost);
/* must pay both children's startup before first tuple */startup_cost += outer_path->startup_cost + inner_path->startup_cost;run_cost += outer_path->total_cost - outer_path->startup_cost;if (outer_path_rows > 1) run_cost += (outer_path_rows - 1) * inner_rescan_start_cost;
inner_run_cost = inner_path->total_cost - inner_path->startup_cost;/* ... for SEMI/ANTI/unique, defer to final; else scan whole inner per outer row ... */else{ run_cost += inner_run_cost; if (outer_path_rows > 1) run_cost += (outer_path_rows - 1) * inner_rescan_run_cost;}workspace->run_cost = run_cost; /* private data for final_cost_nestloop */The nested-loop shape is the textbook one: the inner path is (re)scanned
once per outer row, so the dominant run-cost term is
(outer_rows - 1) * inner_rescan_run_cost on top of the first
inner_run_cost. The startup is the sum of both children’s startups —
neither child can produce a tuple until it has paid its own startup. The
SEMI/ANTI/inner-unique cases (early-out after the first match) are deferred
to final because they need the join quals.
final_cost_nestloop then computes the actual tuple count and the CPU
charge over it:
// final_cost_nestloop — src/backend/optimizer/path/costsize.c (normal-case tail)ntuples = outer_path_rows * inner_path_rows; /* tuples *processed* */
cost_qual_eval(&restrict_qual_cost, path->jpath.joinrestrictinfo, root);startup_cost += restrict_qual_cost.startup;cpu_per_tuple = cpu_tuple_cost + restrict_qual_cost.per_tuple;run_cost += cpu_per_tuple * ntuples;
/* tlist eval costs are paid per output row, not per tuple scanned */startup_cost += path->jpath.path.pathtarget->cost.startup;run_cost += path->jpath.path.pathtarget->cost.per_tuple * path->jpath.path.rows;Note ntuples is the number of candidate pairs examined
(outer_rows * inner_rows), not the number of output rows — the CPU charge
is for every qual evaluation, while target-list cost is over the much
smaller path.rows output. This is the System R cost metric applied to a
join: page costs were inherited from the children’s run costs in the
initial phase; this phase adds the per-comparison CPU term.
cost_hashjoin — the other join shape
Section titled “cost_hashjoin — the other join shape”The hash join’s two-phase split sits in initial_cost_hashjoin /
final_cost_hashjoin. The initial phase charges the hash-build cost as
startup (one hash function per column per inner row, plus a cpu_tuple_cost
for inserting into the table) and, crucially, models batching: if the
inner relation does not fit in hash_mem, the executor spills to disk, so
the cost adds seq_page_cost per spilled page:
// initial_cost_hashjoin — src/backend/optimizer/path/costsize.c (condensed)startup_cost += outer_path->startup_cost;run_cost += outer_path->total_cost - outer_path->startup_cost;startup_cost += inner_path->total_cost; /* must build hash from whole inner */
/* hash build: one cpu_operator_cost per column hash + cpu_tuple_cost insert */startup_cost += (cpu_operator_cost * num_hashclauses + cpu_tuple_cost) * inner_path_rows;run_cost += cpu_operator_cost * num_hashclauses * outer_path_rows; /* probe hashing */
ExecChooseHashTableSize(inner_path_rows_total, inner_path->pathtarget->width, true, parallel_hash, outer_path->parallel_workers, &space_allowed, &numbuckets, &numbatches, &num_skew_mcvs);if (numbatches > 1) /* inner doesn't fit -> spill both sides to disk */{ double outerpages = page_size(outer_path_rows, outer_path->pathtarget->width); double innerpages = page_size(inner_path_rows, inner_path->pathtarget->width); startup_cost += seq_page_cost * innerpages; /* write inner */ run_cost += seq_page_cost * (innerpages + 2 * outerpages); /* re-read */}final_cost_hashjoin then charges the probe cost using the estimated
inner bucket size (how many inner rows land in the bucket each outer row
probes) and halves the qual-eval charge to account for hash-code mismatches
short-circuiting the comparison:
// final_cost_hashjoin — src/backend/optimizer/path/costsize.c (normal-case core)/* comparisons = outer_rows x (inner_rows x bucketsize); halve for hash-mismatch skip */startup_cost += hash_qual_cost.startup;run_cost += hash_qual_cost.per_tuple * outer_path_rows * clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;
hashjointuples = approx_tuple_count(root, &path->jpath, hashclauses);
/* per surviving join tuple: cpu_tuple_cost + other (qp) quals */startup_cost += qp_qual_cost.startup;cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple;run_cost += cpu_per_tuple * hashjointuples;The contrast with nested loop is the whole point of having a cost model:
the hash join front-loads cost into startup (building the table) and then
probes cheaply, so it wins on large equijoins where nested loop’s
outer * inner comparison count explodes — but loses under a tight LIMIT
where the build cost is never amortized. The bucketsize and mcvfreq
estimates that drive the probe cost come from the selectivity machinery
(estimate_hash_bucket_stats, estimate_multivariate_bucketsize), the same
seam discussed next; extended-statistics handling of bucket size is owned by
postgres-extended-statistics.md.
The row-estimation seam
Section titled “The row-estimation seam”The cost functions take row counts as input — they do not estimate them.
The estimation happens earlier, in set_baserel_size_estimates, which is
the seam to the selectivity subsystem (clauselist_selectivity, ultimately
the selfuncs.c per-operator estimators):
// set_baserel_size_estimates — src/backend/optimizer/path/costsize.cnrows = rel->tuples * clauselist_selectivity(root, rel->baserestrictinfo, 0, JOIN_INNER, NULL);rel->rows = clamp_row_est(nrows);cost_qual_eval(&rel->baserestrictcost, rel->baserestrictinfo, root);set_rel_width(root, rel);Two outputs feed the cost model: rel->rows (the post-filter cardinality
the cost functions read) and rel->baserestrictcost (the qual CPU cost
get_restriction_qual_cost later hands to cost_seqscan). Every row
estimate passes through clamp_row_est, which forces it to at least 1.0,
rounds to an integer, and caps it to avoid double overflow — a small but
load-bearing guard, since a 0-row estimate would cause divide-by-zero in
downstream cost interpolation and an infinite estimate would break
add_path. The selectivity functions in selfuncs.c (histogram/MCV
lookups, the System R fallback constants) are owned by
postgres-extended-statistics.md; this document treats rel->rows as a
given.
flowchart LR STATS["pg_statistic<br/>(histograms, MCVs, ndistinct, correlation)"] CLS["clauselist_selectivity<br/>(selfuncs.c)"] SBSE["set_baserel_size_estimates<br/>rel.rows = tuples x selectivity<br/>rel.baserestrictcost = cost_qual_eval(...)"] COST["cost_seqscan / cost_index / cost_*<br/>read rel.rows + baserestrictcost"] ADD["add_path<br/>(keep on startup/total/order partial order)"] STATS --> CLS --> SBSE --> COST --> ADD
Figure 3 — The row-estimation seam. Statistics drive selectivity, which
sets rel->rows; the cost functions consume that cardinality plus the
pre-computed qual cost. The cost model and the selectivity model are
coupled but cleanly separated: cost functions never call
clauselist_selectivity directly for base rows.
Source Walkthrough
Section titled “Source Walkthrough”Anchor on symbol names, not line numbers. The PostgreSQL source moves between releases; a function or struct name is the stable handle. Use
git grep -n '<symbol>' src/backend/optimizer/to locate the current position. The line numbers in the position-hint table were observed at commit273fe94(REL_18_STABLE) and are quick hints only.
Cost parameters and clamps (costsize.c top)
Section titled “Cost parameters and clamps (costsize.c top)”seq_page_cost/random_page_cost/cpu_tuple_cost/cpu_index_tuple_cost/cpu_operator_cost— the five tunable cost GUCs, initialized from theDEFAULT_*macros incost.h.parallel_tuple_cost/parallel_setup_cost/effective_cache_size— parallel-communication weights and the buffering-model cache size.disable_cost(1e10) — legacy add-on; mostly superseded by the per-pathdisabled_nodescounter (enable_*GUCs increment it).clamp_row_est— force a row estimate to[1, 1e100], integral; the universal sanity guard before any cost interpolation.clamp_width_est/clamp_cardinality_to_long— companion clamps for tuple width and cardinality-to-longconversions.
Scan cost functions
Section titled “Scan cost functions”cost_seqscan— page termspc_seq_page_cost * pages+ CPU term(cpu_tuple_cost + qual) * tuples; near-zero startup.cost_index— delegates index-internal cost toamcostestimate; computes heap I/O viaindex_pages_fetched, interpolatesmin_IO_cost/max_IO_costbycsquared(correlation²).index_pages_fetched— Mackert & Lohman (1989) finite-LRU buffer model; distinct heap pages fetched given a pro-ratedeffective_cache_size.cost_samplescan/cost_bitmap_heap_scan/cost_tidscan— sibling scan costers (bitmap and TID scans cross-refpostgres-scan-nodes.md).extract_nonindex_conditions— which restriction clauses become qpquals (CPU-charged) vs. handled inside the index.
Sort and materialization
Section titled “Sort and materialization”cost_sort— wrapper: input cost into startup, thencost_tuplesort.cost_tuplesort— in-memorycomparison_cost * N * log2(N); external merge adds2 * npages * log_runspage accesses at 75% seq / 25% random; boundedN * log2(K)under aLIMIT.cost_incremental_sort— per-groupcost_tuplesortover estimated groups for a prefix-sorted input.cost_material/cost_memoize_rescan— materialization and the Memoize rescan cache.
Join cost functions (two-phase)
Section titled “Join cost functions (two-phase)”initial_cost_nestloop/final_cost_nestloop— lower-bound (no qual inspection) then final; inner rescanned once per outer row; SEMI/ANTI/unique early-out deferred to final.initial_cost_mergejoin/final_cost_mergejoin— merge join, costs the required sorts viacost_sort;cached_scanselcaches merge scan selectivity.initial_cost_hashjoin/final_cost_hashjoin— build cost as startup, batching spill viaseq_page_cost, probe cost via innerbucketsize.cost_gather/cost_gather_merge— parallel leader/worker gather, addsparallel_setup_cost+parallel_tuple_cost.cost_rescan— cost to re-execute the inner side of a join.approx_tuple_count/has_indexed_join_quals— join-tuple count and index-qual detection helpers.
Predicate CPU cost
Section titled “Predicate CPU cost”cost_qual_eval/cost_qual_eval_node— sum aQualCostover a clause list / single node; caches onRestrictInfo.eval_cost.cost_qual_eval_walker— the recursive walker; per node-type charging (FuncExpr,OpExpr,ScalarArrayOpExpr,Aggref, …).add_function_cost(inplancat.c) — bridge topg_proc.procostandprosupportsupport functions; chargesprocost * cpu_operator_cost.get_restriction_qual_cost— returnsbaserel->baserestrictcost(+ pushed-down clause cost for parameterized paths).
Row-estimation seam
Section titled “Row-estimation seam”set_baserel_size_estimates—rel->rows = tuples * selectivity;rel->baserestrictcost = cost_qual_eval(baserestrictinfo);set_rel_width.get_parameterized_baserel_size/calc_joinrel_size_estimate/set_joinrel_size_estimates— parameterized and join cardinalities.clauselist_selectivity(inselfuncs.c) — the selectivity entry point the seam calls; owned bypostgres-extended-statistics.md.
Size/parallel helpers
Section titled “Size/parallel helpers”relation_byte_size—tuples * (MAXALIGN(width) + MAXALIGN(header)).page_size—ceil(relation_byte_size / BLCKSZ).get_parallel_divisor— fraction of work per worker, modeling leader participation (leader contributes1 - 0.3 * workers).
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
clamp_row_est | src/backend/optimizer/path/costsize.c | 213 |
cost_seqscan | src/backend/optimizer/path/costsize.c | 295 |
cost_gather | src/backend/optimizer/path/costsize.c | 446 |
cost_index | src/backend/optimizer/path/costsize.c | 560 |
index_pages_fetched | src/backend/optimizer/path/costsize.c | 908 |
cost_tuplesort | src/backend/optimizer/path/costsize.c | 1898 |
cost_sort | src/backend/optimizer/path/costsize.c | 2144 |
initial_cost_nestloop | src/backend/optimizer/path/costsize.c | 3267 |
final_cost_nestloop | src/backend/optimizer/path/costsize.c | 3349 |
initial_cost_hashjoin | src/backend/optimizer/path/costsize.c | 4160 |
final_cost_hashjoin | src/backend/optimizer/path/costsize.c | 4275 |
cost_qual_eval | src/backend/optimizer/path/costsize.c | 4756 |
cost_qual_eval_walker | src/backend/optimizer/path/costsize.c | 4796 |
get_restriction_qual_cost | src/backend/optimizer/path/costsize.c | 5072 |
set_baserel_size_estimates | src/backend/optimizer/path/costsize.c | 5349 |
relation_byte_size | src/backend/optimizer/path/costsize.c | 6453 |
get_parallel_divisor | src/backend/optimizer/path/costsize.c | 6474 |
add_function_cost | src/backend/optimizer/util/plancat.c | 2125 |
cost_qual_eval (inline gate) | src/backend/optimizer/util/clauses.c | 4831 |
DEFAULT_SEQ_PAGE_COST … | src/include/optimizer/cost.h | 24 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified against the working tree at /data/hgryoo/references/postgres,
REL_18_STABLE, commit 273fe94852b3a7e34fd171e8abdf1481beb302fa.
- Five cost parameters and defaults.
cost.hdefinesDEFAULT_SEQ_PAGE_COST 1.0,DEFAULT_RANDOM_PAGE_COST 4.0,DEFAULT_CPU_TUPLE_COST 0.01,DEFAULT_CPU_INDEX_TUPLE_COST 0.005,DEFAULT_CPU_OPERATOR_COST 0.0025,DEFAULT_PARALLEL_TUPLE_COST 0.1,DEFAULT_PARALLEL_SETUP_COST 1000.0,DEFAULT_EFFECTIVE_CACHE_SIZE 524288. The globals incostsize.care initialized to these. Verified bygrep -n DEFAULT_ src/include/optimizer/cost.h. cost_seqscanpage + CPU decomposition. Confirmeddisk_run_cost = spc_seq_page_cost * baserel->pagesandcpu_run_cost = (cpu_tuple_cost + qpqual_cost.per_tuple) * baserel->tuples, withtotal_cost = startup_cost + cpu_run_cost + disk_run_cost. CPU term usesbaserel->tuples(pre-filter), tlist cost usespath->rows(post-filter). Verified by reading the full function body.cost_indexdelegation and interpolation. Confirmed theamcostestimatecallback suppliesindexSelectivityandindexCorrelation;csquared = indexCorrelation * indexCorrelation;run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost). Theloop_count > 1branch (repeated indexscans) and the index-only-scanallvisfracreduction are present as described.- Mackert-Lohman.
index_pages_fetchedcites “Mackert and Lohman, Index Scans Using a Finite LRU Buffer, ACM TODS Vol. 14 No. 3, 1989” in its header comment and implements the2TNs/(2T+Ns)piecewise formula withba pro-ratedeffective_cache_sizeshare. Verified. cost_tuplesortmodel. Confirmed in-memory quicksortcomparison_cost * tuples * LOG2(tuples)charged to startup, external merge addsnpageaccesses * (seq_page_cost * 0.75 + random_page_cost * 0.25), run cost iscpu_operator_cost * tuples(notcpu_tuple_cost— comment explains the Sort node does no qual/projection).comparison_cost += 2.0 * cpu_operator_costdefault. Verified.- Two-phase joins.
initial_cost_nestloopandinitial_cost_hashjoinfill aJoinCostWorkspaceand explicitly defer qual inspection (“CPU costs left for later”);final_cost_nestloop/final_cost_hashjoincallcost_qual_evalonjoinrestrictinfo/hashclauses. The hashjoin batching branch (numbatches > 1) chargesseq_page_costper spilled page. Verified. cost_qual_eval→add_function_cost. Confirmed the walker caches onRestrictInfo.eval_cost(sentinelstartup < 0) and thatadd_function_cost(inplancat.c, notclauses.c) chargesprocform->procost * cpu_operator_costwhen noprosupportfunction intervenes. Theclauses.cuse at line 4831 is theinline_function“more than 10 operators” expensiveness gate, not a definition — verified bygrep -n cost_qual_eval src/backend/optimizer/util/clauses.c.- Row-estimation seam.
set_baserel_size_estimatescomputesrel->rows = clamp_row_est(rel->tuples * clauselist_selectivity(...))and setsrel->baserestrictcostviacost_qual_eval.clauselist_selectivityis declared inselfuncsand is the boundary to the statistics subsystem. Verified. - REL_18 caveats honored.
disabled_nodes(the per-path counter thatenable_*GUCs increment, present since PG 17) is used throughout rather than only the legacydisable_costadd-on;estimate_multivariate_bucketsize(extended-stats bucket sizing, PG 17+) is called fromfinal_cost_hashjoin. No pre-17 constructs were asserted.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”PostgreSQL’s cost model is a faithful, conservative descendant of the 1979 System R design. Placing it against other engines and the research literature clarifies which of its choices are universal and which are contingent.
The disk-era constant problem. PostgreSQL’s defaults
(random_page_cost = 4.0) encode spinning-disk physics: a random seek
costs roughly four sequential reads. On NVMe SSDs the ratio is closer to
1.1–1.5, and on a fully-cached working set effectively 1.0. Because the
model is unit-free and the parameters are exposed, the standard remedy is to
re-tune — lowering random_page_cost makes index scans more attractive,
which is the single most common production cost-model adjustment. The
deeper critique is that a single random_page_cost cannot express a
storage hierarchy where the first few pages are cached (cheap) and the tail
is on disk (expensive); effective_cache_size and the Mackert-Lohman model
are PostgreSQL’s partial answer, but they assume a uniform cache share per
table rather than a learned per-relation hit rate.
Calibrated vs. abstract cost units. Some engines calibrate their cost units to real hardware. DB2 and the original System R successors ran hardware-characterization to set the I/O/CPU weight; Microsoft SQL Server’s cost units are nominally seconds on a reference machine. PostgreSQL deliberately stays abstract — its costs are dimensionless and only need to be monotonic with time. The tradeoff: abstract units are simpler and robust to model error in absolute terms, but they cannot answer “will this query take 2 seconds?” and they make cross-plan comparison the only meaningful use of the number.
Cardinality estimation is the dominant error source. A celebrated line
of work — Leis et al., How Good Are Query Optimizers, Really? (VLDB 2015)
— showed empirically that for join-heavy queries, errors in the cost model
itself are dwarfed by errors in cardinality estimation: once row counts
are wrong (and they compound multiplicatively through a join tree), the
cost function’s precision is irrelevant. The paper’s blunt conclusion is
that a good optimizer needs good cardinality estimates far more than a
sophisticated cost function — PostgreSQL’s relatively simple cost arithmetic
is, in this light, the right level of investment, with the leverage living
in selfuncs.c and extended statistics (postgres-extended-statistics.md).
Learned and adaptive cost models. The research frontier moves the cost
function from hand-tuned constants to learned models. Neo (Marcus et al.,
VLDB 2019) and Bao learn a cost/latency predictor from execution feedback;
learned cardinality estimators (MSCN, DeepDB, NeuroCard) replace the
histogram pipeline. Adaptive query processing (e.g. the classic
eddies, and runtime re-optimization in systems like SCOPE and Spark’s
AQE) sidesteps estimation error by measuring intermediate cardinalities
mid-execution and re-planning. PostgreSQL has none of this in core — its
cost model is fixed at plan time — though the prosupport mechanism
(add_function_cost consulting a support function) is a small extensibility
hook, and extensions like pg_hint_plan and adaptive-statistics patches
explore the space. The conservative argument for the static model is
plan stability and explainability: a fixed cost function produces
reproducible plans an administrator can reason about with EXPLAIN.
Vectorized and columnar engines re-weight the CPU term. PostgreSQL’s
cpu_tuple_cost/cpu_operator_cost assume a row-at-a-time Volcano
executor (postgres-executor.md). Columnar/vectorized engines (DuckDB,
ClickHouse, Vectorwise) process batches and SIMD-vectorize operators, so
the per-tuple CPU constant is far lower and the relative cost of a scan vs.
a hash join shifts. A cost model tuned for a tuple-at-a-time interpreter
systematically over-costs CPU-bound operators on a vectorized engine; this
is one reason such engines ship their own cost models rather than porting
PostgreSQL’s. The model is thus coupled to the executor architecture it
scores, not just the storage hardware.
What PostgreSQL gets right by staying simple. The two-phase join split
(initial_cost_* / final_cost_*) is a genuinely good engineering pattern
that keeps optimizer runtime bounded as the join search explodes
combinatorially — many learned/adaptive systems still need an equivalent
cheap-lower-bound mechanism to make their search tractable. And the clean
(startup, total) pair, treated as a partial order by add_path, is the
right primitive for LIMIT-aware planning that flat-cost models get wrong.
Sources
Section titled “Sources”- Code (REL_18_STABLE, commit
273fe94, as of 2026-06-05):src/backend/optimizer/path/costsize.c— the cost model:cost_seqscan,cost_index,index_pages_fetched,cost_sort/cost_tuplesort,initial_cost_nestloop/final_cost_nestloop,initial_cost_hashjoin/final_cost_hashjoin,cost_gather,cost_qual_eval,get_restriction_qual_cost,set_baserel_size_estimates,clamp_row_est,relation_byte_size,get_parallel_divisor, and the five cost-parameter globals.src/backend/optimizer/util/clauses.c—inline_function’scost_qual_eval“10 operators” expensiveness gate (the clauses.c seam).src/backend/optimizer/util/plancat.c—add_function_cost, the bridge topg_proc.procost/prosupport.src/include/optimizer/cost.h— theDEFAULT_*cost-parameter macros.src/include/nodes/pathnodes.h—Path(startup_cost,total_cost,rows,disabled_nodes),QualCost,JoinCostWorkspace.
- Theory: Selinger, Astrahan, Chamberlin, Lorie, Price, Access Path
Selection in a Relational Database Management System, SIGMOD 1979 —
captured at
knowledge/research/dbms-papers/systemr-optimizer.md(cost metric, selectivity factors, cost-as-pruning-key). Mackert & Lohman, Index Scans Using a Finite LRU Buffer: A Validated I/O Model, ACM TODS 14(3), 1989 (theindex_pages_fetchedformula). - Comparative / frontier: Leis et al., How Good Are Query Optimizers,
Really?, VLDB 2015 (cardinality vs. cost error); Marcus et al., Neo: A
Learned Query Optimizer, VLDB 2019 (learned cost). Hellerstein,
Stonebraker & Hamilton, Architecture of a Database System, 2007 —
knowledge/research/dbms-papers/fntdb07-architecture.md. - Related KB docs:
postgres-path-generation.md(howPathobjects are built and kept byadd_path),postgres-join-ordering.md(the enumeration search that consumes costs as its pruning key),postgres-extended-statistics.md(the selectivity/cardinality subsystem behindclauselist_selectivity),postgres-planner-overview.md(where costing sits in the planner),postgres-executor.md(the Volcano executor the CPU weights model),postgres-scan-nodes.md,postgres-index-am.md,postgres-nbtree.md(theamcostestimatecallees).