Skip to content

PostgreSQL Cost Model — Page/CPU Costs and Path Comparison

Contents:

A relational query has, in general, an astronomically large space of equivalent physical execution plans: each table can be reached by a sequential scan or one of several index scans, each pair of relations can be joined by nested loop, merge, or hash, and the join order itself is a permutation problem. The optimizer cannot run them to find out which is fastest. It needs a cost model — a function that, given a candidate physical operator tree, returns a number estimating how expensive it is to execute, so that “is plan A better than plan B?” reduces to “is cost(A) < cost(B)?”. The cost model is what turns optimization from search over an unmeasurable space into search over a scalar-valued one.

The canonical formulation is Selinger et al., Access Path Selection in a Relational Database Management System (SIGMOD 1979 — the System R optimizer paper, captured at dbms-papers/systemr-optimizer.md). System R introduced the three ideas every cost-based optimizer since has inherited:

  1. A cost metric combining I/O and CPU. Selinger’s cost is COST = PAGE FETCHES + W * (RSI calls) — page fetches (the I/O term) plus a weighted count of “RSI calls” (the tuple-processing CPU term), with W a tunable weight balancing the two. The insight is that a plan’s cost is not one-dimensional: a plan that touches fewer pages but evaluates more predicates per tuple may or may not win, and the weight W is the dial that decides. PostgreSQL’s five cost GUCs are exactly this idea generalized — separate weights for sequential vs. random page fetches and for tuple, index-tuple, and operator CPU.

  2. Statistics-driven selectivity. To cost a plan you must know how many rows flow between operators, and that depends on how selective the predicates are. System R kept catalog statistics (cardinalities, index value counts) and derived selectivity factors — the fraction of rows a predicate passes — from them, with hard-coded guesses (e.g. 1/10 for an equality on an unindexed column) where statistics were absent. Row-count estimation and cost estimation are therefore two coupled subsystems: the selectivity machinery feeds row counts into the cost machinery.

  3. Cost as the pruning key for dynamic-programming join enumeration. System R built join orders bottom-up, keeping for each subset of relations only the cheapest plan (plus “interesting order” exceptions for plans whose sort order might save a later sort). Cost is the comparison key that makes this pruning sound: a more expensive plan over the same relations producing the same output order can be discarded.

The cost model itself is the focus of this document — the function that maps a physical operator to a number. The selectivity machinery that feeds it row counts is owned by postgres-extended-statistics.md; the join-enumeration search that consumes costs as its pruning key is owned by postgres-join-ordering.md; the construction of the candidate Path objects that get costed is owned by postgres-path-generation.md. This document covers the scoring function at the center of all three.

One subtlety the textbook formulation under-emphasizes but every real engine confronts: cost is not a single number but a (startup, total) pair. The price to produce the first row differs from the price to produce all rows, and the difference matters because of LIMIT and because pipelined parents care about when their child first yields. A sort must consume its entire input before it can emit anything (high startup, the run cost is then cheap); a sequential scan emits its first tuple almost immediately (near-zero startup). An optimizer that scored only total cost would systematically misjudge LIMIT 1 queries. PostgreSQL carries both numbers on every path.

Selinger gives the skeleton — a cost metric, statistics-driven row counts, cost-as-pruning-key. This section names the engineering conventions that production cost models layer on top, the patterns the 1979 paper leaves implicit. PostgreSQL’s specific dial settings in ## PostgreSQL's Approach are one point in this shared design space.

A cost model does not predict wall-clock time; it predicts a unit-free score that is monotonic with time. The convention is to pick one operation as the unit — almost always a single sequential page fetch = 1.0 — and express everything else as a ratio to it. A random page fetch costs several sequential fetches; processing one tuple in the executor costs a small fraction of a page fetch. These ratios are configuration, not constants of nature: they depend on the hardware (a random fetch on an SSD is far cheaper relative to a sequential one than on a spinning disk), so they are exposed as tunable parameters. The model’s job is only to get the relative ordering of plans right.

Following Selinger, the cost of any operator decomposes into a page-access term (how many pages it reads/writes, times the per-page cost) and a CPU term (how many tuples it processes, times the per-tuple cost, plus how many predicate operators it evaluates, times the per-operator cost). A sequential scan is page-dominated; a sort with everything in memory is CPU-dominated; an external sort pays both. Keeping the terms separate is what lets a single set of GUCs re-weight the whole model for different hardware.

Every operator is scored with a startup cost (work done before the first output tuple can be produced) and a total cost (startup plus the run cost of producing all tuples). The split is load-bearing: a path that loses on total cost can still be the best choice under a LIMIT, because only a fraction of the run cost will actually be paid. The path-keeping logic must therefore treat the two as a partial order — a path is dominated only if it is worse on both axes (and on output order, and on parameterization). This is why an optimizer keeps multiple paths per relation rather than a single cheapest one.

The naive page count for an index scan that fetches k tuples is k heap page fetches, but if the table is larger than memory the same page may be re-fetched, and if it is smaller than memory pages stay cached. Real cost models incorporate a buffering model — the classic one being Mackert and Lohman’s finite-LRU formula — to estimate actual distinct pages fetched given the working-set size relative to available cache. Ignoring this systematically over-costs selective index scans on cached tables.

Correlation between index order and heap order

Section titled “Correlation between index order and heap order”

An index scan that returns TIDs in index-key order touches heap pages in that order. If the heap is physically sorted the same way (high correlation — e.g. right after CLUSTER), those heap fetches are mostly sequential and cheap; if the index and heap orders are unrelated (zero correlation) every heap fetch is a random access. A cost model that charged the same per-page cost in both cases would misprice index scans by the random/sequential ratio (4x by default). The convention is to compute both bounds and interpolate by a function of the correlation.

Join cost estimation is expensive — it must inspect the join quals, consult selectivity statistics, and sometimes cost a sub-sort. But the join-enumeration search proposes many candidate joins, most of which will be pruned. The convention is a two-phase split: a cheap initial_cost_* pass computes a lower bound on the path’s cost from the input paths alone (no qual inspection), and only paths that survive comparison against the lower bound get the expensive final_cost_* pass. This keeps the optimizer’s own runtime bounded.

By the time you reach a named PostgreSQL symbol in the next section, you should already know what kind of thing it is:

Theory / conventionPostgreSQL name
Page-fetch unit = 1.0seq_page_cost GUC (DEFAULT_SEQ_PAGE_COST = 1.0)
Random fetch as a multiple of sequentialrandom_page_cost (default 4.0)
Per-tuple CPU weight (W)cpu_tuple_cost (0.01)
Per-index-tuple CPU weightcpu_index_tuple_cost (0.005)
Per-operator CPU weightcpu_operator_cost (0.0025)
(startup, total) cost pairPath.startup_cost, Path.total_cost
Cost of a sequential scancost_seqscan
Cost of an index scancost_index (+ AM’s amcostestimate)
Mackert-Lohman buffering modelindex_pages_fetched
Correlation interpolationcsquared term in cost_index
Cost of a sortcost_sortcost_tuplesort
Two-phase nestloop costinitial_cost_nestloop / final_cost_nestloop
Two-phase hashjoin costinitial_cost_hashjoin / final_cost_hashjoin
Predicate CPU charge from pg_proc.procostcost_qual_evaladd_function_cost
Row-count seam (selectivity)set_baserel_size_estimatesclauselist_selectivity
Row-estimate sanity clampclamp_row_est

PostgreSQL’s cost model lives almost entirely in one file, src/backend/optimizer/path/costsize.c, and is the Selinger model rendered with five design decisions that give it its shape:

  1. Five tunable cost parameters, all global doubles with documented defaults, expressing the I/O and CPU terms as ratios to one sequential page fetch.
  2. Every Path carries (startup_cost, total_cost), plus a disabled_nodes count (the replacement, since PG 17, for the old disable_cost add-on) and a rows estimate. The cost_* functions fill these three fields in.
  3. A cost_<node> function per physical operator, each taking the already-costed child paths and producing this node’s costs by adding its own page and CPU terms.
  4. A two-phase split for joins (initial_cost_* / final_cost_*) so the join search can prune on a cheap lower bound before paying for qual inspection.
  5. A clean seam to the selectivity machinery: cost functions take rows as already-estimated input (from set_baserel_size_estimates and friends, which call clauselist_selectivity), and call cost_qual_eval to turn predicate complexity into a per-tuple CPU charge.

The file header documents the parameters and their meaning; the variables are plain globals initialized to the DEFAULT_* macros from cost.h:

// cost parameter globals — src/backend/optimizer/path/costsize.c
double seq_page_cost = DEFAULT_SEQ_PAGE_COST; /* 1.0 */
double random_page_cost = DEFAULT_RANDOM_PAGE_COST; /* 4.0 */
double cpu_tuple_cost = DEFAULT_CPU_TUPLE_COST; /* 0.01 */
double cpu_index_tuple_cost = DEFAULT_CPU_INDEX_TUPLE_COST; /* 0.005 */
double cpu_operator_cost = DEFAULT_CPU_OPERATOR_COST; /* 0.0025 */
double parallel_tuple_cost = DEFAULT_PARALLEL_TUPLE_COST; /* 0.1 */
double parallel_setup_cost = DEFAULT_PARALLEL_SETUP_COST; /* 1000 */
int effective_cache_size = DEFAULT_EFFECTIVE_CACHE_SIZE; /* 524288 pages */

The defaults encode a model of spinning-disk hardware with a substantial buffer cache. The header comment is explicit about the relationship: “seq_page_cost is normally considerably less than random_page_cost” because sequential reads benefit from readahead and avoid seeks. The CPU weights are two orders of magnitude smaller than a page fetch — processing a tuple is cheap compared to bringing a page off disk — which is why I/O dominates the cost of most scans. effective_cache_size is not a memory allocation; it is the cost model’s estimate of how much OS+shared-buffer cache is available to a query, used only by index_pages_fetched. All seven are settable per-session and seq_page_cost/random_page_cost are settable per-tablespace (fetched via get_tablespace_page_costs).

The enable_* GUCs (enable_seqscan, enable_indexscan, …) are not parameters in this sense — flipping one to off does not remove the path, it adds 1 to the path’s disabled_nodes count, so a disabled path is chosen only if no enabled alternative exists. This replaced the old trick of adding a giant disable_cost (1e10) to the total.

flowchart TB
  subgraph PARAMS["Five cost parameters (ratios to 1 seq page fetch)"]
    SPC["seq_page_cost = 1.0<br/>(the unit)"]
    RPC["random_page_cost = 4.0<br/>(seek penalty)"]
    CTC["cpu_tuple_cost = 0.01"]
    CITC["cpu_index_tuple_cost = 0.005"]
    COC["cpu_operator_cost = 0.0025"]
  end
  subgraph TERMS["Cost of any operator"]
    IO["page-access term<br/>pages x {seq|random}_page_cost"]
    CPU["CPU term<br/>tuples x cpu_tuple_cost<br/>+ operators x cpu_operator_cost"]
  end
  SPC --> IO
  RPC --> IO
  CTC --> CPU
  CITC --> CPU
  COC --> CPU
  IO --> PAIR["Path.total_cost = startup_cost + run_cost"]
  CPU --> PAIR

Figure 1 — The five parameters and the two cost terms. Every cost function decomposes its work into a page-access term (weighted by the page-cost parameters, possibly per-tablespace) and a CPU term (weighted by the per-tuple and per-operator parameters), and sums them into the path’s startup/total pair.

startup_cost vs. total_cost and the row estimate

Section titled “startup_cost vs. total_cost and the row estimate”

Three fields on every Path are the cost model’s output: startup_cost (cost before the first tuple), total_cost (cost to drain the path), and rows (estimated output cardinality). The convention each cost_* function follows is to accumulate a startup_cost and a run_cost, then set total_cost = startup_cost + run_cost. Costs that must be paid before any tuple emerges (building a hash table, sorting all input, evaluating one-time startup predicates) go into startup_cost; costs paid per output tuple go into run_cost. The pathkeeping logic in add_path (owned by postgres-path-generation.md) uses this pair as a partial order: a path dominates another only if it is no worse on startup, total, row count’s relationship to parameterization, and pathkeys — which is precisely why a high-startup sorted path and a low-startup unsorted path can coexist.

The rows field is set into the path from the relation’s pre-computed estimate (baserel->rows) or, for parameterized paths, from param_info->ppi_rows. The cost functions do not themselves run selectivity estimation for the base row count — that already happened in set_baserel_size_estimates.

cost_seqscan is the cleanest illustration of the page + CPU decomposition. Its entire body:

// cost_seqscan — src/backend/optimizer/path/costsize.c (condensed)
if (param_info)
path->rows = param_info->ppi_rows;
else
path->rows = baserel->rows;
/* fetch estimated page cost for tablespace containing table */
get_tablespace_page_costs(baserel->reltablespace, NULL, &spc_seq_page_cost);
/* disk costs */
disk_run_cost = spc_seq_page_cost * baserel->pages;
/* CPU costs */
get_restriction_qual_cost(root, baserel, param_info, &qpqual_cost);
startup_cost += qpqual_cost.startup;
cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
cpu_run_cost = cpu_per_tuple * baserel->tuples;
/* tlist eval costs are paid per output row, not per tuple scanned */
startup_cost += path->pathtarget->cost.startup;
cpu_run_cost += path->pathtarget->cost.per_tuple * path->rows;
path->disabled_nodes = enable_seqscan ? 0 : 1;
path->startup_cost = startup_cost;
path->total_cost = startup_cost + cpu_run_cost + disk_run_cost;

Read it as the Selinger formula made concrete. The page term is spc_seq_page_cost * baserel->pagesevery page is read, regardless of selectivity, because a seqscan cannot skip pages. The CPU term is (cpu_tuple_cost + qual_per_tuple) * baserel->tuples — every tuple is examined (note tuples, not the post-filter rows), and each costs the base tuple cost plus the per-tuple cost of evaluating the WHERE clause (from get_restriction_qual_cost, which returns baserel->baserestrictcost that cost_qual_eval filled in earlier). Two subtleties: (1) the qual’s startup component (e.g. a one-time array hash build) goes into startup_cost, not the per-tuple loop; (2) target-list evaluation cost is charged per output row (path->rows), since projection happens only on surviving tuples. A seqscan’s startup cost is essentially zero — it can emit its first tuple as soon as the first page is read.

cost_index — deferral, Mackert-Lohman, and correlation

Section titled “cost_index — deferral, Mackert-Lohman, and correlation”

cost_index is the richest of the scan cost functions and the one that exercises every convention from ## Common DBMS Design. It does not compute the index-internal cost itself — it calls the access method’s amcostestimate callback (for B-tree, btcostestimate; covered in postgres-index-am.md / postgres-nbtree.md), which returns the index descent cost, the selectivity (fraction of heap tuples matched), and the correlation of index order to heap order:

// cost_index — src/backend/optimizer/path/costsize.c (condensed)
amcostestimate = (amcostestimate_function) index->amcostestimate;
amcostestimate(root, path, loop_count,
&indexStartupCost, &indexTotalCost,
&indexSelectivity, &indexCorrelation, &index_pages);
/* all costs for touching index itself included here */
startup_cost += indexStartupCost;
run_cost += indexTotalCost - indexStartupCost;
/* estimate number of main-table tuples fetched */
tuples_fetched = clamp_row_est(indexSelectivity * baserel->tuples);

The hard part is the heap-access I/O cost, which depends on both the buffering model and the correlation. PostgreSQL computes two bounds and interpolates. The uncorrelated bound (max_IO_cost) assumes every heap page fetch is random; the correlated bound (min_IO_cost) assumes the fetches walk the heap in order, so all but the first are sequential:

// cost_index — heap-cost bounds (normal loop_count <= 1 case), costsize.c
pages_fetched = index_pages_fetched(tuples_fetched, baserel->pages,
(double) index->pages, root);
/* max_IO_cost is for the perfectly uncorrelated case (csquared=0) */
max_IO_cost = pages_fetched * spc_random_page_cost;
/* min_IO_cost is for the perfectly correlated case (csquared=1) */
pages_fetched = ceil(indexSelectivity * (double) baserel->pages);
if (pages_fetched > 0)
{
min_IO_cost = spc_random_page_cost; /* first fetch is random */
if (pages_fetched > 1)
min_IO_cost += (pages_fetched - 1) * spc_seq_page_cost; /* rest sequential */
}
else
min_IO_cost = 0;

The interpolation is by the square of the correlation:

// cost_index — correlation interpolation, costsize.c
csquared = indexCorrelation * indexCorrelation;
run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost);

When csquared == 0 (uncorrelated) the run cost is max_IO_cost; when csquared == 1 (perfectly correlated, e.g. just after CLUSTER) it is min_IO_cost; in between it slides linearly in csquared. The squaring is a deliberate, admittedly heuristic choice (the comment says “XXX is that appropriate?”) — it makes the model favor the random-access assumption unless correlation is high. This single term is why a freshly-clustered table makes index scans dramatically cheaper in the planner’s eyes.

The CPU side mirrors cost_seqscan but charges cpu_tuple_cost + qpqual_cost.per_tuple only over tuples_fetched (the rows the index actually returns to the heap-fetch step), not the whole table — the selectivity win shows up here as well as in the page term.

index_pages_fetched is the Mackert-Lohman finite-LRU buffer model. Given the number of tuples to fetch, the table’s page count, and a pro-rated share of effective_cache_size, it estimates how many distinct pages will actually be read, accounting for the fact that fetching many tuples from a small table re-hits cached pages:

// index_pages_fetched — Mackert & Lohman 1989, costsize.c (core)
T = (pages > 1) ? (double) pages : 1.0; /* # pages in table */
total_pages = root->total_table_pages + index_pages;
b = (double) effective_cache_size * T / total_pages; /* pro-rated cache share */
/* ... force b positive and integral ... */
if (T <= b) /* table fits in cache share */
{
pages_fetched = (2.0 * T * tuples_fetched) / (2.0 * T + tuples_fetched);
if (pages_fetched >= T)
pages_fetched = T; /* never more than all pages */
}
/* ... T > b cases use the piecewise formula ... */

The 2TN/(2T+N) shape is the asymptote: for few tuples it is ≈ N (each tuple a fresh page); as tuples grow it saturates at T (you have read every page). This is the buffering-effects convention made concrete, and the single reason a selective index scan is not charged one random page per row.

flowchart TB
  START["cost_index(path, loop_count)"]
  AM["amcostestimate (btcostestimate)<br/>=> indexStartupCost, indexTotalCost,<br/>indexSelectivity, indexCorrelation"]
  TF["tuples_fetched = selectivity x baserel.tuples"]
  ML["index_pages_fetched<br/>(Mackert-Lohman: distinct heap pages<br/>given effective_cache_size share)"]
  MAX["max_IO_cost = pages x random_page_cost<br/>(uncorrelated bound)"]
  MIN["min_IO_cost = random + (pages-1) x seq<br/>(correlated bound)"]
  INTERP["csquared = correlation^2<br/>run_cost += max + csquared x (min - max)"]
  CPU["+ (cpu_tuple_cost + qual) x tuples_fetched<br/>+ tlist cost x rows"]
  OUT["path.startup_cost / path.total_cost"]
  START --> AM --> TF --> ML
  ML --> MAX
  ML --> MIN
  MAX --> INTERP
  MIN --> INTERP
  INTERP --> CPU --> OUT

Figure 2 — cost_index data flow. The AM callback supplies selectivity and correlation; Mackert-Lohman converts tuples-fetched into distinct heap pages; two I/O bounds are computed and interpolated by correlation-squared; the CPU term is added over the fetched tuples. Index-internal cost is a black box delegated to the AM.

Where do the qual_per_tuple charges in the scan costs come from? cost_qual_eval walks a qual expression tree and sums a (startup, per_tuple) QualCost. The driver loops over the implicitly-ANDed clause list and walks each:

// cost_qual_eval — src/backend/optimizer/path/costsize.c
void
cost_qual_eval(QualCost *cost, List *quals, PlannerInfo *root)
{
cost_qual_eval_context context;
context.root = root;
context.total.startup = 0;
context.total.per_tuple = 0;
foreach(l, quals)
cost_qual_eval_walker((Node *) lfirst(l), &context);
*cost = context.total;
}

The walker caches its result on RestrictInfo nodes (eval_cost, sentinel startup < 0 = not yet computed) so the same clause is never costed twice across the many paths that reference it. For each function/operator node it defers to add_function_cost, which is the bridge from the planner to pg_proc: it charges procost * cpu_operator_cost, or asks a support function (prosupport) for a smarter estimate. (add_function_cost lives in plancat.c, the catalog-access layer, because it must read the pg_proc row.)

// add_function_cost — src/backend/optimizer/util/plancat.c (tail)
/* No support function, or it failed, so rely on procost */
cost->per_tuple += procform->procost * cpu_operator_cost;

This is the System R W * (RSI calls) term, refined: instead of a flat per-tuple charge, each predicate’s CPU cost is the declared cost of its functions (default procost = 1, so one cpu_operator_cost per operator), with expensive functions (e.g. a PL/pgSQL function declared COST 100) charged proportionally. This is why predicate ordering and expensive-function placement can change a plan’s cost.

The same machinery has a second consumer in clauses.c: when the planner considers inlining a SQL function whose parameter appears more than once, it uses cost_qual_eval as an “is this expensive?” gate — defining expensive as “more than 10 operators” — to avoid duplicating costly sub-expressions:

// inline_function expensiveness gate — src/backend/optimizer/util/clauses.c
cost_qual_eval(&eval_cost, list_make1(param), NULL);
if (eval_cost.startup + eval_cost.per_tuple > 10 * cpu_operator_cost)
goto fail; /* don't inline; would duplicate an expensive expression */

So cost_qual_eval is not only a cost-model component but also a decision oracle reused by expression-tree transformations.

cost_sort — the canonical high-startup operator

Section titled “cost_sort — the canonical high-startup operator”

A sort is the textbook example of a high-startup_cost, low-run_cost operator: it must read and order its entire input before it can emit the first tuple, after which emitting is cheap. cost_sort is a thin wrapper — it adds the input path’s cost to the startup, then delegates the model to cost_tuplesort:

// cost_sort — src/backend/optimizer/path/costsize.c
void
cost_sort(Path *path, PlannerInfo *root, List *pathkeys, int input_disabled_nodes,
Cost input_cost, double tuples, int width, Cost comparison_cost,
int sort_mem, double limit_tuples)
{
Cost startup_cost, run_cost;
cost_tuplesort(&startup_cost, &run_cost,
tuples, width, comparison_cost, sort_mem, limit_tuples);
startup_cost += input_cost; /* must consume all input first */
path->rows = tuples;
path->disabled_nodes = input_disabled_nodes + (enable_sort ? 0 : 1);
path->startup_cost = startup_cost;
path->total_cost = startup_cost + run_cost;
}

cost_tuplesort branches on whether the data fits in sort_mem (work_mem). The in-memory case is the classic comparison-sort cost, comparison_cost * N * log2(N), all charged to startup; the external case adds page I/O for the merge passes:

// cost_tuplesort — src/backend/optimizer/path/costsize.c (condensed)
comparison_cost += 2.0 * cpu_operator_cost; /* default per-comparison cost */
if (output_bytes > sort_mem_bytes)
{
/* disk-based merge sort */
double npages = ceil(input_bytes / BLCKSZ);
double nruns = input_bytes / sort_mem_bytes;
double mergeorder = tuplesort_merge_order(sort_mem_bytes);
double log_runs, npageaccesses;
*startup_cost = comparison_cost * tuples * LOG2(tuples); /* N log2 N comparisons */
if (nruns > mergeorder)
log_runs = ceil(log(nruns) / log(mergeorder));
else
log_runs = 1.0;
npageaccesses = 2.0 * npages * log_runs;
/* Assume 3/4ths of accesses are sequential, 1/4th are not */
*startup_cost += npageaccesses * (seq_page_cost * 0.75 + random_page_cost * 0.25);
}
else
{
/* plain in-memory quicksort */
*startup_cost = comparison_cost * tuples * LOG2(tuples);
}
/* small per-extracted-tuple charge; NOT cpu_tuple_cost — Sort does no qual/projection */
*run_cost = cpu_operator_cost * tuples;

Three modeling choices to notice. (1) The whole sorting cost is startup — the run cost is just a token cpu_operator_cost per emitted tuple, because a Sort node does no qual checking or projection, so the planner deliberately charges it less than a normal node’s cpu_tuple_cost. (2) The external case charges page I/O for 2 * npages * log_runs page accesses across the merge passes, mixing 75% sequential / 25% random — an explicit caching/locality assumption baked into the constants. (3) A useful LIMIT (passed as limit_tuples) switches the model to a bounded heap sort of N * log2(K) comparisons, which is why ORDER BY ... LIMIT 10 over a huge input is costed far below a full sort. This is the startup_cost machinery earning its keep: the sort’s high startup is exactly what makes the planner prefer an already-sorted index path under a small LIMIT.

Join costing is where the initial_cost_* / final_cost_* convention appears. initial_cost_nestloop produces a lower bound from the input paths’ costs alone — it does not inspect the join quals (the comment says so explicitly) — and stashes intermediate values in a JoinCostWorkspace. The join search compares this lower bound against the best path found so far; only survivors get final_cost_nestloop, which does the expensive qual-eval and final row estimate.

// initial_cost_nestloop — src/backend/optimizer/path/costsize.c (condensed)
disabled_nodes = enable_nestloop ? 0 : 1;
disabled_nodes += inner_path->disabled_nodes + outer_path->disabled_nodes;
cost_rescan(root, inner_path, &inner_rescan_start_cost, &inner_rescan_total_cost);
/* must pay both children's startup before first tuple */
startup_cost += outer_path->startup_cost + inner_path->startup_cost;
run_cost += outer_path->total_cost - outer_path->startup_cost;
if (outer_path_rows > 1)
run_cost += (outer_path_rows - 1) * inner_rescan_start_cost;
inner_run_cost = inner_path->total_cost - inner_path->startup_cost;
/* ... for SEMI/ANTI/unique, defer to final; else scan whole inner per outer row ... */
else
{
run_cost += inner_run_cost;
if (outer_path_rows > 1)
run_cost += (outer_path_rows - 1) * inner_rescan_run_cost;
}
workspace->run_cost = run_cost; /* private data for final_cost_nestloop */

The nested-loop shape is the textbook one: the inner path is (re)scanned once per outer row, so the dominant run-cost term is (outer_rows - 1) * inner_rescan_run_cost on top of the first inner_run_cost. The startup is the sum of both children’s startups — neither child can produce a tuple until it has paid its own startup. The SEMI/ANTI/inner-unique cases (early-out after the first match) are deferred to final because they need the join quals.

final_cost_nestloop then computes the actual tuple count and the CPU charge over it:

// final_cost_nestloop — src/backend/optimizer/path/costsize.c (normal-case tail)
ntuples = outer_path_rows * inner_path_rows; /* tuples *processed* */
cost_qual_eval(&restrict_qual_cost, path->jpath.joinrestrictinfo, root);
startup_cost += restrict_qual_cost.startup;
cpu_per_tuple = cpu_tuple_cost + restrict_qual_cost.per_tuple;
run_cost += cpu_per_tuple * ntuples;
/* tlist eval costs are paid per output row, not per tuple scanned */
startup_cost += path->jpath.path.pathtarget->cost.startup;
run_cost += path->jpath.path.pathtarget->cost.per_tuple * path->jpath.path.rows;

Note ntuples is the number of candidate pairs examined (outer_rows * inner_rows), not the number of output rows — the CPU charge is for every qual evaluation, while target-list cost is over the much smaller path.rows output. This is the System R cost metric applied to a join: page costs were inherited from the children’s run costs in the initial phase; this phase adds the per-comparison CPU term.

The hash join’s two-phase split sits in initial_cost_hashjoin / final_cost_hashjoin. The initial phase charges the hash-build cost as startup (one hash function per column per inner row, plus a cpu_tuple_cost for inserting into the table) and, crucially, models batching: if the inner relation does not fit in hash_mem, the executor spills to disk, so the cost adds seq_page_cost per spilled page:

// initial_cost_hashjoin — src/backend/optimizer/path/costsize.c (condensed)
startup_cost += outer_path->startup_cost;
run_cost += outer_path->total_cost - outer_path->startup_cost;
startup_cost += inner_path->total_cost; /* must build hash from whole inner */
/* hash build: one cpu_operator_cost per column hash + cpu_tuple_cost insert */
startup_cost += (cpu_operator_cost * num_hashclauses + cpu_tuple_cost) * inner_path_rows;
run_cost += cpu_operator_cost * num_hashclauses * outer_path_rows; /* probe hashing */
ExecChooseHashTableSize(inner_path_rows_total, inner_path->pathtarget->width,
true, parallel_hash, outer_path->parallel_workers,
&space_allowed, &numbuckets, &numbatches, &num_skew_mcvs);
if (numbatches > 1) /* inner doesn't fit -> spill both sides to disk */
{
double outerpages = page_size(outer_path_rows, outer_path->pathtarget->width);
double innerpages = page_size(inner_path_rows, inner_path->pathtarget->width);
startup_cost += seq_page_cost * innerpages; /* write inner */
run_cost += seq_page_cost * (innerpages + 2 * outerpages); /* re-read */
}

final_cost_hashjoin then charges the probe cost using the estimated inner bucket size (how many inner rows land in the bucket each outer row probes) and halves the qual-eval charge to account for hash-code mismatches short-circuiting the comparison:

// final_cost_hashjoin — src/backend/optimizer/path/costsize.c (normal-case core)
/* comparisons = outer_rows x (inner_rows x bucketsize); halve for hash-mismatch skip */
startup_cost += hash_qual_cost.startup;
run_cost += hash_qual_cost.per_tuple * outer_path_rows *
clamp_row_est(inner_path_rows * innerbucketsize) * 0.5;
hashjointuples = approx_tuple_count(root, &path->jpath, hashclauses);
/* per surviving join tuple: cpu_tuple_cost + other (qp) quals */
startup_cost += qp_qual_cost.startup;
cpu_per_tuple = cpu_tuple_cost + qp_qual_cost.per_tuple;
run_cost += cpu_per_tuple * hashjointuples;

The contrast with nested loop is the whole point of having a cost model: the hash join front-loads cost into startup (building the table) and then probes cheaply, so it wins on large equijoins where nested loop’s outer * inner comparison count explodes — but loses under a tight LIMIT where the build cost is never amortized. The bucketsize and mcvfreq estimates that drive the probe cost come from the selectivity machinery (estimate_hash_bucket_stats, estimate_multivariate_bucketsize), the same seam discussed next; extended-statistics handling of bucket size is owned by postgres-extended-statistics.md.

The cost functions take row counts as input — they do not estimate them. The estimation happens earlier, in set_baserel_size_estimates, which is the seam to the selectivity subsystem (clauselist_selectivity, ultimately the selfuncs.c per-operator estimators):

// set_baserel_size_estimates — src/backend/optimizer/path/costsize.c
nrows = rel->tuples *
clauselist_selectivity(root, rel->baserestrictinfo, 0, JOIN_INNER, NULL);
rel->rows = clamp_row_est(nrows);
cost_qual_eval(&rel->baserestrictcost, rel->baserestrictinfo, root);
set_rel_width(root, rel);

Two outputs feed the cost model: rel->rows (the post-filter cardinality the cost functions read) and rel->baserestrictcost (the qual CPU cost get_restriction_qual_cost later hands to cost_seqscan). Every row estimate passes through clamp_row_est, which forces it to at least 1.0, rounds to an integer, and caps it to avoid double overflow — a small but load-bearing guard, since a 0-row estimate would cause divide-by-zero in downstream cost interpolation and an infinite estimate would break add_path. The selectivity functions in selfuncs.c (histogram/MCV lookups, the System R fallback constants) are owned by postgres-extended-statistics.md; this document treats rel->rows as a given.

flowchart LR
  STATS["pg_statistic<br/>(histograms, MCVs, ndistinct, correlation)"]
  CLS["clauselist_selectivity<br/>(selfuncs.c)"]
  SBSE["set_baserel_size_estimates<br/>rel.rows = tuples x selectivity<br/>rel.baserestrictcost = cost_qual_eval(...)"]
  COST["cost_seqscan / cost_index / cost_*<br/>read rel.rows + baserestrictcost"]
  ADD["add_path<br/>(keep on startup/total/order partial order)"]
  STATS --> CLS --> SBSE --> COST --> ADD

Figure 3 — The row-estimation seam. Statistics drive selectivity, which sets rel->rows; the cost functions consume that cardinality plus the pre-computed qual cost. The cost model and the selectivity model are coupled but cleanly separated: cost functions never call clauselist_selectivity directly for base rows.

Anchor on symbol names, not line numbers. The PostgreSQL source moves between releases; a function or struct name is the stable handle. Use git grep -n '<symbol>' src/backend/optimizer/ to locate the current position. The line numbers in the position-hint table were observed at commit 273fe94 (REL_18_STABLE) and are quick hints only.

Cost parameters and clamps (costsize.c top)

Section titled “Cost parameters and clamps (costsize.c top)”
  • seq_page_cost / random_page_cost / cpu_tuple_cost / cpu_index_tuple_cost / cpu_operator_cost — the five tunable cost GUCs, initialized from the DEFAULT_* macros in cost.h.
  • parallel_tuple_cost / parallel_setup_cost / effective_cache_size — parallel-communication weights and the buffering-model cache size.
  • disable_cost (1e10) — legacy add-on; mostly superseded by the per-path disabled_nodes counter (enable_* GUCs increment it).
  • clamp_row_est — force a row estimate to [1, 1e100], integral; the universal sanity guard before any cost interpolation.
  • clamp_width_est / clamp_cardinality_to_long — companion clamps for tuple width and cardinality-to-long conversions.
  • cost_seqscan — page term spc_seq_page_cost * pages + CPU term (cpu_tuple_cost + qual) * tuples; near-zero startup.
  • cost_index — delegates index-internal cost to amcostestimate; computes heap I/O via index_pages_fetched, interpolates min_IO_cost / max_IO_cost by csquared (correlation²).
  • index_pages_fetched — Mackert & Lohman (1989) finite-LRU buffer model; distinct heap pages fetched given a pro-rated effective_cache_size.
  • cost_samplescan / cost_bitmap_heap_scan / cost_tidscan — sibling scan costers (bitmap and TID scans cross-ref postgres-scan-nodes.md).
  • extract_nonindex_conditions — which restriction clauses become qpquals (CPU-charged) vs. handled inside the index.
  • cost_sort — wrapper: input cost into startup, then cost_tuplesort.
  • cost_tuplesort — in-memory comparison_cost * N * log2(N); external merge adds 2 * npages * log_runs page accesses at 75% seq / 25% random; bounded N * log2(K) under a LIMIT.
  • cost_incremental_sort — per-group cost_tuplesort over estimated groups for a prefix-sorted input.
  • cost_material / cost_memoize_rescan — materialization and the Memoize rescan cache.
  • initial_cost_nestloop / final_cost_nestloop — lower-bound (no qual inspection) then final; inner rescanned once per outer row; SEMI/ANTI/unique early-out deferred to final.
  • initial_cost_mergejoin / final_cost_mergejoin — merge join, costs the required sorts via cost_sort; cached_scansel caches merge scan selectivity.
  • initial_cost_hashjoin / final_cost_hashjoin — build cost as startup, batching spill via seq_page_cost, probe cost via inner bucketsize.
  • cost_gather / cost_gather_merge — parallel leader/worker gather, adds parallel_setup_cost + parallel_tuple_cost.
  • cost_rescan — cost to re-execute the inner side of a join.
  • approx_tuple_count / has_indexed_join_quals — join-tuple count and index-qual detection helpers.
  • cost_qual_eval / cost_qual_eval_node — sum a QualCost over a clause list / single node; caches on RestrictInfo.eval_cost.
  • cost_qual_eval_walker — the recursive walker; per node-type charging (FuncExpr, OpExpr, ScalarArrayOpExpr, Aggref, …).
  • add_function_cost (in plancat.c) — bridge to pg_proc.procost and prosupport support functions; charges procost * cpu_operator_cost.
  • get_restriction_qual_cost — returns baserel->baserestrictcost (+ pushed-down clause cost for parameterized paths).
  • set_baserel_size_estimatesrel->rows = tuples * selectivity; rel->baserestrictcost = cost_qual_eval(baserestrictinfo); set_rel_width.
  • get_parameterized_baserel_size / calc_joinrel_size_estimate / set_joinrel_size_estimates — parameterized and join cardinalities.
  • clauselist_selectivity (in selfuncs.c) — the selectivity entry point the seam calls; owned by postgres-extended-statistics.md.
  • relation_byte_sizetuples * (MAXALIGN(width) + MAXALIGN(header)).
  • page_sizeceil(relation_byte_size / BLCKSZ).
  • get_parallel_divisor — fraction of work per worker, modeling leader participation (leader contributes 1 - 0.3 * workers).

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
clamp_row_estsrc/backend/optimizer/path/costsize.c213
cost_seqscansrc/backend/optimizer/path/costsize.c295
cost_gathersrc/backend/optimizer/path/costsize.c446
cost_indexsrc/backend/optimizer/path/costsize.c560
index_pages_fetchedsrc/backend/optimizer/path/costsize.c908
cost_tuplesortsrc/backend/optimizer/path/costsize.c1898
cost_sortsrc/backend/optimizer/path/costsize.c2144
initial_cost_nestloopsrc/backend/optimizer/path/costsize.c3267
final_cost_nestloopsrc/backend/optimizer/path/costsize.c3349
initial_cost_hashjoinsrc/backend/optimizer/path/costsize.c4160
final_cost_hashjoinsrc/backend/optimizer/path/costsize.c4275
cost_qual_evalsrc/backend/optimizer/path/costsize.c4756
cost_qual_eval_walkersrc/backend/optimizer/path/costsize.c4796
get_restriction_qual_costsrc/backend/optimizer/path/costsize.c5072
set_baserel_size_estimatessrc/backend/optimizer/path/costsize.c5349
relation_byte_sizesrc/backend/optimizer/path/costsize.c6453
get_parallel_divisorsrc/backend/optimizer/path/costsize.c6474
add_function_costsrc/backend/optimizer/util/plancat.c2125
cost_qual_eval (inline gate)src/backend/optimizer/util/clauses.c4831
DEFAULT_SEQ_PAGE_COSTsrc/include/optimizer/cost.h24

Verified against the working tree at /data/hgryoo/references/postgres, REL_18_STABLE, commit 273fe94852b3a7e34fd171e8abdf1481beb302fa.

  • Five cost parameters and defaults. cost.h defines DEFAULT_SEQ_PAGE_COST 1.0, DEFAULT_RANDOM_PAGE_COST 4.0, DEFAULT_CPU_TUPLE_COST 0.01, DEFAULT_CPU_INDEX_TUPLE_COST 0.005, DEFAULT_CPU_OPERATOR_COST 0.0025, DEFAULT_PARALLEL_TUPLE_COST 0.1, DEFAULT_PARALLEL_SETUP_COST 1000.0, DEFAULT_EFFECTIVE_CACHE_SIZE 524288. The globals in costsize.c are initialized to these. Verified by grep -n DEFAULT_ src/include/optimizer/cost.h.
  • cost_seqscan page + CPU decomposition. Confirmed disk_run_cost = spc_seq_page_cost * baserel->pages and cpu_run_cost = (cpu_tuple_cost + qpqual_cost.per_tuple) * baserel->tuples, with total_cost = startup_cost + cpu_run_cost + disk_run_cost. CPU term uses baserel->tuples (pre-filter), tlist cost uses path->rows (post-filter). Verified by reading the full function body.
  • cost_index delegation and interpolation. Confirmed the amcostestimate callback supplies indexSelectivity and indexCorrelation; csquared = indexCorrelation * indexCorrelation; run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost). The loop_count > 1 branch (repeated indexscans) and the index-only-scan allvisfrac reduction are present as described.
  • Mackert-Lohman. index_pages_fetched cites “Mackert and Lohman, Index Scans Using a Finite LRU Buffer, ACM TODS Vol. 14 No. 3, 1989” in its header comment and implements the 2TNs/(2T+Ns) piecewise formula with b a pro-rated effective_cache_size share. Verified.
  • cost_tuplesort model. Confirmed in-memory quicksort comparison_cost * tuples * LOG2(tuples) charged to startup, external merge adds npageaccesses * (seq_page_cost * 0.75 + random_page_cost * 0.25), run cost is cpu_operator_cost * tuples (not cpu_tuple_cost — comment explains the Sort node does no qual/projection). comparison_cost += 2.0 * cpu_operator_cost default. Verified.
  • Two-phase joins. initial_cost_nestloop and initial_cost_hashjoin fill a JoinCostWorkspace and explicitly defer qual inspection (“CPU costs left for later”); final_cost_nestloop / final_cost_hashjoin call cost_qual_eval on joinrestrictinfo / hashclauses. The hashjoin batching branch (numbatches > 1) charges seq_page_cost per spilled page. Verified.
  • cost_qual_evaladd_function_cost. Confirmed the walker caches on RestrictInfo.eval_cost (sentinel startup < 0) and that add_function_cost (in plancat.c, not clauses.c) charges procform->procost * cpu_operator_cost when no prosupport function intervenes. The clauses.c use at line 4831 is the inline_function “more than 10 operators” expensiveness gate, not a definition — verified by grep -n cost_qual_eval src/backend/optimizer/util/clauses.c.
  • Row-estimation seam. set_baserel_size_estimates computes rel->rows = clamp_row_est(rel->tuples * clauselist_selectivity(...)) and sets rel->baserestrictcost via cost_qual_eval. clauselist_selectivity is declared in selfuncs and is the boundary to the statistics subsystem. Verified.
  • REL_18 caveats honored. disabled_nodes (the per-path counter that enable_* GUCs increment, present since PG 17) is used throughout rather than only the legacy disable_cost add-on; estimate_multivariate_bucketsize (extended-stats bucket sizing, PG 17+) is called from final_cost_hashjoin. No pre-17 constructs were asserted.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

PostgreSQL’s cost model is a faithful, conservative descendant of the 1979 System R design. Placing it against other engines and the research literature clarifies which of its choices are universal and which are contingent.

The disk-era constant problem. PostgreSQL’s defaults (random_page_cost = 4.0) encode spinning-disk physics: a random seek costs roughly four sequential reads. On NVMe SSDs the ratio is closer to 1.1–1.5, and on a fully-cached working set effectively 1.0. Because the model is unit-free and the parameters are exposed, the standard remedy is to re-tune — lowering random_page_cost makes index scans more attractive, which is the single most common production cost-model adjustment. The deeper critique is that a single random_page_cost cannot express a storage hierarchy where the first few pages are cached (cheap) and the tail is on disk (expensive); effective_cache_size and the Mackert-Lohman model are PostgreSQL’s partial answer, but they assume a uniform cache share per table rather than a learned per-relation hit rate.

Calibrated vs. abstract cost units. Some engines calibrate their cost units to real hardware. DB2 and the original System R successors ran hardware-characterization to set the I/O/CPU weight; Microsoft SQL Server’s cost units are nominally seconds on a reference machine. PostgreSQL deliberately stays abstract — its costs are dimensionless and only need to be monotonic with time. The tradeoff: abstract units are simpler and robust to model error in absolute terms, but they cannot answer “will this query take 2 seconds?” and they make cross-plan comparison the only meaningful use of the number.

Cardinality estimation is the dominant error source. A celebrated line of work — Leis et al., How Good Are Query Optimizers, Really? (VLDB 2015) — showed empirically that for join-heavy queries, errors in the cost model itself are dwarfed by errors in cardinality estimation: once row counts are wrong (and they compound multiplicatively through a join tree), the cost function’s precision is irrelevant. The paper’s blunt conclusion is that a good optimizer needs good cardinality estimates far more than a sophisticated cost function — PostgreSQL’s relatively simple cost arithmetic is, in this light, the right level of investment, with the leverage living in selfuncs.c and extended statistics (postgres-extended-statistics.md).

Learned and adaptive cost models. The research frontier moves the cost function from hand-tuned constants to learned models. Neo (Marcus et al., VLDB 2019) and Bao learn a cost/latency predictor from execution feedback; learned cardinality estimators (MSCN, DeepDB, NeuroCard) replace the histogram pipeline. Adaptive query processing (e.g. the classic eddies, and runtime re-optimization in systems like SCOPE and Spark’s AQE) sidesteps estimation error by measuring intermediate cardinalities mid-execution and re-planning. PostgreSQL has none of this in core — its cost model is fixed at plan time — though the prosupport mechanism (add_function_cost consulting a support function) is a small extensibility hook, and extensions like pg_hint_plan and adaptive-statistics patches explore the space. The conservative argument for the static model is plan stability and explainability: a fixed cost function produces reproducible plans an administrator can reason about with EXPLAIN.

Vectorized and columnar engines re-weight the CPU term. PostgreSQL’s cpu_tuple_cost/cpu_operator_cost assume a row-at-a-time Volcano executor (postgres-executor.md). Columnar/vectorized engines (DuckDB, ClickHouse, Vectorwise) process batches and SIMD-vectorize operators, so the per-tuple CPU constant is far lower and the relative cost of a scan vs. a hash join shifts. A cost model tuned for a tuple-at-a-time interpreter systematically over-costs CPU-bound operators on a vectorized engine; this is one reason such engines ship their own cost models rather than porting PostgreSQL’s. The model is thus coupled to the executor architecture it scores, not just the storage hardware.

What PostgreSQL gets right by staying simple. The two-phase join split (initial_cost_* / final_cost_*) is a genuinely good engineering pattern that keeps optimizer runtime bounded as the join search explodes combinatorially — many learned/adaptive systems still need an equivalent cheap-lower-bound mechanism to make their search tractable. And the clean (startup, total) pair, treated as a partial order by add_path, is the right primitive for LIMIT-aware planning that flat-cost models get wrong.

  • Code (REL_18_STABLE, commit 273fe94, as of 2026-06-05):
    • src/backend/optimizer/path/costsize.c — the cost model: cost_seqscan, cost_index, index_pages_fetched, cost_sort / cost_tuplesort, initial_cost_nestloop / final_cost_nestloop, initial_cost_hashjoin / final_cost_hashjoin, cost_gather, cost_qual_eval, get_restriction_qual_cost, set_baserel_size_estimates, clamp_row_est, relation_byte_size, get_parallel_divisor, and the five cost-parameter globals.
    • src/backend/optimizer/util/clauses.cinline_function’s cost_qual_eval “10 operators” expensiveness gate (the clauses.c seam).
    • src/backend/optimizer/util/plancat.cadd_function_cost, the bridge to pg_proc.procost / prosupport.
    • src/include/optimizer/cost.h — the DEFAULT_* cost-parameter macros.
    • src/include/nodes/pathnodes.hPath (startup_cost, total_cost, rows, disabled_nodes), QualCost, JoinCostWorkspace.
  • Theory: Selinger, Astrahan, Chamberlin, Lorie, Price, Access Path Selection in a Relational Database Management System, SIGMOD 1979 — captured at knowledge/research/dbms-papers/systemr-optimizer.md (cost metric, selectivity factors, cost-as-pruning-key). Mackert & Lohman, Index Scans Using a Finite LRU Buffer: A Validated I/O Model, ACM TODS 14(3), 1989 (the index_pages_fetched formula).
  • Comparative / frontier: Leis et al., How Good Are Query Optimizers, Really?, VLDB 2015 (cardinality vs. cost error); Marcus et al., Neo: A Learned Query Optimizer, VLDB 2019 (learned cost). Hellerstein, Stonebraker & Hamilton, Architecture of a Database System, 2007 — knowledge/research/dbms-papers/fntdb07-architecture.md.
  • Related KB docs: postgres-path-generation.md (how Path objects are built and kept by add_path), postgres-join-ordering.md (the enumeration search that consumes costs as its pruning key), postgres-extended-statistics.md (the selectivity/cardinality subsystem behind clauselist_selectivity), postgres-planner-overview.md (where costing sits in the planner), postgres-executor.md (the Volcano executor the CPU weights model), postgres-scan-nodes.md, postgres-index-am.md, postgres-nbtree.md (the amcostestimate callees).