PostgreSQL Table Sampling — The TSM API, SYSTEM and BERNOULLI
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Sampling is the act of selecting a subset of a population so that a
computation over the subset approximates a computation over the whole at a
fraction of the cost. In a database, the “population” is the set of rows in a
table and the “computation” is some query — count(*), an average, a
GROUP BY aggregate, an exploratory SELECT *. If a billion-row table can
answer “roughly what fraction of orders shipped late?” by examining one
percent of its rows, the query runs a hundred times faster, and for many
analytic and data-exploration purposes the loss of precision is acceptable.
Database System Concepts (Silberschatz 7e, ch. 15 “Query Processing” and
ch. 25 on advanced query processing / approximate query answering) frames
sampling both as a query-evaluation technique and as the engine behind
statistics gathering: the optimizer’s own row-count and histogram estimates
come from a sample of the table collected by ANALYZE. The theory that
makes sampling useful is classical survey statistics: the precision of an
estimate derived from a sample of size n improves as 1/sqrt(n),
independent of the population size N, provided the sample is drawn without
systematic bias. Bias — not sample size — is the enemy.
That proviso forces a taxonomy. The cleanest statistical object is a simple
random sample: every row is included or excluded by an independent coin
flip with the same probability p, so any two rows are equally likely to
appear together and there is no correlation between membership and any
attribute. This is what statisticians call Bernoulli sampling — N
independent Bernoulli(p) trials, one per row, yielding a sample whose size
is itself random with expectation pN. A Bernoulli sample is unbiased for
any quantity, including correlations between columns, because the selection
process ignores the data entirely. Its cost, though, is that the engine must
consider every row — you cannot decide a coin flip for a row without
locating that row — so a Bernoulli scan reads the whole table even when it
returns one percent of it.
The cheaper alternative trades statistical purity for I/O. A database stores
rows in pages (blocks), and the dominant cost of a scan is reading pages,
not examining the rows already in memory on a page. Block sampling (also
called cluster sampling in survey theory) selects whole pages with
probability p and returns all rows on each selected page. Now the engine
reads only pN_pages pages, a genuine 1/p I/O saving. The catch is the
sample is no longer simple-random: rows on the same page are perfectly
correlated in their membership (either all in or all out), so the sample is
clustered. If the physical layout correlates with the data — orders sorted
by date, so each page holds a narrow date range — a block sample badly
misrepresents the date distribution. The statistical term for this loss is
the design effect: clustering inflates the variance of an estimate
relative to a simple random sample of the same size, by a factor that grows
with the intra-page homogeneity of the sampled attribute.
A third axis, orthogonal to block-vs-row, is repeatability. A sample is
repeatable if re-running the same sampling query returns the same rows.
This sounds trivial but is subtle in a live database: between two runs the
table may have been extended, vacuumed, or scanned by a synchronized seq-scan
that started at a different block. A naive implementation that flips a fresh
coin per row each time, or that piggybacks on whatever physical scan order
the storage engine happens to use, is not repeatable. To make sampling
repeatable you need the membership decision for a given row to be a pure
function of the row’s stable identity and a fixed seed — history
independent, as the PostgreSQL source comments put it. Hashing (identity, seed) and thresholding the hash achieves exactly this: the same identity and
seed always hash to the same value, regardless of when or in what order the
row is encountered, and regardless of what other rows exist.
Common DBMS Design
Section titled “Common DBMS Design”The SQL:2003 standard introduced the TABLESAMPLE clause to expose sampling
declaratively, and it standardized exactly the two methods that PostgreSQL
ships:
SELECT * FROM orders TABLESAMPLE SYSTEM (1.5); -- 1.5% of pagesSELECT * FROM orders TABLESAMPLE BERNOULLI (1.5); -- 1.5% of rowsSELECT * FROM orders TABLESAMPLE SYSTEM (1.5) REPEATABLE (42);SYSTEM is the standard’s name for block sampling and BERNOULLI for
row sampling; the percentage is a probability in [0,100], and the optional
REPEATABLE(seed) clause pins the seed. This is the surface that most major
engines converged on. SQL Server’s TABLESAMPLE (n PERCENT) is block-level
by default (it documents that it reads pages, not rows, so small samples can
return zero rows) with a REPEATABLE(seed) option; DB2 offers both
TABLESAMPLE SYSTEM and TABLESAMPLE BERNOULLI with a REPEATABLE seed,
matching the standard almost verbatim; Oracle’s older SAMPLE(n) /
SAMPLE BLOCK(n) syntax predates SQL:2003 but expresses the same row-vs-block
choice and a SEED clause for repeatability. The remarkable thing is the
near-universal agreement: two methods, a percentage, an optional seed.
Underneath the syntax, engines make three recurring design decisions:
-
Where does the sampling decision live — in the access path, or as a filter on top of a full scan? A block-sampling method is an access path: it must influence which pages the storage engine fetches, so it has to reach down into the scan. A row-sampling method could in principle be a post-filter on an ordinary scan, but engines usually fold it into the scan too, so the same code path serves both and the optimizer can cost them uniformly. PostgreSQL’s choice — a dedicated
SampleScanplan node that delegates block and tuple selection to a method — is the folded design. -
How is the method made pluggable? Some engines hardwire SYSTEM and BERNOULLI. PostgreSQL instead defines a tiny callback interface (the TSM API) so the two built-in methods are just two implementations and an extension can ship a third — for example, the
tsm_system_rowsandtsm_system_timecontrib modules sample a fixed number of rows or sample for a fixed time budget rather than a percentage. The interface is deliberately small: estimate size, begin, pick a block, pick a tuple. -
How is repeatability and history-independence achieved? The robust answer, used by PostgreSQL, is the hash-and-threshold trick: compute a cutoff from the requested probability once, then for each candidate identifier hash
(identifier, seed)and accept it if the hash falls below the cutoff. Because the hash is a deterministic function of identity and seed, the sample is repeatable (fix the seed) and history-independent (membership does not depend on scan order, table growth, or concurrent maintenance). This is strictly better than the alternative of consuming a PRNG stream in scan order, which is neither.
A subtlety that every implementation must handle is the interaction between sampling and visibility / concurrency. A sampling method picks a tuple slot (a block number and offset) without knowing whether the tuple there is visible to the current snapshot, or even whether a live tuple exists at that offset. The method’s job is purely to decide membership; the storage engine then applies the normal MVCC visibility test and silently drops slots that are dead or invisible. Crucially, for an unbiased row sample this ordering is correct only because the coin is flipped for every offset independently — extra coin flips spent on invisible tuples do not bias the survivors, since all tuples share the same acceptance probability.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL realizes TABLESAMPLE as a four-layer stack. At the top, the
parser (transformRangeTableSample in parse_clause.c) resolves the
method name to a handler function — SYSTEM and BERNOULLI are not
keywords baked into the grammar but ordinary functions
tsm_system_handler / tsm_bernoulli_handler looked up in pg_proc,
returning the pseudo-type tsm_handler. The handler, invoked once via
GetTsmRoutine (tablesample.c), returns a TsmRoutine node: a vtable of
function pointers plus two flags advertising whether the method produces
repeatable samples across queries and across scans. This is the same
“handler returns a routine struct” pattern PostgreSQL uses for index AMs
(IndexAmRoutine, see postgres-index-am.md), table AMs
(TableAmRoutine, see postgres-table-am.md) and FDWs.
In the planner, set_tablesample_rel_size (allpaths.c) calls the
method’s SampleScanGetSampleSize to learn how many pages it will read and
how many tuples it will return, overwriting the relation’s page/tuple
estimates accordingly; cost_samplescan (costsize.c) then prices the scan,
charging random-page cost if the method supplies a NextSampleBlock
(non-sequential block access) and sequential-page cost otherwise. The
planner also consults repeatable_across_scans: if the query might scan the
sampled rel more than once (e.g. the inner side of a nestloop) and the method
cannot reproduce its sample, the SampleScan is wrapped in a Materialize
node so the sample is computed once and replayed.
At execution time, nodeSamplescan.c defines the SampleScan plan
node. ExecInitSampleScan opens the relation, initializes the TABLESAMPLE
argument expressions and the optional REPEATABLE expression, draws a random
seed if no REPEATABLE clause was given, runs the handler to get the routine,
and calls the method’s InitSampleScan. On the first tuple request,
tablesample_init evaluates the percentage argument and the seed, then calls
BeginSampleScan; the driver loop tablesample_getnext repeatedly asks the
table AM for the next sampled block and the next sampled tuple. Note that
syncscan is disabled exactly when the method supplies a NextSampleBlock
— a block-choosing method dictates block order, so it must not be perturbed
by a synchronized scan starting elsewhere.
At the storage layer, the heap table AM
(heapam_scan_sample_next_block / heapam_scan_sample_next_tuple in
heapam_handler.c) bridges the abstract method to physical pages: it calls
NextSampleBlock to choose a page (or walks pages sequentially when the
method has no NextSampleBlock), reads that page, then repeatedly calls
NextSampleTuple to choose offsets, applying the MVCC visibility test to
each before storing it in the result slot. Because the bridge lives in the
table AM, the TSM API is storage-agnostic.
flowchart TD SQL["SELECT ... TABLESAMPLE SYSTEM(1.5) REPEATABLE(42)"] PARSE["transformRangeTableSample<br/>parse_clause.c<br/>resolve method name to handler Oid"] HANDLER["tsm_system_handler / tsm_bernoulli_handler<br/>returns TsmRoutine vtable"] PLAN["set_tablesample_rel_size + cost_samplescan<br/>SampleScanGetSampleSize -> pages, tuples"] INIT["ExecInitSampleScan<br/>seed = REPEATABLE? hashfloat8(seed) : random<br/>InitSampleScan (alloc tsm_state)"] BEGIN["tablesample_init -> BeginSampleScan<br/>validate percent, compute cutoff"] DRIVE["tablesample_getnext loop"] NB["NextSampleBlock<br/>(SYSTEM only; BERNOULLI = NULL -> seq pages)"] NT["NextSampleTuple<br/>pick offsets on page"] VIS["MVCC visibility test<br/>SampleHeapTupleVisible"] OUT["result slot -> ExecScan"] SQL --> PARSE --> HANDLER --> PLAN --> INIT --> BEGIN --> DRIVE DRIVE --> NB --> NT --> VIS --> OUT VIS -. invisible/dead .-> NT NT -. page exhausted .-> NB
Source Walkthrough
Section titled “Source Walkthrough”This walkthrough follows the call flow top-down: the API contract
(TsmRoutine), the parser/planner glue that resolves and prices a method,
the executor driver loop, the two built-in methods (SYSTEM and BERNOULLI),
and finally the heap-AM bridge that applies visibility. All excerpts are
condensed from REL_18 (273fe94); each begins with a // symbol — path
comment for grepping.
1. The API contract — TsmRoutine and GetTsmRoutine
Section titled “1. The API contract — TsmRoutine and GetTsmRoutine”A tablesample method is described entirely by one struct of callbacks plus two capability flags. The header documents which callbacks may be NULL:
// TsmRoutine — src/include/access/tsmapi.htypedef struct TsmRoutine{ NodeTag type; List *parameterTypes; /* datatype OIDs of TABLESAMPLE args */ bool repeatable_across_queries; bool repeatable_across_scans; SampleScanGetSampleSize_function SampleScanGetSampleSize; /* planner */ InitSampleScan_function InitSampleScan; /* can be NULL */ BeginSampleScan_function BeginSampleScan; NextSampleBlock_function NextSampleBlock; /* can be NULL */ NextSampleTuple_function NextSampleTuple; EndSampleScan_function EndSampleScan; /* can be NULL */} TsmRoutine;The two flags are policy hints the planner reads: repeatable_across_queries
gates whether a REPEATABLE clause is even allowed (the parser rejects
REPEATABLE on a method that clears it), and repeatable_across_scans decides
whether a multiply-scanned SampleScan must be wrapped in a Materialize. The
NULLability of NextSampleBlock is load-bearing: a method that does not pick
blocks (BERNOULLI) leaves it NULL, and that single fact propagates outward —
the executor then enables syncscan, the heap AM walks pages sequentially, and
the cost model charges sequential-page cost.
GetTsmRoutine is the only public entry point of tablesample.c. It invokes
the handler function and sanity-checks the result:
// GetTsmRoutine — src/backend/access/tablesample/tablesample.cTsmRoutine *GetTsmRoutine(Oid tsmhandler){ Datum datum; TsmRoutine *routine;
datum = OidFunctionCall1(tsmhandler, PointerGetDatum(NULL)); routine = (TsmRoutine *) DatumGetPointer(datum);
if (routine == NULL || !IsA(routine, TsmRoutine)) elog(ERROR, "tablesample handler function %u did not return a TsmRoutine struct", tsmhandler); return routine;}The handler is called with a NULL argument — methods ignore the argument and
exist only to return the vtable. The handlers themselves are trivial: they
makeNode(TsmRoutine) (so unset fields are NULL) and assign the callbacks.
SYSTEM advertises both repeatability flags and supplies all five executor
callbacks except EndSampleScan:
// tsm_system_handler — src/backend/access/tablesample/system.cDatumtsm_system_handler(PG_FUNCTION_ARGS){ TsmRoutine *tsm = makeNode(TsmRoutine);
tsm->parameterTypes = list_make1_oid(FLOAT4OID); tsm->repeatable_across_queries = true; tsm->repeatable_across_scans = true; tsm->SampleScanGetSampleSize = system_samplescangetsamplesize; tsm->InitSampleScan = system_initsamplescan; tsm->BeginSampleScan = system_beginsamplescan; tsm->NextSampleBlock = system_nextsampleblock; /* SYSTEM picks blocks */ tsm->NextSampleTuple = system_nextsampletuple; tsm->EndSampleScan = NULL; PG_RETURN_POINTER(tsm);}BERNOULLI is identical except NextSampleBlock = NULL — it never chooses
blocks, only tuples within whatever block the AM hands it.
2. Parser and planner glue
Section titled “2. Parser and planner glue”transformRangeTableSample turns TABLESAMPLE method(args) REPEATABLE(seed)
into a TableSampleClause. The method name is resolved as an ordinary
function returning the tsm_handler pseudo-type; the handler is run once to
learn the argument types and to validate the REPEATABLE flag:
// transformRangeTableSample — src/backend/parser/parse_clause.chandlerOid = LookupFuncName(rts->method, 1, funcargtypes, true);/* ... error if !OidIsValid or wrong return type ... */tsm = GetTsmRoutine(handlerOid);
tablesample = makeNode(TableSampleClause);tablesample->tsmhandler = handlerOid;
if (list_length(rts->args) != list_length(tsm->parameterTypes)) ereport(ERROR, (errcode(ERRCODE_INVALID_TABLESAMPLE_ARGUMENT), ...));/* ... coerce each arg to the method's parameterTypes ... */
if (rts->repeatable != NULL){ if (!tsm->repeatable_across_queries) ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("tablesample method %s does not support REPEATABLE", ...))); /* REPEATABLE expr coerced to float8 */}Coercing REPEATABLE to float8 (not integer) is deliberate: the source
comments in nodeSamplescan.c note that float8 yields unsurprising behavior
both for users who expect integer seeds and for those who expect a
setseed()-like value in [-1, 1].
In the planner, set_tablesample_rel_size asks the method how big the sample
will be and overwrites the relation estimates with the result, so the rest
of the planner sees the sampled rel as if it were a smaller table:
// set_tablesample_rel_size — src/backend/optimizer/path/allpaths.ctsm = GetTsmRoutine(tsc->tsmhandler);tsm->SampleScanGetSampleSize(root, rel, tsc->args, &pages, &tuples);rel->pages = pages;rel->tuples = tuples;set_baserel_size_estimates(root, rel);set_tablesample_rel_pathlist builds the lone SampleScan path and applies
the repeatability guard. The comment explains the design trade-off — rather
than teach the join planner to avoid re-scanning a non-repeatable sample, it
simply materializes:
// set_tablesample_rel_pathlist — src/backend/optimizer/path/allpaths.cpath = create_samplescan_path(root, rel, required_outer);
if ((root->query_level > 1 || bms_membership(root->all_query_rels) != BMS_SINGLETON) && !(GetTsmRoutine(rte->tablesample->tsmhandler)->repeatable_across_scans)){ path = (Path *) create_material_path(rel, path);}add_path(rel, path);cost_samplescan prices the path. The key line is the page-cost choice keyed
on NextSampleBlock: a block-picking method jumps around the heap (random
access), a sequential method reads pages in order:
// cost_samplescan — src/backend/optimizer/path/costsize.c/* if NextSampleBlock is used, assume random access, else sequential */spc_page_cost = (tsm->NextSampleBlock != NULL) ? spc_random_page_cost : spc_seq_page_cost;run_cost += spc_page_cost * baserel->pages; /* pages already set *//* ... */cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;run_cost += cpu_per_tuple * baserel->tuples; /* tuples already set */Because baserel->pages and baserel->tuples were already overwritten by
SampleScanGetSampleSize, the cost model needs no special sampling math — it
multiplies the method’s own estimates by the standard per-page and per-tuple
costs.
3. The executor driver — nodeSamplescan.c
Section titled “3. The executor driver — nodeSamplescan.c”ExecInitSampleScan sets up the node and, critically, fixes the seed once
so it survives rescans. If there is no REPEATABLE clause, a per-scan random
seed is drawn from the global PRNG:
// ExecInitSampleScan — src/backend/executor/nodeSamplescan.cscanstate->args = ExecInitExprList(tsc->args, (PlanState *) scanstate);scanstate->repeatable = ExecInitExpr(tsc->repeatable, (PlanState *) scanstate);
/* If we don't have a REPEATABLE clause, select a random seed (just once). */if (tsc->repeatable == NULL) scanstate->seed = pg_prng_uint32(&pg_global_prng_state);
tsm = GetTsmRoutine(tsc->tsmhandler);scanstate->tsmroutine = tsm;scanstate->tsm_state = NULL;if (tsm->InitSampleScan) tsm->InitSampleScan(scanstate, eflags);scanstate->begun = false; /* defer BeginSampleScan to first tuple */The actual parameter values cannot be evaluated at init time (they may
reference outer LATERAL columns), so BeginSampleScan is deferred to the
first SampleNext, handled by tablesample_init. This is where the seed
becomes deterministic for REPEATABLE: the float8 value is hashed by
hashfloat8, which the comment notes makes REPEATABLE(0) machine
independent for regression testing:
// tablesample_init — src/backend/executor/nodeSamplescan.cif (scanstate->repeatable){ datum = ExecEvalExprSwitchContext(scanstate->repeatable, econtext, &isnull); if (isnull) ereport(ERROR, (errcode(ERRCODE_INVALID_TABLESAMPLE_REPEAT), ...)); /* hashfloat8 turns the float8 seed into a uint32; REPEATABLE(0) is portable */ seed = DatumGetUInt32(DirectFunctionCall1(hashfloat8, datum));}else seed = scanstate->seed; /* the per-scan random seed from init */
scanstate->use_bulkread = true; /* defaults BeginSampleScan may override */scanstate->use_pagemode = true;tsm->BeginSampleScan(scanstate, params, list_length(scanstate->args), seed);
/* We'll use syncscan if there's no NextSampleBlock function */allow_sync = (tsm->NextSampleBlock == NULL);That last line is the whole reason history-independence matters in practice:
a method that picks its own blocks must not allow a synchronized seqscan to
choose the starting block, because that would make the sample depend on
concurrent scan activity. A method without NextSampleBlock walks pages
sequentially and so can safely ride syncscan.
The driver loop is a straightforward two-level iteration over blocks and tuples, delegating both to the table AM:
// tablesample_getnext — src/backend/executor/nodeSamplescan.cfor (;;){ if (!scanstate->haveblock) { if (!table_scan_sample_next_block(scan, scanstate)) { scanstate->done = true; return NULL; /* exhausted relation */ } scanstate->haveblock = true; } if (!table_scan_sample_next_tuple(scan, scanstate, slot)) { scanstate->haveblock = false; /* page exhausted, advance block */ continue; } break; /* found a visible sampled tuple */}scanstate->donetuples++;return slot;ExecReScanSampleScan resets begun/done/haveblock but not the seed,
so a rescan of a repeatable method reproduces the identical sample — that is
exactly what repeatable_across_scans promises and what lets the planner skip
the Materialize wrapper.
4. SYSTEM — block-level sampling
Section titled “4. SYSTEM — block-level sampling”SYSTEM’s private state holds the integer cutoff, the seed, the next block to consider, and the last offset returned:
// SystemSamplerData — src/backend/access/tablesample/system.ctypedef struct{ uint64 cutoff; /* select blocks with hash less than this */ uint32 seed; BlockNumber nextblock; /* next block to consider sampling */ OffsetNumber lt; /* last tuple returned from current block */} SystemSamplerData;BeginSampleScan validates the percentage and converts it into a 32-bit
acceptance cutoff. The cutoff is round((2^32) * percent/100) stored in a
uint64; comparing a uint32 hash against it gives acceptance probability
percent/100, and the formula is exact at the 0% and 100% limits:
// system_beginsamplescan — src/backend/access/tablesample/system.cdouble percent = DatumGetFloat4(params[0]);if (percent < 0 || percent > 100 || isnan(percent)) ereport(ERROR, (errcode(ERRCODE_INVALID_TABLESAMPLE_ARGUMENT), errmsg("sample percentage must be between 0 and 100")));
dcutoff = rint(((double) PG_UINT32_MAX + 1) * percent / 100);sampler->cutoff = (uint64) dcutoff;sampler->seed = seed;sampler->nextblock = 0;sampler->lt = InvalidOffsetNumber;
/* Bulkread unless scanning a tiny fraction; pagemode always (scan whole page) */node->use_bulkread = (percent >= 1);node->use_pagemode = true;NextSampleBlock is the heart of SYSTEM. It scans block numbers in order,
hashing each (blockno, seed) pair and accepting the first block whose hash
falls below the cutoff. Hashing block-then-seed (rather than consuming a PRNG)
is what gives a machine-independent, history-independent decision per block:
// system_nextsampleblock — src/backend/access/tablesample/system.cuint32 hashinput[2];hashinput[1] = sampler->seed; /* same throughout the scan */
for (; nextblock < nblocks; nextblock++){ uint32 hash; hashinput[0] = nextblock; hash = DatumGetUInt32(hash_any((const unsigned char *) hashinput, (int) sizeof(hashinput))); if (hash < sampler->cutoff) break; /* accept this block */}if (nextblock < nblocks){ sampler->nextblock = nextblock + 1; /* resume past it next call */ return nextblock;}sampler->nextblock = 0;return InvalidBlockNumber; /* done */Once a block is accepted, SYSTEM returns every tuple on it — that is the
defining behavior of block (cluster) sampling. NextSampleTuple therefore
just walks offsets 1..maxoffset with no per-tuple coin flip:
// system_nextsampletuple — src/backend/access/tablesample/system.cOffsetNumber tupoffset = sampler->lt;if (tupoffset == InvalidOffsetNumber) tupoffset = FirstOffsetNumber;else tupoffset++;if (tupoffset > maxoffset) tupoffset = InvalidOffsetNumber; /* page done -> advance block */sampler->lt = tupoffset;return tupoffset;SampleScanGetSampleSize for SYSTEM estimates both pages and tuples as
fraction * baserel, because it reads only a fraction of pages:
// system_samplescangetsamplesize — src/backend/access/tablesample/system.c*pages = clamp_row_est(baserel->pages * samplefract); /* fraction of pages */*tuples = clamp_row_est(baserel->tuples * samplefract);5. BERNOULLI — row-level sampling
Section titled “5. BERNOULLI — row-level sampling”BERNOULLI’s state drops nextblock — it never chooses blocks — keeping only
the cutoff, seed, and last offset:
// BernoulliSamplerData — src/backend/access/tablesample/bernoulli.ctypedef struct{ uint64 cutoff; /* select tuples with hash less than this */ uint32 seed; OffsetNumber lt; /* last tuple returned from current block */} BernoulliSamplerData;BeginSampleScan computes the identical cutoff formula, but sets different
buffer-strategy hints: it always uses bulkread (it scans every page) and only
switches to pagemode at large fractions:
// bernoulli_beginsamplescan — src/backend/access/tablesample/bernoulli.cdcutoff = rint(((double) PG_UINT32_MAX + 1) * percent / 100);sampler->cutoff = (uint64) dcutoff;sampler->seed = seed;sampler->lt = InvalidOffsetNumber;
/* Scan all pages -> always bulkread; pagemode only pays off at >=25% */node->use_bulkread = true;node->use_pagemode = (percent >= 25);There is no NextSampleBlock, so the heap AM walks pages sequentially.
NextSampleTuple is where the per-tuple coin lives: for each offset it hashes
(blockno, offset, seed) and accepts if the hash is below cutoff. The
independent per-offset hash is what makes this a true simple-random sample:
// bernoulli_nextsampletuple — src/backend/access/tablesample/bernoulli.cOffsetNumber tupoffset = sampler->lt;uint32 hashinput[3];
if (tupoffset == InvalidOffsetNumber) tupoffset = FirstOffsetNumber;else tupoffset++;
hashinput[0] = blockno; /* same throughout the block */hashinput[2] = sampler->seed; /* same throughout the scan */
for (; tupoffset <= maxoffset; tupoffset++){ uint32 hash; hashinput[1] = tupoffset; hash = DatumGetUInt32(hash_any((const unsigned char *) hashinput, (int) sizeof(hashinput))); if (hash < sampler->cutoff) break; /* accept this tuple */}if (tupoffset > maxoffset) tupoffset = InvalidOffsetNumber; /* page done */sampler->lt = tupoffset;return tupoffset;The comment above this function is the statistical justification for the
slot-then-visibility ordering: “we do the coinflip for every tuple offset in
the table. Since all tuples have the same probability of being returned, it
doesn’t matter if we do extra coinflips for invisible tuples.” Critically,
BERNOULLI’s SampleScanGetSampleSize estimates all pages — it reads the
whole table — but only fraction * tuples:
// bernoulli_samplescangetsamplesize — src/backend/access/tablesample/bernoulli.c*pages = baserel->pages; /* visits every page */*tuples = clamp_row_est(baserel->tuples * samplefract);This *pages = baserel->pages vs SYSTEM’s *pages = pages * fract is the one
estimate that captures the entire cost difference between the two methods.
6. The heap-AM bridge — applying visibility
Section titled “6. The heap-AM bridge — applying visibility”The driver’s table_scan_sample_next_block / _next_tuple dispatch into the
heap AM. heapam_scan_sample_next_block either calls the method’s
NextSampleBlock or, when it is NULL, walks pages sequentially with syncscan
position reporting, then reads the chosen page (and prunes it in pagemode):
// heapam_scan_sample_next_block — src/backend/access/heap/heapam_handler.cif (tsm->NextSampleBlock) blockno = tsm->NextSampleBlock(scanstate, hscan->rs_nblocks);else{ /* scanning table sequentially (BERNOULLI) */ if (hscan->rs_cblock == InvalidBlockNumber) blockno = hscan->rs_startblock; else { blockno = hscan->rs_cblock + 1; if (blockno >= hscan->rs_nblocks) blockno = 0; /* wrap */ if (scan->rs_flags & SO_ALLOW_SYNC) ss_report_location(scan->rs_rd, blockno); if (blockno == hscan->rs_startblock) blockno = InvalidBlockNumber; /* full circle -> done */ }}/* ... read page; if pagemode, heap_prepare_pagescan(scan) prunes + computes visibility ... */heapam_scan_sample_next_tuple repeatedly asks NextSampleTuple for offsets
and applies the MVCC test before returning a tuple — skipping invalid line
pointers and invisible tuples, exactly the “extra coinflips don’t bias”
contract BERNOULLI relies on:
// heapam_scan_sample_next_tuple — src/backend/access/heap/heapam_handler.cfor (;;){ tupoffset = tsm->NextSampleTuple(scanstate, blockno, maxoffset); if (OffsetNumberIsValid(tupoffset)) { itemid = PageGetItemId(page, tupoffset); if (!ItemIdIsNormal(itemid)) continue; /* skip dead/redirect slots */ /* build HeapTuple at (blockno, tupoffset) ... */ if (all_visible) visible = true; else visible = SampleHeapTupleVisible(scan, hscan->rs_cbuf, tuple, tupoffset); if (!visible) continue; /* invisible -> next offset */ ExecStoreBufferHeapTuple(tuple, slot, hscan->rs_cbuf); return true; } else { /* NextSampleTuple returned Invalid -> page exhausted */ ExecClearTuple(slot); return false; }}The all_visible fast path (PageIsAllVisible(page) && !takenDuringRecovery) skips the per-tuple visibility check when the page is
all-visible per the visibility map (see postgres-visibility-map.md), a real
win for SYSTEM where every offset on an accepted page is returned. Note also
that in non-pagemode the buffer is content-locked around the visibility test
and HeapCheckForSerializableConflictOut is invoked — table sampling
participates in SSI predicate locking like any other scan
(see postgres-ssi-predicate-locking.md).
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
TsmRoutine (struct) | src/include/access/tsmapi.h | 56 |
GetTsmRoutine | src/backend/access/tablesample/tablesample.c | 27 |
tsm_system_handler | src/backend/access/tablesample/system.c | 67 |
SystemSamplerData (struct) | src/backend/access/tablesample/system.c | 37 |
system_samplescangetsamplesize | src/backend/access/tablesample/system.c | 88 |
system_beginsamplescan | src/backend/access/tablesample/system.c | 139 |
system_nextsampleblock | src/backend/access/tablesample/system.c | 178 |
system_nextsampletuple | src/backend/access/tablesample/system.c | 236 |
tsm_bernoulli_handler | src/backend/access/tablesample/bernoulli.c | 65 |
BernoulliSamplerData (struct) | src/backend/access/tablesample/bernoulli.c | 37 |
bernoulli_samplescangetsamplesize | src/backend/access/tablesample/bernoulli.c | 86 |
bernoulli_beginsamplescan | src/backend/access/tablesample/bernoulli.c | 136 |
bernoulli_nextsampletuple | src/backend/access/tablesample/bernoulli.c | 181 |
SampleScanState (struct) | src/include/nodes/execnodes.h | 1631 |
ExecInitSampleScan | src/backend/executor/nodeSamplescan.c | 92 |
tablesample_init | src/backend/executor/nodeSamplescan.c | 217 |
tablesample_getnext | src/backend/executor/nodeSamplescan.c | 319 |
ExecReScanSampleScan | src/backend/executor/nodeSamplescan.c | 201 |
transformRangeTableSample | src/backend/parser/parse_clause.c | 907 |
set_tablesample_rel_size | src/backend/optimizer/path/allpaths.c | 825 |
set_tablesample_rel_pathlist | src/backend/optimizer/path/allpaths.c | 865 |
cost_samplescan | src/backend/optimizer/path/costsize.c | 370 |
heapam_scan_sample_next_block | src/backend/access/heap/heapam_handler.c | 2174 |
heapam_scan_sample_next_tuple | src/backend/access/heap/heapam_handler.c | 2264 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”All claims below were checked against the REL_18_STABLE working tree at commit 273fe94. The following facts are asserted as verified:
-
Two core methods, two source files.
system.candbernoulli.care the only files undersrc/backend/access/tablesample/besides the sharedtablesample.c. Each defines exactly one handler (tsm_system_handler/tsm_bernoulli_handler), registered inpg_proc.datreturning thetsm_handlerpseudo-type (prosrc => 'tsm_system_handler'/'tsm_bernoulli_handler'). -
The
TsmRoutinevtable has exactly the seven callbacks plus two flags shown — verified againstsrc/include/access/tsmapi.h.InitSampleScan,NextSampleBlock, andEndSampleScanare documented/* can be NULL */;SampleScanGetSampleSize,BeginSampleScan, andNextSampleTupleare always supplied by both built-in methods. -
SYSTEM supplies
NextSampleBlock; BERNOULLI sets it NULL. Confirmed in both handlers. This NULLness is read in three places, all verified:tablesample_init(allow_sync = (tsm->NextSampleBlock == NULL)),cost_samplescan(spc_page_cost = (tsm->NextSampleBlock != NULL) ? random : seq), andheapam_scan_sample_next_block(calls the method vs. walks sequentially). -
The cutoff formula is identical in both methods. Both compute
dcutoff = rint(((double) PG_UINT32_MAX + 1) * percent / 100)and store it asuint64. Verified byte-for-byte insystem_beginsamplescanandbernoulli_beginsamplescan. The argument is a singleFLOAT4OID(parameterTypes = list_make1_oid(FLOAT4OID)), validated to[0,100]with error codeERRCODE_INVALID_TABLESAMPLE_ARGUMENT. -
Hash inputs differ in arity. SYSTEM hashes a 2-element
uint32array{blockno, seed}; BERNOULLI hashes a 3-element array{blockno, offset, seed}. Both usehash_anyoversizeof(hashinput)and accept whenhash < sampler->cutoff. Verified insystem_nextsampleblockandbernoulli_nextsampletuple. -
Seed handling.
ExecInitSampleScandrawspg_prng_uint32(&pg_global_prng_state)once whentsc->repeatable == NULL;tablesample_initderives the deterministic seed viaDirectFunctionCall1(hashfloat8, datum)when REPEATABLE is present. The REPEATABLE argument is coerced toFLOAT8OIDin the parser. All verified. -
Size-estimate asymmetry. SYSTEM sets
*pages = clamp_row_est(pages * fract); BERNOULLI sets*pages = baserel->pages. Both set*tuples = clamp_row_est(tuples * fract). Verified — this single difference encodes the block-vs-row I/O cost distinction. -
Visibility is applied by the heap AM, not the method.
NextSampleTuplereturns an offset without a visibility guarantee; the comments in bothsystem.candbernoulli.cexplicitly say nodeSamplescan / the AM “will deal with that,” andheapam_scan_sample_next_tuplecallsSampleHeapTupleVisible(with anall_visibleVM fast path) andHeapCheckForSerializableConflictOut. Verified.
One nuance worth flagging for drift: the SQL-visible defaults (the bulkread 1% cutoff for SYSTEM, the pagemode 25% cutoff for BERNOULLI) are described in the source comments as guesses “based on very limited experimentation,” so they are tuning constants, not invariants — treat the exact thresholds as revision-specific.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”PostgreSQL’s TSM API is a faithful, minimal implementation of SQL:2003 sampling, and its design choices read clearly against the wider landscape.
Extension methods: sampling by count and by time. The percentage-based
interface is not the only useful contract, and the pluggability of the TSM
API is what lets core stay small while two contrib modules add other
policies. tsm_system_rows samples approximately a fixed number of rows
(not a percentage) using block-level selection, and tsm_system_time samples
for a fixed time budget in milliseconds. Both are out of scope here (this
doc is core-only) but they are the canonical demonstration that the four
callbacks suffice for genuinely different sampling semantics: each ships its
own BeginSampleScan to interpret its argument and its own NextSampleBlock
to stop on the right condition. The lesson is that the API’s narrow waist
(estimate / begin / next-block / next-tuple) was drawn at the right place.
flowchart LR
subgraph SYSTEM["SYSTEM — block sampling"]
SB["hash(blockno, seed) < cutoff?"]
SACC["accept block<br/>return ALL offsets"]
SREJ["reject block"]
SB -- yes --> SACC
SB -- no --> SREJ
end
subgraph BERN["BERNOULLI — row sampling"]
BB["hash(blockno, offset, seed) < cutoff?"]
BACC["accept this tuple"]
BREJ["reject this tuple"]
BB -- yes --> BACC
BB -- no --> BREJ
end
COST["cost: SYSTEM reads fract*pages (random)<br/>BERNOULLI reads ALL pages (seq)"]
STAT["statistics: SYSTEM clustered (design effect)<br/>BERNOULLI true simple-random"]
SYSTEM --> COST
BERN --> COST
SYSTEM --> STAT
BERN --> STAT
Other engines. SQL Server’s TABLESAMPLE is block-only (it has no
BERNOULLI equivalent in T-SQL; the percentage is approximate and small
samples may return zero rows) but adds a ROWS count form and a
REPEATABLE(seed) clause that maps onto the same deterministic-seed idea.
DB2 LUW implements both SYSTEM and BERNOULLI with REPEATABLE, the
closest match to PostgreSQL. Oracle predates the standard with
SAMPLE (n) (row) and SAMPLE BLOCK (n) (block) plus a SEED clause, and
notably integrates sampling into its optimizer’s dynamic-sampling and
approximate-query (APPROX_COUNT_DISTINCT) machinery — a direction
PostgreSQL has not taken in core. The convergence on “block vs row,
percentage, optional seed” across four independent products is strong
evidence the abstraction is the natural one.
Sampling for statistics, not just queries. PostgreSQL’s ANALYZE does
not use the TSM API. It runs its own two-stage reservoir sampler
(Vitter’s algorithm; see postgres-analyze-transform.md) over a fixed target
number of rows (default_statistics_target * 300) drawn from random blocks.
The two sampling subsystems are deliberately separate: ANALYZE needs a
fixed-size sample for histogram construction with bounded memory, whereas
TABLESAMPLE needs a fixed-probability sample exposed to SQL. Reservoir
sampling (Vitter 1985) is the apt theory anchor for the former; the
hash-threshold Bernoulli/cluster sampling here is the apt anchor for the
latter. A research-frontier observation: there is no core mechanism to feed a
TABLESAMPLE result back into the optimizer’s estimates (true
online/adaptive sampling, as in approximate-query systems like BlinkDB or
the AQUA project), so PostgreSQL’s sampling remains a user-facing query
operator rather than an internal estimation loop.
Block sampling’s statistical hazard, revisited. The design effect from clustering is not academic. Because SYSTEM returns every row on an accepted page, any attribute that correlates with physical row placement — a serial ID, an insertion-time-ordered timestamp, a clustering index — is systematically misrepresented by a SYSTEM sample. The honest guidance, matching the standard’s intent, is: use SYSTEM when you want a fast, rough look and the table is not physically ordered on the attribute you care about; use BERNOULLI when you need an unbiased row sample and can afford to read every page. PostgreSQL exposes both and prices them differently precisely so the planner and the user can make that trade-off explicitly.
Parallelism. A SampleScan is not itself parallel-aware in core; the
planner checks func_parallel(rte->tablesample->tsmhandler) and the
parallel-safety of the arguments before allowing a sampled rel under a
parallel plan, but the sampling itself is performed by a single worker’s
scan. Parallelizing block sampling (partitioning the block space across
workers while preserving the hash-threshold decision) is a clean extension
the current API would accommodate, since the block decision is already a pure
function of (blockno, seed) — any worker can independently decide any block.
Sources
Section titled “Sources”- PostgreSQL REL_18_STABLE source (commit 273fe94), files:
src/backend/access/tablesample/tablesample.c—GetTsmRoutinesrc/backend/access/tablesample/system.c— SYSTEM methodsrc/backend/access/tablesample/bernoulli.c— BERNOULLI methodsrc/include/access/tsmapi.h— theTsmRoutinevtable and signaturessrc/backend/executor/nodeSamplescan.c— the SampleScan executor nodesrc/backend/access/heap/heapam_handler.c— heap-AM sample bridge + visibilitysrc/backend/optimizer/path/allpaths.c—set_tablesample_rel_size/_pathlistsrc/backend/optimizer/path/costsize.c—cost_samplescansrc/backend/parser/parse_clause.c—transformRangeTableSamplesrc/include/catalog/pg_proc.dat— handler function registrations
- Database System Concepts, Silberschatz, Korth & Sudarshan, 7th ed.,
ch. 15 (Query Processing) and the advanced query-processing material on
sampling and approximate query answering —
knowledge/research/dbms-general/database-system-concepts.md. - SQL:2003 standard,
<table sample clause>(SYSTEM / BERNOULLI / REPEATABLE). - Cross-references within this tree: postgres-scan-nodes.md (SeqScan /
IndexScan node machinery), postgres-index-am.md (index AM handler surface),
postgres-table-am.md (
TableAmRoutine), postgres-heap-am.md (heap scan internals), postgres-analyze-transform.md (ANALYZE’s separate reservoir sampler), postgres-visibility-map.md (the all-visible fast path), postgres-ssi-predicate-locking.md (serializable conflict checks during scans), postgres-cost-model.md (page/tuple cost constants).