Skip to content

PostgreSQL Table Sampling — The TSM API, SYSTEM and BERNOULLI

Contents:

Sampling is the act of selecting a subset of a population so that a computation over the subset approximates a computation over the whole at a fraction of the cost. In a database, the “population” is the set of rows in a table and the “computation” is some query — count(*), an average, a GROUP BY aggregate, an exploratory SELECT *. If a billion-row table can answer “roughly what fraction of orders shipped late?” by examining one percent of its rows, the query runs a hundred times faster, and for many analytic and data-exploration purposes the loss of precision is acceptable. Database System Concepts (Silberschatz 7e, ch. 15 “Query Processing” and ch. 25 on advanced query processing / approximate query answering) frames sampling both as a query-evaluation technique and as the engine behind statistics gathering: the optimizer’s own row-count and histogram estimates come from a sample of the table collected by ANALYZE. The theory that makes sampling useful is classical survey statistics: the precision of an estimate derived from a sample of size n improves as 1/sqrt(n), independent of the population size N, provided the sample is drawn without systematic bias. Bias — not sample size — is the enemy.

That proviso forces a taxonomy. The cleanest statistical object is a simple random sample: every row is included or excluded by an independent coin flip with the same probability p, so any two rows are equally likely to appear together and there is no correlation between membership and any attribute. This is what statisticians call Bernoulli samplingN independent Bernoulli(p) trials, one per row, yielding a sample whose size is itself random with expectation pN. A Bernoulli sample is unbiased for any quantity, including correlations between columns, because the selection process ignores the data entirely. Its cost, though, is that the engine must consider every row — you cannot decide a coin flip for a row without locating that row — so a Bernoulli scan reads the whole table even when it returns one percent of it.

The cheaper alternative trades statistical purity for I/O. A database stores rows in pages (blocks), and the dominant cost of a scan is reading pages, not examining the rows already in memory on a page. Block sampling (also called cluster sampling in survey theory) selects whole pages with probability p and returns all rows on each selected page. Now the engine reads only pN_pages pages, a genuine 1/p I/O saving. The catch is the sample is no longer simple-random: rows on the same page are perfectly correlated in their membership (either all in or all out), so the sample is clustered. If the physical layout correlates with the data — orders sorted by date, so each page holds a narrow date range — a block sample badly misrepresents the date distribution. The statistical term for this loss is the design effect: clustering inflates the variance of an estimate relative to a simple random sample of the same size, by a factor that grows with the intra-page homogeneity of the sampled attribute.

A third axis, orthogonal to block-vs-row, is repeatability. A sample is repeatable if re-running the same sampling query returns the same rows. This sounds trivial but is subtle in a live database: between two runs the table may have been extended, vacuumed, or scanned by a synchronized seq-scan that started at a different block. A naive implementation that flips a fresh coin per row each time, or that piggybacks on whatever physical scan order the storage engine happens to use, is not repeatable. To make sampling repeatable you need the membership decision for a given row to be a pure function of the row’s stable identity and a fixed seed — history independent, as the PostgreSQL source comments put it. Hashing (identity, seed) and thresholding the hash achieves exactly this: the same identity and seed always hash to the same value, regardless of when or in what order the row is encountered, and regardless of what other rows exist.

The SQL:2003 standard introduced the TABLESAMPLE clause to expose sampling declaratively, and it standardized exactly the two methods that PostgreSQL ships:

SELECT * FROM orders TABLESAMPLE SYSTEM (1.5); -- 1.5% of pages
SELECT * FROM orders TABLESAMPLE BERNOULLI (1.5); -- 1.5% of rows
SELECT * FROM orders TABLESAMPLE SYSTEM (1.5) REPEATABLE (42);

SYSTEM is the standard’s name for block sampling and BERNOULLI for row sampling; the percentage is a probability in [0,100], and the optional REPEATABLE(seed) clause pins the seed. This is the surface that most major engines converged on. SQL Server’s TABLESAMPLE (n PERCENT) is block-level by default (it documents that it reads pages, not rows, so small samples can return zero rows) with a REPEATABLE(seed) option; DB2 offers both TABLESAMPLE SYSTEM and TABLESAMPLE BERNOULLI with a REPEATABLE seed, matching the standard almost verbatim; Oracle’s older SAMPLE(n) / SAMPLE BLOCK(n) syntax predates SQL:2003 but expresses the same row-vs-block choice and a SEED clause for repeatability. The remarkable thing is the near-universal agreement: two methods, a percentage, an optional seed.

Underneath the syntax, engines make three recurring design decisions:

  1. Where does the sampling decision live — in the access path, or as a filter on top of a full scan? A block-sampling method is an access path: it must influence which pages the storage engine fetches, so it has to reach down into the scan. A row-sampling method could in principle be a post-filter on an ordinary scan, but engines usually fold it into the scan too, so the same code path serves both and the optimizer can cost them uniformly. PostgreSQL’s choice — a dedicated SampleScan plan node that delegates block and tuple selection to a method — is the folded design.

  2. How is the method made pluggable? Some engines hardwire SYSTEM and BERNOULLI. PostgreSQL instead defines a tiny callback interface (the TSM API) so the two built-in methods are just two implementations and an extension can ship a third — for example, the tsm_system_rows and tsm_system_time contrib modules sample a fixed number of rows or sample for a fixed time budget rather than a percentage. The interface is deliberately small: estimate size, begin, pick a block, pick a tuple.

  3. How is repeatability and history-independence achieved? The robust answer, used by PostgreSQL, is the hash-and-threshold trick: compute a cutoff from the requested probability once, then for each candidate identifier hash (identifier, seed) and accept it if the hash falls below the cutoff. Because the hash is a deterministic function of identity and seed, the sample is repeatable (fix the seed) and history-independent (membership does not depend on scan order, table growth, or concurrent maintenance). This is strictly better than the alternative of consuming a PRNG stream in scan order, which is neither.

A subtlety that every implementation must handle is the interaction between sampling and visibility / concurrency. A sampling method picks a tuple slot (a block number and offset) without knowing whether the tuple there is visible to the current snapshot, or even whether a live tuple exists at that offset. The method’s job is purely to decide membership; the storage engine then applies the normal MVCC visibility test and silently drops slots that are dead or invisible. Crucially, for an unbiased row sample this ordering is correct only because the coin is flipped for every offset independently — extra coin flips spent on invisible tuples do not bias the survivors, since all tuples share the same acceptance probability.

PostgreSQL realizes TABLESAMPLE as a four-layer stack. At the top, the parser (transformRangeTableSample in parse_clause.c) resolves the method name to a handler functionSYSTEM and BERNOULLI are not keywords baked into the grammar but ordinary functions tsm_system_handler / tsm_bernoulli_handler looked up in pg_proc, returning the pseudo-type tsm_handler. The handler, invoked once via GetTsmRoutine (tablesample.c), returns a TsmRoutine node: a vtable of function pointers plus two flags advertising whether the method produces repeatable samples across queries and across scans. This is the same “handler returns a routine struct” pattern PostgreSQL uses for index AMs (IndexAmRoutine, see postgres-index-am.md), table AMs (TableAmRoutine, see postgres-table-am.md) and FDWs.

In the planner, set_tablesample_rel_size (allpaths.c) calls the method’s SampleScanGetSampleSize to learn how many pages it will read and how many tuples it will return, overwriting the relation’s page/tuple estimates accordingly; cost_samplescan (costsize.c) then prices the scan, charging random-page cost if the method supplies a NextSampleBlock (non-sequential block access) and sequential-page cost otherwise. The planner also consults repeatable_across_scans: if the query might scan the sampled rel more than once (e.g. the inner side of a nestloop) and the method cannot reproduce its sample, the SampleScan is wrapped in a Materialize node so the sample is computed once and replayed.

At execution time, nodeSamplescan.c defines the SampleScan plan node. ExecInitSampleScan opens the relation, initializes the TABLESAMPLE argument expressions and the optional REPEATABLE expression, draws a random seed if no REPEATABLE clause was given, runs the handler to get the routine, and calls the method’s InitSampleScan. On the first tuple request, tablesample_init evaluates the percentage argument and the seed, then calls BeginSampleScan; the driver loop tablesample_getnext repeatedly asks the table AM for the next sampled block and the next sampled tuple. Note that syncscan is disabled exactly when the method supplies a NextSampleBlock — a block-choosing method dictates block order, so it must not be perturbed by a synchronized scan starting elsewhere.

At the storage layer, the heap table AM (heapam_scan_sample_next_block / heapam_scan_sample_next_tuple in heapam_handler.c) bridges the abstract method to physical pages: it calls NextSampleBlock to choose a page (or walks pages sequentially when the method has no NextSampleBlock), reads that page, then repeatedly calls NextSampleTuple to choose offsets, applying the MVCC visibility test to each before storing it in the result slot. Because the bridge lives in the table AM, the TSM API is storage-agnostic.

flowchart TD
  SQL["SELECT ... TABLESAMPLE SYSTEM(1.5) REPEATABLE(42)"]
  PARSE["transformRangeTableSample<br/>parse_clause.c<br/>resolve method name to handler Oid"]
  HANDLER["tsm_system_handler / tsm_bernoulli_handler<br/>returns TsmRoutine vtable"]
  PLAN["set_tablesample_rel_size + cost_samplescan<br/>SampleScanGetSampleSize -> pages, tuples"]
  INIT["ExecInitSampleScan<br/>seed = REPEATABLE? hashfloat8(seed) : random<br/>InitSampleScan (alloc tsm_state)"]
  BEGIN["tablesample_init -> BeginSampleScan<br/>validate percent, compute cutoff"]
  DRIVE["tablesample_getnext loop"]
  NB["NextSampleBlock<br/>(SYSTEM only; BERNOULLI = NULL -> seq pages)"]
  NT["NextSampleTuple<br/>pick offsets on page"]
  VIS["MVCC visibility test<br/>SampleHeapTupleVisible"]
  OUT["result slot -> ExecScan"]

  SQL --> PARSE --> HANDLER --> PLAN --> INIT --> BEGIN --> DRIVE
  DRIVE --> NB --> NT --> VIS --> OUT
  VIS -. invisible/dead .-> NT
  NT -. page exhausted .-> NB

This walkthrough follows the call flow top-down: the API contract (TsmRoutine), the parser/planner glue that resolves and prices a method, the executor driver loop, the two built-in methods (SYSTEM and BERNOULLI), and finally the heap-AM bridge that applies visibility. All excerpts are condensed from REL_18 (273fe94); each begins with a // symbol — path comment for grepping.

1. The API contract — TsmRoutine and GetTsmRoutine

Section titled “1. The API contract — TsmRoutine and GetTsmRoutine”

A tablesample method is described entirely by one struct of callbacks plus two capability flags. The header documents which callbacks may be NULL:

// TsmRoutine — src/include/access/tsmapi.h
typedef struct TsmRoutine
{
NodeTag type;
List *parameterTypes; /* datatype OIDs of TABLESAMPLE args */
bool repeatable_across_queries;
bool repeatable_across_scans;
SampleScanGetSampleSize_function SampleScanGetSampleSize; /* planner */
InitSampleScan_function InitSampleScan; /* can be NULL */
BeginSampleScan_function BeginSampleScan;
NextSampleBlock_function NextSampleBlock; /* can be NULL */
NextSampleTuple_function NextSampleTuple;
EndSampleScan_function EndSampleScan; /* can be NULL */
} TsmRoutine;

The two flags are policy hints the planner reads: repeatable_across_queries gates whether a REPEATABLE clause is even allowed (the parser rejects REPEATABLE on a method that clears it), and repeatable_across_scans decides whether a multiply-scanned SampleScan must be wrapped in a Materialize. The NULLability of NextSampleBlock is load-bearing: a method that does not pick blocks (BERNOULLI) leaves it NULL, and that single fact propagates outward — the executor then enables syncscan, the heap AM walks pages sequentially, and the cost model charges sequential-page cost.

GetTsmRoutine is the only public entry point of tablesample.c. It invokes the handler function and sanity-checks the result:

// GetTsmRoutine — src/backend/access/tablesample/tablesample.c
TsmRoutine *
GetTsmRoutine(Oid tsmhandler)
{
Datum datum;
TsmRoutine *routine;
datum = OidFunctionCall1(tsmhandler, PointerGetDatum(NULL));
routine = (TsmRoutine *) DatumGetPointer(datum);
if (routine == NULL || !IsA(routine, TsmRoutine))
elog(ERROR, "tablesample handler function %u did not return a TsmRoutine struct",
tsmhandler);
return routine;
}

The handler is called with a NULL argument — methods ignore the argument and exist only to return the vtable. The handlers themselves are trivial: they makeNode(TsmRoutine) (so unset fields are NULL) and assign the callbacks. SYSTEM advertises both repeatability flags and supplies all five executor callbacks except EndSampleScan:

// tsm_system_handler — src/backend/access/tablesample/system.c
Datum
tsm_system_handler(PG_FUNCTION_ARGS)
{
TsmRoutine *tsm = makeNode(TsmRoutine);
tsm->parameterTypes = list_make1_oid(FLOAT4OID);
tsm->repeatable_across_queries = true;
tsm->repeatable_across_scans = true;
tsm->SampleScanGetSampleSize = system_samplescangetsamplesize;
tsm->InitSampleScan = system_initsamplescan;
tsm->BeginSampleScan = system_beginsamplescan;
tsm->NextSampleBlock = system_nextsampleblock; /* SYSTEM picks blocks */
tsm->NextSampleTuple = system_nextsampletuple;
tsm->EndSampleScan = NULL;
PG_RETURN_POINTER(tsm);
}

BERNOULLI is identical except NextSampleBlock = NULL — it never chooses blocks, only tuples within whatever block the AM hands it.

transformRangeTableSample turns TABLESAMPLE method(args) REPEATABLE(seed) into a TableSampleClause. The method name is resolved as an ordinary function returning the tsm_handler pseudo-type; the handler is run once to learn the argument types and to validate the REPEATABLE flag:

// transformRangeTableSample — src/backend/parser/parse_clause.c
handlerOid = LookupFuncName(rts->method, 1, funcargtypes, true);
/* ... error if !OidIsValid or wrong return type ... */
tsm = GetTsmRoutine(handlerOid);
tablesample = makeNode(TableSampleClause);
tablesample->tsmhandler = handlerOid;
if (list_length(rts->args) != list_length(tsm->parameterTypes))
ereport(ERROR, (errcode(ERRCODE_INVALID_TABLESAMPLE_ARGUMENT), ...));
/* ... coerce each arg to the method's parameterTypes ... */
if (rts->repeatable != NULL)
{
if (!tsm->repeatable_across_queries)
ereport(ERROR, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("tablesample method %s does not support REPEATABLE", ...)));
/* REPEATABLE expr coerced to float8 */
}

Coercing REPEATABLE to float8 (not integer) is deliberate: the source comments in nodeSamplescan.c note that float8 yields unsurprising behavior both for users who expect integer seeds and for those who expect a setseed()-like value in [-1, 1].

In the planner, set_tablesample_rel_size asks the method how big the sample will be and overwrites the relation estimates with the result, so the rest of the planner sees the sampled rel as if it were a smaller table:

// set_tablesample_rel_size — src/backend/optimizer/path/allpaths.c
tsm = GetTsmRoutine(tsc->tsmhandler);
tsm->SampleScanGetSampleSize(root, rel, tsc->args, &pages, &tuples);
rel->pages = pages;
rel->tuples = tuples;
set_baserel_size_estimates(root, rel);

set_tablesample_rel_pathlist builds the lone SampleScan path and applies the repeatability guard. The comment explains the design trade-off — rather than teach the join planner to avoid re-scanning a non-repeatable sample, it simply materializes:

// set_tablesample_rel_pathlist — src/backend/optimizer/path/allpaths.c
path = create_samplescan_path(root, rel, required_outer);
if ((root->query_level > 1 ||
bms_membership(root->all_query_rels) != BMS_SINGLETON) &&
!(GetTsmRoutine(rte->tablesample->tsmhandler)->repeatable_across_scans))
{
path = (Path *) create_material_path(rel, path);
}
add_path(rel, path);

cost_samplescan prices the path. The key line is the page-cost choice keyed on NextSampleBlock: a block-picking method jumps around the heap (random access), a sequential method reads pages in order:

// cost_samplescan — src/backend/optimizer/path/costsize.c
/* if NextSampleBlock is used, assume random access, else sequential */
spc_page_cost = (tsm->NextSampleBlock != NULL) ?
spc_random_page_cost : spc_seq_page_cost;
run_cost += spc_page_cost * baserel->pages; /* pages already set */
/* ... */
cpu_per_tuple = cpu_tuple_cost + qpqual_cost.per_tuple;
run_cost += cpu_per_tuple * baserel->tuples; /* tuples already set */

Because baserel->pages and baserel->tuples were already overwritten by SampleScanGetSampleSize, the cost model needs no special sampling math — it multiplies the method’s own estimates by the standard per-page and per-tuple costs.

3. The executor driver — nodeSamplescan.c

Section titled “3. The executor driver — nodeSamplescan.c”

ExecInitSampleScan sets up the node and, critically, fixes the seed once so it survives rescans. If there is no REPEATABLE clause, a per-scan random seed is drawn from the global PRNG:

// ExecInitSampleScan — src/backend/executor/nodeSamplescan.c
scanstate->args = ExecInitExprList(tsc->args, (PlanState *) scanstate);
scanstate->repeatable = ExecInitExpr(tsc->repeatable, (PlanState *) scanstate);
/* If we don't have a REPEATABLE clause, select a random seed (just once). */
if (tsc->repeatable == NULL)
scanstate->seed = pg_prng_uint32(&pg_global_prng_state);
tsm = GetTsmRoutine(tsc->tsmhandler);
scanstate->tsmroutine = tsm;
scanstate->tsm_state = NULL;
if (tsm->InitSampleScan)
tsm->InitSampleScan(scanstate, eflags);
scanstate->begun = false; /* defer BeginSampleScan to first tuple */

The actual parameter values cannot be evaluated at init time (they may reference outer LATERAL columns), so BeginSampleScan is deferred to the first SampleNext, handled by tablesample_init. This is where the seed becomes deterministic for REPEATABLE: the float8 value is hashed by hashfloat8, which the comment notes makes REPEATABLE(0) machine independent for regression testing:

// tablesample_init — src/backend/executor/nodeSamplescan.c
if (scanstate->repeatable)
{
datum = ExecEvalExprSwitchContext(scanstate->repeatable, econtext, &isnull);
if (isnull)
ereport(ERROR, (errcode(ERRCODE_INVALID_TABLESAMPLE_REPEAT), ...));
/* hashfloat8 turns the float8 seed into a uint32; REPEATABLE(0) is portable */
seed = DatumGetUInt32(DirectFunctionCall1(hashfloat8, datum));
}
else
seed = scanstate->seed; /* the per-scan random seed from init */
scanstate->use_bulkread = true; /* defaults BeginSampleScan may override */
scanstate->use_pagemode = true;
tsm->BeginSampleScan(scanstate, params, list_length(scanstate->args), seed);
/* We'll use syncscan if there's no NextSampleBlock function */
allow_sync = (tsm->NextSampleBlock == NULL);

That last line is the whole reason history-independence matters in practice: a method that picks its own blocks must not allow a synchronized seqscan to choose the starting block, because that would make the sample depend on concurrent scan activity. A method without NextSampleBlock walks pages sequentially and so can safely ride syncscan.

The driver loop is a straightforward two-level iteration over blocks and tuples, delegating both to the table AM:

// tablesample_getnext — src/backend/executor/nodeSamplescan.c
for (;;)
{
if (!scanstate->haveblock)
{
if (!table_scan_sample_next_block(scan, scanstate))
{
scanstate->done = true;
return NULL; /* exhausted relation */
}
scanstate->haveblock = true;
}
if (!table_scan_sample_next_tuple(scan, scanstate, slot))
{
scanstate->haveblock = false; /* page exhausted, advance block */
continue;
}
break; /* found a visible sampled tuple */
}
scanstate->donetuples++;
return slot;

ExecReScanSampleScan resets begun/done/haveblock but not the seed, so a rescan of a repeatable method reproduces the identical sample — that is exactly what repeatable_across_scans promises and what lets the planner skip the Materialize wrapper.

SYSTEM’s private state holds the integer cutoff, the seed, the next block to consider, and the last offset returned:

// SystemSamplerData — src/backend/access/tablesample/system.c
typedef struct
{
uint64 cutoff; /* select blocks with hash less than this */
uint32 seed;
BlockNumber nextblock; /* next block to consider sampling */
OffsetNumber lt; /* last tuple returned from current block */
} SystemSamplerData;

BeginSampleScan validates the percentage and converts it into a 32-bit acceptance cutoff. The cutoff is round((2^32) * percent/100) stored in a uint64; comparing a uint32 hash against it gives acceptance probability percent/100, and the formula is exact at the 0% and 100% limits:

// system_beginsamplescan — src/backend/access/tablesample/system.c
double percent = DatumGetFloat4(params[0]);
if (percent < 0 || percent > 100 || isnan(percent))
ereport(ERROR, (errcode(ERRCODE_INVALID_TABLESAMPLE_ARGUMENT),
errmsg("sample percentage must be between 0 and 100")));
dcutoff = rint(((double) PG_UINT32_MAX + 1) * percent / 100);
sampler->cutoff = (uint64) dcutoff;
sampler->seed = seed;
sampler->nextblock = 0;
sampler->lt = InvalidOffsetNumber;
/* Bulkread unless scanning a tiny fraction; pagemode always (scan whole page) */
node->use_bulkread = (percent >= 1);
node->use_pagemode = true;

NextSampleBlock is the heart of SYSTEM. It scans block numbers in order, hashing each (blockno, seed) pair and accepting the first block whose hash falls below the cutoff. Hashing block-then-seed (rather than consuming a PRNG) is what gives a machine-independent, history-independent decision per block:

// system_nextsampleblock — src/backend/access/tablesample/system.c
uint32 hashinput[2];
hashinput[1] = sampler->seed; /* same throughout the scan */
for (; nextblock < nblocks; nextblock++)
{
uint32 hash;
hashinput[0] = nextblock;
hash = DatumGetUInt32(hash_any((const unsigned char *) hashinput,
(int) sizeof(hashinput)));
if (hash < sampler->cutoff)
break; /* accept this block */
}
if (nextblock < nblocks)
{
sampler->nextblock = nextblock + 1; /* resume past it next call */
return nextblock;
}
sampler->nextblock = 0;
return InvalidBlockNumber; /* done */

Once a block is accepted, SYSTEM returns every tuple on it — that is the defining behavior of block (cluster) sampling. NextSampleTuple therefore just walks offsets 1..maxoffset with no per-tuple coin flip:

// system_nextsampletuple — src/backend/access/tablesample/system.c
OffsetNumber tupoffset = sampler->lt;
if (tupoffset == InvalidOffsetNumber)
tupoffset = FirstOffsetNumber;
else
tupoffset++;
if (tupoffset > maxoffset)
tupoffset = InvalidOffsetNumber; /* page done -> advance block */
sampler->lt = tupoffset;
return tupoffset;

SampleScanGetSampleSize for SYSTEM estimates both pages and tuples as fraction * baserel, because it reads only a fraction of pages:

// system_samplescangetsamplesize — src/backend/access/tablesample/system.c
*pages = clamp_row_est(baserel->pages * samplefract); /* fraction of pages */
*tuples = clamp_row_est(baserel->tuples * samplefract);

BERNOULLI’s state drops nextblock — it never chooses blocks — keeping only the cutoff, seed, and last offset:

// BernoulliSamplerData — src/backend/access/tablesample/bernoulli.c
typedef struct
{
uint64 cutoff; /* select tuples with hash less than this */
uint32 seed;
OffsetNumber lt; /* last tuple returned from current block */
} BernoulliSamplerData;

BeginSampleScan computes the identical cutoff formula, but sets different buffer-strategy hints: it always uses bulkread (it scans every page) and only switches to pagemode at large fractions:

// bernoulli_beginsamplescan — src/backend/access/tablesample/bernoulli.c
dcutoff = rint(((double) PG_UINT32_MAX + 1) * percent / 100);
sampler->cutoff = (uint64) dcutoff;
sampler->seed = seed;
sampler->lt = InvalidOffsetNumber;
/* Scan all pages -> always bulkread; pagemode only pays off at >=25% */
node->use_bulkread = true;
node->use_pagemode = (percent >= 25);

There is no NextSampleBlock, so the heap AM walks pages sequentially. NextSampleTuple is where the per-tuple coin lives: for each offset it hashes (blockno, offset, seed) and accepts if the hash is below cutoff. The independent per-offset hash is what makes this a true simple-random sample:

// bernoulli_nextsampletuple — src/backend/access/tablesample/bernoulli.c
OffsetNumber tupoffset = sampler->lt;
uint32 hashinput[3];
if (tupoffset == InvalidOffsetNumber)
tupoffset = FirstOffsetNumber;
else
tupoffset++;
hashinput[0] = blockno; /* same throughout the block */
hashinput[2] = sampler->seed; /* same throughout the scan */
for (; tupoffset <= maxoffset; tupoffset++)
{
uint32 hash;
hashinput[1] = tupoffset;
hash = DatumGetUInt32(hash_any((const unsigned char *) hashinput,
(int) sizeof(hashinput)));
if (hash < sampler->cutoff)
break; /* accept this tuple */
}
if (tupoffset > maxoffset)
tupoffset = InvalidOffsetNumber; /* page done */
sampler->lt = tupoffset;
return tupoffset;

The comment above this function is the statistical justification for the slot-then-visibility ordering: “we do the coinflip for every tuple offset in the table. Since all tuples have the same probability of being returned, it doesn’t matter if we do extra coinflips for invisible tuples.” Critically, BERNOULLI’s SampleScanGetSampleSize estimates all pages — it reads the whole table — but only fraction * tuples:

// bernoulli_samplescangetsamplesize — src/backend/access/tablesample/bernoulli.c
*pages = baserel->pages; /* visits every page */
*tuples = clamp_row_est(baserel->tuples * samplefract);

This *pages = baserel->pages vs SYSTEM’s *pages = pages * fract is the one estimate that captures the entire cost difference between the two methods.

6. The heap-AM bridge — applying visibility

Section titled “6. The heap-AM bridge — applying visibility”

The driver’s table_scan_sample_next_block / _next_tuple dispatch into the heap AM. heapam_scan_sample_next_block either calls the method’s NextSampleBlock or, when it is NULL, walks pages sequentially with syncscan position reporting, then reads the chosen page (and prunes it in pagemode):

// heapam_scan_sample_next_block — src/backend/access/heap/heapam_handler.c
if (tsm->NextSampleBlock)
blockno = tsm->NextSampleBlock(scanstate, hscan->rs_nblocks);
else
{ /* scanning table sequentially (BERNOULLI) */
if (hscan->rs_cblock == InvalidBlockNumber)
blockno = hscan->rs_startblock;
else
{
blockno = hscan->rs_cblock + 1;
if (blockno >= hscan->rs_nblocks)
blockno = 0; /* wrap */
if (scan->rs_flags & SO_ALLOW_SYNC)
ss_report_location(scan->rs_rd, blockno);
if (blockno == hscan->rs_startblock)
blockno = InvalidBlockNumber; /* full circle -> done */
}
}
/* ... read page; if pagemode, heap_prepare_pagescan(scan) prunes + computes visibility ... */

heapam_scan_sample_next_tuple repeatedly asks NextSampleTuple for offsets and applies the MVCC test before returning a tuple — skipping invalid line pointers and invisible tuples, exactly the “extra coinflips don’t bias” contract BERNOULLI relies on:

// heapam_scan_sample_next_tuple — src/backend/access/heap/heapam_handler.c
for (;;)
{
tupoffset = tsm->NextSampleTuple(scanstate, blockno, maxoffset);
if (OffsetNumberIsValid(tupoffset))
{
itemid = PageGetItemId(page, tupoffset);
if (!ItemIdIsNormal(itemid))
continue; /* skip dead/redirect slots */
/* build HeapTuple at (blockno, tupoffset) ... */
if (all_visible)
visible = true;
else
visible = SampleHeapTupleVisible(scan, hscan->rs_cbuf, tuple, tupoffset);
if (!visible)
continue; /* invisible -> next offset */
ExecStoreBufferHeapTuple(tuple, slot, hscan->rs_cbuf);
return true;
}
else
{ /* NextSampleTuple returned Invalid -> page exhausted */
ExecClearTuple(slot);
return false;
}
}

The all_visible fast path (PageIsAllVisible(page) && !takenDuringRecovery) skips the per-tuple visibility check when the page is all-visible per the visibility map (see postgres-visibility-map.md), a real win for SYSTEM where every offset on an accepted page is returned. Note also that in non-pagemode the buffer is content-locked around the visibility test and HeapCheckForSerializableConflictOut is invoked — table sampling participates in SSI predicate locking like any other scan (see postgres-ssi-predicate-locking.md).

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
TsmRoutine (struct)src/include/access/tsmapi.h56
GetTsmRoutinesrc/backend/access/tablesample/tablesample.c27
tsm_system_handlersrc/backend/access/tablesample/system.c67
SystemSamplerData (struct)src/backend/access/tablesample/system.c37
system_samplescangetsamplesizesrc/backend/access/tablesample/system.c88
system_beginsamplescansrc/backend/access/tablesample/system.c139
system_nextsampleblocksrc/backend/access/tablesample/system.c178
system_nextsampletuplesrc/backend/access/tablesample/system.c236
tsm_bernoulli_handlersrc/backend/access/tablesample/bernoulli.c65
BernoulliSamplerData (struct)src/backend/access/tablesample/bernoulli.c37
bernoulli_samplescangetsamplesizesrc/backend/access/tablesample/bernoulli.c86
bernoulli_beginsamplescansrc/backend/access/tablesample/bernoulli.c136
bernoulli_nextsampletuplesrc/backend/access/tablesample/bernoulli.c181
SampleScanState (struct)src/include/nodes/execnodes.h1631
ExecInitSampleScansrc/backend/executor/nodeSamplescan.c92
tablesample_initsrc/backend/executor/nodeSamplescan.c217
tablesample_getnextsrc/backend/executor/nodeSamplescan.c319
ExecReScanSampleScansrc/backend/executor/nodeSamplescan.c201
transformRangeTableSamplesrc/backend/parser/parse_clause.c907
set_tablesample_rel_sizesrc/backend/optimizer/path/allpaths.c825
set_tablesample_rel_pathlistsrc/backend/optimizer/path/allpaths.c865
cost_samplescansrc/backend/optimizer/path/costsize.c370
heapam_scan_sample_next_blocksrc/backend/access/heap/heapam_handler.c2174
heapam_scan_sample_next_tuplesrc/backend/access/heap/heapam_handler.c2264

All claims below were checked against the REL_18_STABLE working tree at commit 273fe94. The following facts are asserted as verified:

  • Two core methods, two source files. system.c and bernoulli.c are the only files under src/backend/access/tablesample/ besides the shared tablesample.c. Each defines exactly one handler (tsm_system_handler / tsm_bernoulli_handler), registered in pg_proc.dat returning the tsm_handler pseudo-type (prosrc => 'tsm_system_handler' / 'tsm_bernoulli_handler').

  • The TsmRoutine vtable has exactly the seven callbacks plus two flags shown — verified against src/include/access/tsmapi.h. InitSampleScan, NextSampleBlock, and EndSampleScan are documented /* can be NULL */; SampleScanGetSampleSize, BeginSampleScan, and NextSampleTuple are always supplied by both built-in methods.

  • SYSTEM supplies NextSampleBlock; BERNOULLI sets it NULL. Confirmed in both handlers. This NULLness is read in three places, all verified: tablesample_init (allow_sync = (tsm->NextSampleBlock == NULL)), cost_samplescan (spc_page_cost = (tsm->NextSampleBlock != NULL) ? random : seq), and heapam_scan_sample_next_block (calls the method vs. walks sequentially).

  • The cutoff formula is identical in both methods. Both compute dcutoff = rint(((double) PG_UINT32_MAX + 1) * percent / 100) and store it as uint64. Verified byte-for-byte in system_beginsamplescan and bernoulli_beginsamplescan. The argument is a single FLOAT4OID (parameterTypes = list_make1_oid(FLOAT4OID)), validated to [0,100] with error code ERRCODE_INVALID_TABLESAMPLE_ARGUMENT.

  • Hash inputs differ in arity. SYSTEM hashes a 2-element uint32 array {blockno, seed}; BERNOULLI hashes a 3-element array {blockno, offset, seed}. Both use hash_any over sizeof(hashinput) and accept when hash < sampler->cutoff. Verified in system_nextsampleblock and bernoulli_nextsampletuple.

  • Seed handling. ExecInitSampleScan draws pg_prng_uint32(&pg_global_prng_state) once when tsc->repeatable == NULL; tablesample_init derives the deterministic seed via DirectFunctionCall1(hashfloat8, datum) when REPEATABLE is present. The REPEATABLE argument is coerced to FLOAT8OID in the parser. All verified.

  • Size-estimate asymmetry. SYSTEM sets *pages = clamp_row_est(pages * fract); BERNOULLI sets *pages = baserel->pages. Both set *tuples = clamp_row_est(tuples * fract). Verified — this single difference encodes the block-vs-row I/O cost distinction.

  • Visibility is applied by the heap AM, not the method. NextSampleTuple returns an offset without a visibility guarantee; the comments in both system.c and bernoulli.c explicitly say nodeSamplescan / the AM “will deal with that,” and heapam_scan_sample_next_tuple calls SampleHeapTupleVisible (with an all_visible VM fast path) and HeapCheckForSerializableConflictOut. Verified.

One nuance worth flagging for drift: the SQL-visible defaults (the bulkread 1% cutoff for SYSTEM, the pagemode 25% cutoff for BERNOULLI) are described in the source comments as guesses “based on very limited experimentation,” so they are tuning constants, not invariants — treat the exact thresholds as revision-specific.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

PostgreSQL’s TSM API is a faithful, minimal implementation of SQL:2003 sampling, and its design choices read clearly against the wider landscape.

Extension methods: sampling by count and by time. The percentage-based interface is not the only useful contract, and the pluggability of the TSM API is what lets core stay small while two contrib modules add other policies. tsm_system_rows samples approximately a fixed number of rows (not a percentage) using block-level selection, and tsm_system_time samples for a fixed time budget in milliseconds. Both are out of scope here (this doc is core-only) but they are the canonical demonstration that the four callbacks suffice for genuinely different sampling semantics: each ships its own BeginSampleScan to interpret its argument and its own NextSampleBlock to stop on the right condition. The lesson is that the API’s narrow waist (estimate / begin / next-block / next-tuple) was drawn at the right place.

flowchart LR
  subgraph SYSTEM["SYSTEM — block sampling"]
    SB["hash(blockno, seed) < cutoff?"]
    SACC["accept block<br/>return ALL offsets"]
    SREJ["reject block"]
    SB -- yes --> SACC
    SB -- no --> SREJ
  end
  subgraph BERN["BERNOULLI — row sampling"]
    BB["hash(blockno, offset, seed) < cutoff?"]
    BACC["accept this tuple"]
    BREJ["reject this tuple"]
    BB -- yes --> BACC
    BB -- no --> BREJ
  end
  COST["cost: SYSTEM reads fract*pages (random)<br/>BERNOULLI reads ALL pages (seq)"]
  STAT["statistics: SYSTEM clustered (design effect)<br/>BERNOULLI true simple-random"]
  SYSTEM --> COST
  BERN --> COST
  SYSTEM --> STAT
  BERN --> STAT

Other engines. SQL Server’s TABLESAMPLE is block-only (it has no BERNOULLI equivalent in T-SQL; the percentage is approximate and small samples may return zero rows) but adds a ROWS count form and a REPEATABLE(seed) clause that maps onto the same deterministic-seed idea. DB2 LUW implements both SYSTEM and BERNOULLI with REPEATABLE, the closest match to PostgreSQL. Oracle predates the standard with SAMPLE (n) (row) and SAMPLE BLOCK (n) (block) plus a SEED clause, and notably integrates sampling into its optimizer’s dynamic-sampling and approximate-query (APPROX_COUNT_DISTINCT) machinery — a direction PostgreSQL has not taken in core. The convergence on “block vs row, percentage, optional seed” across four independent products is strong evidence the abstraction is the natural one.

Sampling for statistics, not just queries. PostgreSQL’s ANALYZE does not use the TSM API. It runs its own two-stage reservoir sampler (Vitter’s algorithm; see postgres-analyze-transform.md) over a fixed target number of rows (default_statistics_target * 300) drawn from random blocks. The two sampling subsystems are deliberately separate: ANALYZE needs a fixed-size sample for histogram construction with bounded memory, whereas TABLESAMPLE needs a fixed-probability sample exposed to SQL. Reservoir sampling (Vitter 1985) is the apt theory anchor for the former; the hash-threshold Bernoulli/cluster sampling here is the apt anchor for the latter. A research-frontier observation: there is no core mechanism to feed a TABLESAMPLE result back into the optimizer’s estimates (true online/adaptive sampling, as in approximate-query systems like BlinkDB or the AQUA project), so PostgreSQL’s sampling remains a user-facing query operator rather than an internal estimation loop.

Block sampling’s statistical hazard, revisited. The design effect from clustering is not academic. Because SYSTEM returns every row on an accepted page, any attribute that correlates with physical row placement — a serial ID, an insertion-time-ordered timestamp, a clustering index — is systematically misrepresented by a SYSTEM sample. The honest guidance, matching the standard’s intent, is: use SYSTEM when you want a fast, rough look and the table is not physically ordered on the attribute you care about; use BERNOULLI when you need an unbiased row sample and can afford to read every page. PostgreSQL exposes both and prices them differently precisely so the planner and the user can make that trade-off explicitly.

Parallelism. A SampleScan is not itself parallel-aware in core; the planner checks func_parallel(rte->tablesample->tsmhandler) and the parallel-safety of the arguments before allowing a sampled rel under a parallel plan, but the sampling itself is performed by a single worker’s scan. Parallelizing block sampling (partitioning the block space across workers while preserving the hash-threshold decision) is a clean extension the current API would accommodate, since the block decision is already a pure function of (blockno, seed) — any worker can independently decide any block.

  • PostgreSQL REL_18_STABLE source (commit 273fe94), files:
    • src/backend/access/tablesample/tablesample.cGetTsmRoutine
    • src/backend/access/tablesample/system.c — SYSTEM method
    • src/backend/access/tablesample/bernoulli.c — BERNOULLI method
    • src/include/access/tsmapi.h — the TsmRoutine vtable and signatures
    • src/backend/executor/nodeSamplescan.c — the SampleScan executor node
    • src/backend/access/heap/heapam_handler.c — heap-AM sample bridge + visibility
    • src/backend/optimizer/path/allpaths.cset_tablesample_rel_size / _pathlist
    • src/backend/optimizer/path/costsize.ccost_samplescan
    • src/backend/parser/parse_clause.ctransformRangeTableSample
    • src/include/catalog/pg_proc.dat — handler function registrations
  • Database System Concepts, Silberschatz, Korth & Sudarshan, 7th ed., ch. 15 (Query Processing) and the advanced query-processing material on sampling and approximate query answering — knowledge/research/dbms-general/database-system-concepts.md.
  • SQL:2003 standard, <table sample clause> (SYSTEM / BERNOULLI / REPEATABLE).
  • Cross-references within this tree: postgres-scan-nodes.md (SeqScan / IndexScan node machinery), postgres-index-am.md (index AM handler surface), postgres-table-am.md (TableAmRoutine), postgres-heap-am.md (heap scan internals), postgres-analyze-transform.md (ANALYZE’s separate reservoir sampler), postgres-visibility-map.md (the all-visible fast path), postgres-ssi-predicate-locking.md (serializable conflict checks during scans), postgres-cost-model.md (page/tuple cost constants).