PostgreSQL BRIN — Block Range Indexes and the Range Map
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”Every index family discussed so far in this code-analysis/postgres tree —
the B-tree (postgres-nbtree.md), GiST, hash — shares one structural
commitment: one index entry per heap tuple. That is what makes them
precise (a probe returns exactly the matching TIDs) and what makes them
expensive: the index of a billion-row table is itself hundreds of millions
of entries, often a sizable fraction of the table’s own size, and every
insert pays to thread a new entry into a balanced tree. For an analytic
warehouse table that is append-mostly and physically clustered on the
column you filter by — a fact table ordered by created_at, a sensor log
ordered by timestamp, an order table ordered by order_id — that precision
is mostly wasted. The query WHERE created_at BETWEEN x AND y does not need
to know which rows match; it needs to know which disk pages cannot
possibly contain a match, so the executor can skip reading them. The index
that answers that cheaper question can itself be thousands of times smaller.
This is the idea of the block-oriented summary index, known in the
literature and in other engines under several names: zone maps (Netezza,
Oracle Exadata storage indexes), Small Materialized Aggregates (the
foundational notion from Moerkotte’s 1998 VLDB paper “Small Materialized
Aggregates: A Light Weight Index Structure for Data Warehousing”), MinMax
indexes, and in PostgreSQL BRIN — Block Range INdex. The textbook
framing lives in Database Internals (Petzold/Petrov; captured in
research/dbms-general/database-internals.md) under the distinction between
precise and approximate secondary structures, and in Database System
Concepts (Silberschatz 7e, ch. 14 on indexing) under the cost model that
makes a sequential scan with page-skipping competitive with a B-tree probe
once the selectivity is low and the data is clustered.
The defining trade is size versus precision, and BRIN sits at the extreme small-and-imprecise corner:
-
It indexes ranges of pages, not rows. Heap blocks are partitioned into fixed-size groups — page ranges — and the index stores one summary per range. With 128 pages per range (PostgreSQL’s default), the index has one entry per ~128 × ~226 ≈ 29,000 tuples. A multi-gigabyte table gets an index measured in kilobytes.
-
The summary is a bound, not a value set. For an ordered type the summary is
(min, max)of every value in the range; for a geometric type it is a bounding box. A query qual is checked against the bound: if the qual cannot be satisfied by any value inside[min, max], the whole range is skipped. If it might be, the range is returned for rechecking. This is the same one-sided logic as a Bloom filter: no false negatives, many false positives. A range is returned whenever it might match. -
Effectiveness depends entirely on physical correlation. If the indexed column is well correlated with heap block order, each range’s
[min, max]is tight and disjoint from its neighbours, so a selective qual skips almost every range. If the column is shuffled, every range’s[min, max]spans nearly the whole domain, every range matches every qual, and the scan degenerates to a sequential scan plus useless index overhead. BRIN is therefore a specialized tool, not a default —pg_stats.correlationis the deciding statistic.
The second structural idea, orthogonal to summarization, is the translation layer. To skip pages, the scan must, for each page range, find that range’s summary in O(1) without descending a tree (a tree would reintroduce the per-entry cost we are trying to avoid). BRIN’s answer is the range map (revmap): a flat array, addressed by simple arithmetic on the heap block number, where slot i holds the TID of the summary tuple for page range i. The README states it directly: “Since the map entries are fixed size, it is possible to compute the address of the range map entry for any given heap page by simple arithmetic.” The revmap is the part of BRIN that has no analogue in a B-tree, and it organizes the rest of this document.
Two consequences follow immediately and shape everything downstream:
-
BRIN can only produce a lossy bitmap, never a tuple stream. Because a range’s summary identifies pages that might match, the only honest output is “scan these pages and recheck.” PostgreSQL’s AM layer formalizes this: BRIN supports
amgetbitmapbut notamgettuple, and the bitmap it builds is intentionally lossy at page granularity. -
Maintenance is asymmetric and lazy. An insert can only ever widen a range’s
[min, max]— and only if the new value falls outside it — so most inserts touch nothing. A new range at the end of a growing table is simply left unsummarized (and unsummarized ranges are always returned by scans, so correctness holds), with summarization deferred to VACUUM or an explicit/auto call. Deletes never shrink a summary; the README notes this is an “optimization opportunity only, not a correctness issue.”
Common DBMS Design
Section titled “Common DBMS Design”Engineers building a block-summary index converge on a recurring set of choices. Naming them first lets PostgreSQL’s specific symbols read as positions in a shared design space.
Fixed-size page ranges and a computed translation map
Section titled “Fixed-size page ranges and a computed translation map”Almost every zone-map-style structure groups the base table into fixed-size
runs of storage blocks and keeps one summary per run. Fixed size is the key
simplification: the summary for the block containing tuple t is found by
integer division t_block / blocks_per_run, with no search. Variable-size
runs (self-tuning the granularity to data density) are perennially proposed
and almost never shipped, because they reintroduce a lookup structure. The
PostgreSQL README’s “Future improvements” section explicitly lists
different-size page ranges as unimplemented.
One-sided summaries: bounds, not membership
Section titled “One-sided summaries: bounds, not membership”The summary must be conservative: it may claim a range might match when it does not (false positive, costing a wasted page read and recheck), but it must never claim a range cannot match when it does (false negative, a correctness bug). The two dominant summary shapes are:
- MinMax — store
minandmaxof the ordered type. Answers range and equality predicates (<,<=,=,>=,>). - Bounding box / inclusion — store a geometric or set-theoretic envelope (a box for points, a union for ranges). Answers containment and overlap.
Both are unions: merging two ranges’ summaries means widening the bound to cover both, an operation that is associative and commutative — which is what lets summaries be built in parallel and merged.
Lossy, recheck-mandatory scans
Section titled “Lossy, recheck-mandatory scans”Because the structure is approximate, the scan operator must always pair with a recheck: the engine reads every page the index nominated, then re-applies the original predicate to each tuple. The index’s job is purely to prune the set of pages read. This is identical in spirit to a Bloom-filtered scan or a lossy bitmap index, and it is why these indexes only ever feed a bitmap-heap-scan-like node, never an index-only or ordered scan.
Deferred, batch summarization
Section titled “Deferred, batch summarization”Maintaining a per-row index on insert is exactly the cost a summary index exists to avoid, so summarization is batched: a freshly written run of blocks is left unsummarized until a maintenance pass (a VACUUM-equivalent, a bulk-load hook, or a background trigger) scans the run once and computes its bound. Until then the run is treated as “might match everything” so queries stay correct. A subtle concurrency hazard appears here: while the maintenance pass scans a run, new tuples may land in it; the implementation must capture those, classically via a placeholder/two-phase protocol.
Crash safety via single-page atomic actions
Section titled “Crash safety via single-page atomic actions”Like any disk index, every mutation — summarize, update-in-place, move, desummarize, extend the map — must be a WAL-logged action that leaves the on-disk structure valid at any interruption point. The recurring technique is to decompose multi-page mutations so each WAL record covers the pages whose combined state must be atomic (e.g. the new summary page plus the map slot that points at it), and to make any orphaned intermediate state self-correcting under a later scan or VACUUM.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL implements all five conventions, in src/backend/access/brin/.
The access method is registered by brinhandler, which advertises the
defining capability profile: no amgettuple, yes amgetbitmap, lossy,
multi-column, and summarizing.
// brinhandler — src/backend/access/brin/brin.camroutine->amcanorder = false;amroutine->amcanmulticol = true;amroutine->amoptionalkey = true;amroutine->amsearchnulls = true;amroutine->amstorage = true; /* opclass picks the stored type */amroutine->amcanparallel = false; /* scans are not parallel ... */amroutine->amcanbuildparallel = true; /* ... but builds are */amroutine->amsummarizing = true; /* inserts may trigger autosummarize */amroutine->amgettuple = NULL; /* no precise probe — lossy only */amroutine->amgetbitmap = bringetbitmap;The on-disk index has exactly three page types, tagged in the page’s special
area so pg_filedump and pageinspect can recognize them:
// brin_page.h — page-type tags live in the last half-word of every page#define BRIN_PAGETYPE_META 0xF091 /* block 0 */#define BRIN_PAGETYPE_REVMAP 0xF092 /* the range map, right after meta */#define BRIN_PAGETYPE_REGULAR 0xF093 /* summary (and placeholder) tuples */Block 0 is the metapage; the revmap occupies a contiguous run of blocks starting at block 1 and growing as the table grows; everything else is a regular page holding summary tuples. The layout and the scan/insert flows are summarized below.
flowchart TD
subgraph IDX["BRIN index relation (main fork)"]
META["blk 0: METAPAGE<br/>pagesPerRange, lastRevmapPage, version"]
RM["blks 1..lastRevmapPage: REVMAP pages<br/>rm_tids[]: one ItemPointer per page range"]
REG["blks > lastRevmapPage: REGULAR pages<br/>summary tuples (min/max, null flags)"]
end
HEAP["Heap: page ranges of pagesPerRange blocks each<br/>range i = heap blocks [i*ppr .. i*ppr+ppr-1]"]
RM -- "slot i -> TID" --> REG
REG -- "summarizes" --> HEAP
HEAP -- "heapBlk / ppr -> revmap slot" --> RM
The per-range summary tuple
Section titled “The per-range summary tuple”The opclass defines what is stored. For the default minmax opclass each
indexed column contributes two stored Datums — the min and the max —
described by a BrinOpcInfo with oi_nstored = 2:
// brin_minmax_opcinfo — src/backend/access/brin/brin_minmax.cresult = palloc0(MAXALIGN(SizeofBrinOpcInfo(2)) + sizeof(MinmaxOpaque));result->oi_nstored = 2; /* min and max */result->oi_regular_nulls = true; /* AM handles NULL bookkeeping generically */result->oi_typcache[0] = result->oi_typcache[1] = lookup_type_cache(typoid, 0);In memory a summary is a BrinMemTuple carrying one BrinValues per column;
the two generic NULL flags (bv_hasnulls, bv_allnulls) plus the
opclass-specific bv_values[] capture the whole summary, and tuple-level
flags mark placeholder and empty-range tuples:
// brin_tuple.h — the deformed (in-memory) summarytypedef struct BrinValues { AttrNumber bv_attno; bool bv_hasnulls; /* any NULL in the range? */ bool bv_allnulls; /* all values NULL in the range? */ Datum *bv_values; /* opclass-stored values (min,max for minmax) */ void *bv_mem_value;} BrinValues;
typedef struct BrinMemTuple { bool bt_placeholder; /* summarization in progress */ bool bt_empty_range; /* range provably has no tuples */ BlockNumber bt_blkno; /* first heap block of the range */ /* ... */ BrinValues bt_columns[FLEXIBLE_ARRAY_MEMBER];} BrinMemTuple;add_values_to_range is the per-tuple accumulator shared by build and
insert: for each column it either records a NULL locally or calls the
opclass BRIN_PROCNUM_ADDVALUE (e.g. brin_minmax_add_value) to widen the
bound, returning whether the summary changed.
// add_values_to_range — src/backend/access/brin/brin.c (condensed)bool modified = dtup->bt_empty_range;for (keyno = 0; keyno < bdesc->bd_tupdesc->natts; keyno++) { BrinValues *bval = &dtup->bt_columns[keyno]; if (bdesc->bd_info[keyno]->oi_regular_nulls && nulls[keyno]) { if (!bval->bv_hasnulls) { bval->bv_hasnulls = true; modified = true; } continue; } result = FunctionCall4Coll(addValue, /* BRIN_PROCNUM_ADDVALUE */ idxRel->rd_indcollation[keyno], PointerGetDatum(bdesc), PointerGetDatum(bval), values[keyno], nulls[keyno]); modified |= DatumGetBool(result);}dtup->bt_empty_range = false;return modified;The minmax addValue is the heart of the “bound, not value” design: a value
that already lies inside [min, max] changes nothing (and returns false, so
no page write); only an out-of-bounds value widens the summary.
// brin_minmax_add_value — brin_minmax.c (condensed)if (column->bv_allnulls) { /* first non-null seen: min = max = newval */ column->bv_values[0] = datumCopy(newval, attr->attbyval, attr->attlen); column->bv_values[1] = datumCopy(newval, attr->attbyval, attr->attlen); column->bv_allnulls = false; PG_RETURN_BOOL(true);}cmpFn = minmax_get_strategy_procinfo(bdesc, attno, attr->atttypid, BTLessStrategyNumber);if (DatumGetBool(FunctionCall2Coll(cmpFn, colloid, newval, column->bv_values[0]))) { column->bv_values[0] = datumCopy(newval, ...); updated = true; /* new min */}cmpFn = minmax_get_strategy_procinfo(bdesc, attno, attr->atttypid, BTGreaterStrategyNumber);if (DatumGetBool(FunctionCall2Coll(cmpFn, colloid, newval, column->bv_values[1]))) { column->bv_values[1] = datumCopy(newval, ...); updated = true; /* new max */}PG_RETURN_BOOL(updated);The range map (revmap)
Section titled “The range map (revmap)”The revmap is a flat array of ItemPointerData spread across consecutive
revmap pages. Two macros do all the addressing — there is no search:
// brin_revmap.c — heap block -> (revmap page, slot) by pure arithmetic#define HEAPBLK_TO_REVMAP_BLK(pagesPerRange, heapBlk) \ ((heapBlk / pagesPerRange) / REVMAP_PAGE_MAXITEMS)#define HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk) \ ((heapBlk / pagesPerRange) % REVMAP_PAGE_MAXITEMS)heapBlk / pagesPerRange is the range number; dividing by
REVMAP_PAGE_MAXITEMS picks the revmap page (then +1 to skip the
metapage), and the remainder is the slot within it. brinGetTupleForHeapBlock
is the workhorse that turns a heap block into the summary BrinTuple: read
the revmap slot, follow the TID to a regular page, and — crucially — verify
tup->bt_blkno == heapBlk, because the revmap could have been repointed
concurrently. A mismatch means retry; a self-referential loop is detected and
raised as corruption.
// brinGetTupleForHeapBlock — brin_revmap.c (condensed)heapBlk = (heapBlk / revmap->rm_pagesPerRange) * revmap->rm_pagesPerRange;mapBlk = revmap_get_blkno(revmap, heapBlk);if (mapBlk == InvalidBlockNumber) return NULL; /* range not summarized */for (;;) { /* read revmap slot under SHARE lock */ iptr = contents->rm_tids + HEAPBLK_TO_REVMAP_INDEX(revmap->rm_pagesPerRange, heapBlk); if (!ItemPointerIsValid(iptr)) return NULL; /* unsummarized */ if (ItemPointerIsValid(&previptr) && ItemPointerEquals(&previptr, iptr)) ereport(ERROR, (errcode(ERRCODE_INDEX_CORRUPTED), ...)); /* loop guard */ previptr = *iptr; blk = ItemPointerGetBlockNumber(iptr); *off = ItemPointerGetOffsetNumber(iptr); /* fetch the regular page, lock in requested mode */ if (BRIN_IS_REGULAR_PAGE(page)) { tup = (BrinTuple *) PageGetItem(page, lp); if (tup->bt_blkno == heapBlk) return tup; /* found it */ } LockBuffer(*buf, BUFFER_LOCK_UNLOCK); /* repointed concurrently; retry */}Setting a slot is symmetric and trivial — brinSetHeapBlockItemptr writes
(or invalidates) the ItemPointerData at the computed slot; it is the same
routine used in WAL replay, which is why the revmap update is replay-safe.
Lossy bitmap scan
Section titled “Lossy bitmap scan”bringetbitmap is the only scan entry point. It does not descend anything:
it learns the heap size, then walks the revmap linearly from block 0 in
steps of pagesPerRange, and for each range either emits all its pages
(unsummarized, placeholder, or matching) or skips them (consistent function
says no). Every emitted page goes into a TIDBitmap at page granularity —
inherently lossy — for the BitmapHeapScan recheck to filter.
flowchart TD
S["bringetbitmap: nblocks = RelationGetNumberOfBlocks(heap)"] --> L{"heapBlk < nblocks?"}
L -- no --> DONE["return totalpages * 10"]
L -- yes --> G["brinGetTupleForHeapBlock(heapBlk)"]
G --> T{"got a summary tuple?"}
T -- "no (unsummarized)" --> ADD["addrange = true"]
T -- "yes" --> PH{"placeholder?"}
PH -- yes --> ADD
PH -- no --> CK["for each attr with keys:<br/>check null keys, then<br/>opclass consistent()"]
CK --> M{"all keys consistent?"}
M -- no --> SKIP["addrange = false"]
M -- yes --> ADD
ADD --> E["tbm_add_page for every page in range"]
SKIP --> N["heapBlk += pagesPerRange"]
E --> N
N --> L
The matching logic lives inline in the scan loop; the key calls are
check_null_keys (handles IS [NOT] NULL and the all-nulls short-circuit)
and the opclass BRIN_PROCNUM_CONSISTENT:
// bringetbitmap — brin.c (condensed): per-range decisionif (!gottuple) addrange = true; /* unsummarized -> keep */else if (dtup->bt_placeholder) addrange = true; /* in-progress -> keep */else { addrange = true; /* default keep; no keys => match */ for (attno = 1; attno <= bdesc->bd_tupdesc->natts; attno++) { if (dtup->bt_empty_range) { addrange = false; break; } /* provably empty */ if (bdesc->bd_info[attno-1]->oi_regular_nulls && !check_null_keys(bval, nullkeys[attno-1], nnullkeys[attno-1])) { addrange = false; break; } if (bval->bv_allnulls) { addrange = false; break; } /* strict ops */ add = FunctionCall3Coll(&consistentFn[attno-1], collation, PointerGetDatum(bdesc), PointerGetDatum(bval), PointerGetDatum(keys[attno-1][keyno])); addrange = DatumGetBool(add); if (!addrange) break; }}if (addrange) for (pageno = heapBlk; pageno <= Min(nblocks, heapBlk+ppr)-1; pageno++) tbm_add_page(tbm, pageno); /* page-granular, lossy */The minmax consistent function is the dual of addValue: it translates a
scan-key strategy into a comparison against the stored bound — </<= test
the min, >/>= test the max, and = requires min <= key <= max:
// brin_minmax_consistent — brin_minmax.c (condensed)switch (key->sk_strategy) { case BTLessStrategyNumber: case BTLessEqualStrategyNumber: matches = FunctionCall2Coll(finfo, colloid, column->bv_values[0] /*min*/, value); break; case BTEqualStrategyNumber: /* min <= key AND max >= key */ matches = FunctionCall2Coll(/*<=*/finfo, colloid, column->bv_values[0], value); if (!DatumGetBool(matches)) break; matches = FunctionCall2Coll(/*>=*/finfo2, colloid, column->bv_values[1], value); break; case BTGreaterEqualStrategyNumber: case BTGreaterStrategyNumber: matches = FunctionCall2Coll(finfo, colloid, column->bv_values[1] /*max*/, value); break;}PG_RETURN_DATUM(matches);Maintenance: insert narrows, VACUUM summarizes
Section titled “Maintenance: insert narrows, VACUUM summarizes”brininsert does not create entries. It finds the summary covering the
inserted tuple’s range; if the range is unsummarized it returns immediately,
otherwise it asks add_values_to_range whether the bound must widen and, if
so, applies brin_doupdate. New ranges are summarized lazily by VACUUM,
brin_summarize_new_values(), or the autosummarize reloption. The
opclass-defined union merges two summaries (used both to fold a
worker-built range into the leader’s during parallel build and to re-merge
the placeholder during concurrent summarization). The next section traces
these flows symbol by symbol.
Source Walkthrough
Section titled “Source Walkthrough”AM surface and descriptors
Section titled “AM surface and descriptors”brinhandler (brin.c) fills the IndexAmRoutine and is the entry point
for the generic AM dispatch (see postgres-index-am.md). brin_build_desc
constructs the BrinDesc — the per-index runtime descriptor — by calling
each column’s BRIN_PROCNUM_OPCINFO to learn oi_nstored and the stored
types. The four generic support procedures every opclass must provide are
numbered in brin_internal.h:
// brin_internal.h — generic opclass support procedure numbers#define BRIN_PROCNUM_OPCINFO 1 /* describe the stored layout */#define BRIN_PROCNUM_ADDVALUE 2 /* widen a summary with a heap value */#define BRIN_PROCNUM_CONSISTENT 3 /* does a summary match a scan key? */#define BRIN_PROCNUM_UNION 4 /* merge two summaries */#define BRIN_PROCNUM_OPTIONS 5 /* optional */#define BRIN_LAST_OPTIONAL_PROCNUM 15brin_build_desc walks the index tuple descriptor and accumulates
bd_totalstored; the resulting bdesc->bd_info[] array of BrinOpcInfo *
is threaded through every add/consistent/union call.
Revmap: addressing, fetch, and physical extension
Section titled “Revmap: addressing, fetch, and physical extension”The revmap access object BrinRevmap is created by brinRevmapInitialize,
which reads pagesPerRange and lastRevmapPage from the metapage and caches
them. revmap_get_blkno converts a heap block to a physical revmap block (or
InvalidBlockNumber if that revmap page does not exist yet — meaning the
range is unsummarized):
// revmap_get_blkno — brin_revmap.ctargetblk = HEAPBLK_TO_REVMAP_BLK(revmap->rm_pagesPerRange, heapBlk) + 1; /* +1 skips meta */if (targetblk <= revmap->rm_lastRevmapPage) return targetblk;return InvalidBlockNumber;When a summary must be inserted for a range whose revmap page does not exist,
revmap_extend_and_get_blkno loops calling revmap_physical_extend. This is
the most intricate routine in the revmap: it locks the metapage (serializing
revmap growth), grabs the next block — reading it if it already exists as a
regular page, else extending the relation — and if that block already holds
summary tuples, evacuates them first. This is the “Whenever the revmap needs
to be extended by another page, existing tuples in that page are moved to some
other page” behavior from the README.
// revmap_physical_extend — brin_revmap.c (condensed)LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_EXCLUSIVE); /* serialize extension */if (metadata->lastRevmapPage != revmap->rm_lastRevmapPage) { revmap->rm_lastRevmapPage = metadata->lastRevmapPage; /* stale cache: caller retries */ LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK); return;}mapBlk = metadata->lastRevmapPage + 1;/* ... obtain buffer for mapBlk (read or ExtendBufferedRel) ... */if (brin_start_evacuating_page(irel, buf)) { /* page holds tuples? */ LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK); brin_evacuate_page(irel, revmap->rm_pagesPerRange, revmap, buf); return; /* caller starts over */}START_CRIT_SECTION();brin_page_init(page, BRIN_PAGETYPE_REVMAP); /* retype as revmap */metadata->lastRevmapPage = mapBlk;/* set metapage pd_lower, MarkBufferDirty, WAL: XLOG_BRIN_REVMAP_EXTEND */END_CRIT_SECTION();brinGetTupleForHeapBlock (quoted in section 3) is the read path;
brinSetHeapBlockItemptr is the write path and is shared with WAL replay.
brinRevmapDesummarizeRange is the reverse of summarization: it locks the
revmap slot and the regular page, deletes the summary tuple with
PageIndexTupleDeleteNoCompact, invalidates the slot, and WAL-logs
XLOG_BRIN_DESUMMARIZE. It returns false (retry) if the regular page turned
into a revmap page underneath it.
// brinRevmapDesummarizeRange — brin_revmap.c (condensed)START_CRIT_SECTION();ItemPointerSetInvalid(&invalidIptr);brinSetHeapBlockItemptr(revmapBuf, revmap->rm_pagesPerRange, heapBlk, invalidIptr);PageIndexTupleDeleteNoCompact(regPg, regOffset); /* drop the summary tuple */MarkBufferDirty(regBuf); MarkBufferDirty(revmapBuf);/* WAL: xl_brin_desummarize {pagesPerRange, heapBlk, regOffset} */END_CRIT_SECTION();Page operations: insert, same-page update, move, evacuate
Section titled “Page operations: insert, same-page update, move, evacuate”brin_doinsert (brin_pageops.c) places a new summary (or placeholder, or
empty-range) tuple: it ensures the revmap is long enough
(brinRevmapExtend), finds an insert buffer (brin_getinsertbuffer, which
consults the FSM and may extend the relation), PageAddItems the tuple,
points the revmap slot at it with brinSetHeapBlockItemptr, and WAL-logs
XLOG_BRIN_INSERT — all under one critical section so the tuple and the
slot pointing at it become durable together.
brin_doupdate is the in-place narrowing path. The decision between a
cheap same-page overwrite and an expensive move hinges on
brin_can_do_samepage_update:
// brin_can_do_samepage_update — brin_pageops.creturn ((newsz <= origsz) || PageGetExactFreeSpace(BufferGetPage(buffer)) >= (newsz - origsz));If the new (wider) summary fits, brin_doupdate does a
PageIndexTupleOverwrite and logs XLOG_BRIN_SAMEPAGE_UPDATE — the revmap
slot does not change, because the tuple stays at the same TID. If it does not
fit, the new tuple is written to a different page, the old one is deleted,
and the revmap slot is repointed to the new TID, logged as
XLOG_BRIN_UPDATE covering new page + revmap + old page:
// brin_doupdate — brin_pageops.c (the move branch, condensed)revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);START_CRIT_SECTION();PageIndexTupleDeleteNoCompact(oldpage, oldoff); /* remove old summary */newoff = PageAddItem(newpage, newtup, newsz, ...); /* write new one */ItemPointerSet(&newtid, newblk, newoff);brinSetHeapBlockItemptr(revmapbuf, pagesPerRange, heapBlk, newtid); /* repoint */MarkBufferDirty(oldbuf); MarkBufferDirty(newbuf); MarkBufferDirty(revmapbuf);/* WAL: XLOG_BRIN_UPDATE (+ INIT_PAGE if extended) over all three buffers */END_CRIT_SECTION();brin_doupdate is careful about concurrency: before committing it re-checks
that the old tuple is still a normal item on a still-regular page and that
its bytes match the origtup snapshot (brin_tuples_equal); if a concurrent
inserter changed it, brin_doupdate returns false and brininsert restarts
its loop, re-reading the now-current summary so neither inserter’s value is
lost.
The evacuation pair supports revmap growth. brin_start_evacuating_page sets
the non-WAL-logged BRIN_EVACUATE_PAGE flag so br_page_get_freespace
reports zero (no new tuples land there), and brin_evacuate_page moves every
tuple off the page by re-running brin_doupdate for each, then releases it
for retyping as a revmap page.
Scan: bringetbitmap
Section titled “Scan: bringetbitmap”brinbeginscan builds a BrinOpaque (revmap access + BrinDesc).
bringetbitmap (section 3) preprocesses scan keys into per-attribute arrays,
splitting IS [NOT] NULL keys (nullkeys) from regular keys (keys), then
runs the revmap walk. It lazily looks up each attribute’s
BRIN_PROCNUM_CONSISTENT once. Note the opclass may take either signature:
if consistentFn.fn_nargs >= 4 the AM passes all keys at once
(FunctionCall4Coll), otherwise one at a time (FunctionCall3Coll) — minmax
uses the 3-arg form. brinrescan copies new scan keys; brinendscan frees
the opaque. The return value totalpages * 10 is a deliberately crude row
estimate (the AM only knows pages, not tuples).
Build: serial, parallel, and empty ranges
Section titled “Build: serial, parallel, and empty ranges”brinbuild initializes the metapage (brin_metapage_init, WAL
XLOG_BRIN_CREATE_INDEX) and a BrinBuildState, then either launches
parallel workers (_brin_begin_parallel / _brin_parallel_merge) or scans
the heap serially in physical order with table_index_build_scan feeding
brinbuildCallback. The callback accumulates tuples into the running range,
and whenever a heap block crosses into the next range it flushes via
form_and_insert_tuple:
// brinbuildCallback — brin.c (condensed)while (thisblock > state->bs_currRangeStart + state->bs_pagesPerRange - 1) { form_and_insert_tuple(state); /* flush completed range */ state->bs_currRangeStart += state->bs_pagesPerRange; brin_memtuple_initialize(state->bs_dtuple, state->bs_bdesc);}(void) add_values_to_range(index, state->bs_bdesc, state->bs_dtuple, values, isnull);After the scan brinbuild flushes the final partial range and calls
brin_fill_empty_ranges to backfill empty-range summary tuples for any
trailing ranges the scan skipped (heap blocks with no live tuples). Empty
ranges matter for correctness-cheapness: a predicate like
WHERE nonnull_col IS NULL can prune an empty range instead of treating the
gap as unsummarized (and therefore always-matching). The parallel callback
brinbuildCallbackParallel instead spills each range to a shared tuplesort
(form_and_spill_tuple); the leader merges per-worker results and fills empty
ranges itself.
Summarization and the placeholder protocol
Section titled “Summarization and the placeholder protocol”brinsummarize scans the revmap and, for each unsummarized range (NULL
slot), calls summarize_range. This is the concurrency-critical routine. It
cannot simply scan the heap range and insert a summary, because concurrent
inserters could add tuples to the range during the scan and be missed. The
solution is a placeholder tuple:
flowchart TD
A["summarize_range(heapBlk)"] --> B["brin_form_placeholder_tuple<br/>brin_doinsert: revmap now points to placeholder"]
B --> C["concurrent inserters update<br/>the placeholder via brininsert"]
B --> D["table_index_build_range_scan<br/>'any visible' over the heap range"]
D --> E["loop: brin_form_tuple from scan result"]
E --> F{"brin_doupdate placeholder -> summary?"}
F -- success --> G["done: revmap points to real summary"]
F -- "failed (placeholder changed)" --> H["re-read placeholder<br/>union_tuples into scan result"]
H --> E
// summarize_range — brin.c (condensed)phtup = brin_form_placeholder_tuple(state->bs_bdesc, heapBlk, &phsz);offset = brin_doinsert(state->bs_irel, ppr, state->bs_rmAccess, &phbuf, heapBlk, phtup, phsz);/* scan the heap range; 'any visible' so in-progress inserts are included */table_index_build_range_scan(heapRel, state->bs_irel, indexInfo, false, true, false, heapBlk, scanNumBlks, brinbuildCallback, state, NULL);for (;;) { newtup = brin_form_tuple(state->bs_bdesc, heapBlk, state->bs_dtuple, &newsize); didupdate = brin_doupdate(..., phbuf, offset, phtup, phsz, newtup, newsize, samepage); if (didupdate) break; /* placeholder -> summary committed */ /* someone updated the placeholder concurrently: merge and retry */ phtup = brinGetTupleForHeapBlock(state->bs_rmAccess, heapBlk, &phbuf, &offset, &phsz, ...); union_tuples(state->bs_bdesc, state->bs_dtuple, phtup);}union_tuples (and the opclass brin_minmax_union) merges the placeholder’s
accumulated values into the scan result, handling the empty/all-null cases
before invoking BRIN_PROCNUM_UNION per column. The SQL entry points are
brin_summarize_range / brin_summarize_new_values (the latter passing
BRIN_ALL_BLOCKRANGES), both taking ShareUpdateExclusiveLock and refusing
to run during recovery. autosummarize is triggered from brininsert: when
the first tuple lands in the first block of a new range, it requests
AVW_BRINSummarizeRange work from autovacuum for the previous range.
// brininsert — brin.c (autosummarize trigger, condensed)if (autosummarize && heapBlk > 0 && heapBlk == origHeapBlk && ItemPointerGetOffsetNumber(heaptid) == FirstOffsetNumber) { lastPageTuple = brinGetTupleForHeapBlock(revmap, heapBlk - 1, &buf, &off, NULL, BUFFER_LOCK_SHARE); if (!lastPageTuple) AutoVacuumRequestWork(AVW_BRINSummarizeRange, RelationGetRelid(idxRel), heapBlk - 1);}VACUUM and delete
Section titled “VACUUM and delete”brinbulkdelete is essentially a no-op — BRIN stores no per-tuple TIDs, so
heap tuple removal needs no index cleanup (the README: tightening a summary
after deletes is “an optimization opportunity only, not a correctness
issue”). brinvacuumcleanup does the real VACUUM work: brin_vacuum_scan
walks the index in physical order repairing any uninitialized pages via
brin_page_cleanup (recovering FSM space lost to a crash mid-extension), then
brinsummarize summarizes all still-unsummarized ranges (excluding the
partial tail, since include_partial = false for VACUUM).
The inclusion opclass (contrast)
Section titled “The inclusion opclass (contrast)”The inclusion opclass generalizes minmax from a scalar [min,max] to a
union envelope and stores three values: the union, an unmergeable flag,
and a contains_empty flag (oi_nstored = 3). It is what backs
box/range/inet BRIN, and it uses extra support procedures
PROCNUM_MERGE (11, required) and PROCNUM_MERGEABLE (12) beyond the four
generic ones. Its geometric internals belong to postgres-gist.md; what
matters here is that the framework — opcinfo/addValue/consistent/union, the
revmap, the lossy scan — is opclass-agnostic, and a new summary kind is just
a new set of these five callbacks.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
brinhandler | src/backend/access/brin/brin.c | 250 |
brininsert | src/backend/access/brin/brin.c | 344 |
bringetbitmap | src/backend/access/brin/brin.c | 567 |
brinbuildCallback | src/backend/access/brin/brin.c | 995 |
brinbuild | src/backend/access/brin/brin.c | 1105 |
brinbulkdelete | src/backend/access/brin/brin.c | 1302 |
brinvacuumcleanup | src/backend/access/brin/brin.c | 1317 |
brin_summarize_range | src/backend/access/brin/brin.c | 1381 |
brin_desummarize_range | src/backend/access/brin/brin.c | 1491 |
brin_build_desc | src/backend/access/brin/brin.c | 1581 |
summarize_range | src/backend/access/brin/brin.c | 1761 |
brinsummarize | src/backend/access/brin/brin.c | 1887 |
union_tuples | src/backend/access/brin/brin.c | 2031 |
brin_vacuum_scan | src/backend/access/brin/brin.c | 2172 |
add_values_to_range | src/backend/access/brin/brin.c | 2205 |
check_null_keys | src/backend/access/brin/brin.c | 2298 |
brin_fill_empty_ranges | src/backend/access/brin/brin.c | 2993 |
brinRevmapInitialize | src/backend/access/brin/brin_revmap.c | 70 |
brinSetHeapBlockItemptr | src/backend/access/brin/brin_revmap.c | 155 |
brinGetTupleForHeapBlock | src/backend/access/brin/brin_revmap.c | 194 |
brinRevmapDesummarizeRange | src/backend/access/brin/brin_revmap.c | 323 |
revmap_get_blkno | src/backend/access/brin/brin_revmap.c | 442 |
revmap_physical_extend | src/backend/access/brin/brin_revmap.c | 522 |
HEAPBLK_TO_REVMAP_BLK / _INDEX | src/backend/access/brin/brin_revmap.c | 40 |
brin_doupdate | src/backend/access/brin/brin_pageops.c | 53 |
brin_can_do_samepage_update | src/backend/access/brin/brin_pageops.c | 323 |
brin_doinsert | src/backend/access/brin/brin_pageops.c | 342 |
brin_metapage_init | src/backend/access/brin/brin_pageops.c | 486 |
brin_start_evacuating_page | src/backend/access/brin/brin_pageops.c | 524 |
brin_evacuate_page | src/backend/access/brin/brin_pageops.c | 564 |
brin_getinsertbuffer | src/backend/access/brin/brin_pageops.c | 689 |
brin_minmax_opcinfo | src/backend/access/brin/brin_minmax.c | 34 |
brin_minmax_add_value | src/backend/access/brin/brin_minmax.c | 64 |
brin_minmax_consistent | src/backend/access/brin/brin_minmax.c | 137 |
brin_minmax_union | src/backend/access/brin/brin_minmax.c | 208 |
BrinMetaPageData / RevmapContents | src/include/access/brin_page.h | 64 / 78 |
REVMAP_PAGE_MAXITEMS | src/include/access/brin_page.h | 93 |
BrinValues / BrinMemTuple | src/include/access/brin_tuple.h | 29 / 44 |
BRIN_DEFAULT_PAGES_PER_RANGE | src/include/access/brin.h | 39 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified against the REL_18_STABLE checkout at commit 273fe94 (PG 18.x),
source tree /data/hgryoo/references/postgres.
- Page types and metapage layout —
brin_page.hdefines exactly three page types (BRIN_PAGETYPE_META0xF091,_REVMAP0xF092,_REGULAR0xF093),BRIN_METAPAGE_BLKNO0,BRIN_CURRENT_VERSION1, andBrinMetaPageData { brinMagic, brinVersion, pagesPerRange, lastRevmapPage }. Confirmed. - Revmap arithmetic —
HEAPBLK_TO_REVMAP_BLKandHEAPBLK_TO_REVMAP_INDEXinbrin_revmap.ccompute page and slot purely fromheapBlk,pagesPerRange, andREVMAP_PAGE_MAXITEMS;+1skips the metapage.revmap_get_blknoreturnsInvalidBlockNumberfor not-yet- allocated revmap pages (= unsummarized). Confirmed verbatim. - No amgettuple —
brinhandlersetsamgettuple = NULLandamgetbitmap = bringetbitmap;amcanparallel = falsebutamcanbuildparallel = true;amsummarizing = true. Confirmed. - Default granularity —
BRIN_DEFAULT_PAGES_PER_RANGEis128inbrin.h. Confirmed. - minmax stored layout —
brin_minmax_opcinfosetsoi_nstored = 2(bv_values[0]= min,[1]= max) andoi_regular_nulls = true;inclusionsetsoi_nstored = 3. Confirmed. - Generic procnums —
BRIN_PROCNUM_OPCINFO=1,ADDVALUE=2,CONSISTENT=3,UNION=4 inbrin_internal.h; inclusion addsPROCNUM_MERGE=11,PROCNUM_MERGEABLE=12. Confirmed. - Placeholder protocol —
summarize_rangeinsertsbrin_form_placeholder_tuple, scans withtable_index_build_range_scan(any-visible mode), then loopsbrin_doupdate+union_tupleson conflict. Confirmed. - WAL record kinds named —
XLOG_BRIN_CREATE_INDEX,_INSERT,_SAMEPAGE_UPDATE,_UPDATE,_REVMAP_EXTEND,_DESUMMARIZEare emitted by the routines as described. Confirmed againstbrin.c,brin_pageops.c,brin_revmap.c. (WAL replay handlers live inbrin_xlog.c, not walked here.) - Concurrency loop-guard —
brinGetTupleForHeapBlockdetects an unchanged revmap TID across iterations and raisesERRCODE_INDEX_CORRUPTED. Confirmed. - Scope deferrals honored — generic
IndexAmRoutinedispatch is left topostgres-index-am.md; inclusion-opclass geometric internals and the bounding-box machinery are left topostgres-gist.md; buffer/WAL primitives are named but not re-derived. - Caveat — line numbers in the position-hint table are hints scoped to
273fe94; symbol names are the durable anchors. The inclusion opclass is summarized for contrast only and not exhaustively walked.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”-
Moerkotte’s SMA vs. BRIN — the same idea, two granularities. The foundational paper, Moerkotte 1998 (“Small Materialized Aggregates”), proposed keeping per-partition aggregates (min, max, sum, count) so the optimizer could prune partitions and even answer aggregate queries without touching the data. BRIN is the pruning half of SMA, specialized to a storage-block partition and a fixed
(min,max)summary, and deliberately not the aggregate-answering half: a BRIN scan still rechecks every nominated page, it never returns the summary as a query answer. A note comparing what PostgreSQL gives up by not materializingsum/count(and whatpg_stats-driven planning recovers instead) would sharpen the lineage. Paper anchor:dbms-papers/(SMA / data-warehouse indexing). -
Zone maps and storage indexes (Netezza, Oracle Exadata, Amazon Redshift). Commercial column/appliance engines ship the same structure under different names — zone maps keep per-zone min/max maintained automatically by the storage layer, and Exadata storage indexes live in RAM on the storage cells. The architectural contrast is who owns the summary: in those systems it is an opaque, non-user-visible artifact of the storage engine, refreshed transparently; in PostgreSQL BRIN is a first-class,
CREATE INDEX-built,VACUUM-maintained relation with a user-tunablepages_per_rangeand explicitbrin_summarize_*controls. PostgreSQL trades the appliance’s invisibility for inspectability (pageinspect’sbrin_page_items,brin_revmap_data) and DBA leverage. -
In-core BRIN opclasses beyond minmax/inclusion. REL_18 ships two further summary kinds in the same framework, both worth a comparative note:
brin_minmax_multi(brin_minmax_multi.c,oi_nstored = 1) stores a serialized set of value ranges instead of a single[min,max], so a column with a few outliers or mild clustering does not collapse every range’s summary to the full domain — it is the practical answer to BRIN’s Achilles heel (low correlation).brin_bloom(brin_bloom.c) summarizes a range into a bloom filter, trading the ordered-type requirement for equality-only pruning on uncorrelated but low-cardinality columns. Both reuse the revmap, the lossy scan, and the opcinfo/addValue/consistent/union contract unchanged — concrete proof that the framework dissected here is opclass-agnostic. -
The correlation cliff and self-tuning ranges. BRIN’s effectiveness is a step function of
pg_stats.correlation; the README’s own “Future improvements” section floats different-size page ranges that self-tune granularity to data density (finer where summaries would otherwise be loose, coarser where they are already tight), explicitly noting this “can probably be made to work by using the range map as an index in itself.” That is an unrealized research frontier: it reintroduces a search into the revmap and so trades the O(1) translation that defines BRIN today for adaptivity. A prototype measuring pruning gain vs. revmap-lookup cost would directly test the README’s conjecture. -
Lossy-bitmap representation cost. The second README “Future improvements” bullet observes that
TIDBitmap’s lossy page-range representation (aBitmapsetper range) is suboptimal for BRIN, which always emits whole contiguous page ranges. A range-aware bitmap encoding is a self-contained optimization that would shrink the bridge betweenbringetbitmapand the BitmapHeapScan recheck; it belongs topostgres-executor.md’sTIDBitmaprather than to BRIN proper, but BRIN is its motivating workload. -
Column imprints and learned summaries. Research structures such as column imprints (Sidirourgos & Kersten, SIGMOD 2013) and learned/ML-based zone maps push the summary toward a cache-line-granular bit-vector or a learned predicate, attacking the same false-positive rate BRIN’s fixed
[min,max]accepts. They frame the open question BRIN answers most conservatively: how much summary precision is worth how much index size and maintenance cost — the size-vs-precision trade named in Section 1, now a live design axis rather than a single fixed point.
Sources
Section titled “Sources”In-tree READMEs and source files (REL_18_STABLE, commit 273fe94)
Section titled “In-tree READMEs and source files (REL_18_STABLE, commit 273fe94)”src/backend/access/brin/README— the design document: the per-range summary model, the range map and its fixed-arithmetic addressing, the evacuate-on-extend rule, summarization and the placeholder/concurrent-insert protocol, the “deletes don’t shrink summaries is an optimization opportunity only” note, and the “Future improvements” list (different-size page ranges, TIDBitmap representation).src/backend/access/brin/brin.c—brinhandler,brininsert(with theautosummarizetrigger),bringetbitmap,brinbuild/brinbuildCallback/brinbuildCallbackParallel,brin_build_desc,summarize_range/brinsummarize,union_tuples,add_values_to_range,check_null_keys,brin_fill_empty_ranges,brinbulkdelete/brinvacuumcleanup/brin_vacuum_scan, and the SQL entry pointsbrin_summarize_range/brin_summarize_new_values/brin_desummarize_range.src/backend/access/brin/brin_revmap.c—brinRevmapInitialize,HEAPBLK_TO_REVMAP_BLK/HEAPBLK_TO_REVMAP_INDEX,revmap_get_blkno,revmap_physical_extend,brinGetTupleForHeapBlock(the loop-guarded read path),brinSetHeapBlockItemptr(shared with WAL replay), andbrinRevmapDesummarizeRange.src/backend/access/brin/brin_pageops.c—brin_doinsert,brin_doupdate(same-page-overwrite vs. move-and-repoint branches),brin_can_do_samepage_update,brin_metapage_init,brin_start_evacuating_page/brin_evacuate_page,brin_getinsertbuffer.src/backend/access/brin/brin_minmax.c— the default opclass:brin_minmax_opcinfo(oi_nstored = 2),brin_minmax_add_value,brin_minmax_consistent,brin_minmax_union.src/backend/access/brin/brin_inclusion.c— the inclusion opclass (oi_nstored = 3,PROCNUM_MERGE/MERGEABLE), referenced for contrast.src/backend/access/brin/brin_minmax_multi.c,brin_bloom.c— additional in-core opclasses cited in Section 6.src/include/access/brin_page.h—BRIN_PAGETYPE_{META,REVMAP,REGULAR},BrinMetaPageData,RevmapContents,REVMAP_PAGE_MAXITEMS,BRIN_METAPAGE_BLKNO,BRIN_CURRENT_VERSION.src/include/access/brin_tuple.h—BrinValues,BrinMemTuple,BrinTuple, the placeholder/empty-range flags.src/include/access/brin_internal.h—BrinDesc,BrinOpcInfo, the genericBRIN_PROCNUM_*numbers.src/include/access/brin.h—BrinStatsData, reloptions, andBRIN_DEFAULT_PAGES_PER_RANGE(128).
Papers and textbook chapters
Section titled “Papers and textbook chapters”- Moerkotte, G. (1998). “Small Materialized Aggregates: A Light Weight
Index Structure for Data Warehousing.” VLDB 1998, 476-487. The conceptual
ancestor of BRIN; summary-per-partition pruning. Anchor in
knowledge/research/dbms-papers/. - Sidirourgos, L. & Kersten, M. (2013). “Column Imprints: A Secondary Index Structure.” SIGMOD 2013, 893-904. A finer-grained, bit-vector summary in the same false-positive-pruning family; the research frontier cited in Section 6.
- Database Internals (Petrov 2019) — the precise vs. approximate
secondary-structure distinction and bitmap/recheck framing
(
knowledge/research/dbms-general/database-internals.md). - Database System Concepts (Silberschatz, Korth, Sudarshan, 7e), ch. 14
“Indexing” — the selectivity/clustering cost model under which a
page-skipping scan beats a B-tree probe
(
knowledge/research/dbms-general/).
Sibling docs (cross-references — mechanism owned there, not duplicated here)
Section titled “Sibling docs (cross-references — mechanism owned there, not duplicated here)”postgres-index-am.md— the genericIndexAmRoutinedispatch contract thatbrinhandlerfills in (amgetbitmap, amsummarizing, parallel-build flags).postgres-gist.md— the inclusion-style geometric opclass internals and bounding-box machinery that BRIN’s inclusion opclass reuses, deferred there.postgres-buffer-manager.md— the buffer pins and content locks that the revmap and regular-page mutations latch on.postgres-vacuum.md/postgres-autovacuum.md— the VACUUM pass that callsbrinvacuumcleanup/brinsummarizeand the autovacuum worker that servicesAVW_BRINSummarizeRangeautosummarize requests.postgres-wal-records-rmgr.md— theRM_BRIN_IDresource manager and theXLOG_BRIN_*record kinds whose replay handlers live inbrin_xlog.c.postgres-architecture-overview.md— Axis 4 (pluggable access methods), where BRIN sits as the summarizing, lossy-bitmap-only index AM.