Skip to content

PostgreSQL BRIN — Block Range Indexes and the Range Map

Contents:

Every index family discussed so far in this code-analysis/postgres tree — the B-tree (postgres-nbtree.md), GiST, hash — shares one structural commitment: one index entry per heap tuple. That is what makes them precise (a probe returns exactly the matching TIDs) and what makes them expensive: the index of a billion-row table is itself hundreds of millions of entries, often a sizable fraction of the table’s own size, and every insert pays to thread a new entry into a balanced tree. For an analytic warehouse table that is append-mostly and physically clustered on the column you filter by — a fact table ordered by created_at, a sensor log ordered by timestamp, an order table ordered by order_id — that precision is mostly wasted. The query WHERE created_at BETWEEN x AND y does not need to know which rows match; it needs to know which disk pages cannot possibly contain a match, so the executor can skip reading them. The index that answers that cheaper question can itself be thousands of times smaller.

This is the idea of the block-oriented summary index, known in the literature and in other engines under several names: zone maps (Netezza, Oracle Exadata storage indexes), Small Materialized Aggregates (the foundational notion from Moerkotte’s 1998 VLDB paper “Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing”), MinMax indexes, and in PostgreSQL BRIN — Block Range INdex. The textbook framing lives in Database Internals (Petzold/Petrov; captured in research/dbms-general/database-internals.md) under the distinction between precise and approximate secondary structures, and in Database System Concepts (Silberschatz 7e, ch. 14 on indexing) under the cost model that makes a sequential scan with page-skipping competitive with a B-tree probe once the selectivity is low and the data is clustered.

The defining trade is size versus precision, and BRIN sits at the extreme small-and-imprecise corner:

  1. It indexes ranges of pages, not rows. Heap blocks are partitioned into fixed-size groups — page ranges — and the index stores one summary per range. With 128 pages per range (PostgreSQL’s default), the index has one entry per ~128 × ~226 ≈ 29,000 tuples. A multi-gigabyte table gets an index measured in kilobytes.

  2. The summary is a bound, not a value set. For an ordered type the summary is (min, max) of every value in the range; for a geometric type it is a bounding box. A query qual is checked against the bound: if the qual cannot be satisfied by any value inside [min, max], the whole range is skipped. If it might be, the range is returned for rechecking. This is the same one-sided logic as a Bloom filter: no false negatives, many false positives. A range is returned whenever it might match.

  3. Effectiveness depends entirely on physical correlation. If the indexed column is well correlated with heap block order, each range’s [min, max] is tight and disjoint from its neighbours, so a selective qual skips almost every range. If the column is shuffled, every range’s [min, max] spans nearly the whole domain, every range matches every qual, and the scan degenerates to a sequential scan plus useless index overhead. BRIN is therefore a specialized tool, not a default — pg_stats.correlation is the deciding statistic.

The second structural idea, orthogonal to summarization, is the translation layer. To skip pages, the scan must, for each page range, find that range’s summary in O(1) without descending a tree (a tree would reintroduce the per-entry cost we are trying to avoid). BRIN’s answer is the range map (revmap): a flat array, addressed by simple arithmetic on the heap block number, where slot i holds the TID of the summary tuple for page range i. The README states it directly: “Since the map entries are fixed size, it is possible to compute the address of the range map entry for any given heap page by simple arithmetic.” The revmap is the part of BRIN that has no analogue in a B-tree, and it organizes the rest of this document.

Two consequences follow immediately and shape everything downstream:

  1. BRIN can only produce a lossy bitmap, never a tuple stream. Because a range’s summary identifies pages that might match, the only honest output is “scan these pages and recheck.” PostgreSQL’s AM layer formalizes this: BRIN supports amgetbitmap but not amgettuple, and the bitmap it builds is intentionally lossy at page granularity.

  2. Maintenance is asymmetric and lazy. An insert can only ever widen a range’s [min, max] — and only if the new value falls outside it — so most inserts touch nothing. A new range at the end of a growing table is simply left unsummarized (and unsummarized ranges are always returned by scans, so correctness holds), with summarization deferred to VACUUM or an explicit/auto call. Deletes never shrink a summary; the README notes this is an “optimization opportunity only, not a correctness issue.”

Engineers building a block-summary index converge on a recurring set of choices. Naming them first lets PostgreSQL’s specific symbols read as positions in a shared design space.

Fixed-size page ranges and a computed translation map

Section titled “Fixed-size page ranges and a computed translation map”

Almost every zone-map-style structure groups the base table into fixed-size runs of storage blocks and keeps one summary per run. Fixed size is the key simplification: the summary for the block containing tuple t is found by integer division t_block / blocks_per_run, with no search. Variable-size runs (self-tuning the granularity to data density) are perennially proposed and almost never shipped, because they reintroduce a lookup structure. The PostgreSQL README’s “Future improvements” section explicitly lists different-size page ranges as unimplemented.

One-sided summaries: bounds, not membership

Section titled “One-sided summaries: bounds, not membership”

The summary must be conservative: it may claim a range might match when it does not (false positive, costing a wasted page read and recheck), but it must never claim a range cannot match when it does (false negative, a correctness bug). The two dominant summary shapes are:

  • MinMax — store min and max of the ordered type. Answers range and equality predicates (<, <=, =, >=, >).
  • Bounding box / inclusion — store a geometric or set-theoretic envelope (a box for points, a union for ranges). Answers containment and overlap.

Both are unions: merging two ranges’ summaries means widening the bound to cover both, an operation that is associative and commutative — which is what lets summaries be built in parallel and merged.

Because the structure is approximate, the scan operator must always pair with a recheck: the engine reads every page the index nominated, then re-applies the original predicate to each tuple. The index’s job is purely to prune the set of pages read. This is identical in spirit to a Bloom-filtered scan or a lossy bitmap index, and it is why these indexes only ever feed a bitmap-heap-scan-like node, never an index-only or ordered scan.

Maintaining a per-row index on insert is exactly the cost a summary index exists to avoid, so summarization is batched: a freshly written run of blocks is left unsummarized until a maintenance pass (a VACUUM-equivalent, a bulk-load hook, or a background trigger) scans the run once and computes its bound. Until then the run is treated as “might match everything” so queries stay correct. A subtle concurrency hazard appears here: while the maintenance pass scans a run, new tuples may land in it; the implementation must capture those, classically via a placeholder/two-phase protocol.

Crash safety via single-page atomic actions

Section titled “Crash safety via single-page atomic actions”

Like any disk index, every mutation — summarize, update-in-place, move, desummarize, extend the map — must be a WAL-logged action that leaves the on-disk structure valid at any interruption point. The recurring technique is to decompose multi-page mutations so each WAL record covers the pages whose combined state must be atomic (e.g. the new summary page plus the map slot that points at it), and to make any orphaned intermediate state self-correcting under a later scan or VACUUM.

PostgreSQL implements all five conventions, in src/backend/access/brin/. The access method is registered by brinhandler, which advertises the defining capability profile: no amgettuple, yes amgetbitmap, lossy, multi-column, and summarizing.

// brinhandler — src/backend/access/brin/brin.c
amroutine->amcanorder = false;
amroutine->amcanmulticol = true;
amroutine->amoptionalkey = true;
amroutine->amsearchnulls = true;
amroutine->amstorage = true; /* opclass picks the stored type */
amroutine->amcanparallel = false; /* scans are not parallel ... */
amroutine->amcanbuildparallel = true; /* ... but builds are */
amroutine->amsummarizing = true; /* inserts may trigger autosummarize */
amroutine->amgettuple = NULL; /* no precise probe — lossy only */
amroutine->amgetbitmap = bringetbitmap;

The on-disk index has exactly three page types, tagged in the page’s special area so pg_filedump and pageinspect can recognize them:

// brin_page.h — page-type tags live in the last half-word of every page
#define BRIN_PAGETYPE_META 0xF091 /* block 0 */
#define BRIN_PAGETYPE_REVMAP 0xF092 /* the range map, right after meta */
#define BRIN_PAGETYPE_REGULAR 0xF093 /* summary (and placeholder) tuples */

Block 0 is the metapage; the revmap occupies a contiguous run of blocks starting at block 1 and growing as the table grows; everything else is a regular page holding summary tuples. The layout and the scan/insert flows are summarized below.

flowchart TD
  subgraph IDX["BRIN index relation (main fork)"]
    META["blk 0: METAPAGE<br/>pagesPerRange, lastRevmapPage, version"]
    RM["blks 1..lastRevmapPage: REVMAP pages<br/>rm_tids[]: one ItemPointer per page range"]
    REG["blks &gt; lastRevmapPage: REGULAR pages<br/>summary tuples (min/max, null flags)"]
  end
  HEAP["Heap: page ranges of pagesPerRange blocks each<br/>range i = heap blocks [i*ppr .. i*ppr+ppr-1]"]
  RM -- "slot i -&gt; TID" --> REG
  REG -- "summarizes" --> HEAP
  HEAP -- "heapBlk / ppr -&gt; revmap slot" --> RM

The opclass defines what is stored. For the default minmax opclass each indexed column contributes two stored Datums — the min and the max — described by a BrinOpcInfo with oi_nstored = 2:

// brin_minmax_opcinfo — src/backend/access/brin/brin_minmax.c
result = palloc0(MAXALIGN(SizeofBrinOpcInfo(2)) + sizeof(MinmaxOpaque));
result->oi_nstored = 2; /* min and max */
result->oi_regular_nulls = true; /* AM handles NULL bookkeeping generically */
result->oi_typcache[0] = result->oi_typcache[1] = lookup_type_cache(typoid, 0);

In memory a summary is a BrinMemTuple carrying one BrinValues per column; the two generic NULL flags (bv_hasnulls, bv_allnulls) plus the opclass-specific bv_values[] capture the whole summary, and tuple-level flags mark placeholder and empty-range tuples:

// brin_tuple.h — the deformed (in-memory) summary
typedef struct BrinValues {
AttrNumber bv_attno;
bool bv_hasnulls; /* any NULL in the range? */
bool bv_allnulls; /* all values NULL in the range? */
Datum *bv_values; /* opclass-stored values (min,max for minmax) */
void *bv_mem_value;
} BrinValues;
typedef struct BrinMemTuple {
bool bt_placeholder; /* summarization in progress */
bool bt_empty_range; /* range provably has no tuples */
BlockNumber bt_blkno; /* first heap block of the range */
/* ... */
BrinValues bt_columns[FLEXIBLE_ARRAY_MEMBER];
} BrinMemTuple;

add_values_to_range is the per-tuple accumulator shared by build and insert: for each column it either records a NULL locally or calls the opclass BRIN_PROCNUM_ADDVALUE (e.g. brin_minmax_add_value) to widen the bound, returning whether the summary changed.

// add_values_to_range — src/backend/access/brin/brin.c (condensed)
bool modified = dtup->bt_empty_range;
for (keyno = 0; keyno < bdesc->bd_tupdesc->natts; keyno++) {
BrinValues *bval = &dtup->bt_columns[keyno];
if (bdesc->bd_info[keyno]->oi_regular_nulls && nulls[keyno]) {
if (!bval->bv_hasnulls) { bval->bv_hasnulls = true; modified = true; }
continue;
}
result = FunctionCall4Coll(addValue, /* BRIN_PROCNUM_ADDVALUE */
idxRel->rd_indcollation[keyno],
PointerGetDatum(bdesc), PointerGetDatum(bval),
values[keyno], nulls[keyno]);
modified |= DatumGetBool(result);
}
dtup->bt_empty_range = false;
return modified;

The minmax addValue is the heart of the “bound, not value” design: a value that already lies inside [min, max] changes nothing (and returns false, so no page write); only an out-of-bounds value widens the summary.

// brin_minmax_add_value — brin_minmax.c (condensed)
if (column->bv_allnulls) { /* first non-null seen: min = max = newval */
column->bv_values[0] = datumCopy(newval, attr->attbyval, attr->attlen);
column->bv_values[1] = datumCopy(newval, attr->attbyval, attr->attlen);
column->bv_allnulls = false;
PG_RETURN_BOOL(true);
}
cmpFn = minmax_get_strategy_procinfo(bdesc, attno, attr->atttypid, BTLessStrategyNumber);
if (DatumGetBool(FunctionCall2Coll(cmpFn, colloid, newval, column->bv_values[0]))) {
column->bv_values[0] = datumCopy(newval, ...); updated = true; /* new min */
}
cmpFn = minmax_get_strategy_procinfo(bdesc, attno, attr->atttypid, BTGreaterStrategyNumber);
if (DatumGetBool(FunctionCall2Coll(cmpFn, colloid, newval, column->bv_values[1]))) {
column->bv_values[1] = datumCopy(newval, ...); updated = true; /* new max */
}
PG_RETURN_BOOL(updated);

The revmap is a flat array of ItemPointerData spread across consecutive revmap pages. Two macros do all the addressing — there is no search:

// brin_revmap.c — heap block -> (revmap page, slot) by pure arithmetic
#define HEAPBLK_TO_REVMAP_BLK(pagesPerRange, heapBlk) \
((heapBlk / pagesPerRange) / REVMAP_PAGE_MAXITEMS)
#define HEAPBLK_TO_REVMAP_INDEX(pagesPerRange, heapBlk) \
((heapBlk / pagesPerRange) % REVMAP_PAGE_MAXITEMS)

heapBlk / pagesPerRange is the range number; dividing by REVMAP_PAGE_MAXITEMS picks the revmap page (then +1 to skip the metapage), and the remainder is the slot within it. brinGetTupleForHeapBlock is the workhorse that turns a heap block into the summary BrinTuple: read the revmap slot, follow the TID to a regular page, and — crucially — verify tup->bt_blkno == heapBlk, because the revmap could have been repointed concurrently. A mismatch means retry; a self-referential loop is detected and raised as corruption.

// brinGetTupleForHeapBlock — brin_revmap.c (condensed)
heapBlk = (heapBlk / revmap->rm_pagesPerRange) * revmap->rm_pagesPerRange;
mapBlk = revmap_get_blkno(revmap, heapBlk);
if (mapBlk == InvalidBlockNumber) return NULL; /* range not summarized */
for (;;) {
/* read revmap slot under SHARE lock */
iptr = contents->rm_tids + HEAPBLK_TO_REVMAP_INDEX(revmap->rm_pagesPerRange, heapBlk);
if (!ItemPointerIsValid(iptr)) return NULL; /* unsummarized */
if (ItemPointerIsValid(&previptr) && ItemPointerEquals(&previptr, iptr))
ereport(ERROR, (errcode(ERRCODE_INDEX_CORRUPTED), ...)); /* loop guard */
previptr = *iptr;
blk = ItemPointerGetBlockNumber(iptr); *off = ItemPointerGetOffsetNumber(iptr);
/* fetch the regular page, lock in requested mode */
if (BRIN_IS_REGULAR_PAGE(page)) {
tup = (BrinTuple *) PageGetItem(page, lp);
if (tup->bt_blkno == heapBlk) return tup; /* found it */
}
LockBuffer(*buf, BUFFER_LOCK_UNLOCK); /* repointed concurrently; retry */
}

Setting a slot is symmetric and trivial — brinSetHeapBlockItemptr writes (or invalidates) the ItemPointerData at the computed slot; it is the same routine used in WAL replay, which is why the revmap update is replay-safe.

bringetbitmap is the only scan entry point. It does not descend anything: it learns the heap size, then walks the revmap linearly from block 0 in steps of pagesPerRange, and for each range either emits all its pages (unsummarized, placeholder, or matching) or skips them (consistent function says no). Every emitted page goes into a TIDBitmap at page granularity — inherently lossy — for the BitmapHeapScan recheck to filter.

flowchart TD
  S["bringetbitmap: nblocks = RelationGetNumberOfBlocks(heap)"] --> L{"heapBlk &lt; nblocks?"}
  L -- no --> DONE["return totalpages * 10"]
  L -- yes --> G["brinGetTupleForHeapBlock(heapBlk)"]
  G --> T{"got a summary tuple?"}
  T -- "no (unsummarized)" --> ADD["addrange = true"]
  T -- "yes" --> PH{"placeholder?"}
  PH -- yes --> ADD
  PH -- no --> CK["for each attr with keys:<br/>check null keys, then<br/>opclass consistent()"]
  CK --> M{"all keys consistent?"}
  M -- no --> SKIP["addrange = false"]
  M -- yes --> ADD
  ADD --> E["tbm_add_page for every page in range"]
  SKIP --> N["heapBlk += pagesPerRange"]
  E --> N
  N --> L

The matching logic lives inline in the scan loop; the key calls are check_null_keys (handles IS [NOT] NULL and the all-nulls short-circuit) and the opclass BRIN_PROCNUM_CONSISTENT:

// bringetbitmap — brin.c (condensed): per-range decision
if (!gottuple) addrange = true; /* unsummarized -> keep */
else if (dtup->bt_placeholder) addrange = true; /* in-progress -> keep */
else {
addrange = true; /* default keep; no keys => match */
for (attno = 1; attno <= bdesc->bd_tupdesc->natts; attno++) {
if (dtup->bt_empty_range) { addrange = false; break; } /* provably empty */
if (bdesc->bd_info[attno-1]->oi_regular_nulls &&
!check_null_keys(bval, nullkeys[attno-1], nnullkeys[attno-1]))
{ addrange = false; break; }
if (bval->bv_allnulls) { addrange = false; break; } /* strict ops */
add = FunctionCall3Coll(&consistentFn[attno-1], collation,
PointerGetDatum(bdesc), PointerGetDatum(bval),
PointerGetDatum(keys[attno-1][keyno]));
addrange = DatumGetBool(add);
if (!addrange) break;
}
}
if (addrange)
for (pageno = heapBlk; pageno <= Min(nblocks, heapBlk+ppr)-1; pageno++)
tbm_add_page(tbm, pageno); /* page-granular, lossy */

The minmax consistent function is the dual of addValue: it translates a scan-key strategy into a comparison against the stored bound — </<= test the min, >/>= test the max, and = requires min <= key <= max:

// brin_minmax_consistent — brin_minmax.c (condensed)
switch (key->sk_strategy) {
case BTLessStrategyNumber: case BTLessEqualStrategyNumber:
matches = FunctionCall2Coll(finfo, colloid, column->bv_values[0] /*min*/, value);
break;
case BTEqualStrategyNumber: /* min <= key AND max >= key */
matches = FunctionCall2Coll(/*<=*/finfo, colloid, column->bv_values[0], value);
if (!DatumGetBool(matches)) break;
matches = FunctionCall2Coll(/*>=*/finfo2, colloid, column->bv_values[1], value);
break;
case BTGreaterEqualStrategyNumber: case BTGreaterStrategyNumber:
matches = FunctionCall2Coll(finfo, colloid, column->bv_values[1] /*max*/, value);
break;
}
PG_RETURN_DATUM(matches);

Maintenance: insert narrows, VACUUM summarizes

Section titled “Maintenance: insert narrows, VACUUM summarizes”

brininsert does not create entries. It finds the summary covering the inserted tuple’s range; if the range is unsummarized it returns immediately, otherwise it asks add_values_to_range whether the bound must widen and, if so, applies brin_doupdate. New ranges are summarized lazily by VACUUM, brin_summarize_new_values(), or the autosummarize reloption. The opclass-defined union merges two summaries (used both to fold a worker-built range into the leader’s during parallel build and to re-merge the placeholder during concurrent summarization). The next section traces these flows symbol by symbol.

brinhandler (brin.c) fills the IndexAmRoutine and is the entry point for the generic AM dispatch (see postgres-index-am.md). brin_build_desc constructs the BrinDesc — the per-index runtime descriptor — by calling each column’s BRIN_PROCNUM_OPCINFO to learn oi_nstored and the stored types. The four generic support procedures every opclass must provide are numbered in brin_internal.h:

// brin_internal.h — generic opclass support procedure numbers
#define BRIN_PROCNUM_OPCINFO 1 /* describe the stored layout */
#define BRIN_PROCNUM_ADDVALUE 2 /* widen a summary with a heap value */
#define BRIN_PROCNUM_CONSISTENT 3 /* does a summary match a scan key? */
#define BRIN_PROCNUM_UNION 4 /* merge two summaries */
#define BRIN_PROCNUM_OPTIONS 5 /* optional */
#define BRIN_LAST_OPTIONAL_PROCNUM 15

brin_build_desc walks the index tuple descriptor and accumulates bd_totalstored; the resulting bdesc->bd_info[] array of BrinOpcInfo * is threaded through every add/consistent/union call.

Revmap: addressing, fetch, and physical extension

Section titled “Revmap: addressing, fetch, and physical extension”

The revmap access object BrinRevmap is created by brinRevmapInitialize, which reads pagesPerRange and lastRevmapPage from the metapage and caches them. revmap_get_blkno converts a heap block to a physical revmap block (or InvalidBlockNumber if that revmap page does not exist yet — meaning the range is unsummarized):

// revmap_get_blkno — brin_revmap.c
targetblk = HEAPBLK_TO_REVMAP_BLK(revmap->rm_pagesPerRange, heapBlk) + 1; /* +1 skips meta */
if (targetblk <= revmap->rm_lastRevmapPage)
return targetblk;
return InvalidBlockNumber;

When a summary must be inserted for a range whose revmap page does not exist, revmap_extend_and_get_blkno loops calling revmap_physical_extend. This is the most intricate routine in the revmap: it locks the metapage (serializing revmap growth), grabs the next block — reading it if it already exists as a regular page, else extending the relation — and if that block already holds summary tuples, evacuates them first. This is the “Whenever the revmap needs to be extended by another page, existing tuples in that page are moved to some other page” behavior from the README.

// revmap_physical_extend — brin_revmap.c (condensed)
LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_EXCLUSIVE); /* serialize extension */
if (metadata->lastRevmapPage != revmap->rm_lastRevmapPage) {
revmap->rm_lastRevmapPage = metadata->lastRevmapPage; /* stale cache: caller retries */
LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK); return;
}
mapBlk = metadata->lastRevmapPage + 1;
/* ... obtain buffer for mapBlk (read or ExtendBufferedRel) ... */
if (brin_start_evacuating_page(irel, buf)) { /* page holds tuples? */
LockBuffer(revmap->rm_metaBuf, BUFFER_LOCK_UNLOCK);
brin_evacuate_page(irel, revmap->rm_pagesPerRange, revmap, buf);
return; /* caller starts over */
}
START_CRIT_SECTION();
brin_page_init(page, BRIN_PAGETYPE_REVMAP); /* retype as revmap */
metadata->lastRevmapPage = mapBlk;
/* set metapage pd_lower, MarkBufferDirty, WAL: XLOG_BRIN_REVMAP_EXTEND */
END_CRIT_SECTION();

brinGetTupleForHeapBlock (quoted in section 3) is the read path; brinSetHeapBlockItemptr is the write path and is shared with WAL replay. brinRevmapDesummarizeRange is the reverse of summarization: it locks the revmap slot and the regular page, deletes the summary tuple with PageIndexTupleDeleteNoCompact, invalidates the slot, and WAL-logs XLOG_BRIN_DESUMMARIZE. It returns false (retry) if the regular page turned into a revmap page underneath it.

// brinRevmapDesummarizeRange — brin_revmap.c (condensed)
START_CRIT_SECTION();
ItemPointerSetInvalid(&invalidIptr);
brinSetHeapBlockItemptr(revmapBuf, revmap->rm_pagesPerRange, heapBlk, invalidIptr);
PageIndexTupleDeleteNoCompact(regPg, regOffset); /* drop the summary tuple */
MarkBufferDirty(regBuf); MarkBufferDirty(revmapBuf);
/* WAL: xl_brin_desummarize {pagesPerRange, heapBlk, regOffset} */
END_CRIT_SECTION();

Page operations: insert, same-page update, move, evacuate

Section titled “Page operations: insert, same-page update, move, evacuate”

brin_doinsert (brin_pageops.c) places a new summary (or placeholder, or empty-range) tuple: it ensures the revmap is long enough (brinRevmapExtend), finds an insert buffer (brin_getinsertbuffer, which consults the FSM and may extend the relation), PageAddItems the tuple, points the revmap slot at it with brinSetHeapBlockItemptr, and WAL-logs XLOG_BRIN_INSERTall under one critical section so the tuple and the slot pointing at it become durable together.

brin_doupdate is the in-place narrowing path. The decision between a cheap same-page overwrite and an expensive move hinges on brin_can_do_samepage_update:

// brin_can_do_samepage_update — brin_pageops.c
return ((newsz <= origsz) ||
PageGetExactFreeSpace(BufferGetPage(buffer)) >= (newsz - origsz));

If the new (wider) summary fits, brin_doupdate does a PageIndexTupleOverwrite and logs XLOG_BRIN_SAMEPAGE_UPDATE — the revmap slot does not change, because the tuple stays at the same TID. If it does not fit, the new tuple is written to a different page, the old one is deleted, and the revmap slot is repointed to the new TID, logged as XLOG_BRIN_UPDATE covering new page + revmap + old page:

// brin_doupdate — brin_pageops.c (the move branch, condensed)
revmapbuf = brinLockRevmapPageForUpdate(revmap, heapBlk);
START_CRIT_SECTION();
PageIndexTupleDeleteNoCompact(oldpage, oldoff); /* remove old summary */
newoff = PageAddItem(newpage, newtup, newsz, ...); /* write new one */
ItemPointerSet(&newtid, newblk, newoff);
brinSetHeapBlockItemptr(revmapbuf, pagesPerRange, heapBlk, newtid); /* repoint */
MarkBufferDirty(oldbuf); MarkBufferDirty(newbuf); MarkBufferDirty(revmapbuf);
/* WAL: XLOG_BRIN_UPDATE (+ INIT_PAGE if extended) over all three buffers */
END_CRIT_SECTION();

brin_doupdate is careful about concurrency: before committing it re-checks that the old tuple is still a normal item on a still-regular page and that its bytes match the origtup snapshot (brin_tuples_equal); if a concurrent inserter changed it, brin_doupdate returns false and brininsert restarts its loop, re-reading the now-current summary so neither inserter’s value is lost.

The evacuation pair supports revmap growth. brin_start_evacuating_page sets the non-WAL-logged BRIN_EVACUATE_PAGE flag so br_page_get_freespace reports zero (no new tuples land there), and brin_evacuate_page moves every tuple off the page by re-running brin_doupdate for each, then releases it for retyping as a revmap page.

brinbeginscan builds a BrinOpaque (revmap access + BrinDesc). bringetbitmap (section 3) preprocesses scan keys into per-attribute arrays, splitting IS [NOT] NULL keys (nullkeys) from regular keys (keys), then runs the revmap walk. It lazily looks up each attribute’s BRIN_PROCNUM_CONSISTENT once. Note the opclass may take either signature: if consistentFn.fn_nargs >= 4 the AM passes all keys at once (FunctionCall4Coll), otherwise one at a time (FunctionCall3Coll) — minmax uses the 3-arg form. brinrescan copies new scan keys; brinendscan frees the opaque. The return value totalpages * 10 is a deliberately crude row estimate (the AM only knows pages, not tuples).

brinbuild initializes the metapage (brin_metapage_init, WAL XLOG_BRIN_CREATE_INDEX) and a BrinBuildState, then either launches parallel workers (_brin_begin_parallel / _brin_parallel_merge) or scans the heap serially in physical order with table_index_build_scan feeding brinbuildCallback. The callback accumulates tuples into the running range, and whenever a heap block crosses into the next range it flushes via form_and_insert_tuple:

// brinbuildCallback — brin.c (condensed)
while (thisblock > state->bs_currRangeStart + state->bs_pagesPerRange - 1) {
form_and_insert_tuple(state); /* flush completed range */
state->bs_currRangeStart += state->bs_pagesPerRange;
brin_memtuple_initialize(state->bs_dtuple, state->bs_bdesc);
}
(void) add_values_to_range(index, state->bs_bdesc, state->bs_dtuple, values, isnull);

After the scan brinbuild flushes the final partial range and calls brin_fill_empty_ranges to backfill empty-range summary tuples for any trailing ranges the scan skipped (heap blocks with no live tuples). Empty ranges matter for correctness-cheapness: a predicate like WHERE nonnull_col IS NULL can prune an empty range instead of treating the gap as unsummarized (and therefore always-matching). The parallel callback brinbuildCallbackParallel instead spills each range to a shared tuplesort (form_and_spill_tuple); the leader merges per-worker results and fills empty ranges itself.

Summarization and the placeholder protocol

Section titled “Summarization and the placeholder protocol”

brinsummarize scans the revmap and, for each unsummarized range (NULL slot), calls summarize_range. This is the concurrency-critical routine. It cannot simply scan the heap range and insert a summary, because concurrent inserters could add tuples to the range during the scan and be missed. The solution is a placeholder tuple:

flowchart TD
  A["summarize_range(heapBlk)"] --> B["brin_form_placeholder_tuple<br/>brin_doinsert: revmap now points to placeholder"]
  B --> C["concurrent inserters update<br/>the placeholder via brininsert"]
  B --> D["table_index_build_range_scan<br/>'any visible' over the heap range"]
  D --> E["loop: brin_form_tuple from scan result"]
  E --> F{"brin_doupdate placeholder -&gt; summary?"}
  F -- success --> G["done: revmap points to real summary"]
  F -- "failed (placeholder changed)" --> H["re-read placeholder<br/>union_tuples into scan result"]
  H --> E
// summarize_range — brin.c (condensed)
phtup = brin_form_placeholder_tuple(state->bs_bdesc, heapBlk, &phsz);
offset = brin_doinsert(state->bs_irel, ppr, state->bs_rmAccess, &phbuf, heapBlk, phtup, phsz);
/* scan the heap range; 'any visible' so in-progress inserts are included */
table_index_build_range_scan(heapRel, state->bs_irel, indexInfo, false, true, false,
heapBlk, scanNumBlks, brinbuildCallback, state, NULL);
for (;;) {
newtup = brin_form_tuple(state->bs_bdesc, heapBlk, state->bs_dtuple, &newsize);
didupdate = brin_doupdate(..., phbuf, offset, phtup, phsz, newtup, newsize, samepage);
if (didupdate) break; /* placeholder -> summary committed */
/* someone updated the placeholder concurrently: merge and retry */
phtup = brinGetTupleForHeapBlock(state->bs_rmAccess, heapBlk, &phbuf, &offset, &phsz, ...);
union_tuples(state->bs_bdesc, state->bs_dtuple, phtup);
}

union_tuples (and the opclass brin_minmax_union) merges the placeholder’s accumulated values into the scan result, handling the empty/all-null cases before invoking BRIN_PROCNUM_UNION per column. The SQL entry points are brin_summarize_range / brin_summarize_new_values (the latter passing BRIN_ALL_BLOCKRANGES), both taking ShareUpdateExclusiveLock and refusing to run during recovery. autosummarize is triggered from brininsert: when the first tuple lands in the first block of a new range, it requests AVW_BRINSummarizeRange work from autovacuum for the previous range.

// brininsert — brin.c (autosummarize trigger, condensed)
if (autosummarize && heapBlk > 0 && heapBlk == origHeapBlk &&
ItemPointerGetOffsetNumber(heaptid) == FirstOffsetNumber) {
lastPageTuple = brinGetTupleForHeapBlock(revmap, heapBlk - 1, &buf, &off, NULL, BUFFER_LOCK_SHARE);
if (!lastPageTuple)
AutoVacuumRequestWork(AVW_BRINSummarizeRange, RelationGetRelid(idxRel), heapBlk - 1);
}

brinbulkdelete is essentially a no-op — BRIN stores no per-tuple TIDs, so heap tuple removal needs no index cleanup (the README: tightening a summary after deletes is “an optimization opportunity only, not a correctness issue”). brinvacuumcleanup does the real VACUUM work: brin_vacuum_scan walks the index in physical order repairing any uninitialized pages via brin_page_cleanup (recovering FSM space lost to a crash mid-extension), then brinsummarize summarizes all still-unsummarized ranges (excluding the partial tail, since include_partial = false for VACUUM).

The inclusion opclass generalizes minmax from a scalar [min,max] to a union envelope and stores three values: the union, an unmergeable flag, and a contains_empty flag (oi_nstored = 3). It is what backs box/range/inet BRIN, and it uses extra support procedures PROCNUM_MERGE (11, required) and PROCNUM_MERGEABLE (12) beyond the four generic ones. Its geometric internals belong to postgres-gist.md; what matters here is that the framework — opcinfo/addValue/consistent/union, the revmap, the lossy scan — is opclass-agnostic, and a new summary kind is just a new set of these five callbacks.

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
brinhandlersrc/backend/access/brin/brin.c250
brininsertsrc/backend/access/brin/brin.c344
bringetbitmapsrc/backend/access/brin/brin.c567
brinbuildCallbacksrc/backend/access/brin/brin.c995
brinbuildsrc/backend/access/brin/brin.c1105
brinbulkdeletesrc/backend/access/brin/brin.c1302
brinvacuumcleanupsrc/backend/access/brin/brin.c1317
brin_summarize_rangesrc/backend/access/brin/brin.c1381
brin_desummarize_rangesrc/backend/access/brin/brin.c1491
brin_build_descsrc/backend/access/brin/brin.c1581
summarize_rangesrc/backend/access/brin/brin.c1761
brinsummarizesrc/backend/access/brin/brin.c1887
union_tuplessrc/backend/access/brin/brin.c2031
brin_vacuum_scansrc/backend/access/brin/brin.c2172
add_values_to_rangesrc/backend/access/brin/brin.c2205
check_null_keyssrc/backend/access/brin/brin.c2298
brin_fill_empty_rangessrc/backend/access/brin/brin.c2993
brinRevmapInitializesrc/backend/access/brin/brin_revmap.c70
brinSetHeapBlockItemptrsrc/backend/access/brin/brin_revmap.c155
brinGetTupleForHeapBlocksrc/backend/access/brin/brin_revmap.c194
brinRevmapDesummarizeRangesrc/backend/access/brin/brin_revmap.c323
revmap_get_blknosrc/backend/access/brin/brin_revmap.c442
revmap_physical_extendsrc/backend/access/brin/brin_revmap.c522
HEAPBLK_TO_REVMAP_BLK / _INDEXsrc/backend/access/brin/brin_revmap.c40
brin_doupdatesrc/backend/access/brin/brin_pageops.c53
brin_can_do_samepage_updatesrc/backend/access/brin/brin_pageops.c323
brin_doinsertsrc/backend/access/brin/brin_pageops.c342
brin_metapage_initsrc/backend/access/brin/brin_pageops.c486
brin_start_evacuating_pagesrc/backend/access/brin/brin_pageops.c524
brin_evacuate_pagesrc/backend/access/brin/brin_pageops.c564
brin_getinsertbuffersrc/backend/access/brin/brin_pageops.c689
brin_minmax_opcinfosrc/backend/access/brin/brin_minmax.c34
brin_minmax_add_valuesrc/backend/access/brin/brin_minmax.c64
brin_minmax_consistentsrc/backend/access/brin/brin_minmax.c137
brin_minmax_unionsrc/backend/access/brin/brin_minmax.c208
BrinMetaPageData / RevmapContentssrc/include/access/brin_page.h64 / 78
REVMAP_PAGE_MAXITEMSsrc/include/access/brin_page.h93
BrinValues / BrinMemTuplesrc/include/access/brin_tuple.h29 / 44
BRIN_DEFAULT_PAGES_PER_RANGEsrc/include/access/brin.h39

Verified against the REL_18_STABLE checkout at commit 273fe94 (PG 18.x), source tree /data/hgryoo/references/postgres.

  • Page types and metapage layoutbrin_page.h defines exactly three page types (BRIN_PAGETYPE_META 0xF091, _REVMAP 0xF092, _REGULAR 0xF093), BRIN_METAPAGE_BLKNO 0, BRIN_CURRENT_VERSION 1, and BrinMetaPageData { brinMagic, brinVersion, pagesPerRange, lastRevmapPage }. Confirmed.
  • Revmap arithmeticHEAPBLK_TO_REVMAP_BLK and HEAPBLK_TO_REVMAP_INDEX in brin_revmap.c compute page and slot purely from heapBlk, pagesPerRange, and REVMAP_PAGE_MAXITEMS; +1 skips the metapage. revmap_get_blkno returns InvalidBlockNumber for not-yet- allocated revmap pages (= unsummarized). Confirmed verbatim.
  • No amgettuplebrinhandler sets amgettuple = NULL and amgetbitmap = bringetbitmap; amcanparallel = false but amcanbuildparallel = true; amsummarizing = true. Confirmed.
  • Default granularityBRIN_DEFAULT_PAGES_PER_RANGE is 128 in brin.h. Confirmed.
  • minmax stored layoutbrin_minmax_opcinfo sets oi_nstored = 2 (bv_values[0] = min, [1] = max) and oi_regular_nulls = true; inclusion sets oi_nstored = 3. Confirmed.
  • Generic procnumsBRIN_PROCNUM_OPCINFO=1, ADDVALUE=2, CONSISTENT=3, UNION=4 in brin_internal.h; inclusion adds PROCNUM_MERGE=11, PROCNUM_MERGEABLE=12. Confirmed.
  • Placeholder protocolsummarize_range inserts brin_form_placeholder_tuple, scans with table_index_build_range_scan (any-visible mode), then loops brin_doupdate + union_tuples on conflict. Confirmed.
  • WAL record kinds namedXLOG_BRIN_CREATE_INDEX, _INSERT, _SAMEPAGE_UPDATE, _UPDATE, _REVMAP_EXTEND, _DESUMMARIZE are emitted by the routines as described. Confirmed against brin.c, brin_pageops.c, brin_revmap.c. (WAL replay handlers live in brin_xlog.c, not walked here.)
  • Concurrency loop-guardbrinGetTupleForHeapBlock detects an unchanged revmap TID across iterations and raises ERRCODE_INDEX_CORRUPTED. Confirmed.
  • Scope deferrals honored — generic IndexAmRoutine dispatch is left to postgres-index-am.md; inclusion-opclass geometric internals and the bounding-box machinery are left to postgres-gist.md; buffer/WAL primitives are named but not re-derived.
  • Caveat — line numbers in the position-hint table are hints scoped to 273fe94; symbol names are the durable anchors. The inclusion opclass is summarized for contrast only and not exhaustively walked.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”
  • Moerkotte’s SMA vs. BRIN — the same idea, two granularities. The foundational paper, Moerkotte 1998 (“Small Materialized Aggregates”), proposed keeping per-partition aggregates (min, max, sum, count) so the optimizer could prune partitions and even answer aggregate queries without touching the data. BRIN is the pruning half of SMA, specialized to a storage-block partition and a fixed (min,max) summary, and deliberately not the aggregate-answering half: a BRIN scan still rechecks every nominated page, it never returns the summary as a query answer. A note comparing what PostgreSQL gives up by not materializing sum/count (and what pg_stats-driven planning recovers instead) would sharpen the lineage. Paper anchor: dbms-papers/ (SMA / data-warehouse indexing).

  • Zone maps and storage indexes (Netezza, Oracle Exadata, Amazon Redshift). Commercial column/appliance engines ship the same structure under different names — zone maps keep per-zone min/max maintained automatically by the storage layer, and Exadata storage indexes live in RAM on the storage cells. The architectural contrast is who owns the summary: in those systems it is an opaque, non-user-visible artifact of the storage engine, refreshed transparently; in PostgreSQL BRIN is a first-class, CREATE INDEX-built, VACUUM-maintained relation with a user-tunable pages_per_range and explicit brin_summarize_* controls. PostgreSQL trades the appliance’s invisibility for inspectability (pageinspect’s brin_page_items, brin_revmap_data) and DBA leverage.

  • In-core BRIN opclasses beyond minmax/inclusion. REL_18 ships two further summary kinds in the same framework, both worth a comparative note: brin_minmax_multi (brin_minmax_multi.c, oi_nstored = 1) stores a serialized set of value ranges instead of a single [min,max], so a column with a few outliers or mild clustering does not collapse every range’s summary to the full domain — it is the practical answer to BRIN’s Achilles heel (low correlation). brin_bloom (brin_bloom.c) summarizes a range into a bloom filter, trading the ordered-type requirement for equality-only pruning on uncorrelated but low-cardinality columns. Both reuse the revmap, the lossy scan, and the opcinfo/addValue/consistent/union contract unchanged — concrete proof that the framework dissected here is opclass-agnostic.

  • The correlation cliff and self-tuning ranges. BRIN’s effectiveness is a step function of pg_stats.correlation; the README’s own “Future improvements” section floats different-size page ranges that self-tune granularity to data density (finer where summaries would otherwise be loose, coarser where they are already tight), explicitly noting this “can probably be made to work by using the range map as an index in itself.” That is an unrealized research frontier: it reintroduces a search into the revmap and so trades the O(1) translation that defines BRIN today for adaptivity. A prototype measuring pruning gain vs. revmap-lookup cost would directly test the README’s conjecture.

  • Lossy-bitmap representation cost. The second README “Future improvements” bullet observes that TIDBitmap’s lossy page-range representation (a Bitmapset per range) is suboptimal for BRIN, which always emits whole contiguous page ranges. A range-aware bitmap encoding is a self-contained optimization that would shrink the bridge between bringetbitmap and the BitmapHeapScan recheck; it belongs to postgres-executor.md’s TIDBitmap rather than to BRIN proper, but BRIN is its motivating workload.

  • Column imprints and learned summaries. Research structures such as column imprints (Sidirourgos & Kersten, SIGMOD 2013) and learned/ML-based zone maps push the summary toward a cache-line-granular bit-vector or a learned predicate, attacking the same false-positive rate BRIN’s fixed [min,max] accepts. They frame the open question BRIN answers most conservatively: how much summary precision is worth how much index size and maintenance cost — the size-vs-precision trade named in Section 1, now a live design axis rather than a single fixed point.

In-tree READMEs and source files (REL_18_STABLE, commit 273fe94)

Section titled “In-tree READMEs and source files (REL_18_STABLE, commit 273fe94)”
  • src/backend/access/brin/README — the design document: the per-range summary model, the range map and its fixed-arithmetic addressing, the evacuate-on-extend rule, summarization and the placeholder/concurrent-insert protocol, the “deletes don’t shrink summaries is an optimization opportunity only” note, and the “Future improvements” list (different-size page ranges, TIDBitmap representation).
  • src/backend/access/brin/brin.cbrinhandler, brininsert (with the autosummarize trigger), bringetbitmap, brinbuild / brinbuildCallback / brinbuildCallbackParallel, brin_build_desc, summarize_range / brinsummarize, union_tuples, add_values_to_range, check_null_keys, brin_fill_empty_ranges, brinbulkdelete / brinvacuumcleanup / brin_vacuum_scan, and the SQL entry points brin_summarize_range / brin_summarize_new_values / brin_desummarize_range.
  • src/backend/access/brin/brin_revmap.cbrinRevmapInitialize, HEAPBLK_TO_REVMAP_BLK / HEAPBLK_TO_REVMAP_INDEX, revmap_get_blkno, revmap_physical_extend, brinGetTupleForHeapBlock (the loop-guarded read path), brinSetHeapBlockItemptr (shared with WAL replay), and brinRevmapDesummarizeRange.
  • src/backend/access/brin/brin_pageops.cbrin_doinsert, brin_doupdate (same-page-overwrite vs. move-and-repoint branches), brin_can_do_samepage_update, brin_metapage_init, brin_start_evacuating_page / brin_evacuate_page, brin_getinsertbuffer.
  • src/backend/access/brin/brin_minmax.c — the default opclass: brin_minmax_opcinfo (oi_nstored = 2), brin_minmax_add_value, brin_minmax_consistent, brin_minmax_union.
  • src/backend/access/brin/brin_inclusion.c — the inclusion opclass (oi_nstored = 3, PROCNUM_MERGE/MERGEABLE), referenced for contrast.
  • src/backend/access/brin/brin_minmax_multi.c, brin_bloom.c — additional in-core opclasses cited in Section 6.
  • src/include/access/brin_page.hBRIN_PAGETYPE_{META,REVMAP,REGULAR}, BrinMetaPageData, RevmapContents, REVMAP_PAGE_MAXITEMS, BRIN_METAPAGE_BLKNO, BRIN_CURRENT_VERSION.
  • src/include/access/brin_tuple.hBrinValues, BrinMemTuple, BrinTuple, the placeholder/empty-range flags.
  • src/include/access/brin_internal.hBrinDesc, BrinOpcInfo, the generic BRIN_PROCNUM_* numbers.
  • src/include/access/brin.hBrinStatsData, reloptions, and BRIN_DEFAULT_PAGES_PER_RANGE (128).
  • Moerkotte, G. (1998). “Small Materialized Aggregates: A Light Weight Index Structure for Data Warehousing.” VLDB 1998, 476-487. The conceptual ancestor of BRIN; summary-per-partition pruning. Anchor in knowledge/research/dbms-papers/.
  • Sidirourgos, L. & Kersten, M. (2013). “Column Imprints: A Secondary Index Structure.” SIGMOD 2013, 893-904. A finer-grained, bit-vector summary in the same false-positive-pruning family; the research frontier cited in Section 6.
  • Database Internals (Petrov 2019) — the precise vs. approximate secondary-structure distinction and bitmap/recheck framing (knowledge/research/dbms-general/database-internals.md).
  • Database System Concepts (Silberschatz, Korth, Sudarshan, 7e), ch. 14 “Indexing” — the selectivity/clustering cost model under which a page-skipping scan beats a B-tree probe (knowledge/research/dbms-general/).

Sibling docs (cross-references — mechanism owned there, not duplicated here)

Section titled “Sibling docs (cross-references — mechanism owned there, not duplicated here)”
  • postgres-index-am.md — the generic IndexAmRoutine dispatch contract that brinhandler fills in (amgetbitmap, amsummarizing, parallel-build flags).
  • postgres-gist.md — the inclusion-style geometric opclass internals and bounding-box machinery that BRIN’s inclusion opclass reuses, deferred there.
  • postgres-buffer-manager.md — the buffer pins and content locks that the revmap and regular-page mutations latch on.
  • postgres-vacuum.md / postgres-autovacuum.md — the VACUUM pass that calls brinvacuumcleanup/brinsummarize and the autovacuum worker that services AVW_BRINSummarizeRange autosummarize requests.
  • postgres-wal-records-rmgr.md — the RM_BRIN_ID resource manager and the XLOG_BRIN_* record kinds whose replay handlers live in brin_xlog.c.
  • postgres-architecture-overview.md — Axis 4 (pluggable access methods), where BRIN sits as the summarizing, lossy-bitmap-only index AM.