PostgreSQL GIN — Generalized Inverted Index, Posting Trees, and the Fast-Update List
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A secondary index in the textbook sense maps one indexed value to the set
of rows that carry it. For an int column the mapping is one-to-one-ish: a
B-tree leaf entry holds a key and a single (or small) set of row pointers,
and the access method (AM) treats the column value as opaque and orders it by
a single comparison operator. That model breaks the moment the indexed column
is composite — an array {4, 7, 12}, a tsvector of a hundred lexemes, a
jsonb document with dozens of paths — and the queries you want to accelerate
are not “find rows where the array equals {4,7,12}” but “find rows whose
array contains 7”, “whose document matches the full-text query
fox & quick”, “whose jsonb @> '{"a":1}'”. The value as a whole is the
wrong indexing granularity; the elements inside it are what queries probe.
The data structure that fits this shape is the inverted index, the
workhorse of information retrieval. Database System Concepts (Silberschatz
7e, ch. 21 on information retrieval) describes it directly: “An inverted index
stores for each word a list of all document [identifiers] in which the word
occurs.” Each distinct term (word, lexeme, array element — generically a
key) maps to a posting list: the sorted set of document identifiers
(here, heap tuple IDs) that contain that key. A conjunctive query fox & quick is answered by intersecting the posting lists of fox and quick; a
disjunctive query unions them. The index is “inverted” because it inverts the
natural document-to-terms relationship into a terms-to-documents one. This is
exactly the structure a search engine builds over a web corpus, and it is
Database System Concepts’ “inverted index” of ch. 21 — the same idea
relational engines borrow to index array/document/text columns.
Two properties of real corpora make the inverted index a non-trivial engineering problem rather than a hash table of lists:
-
Skew. Term frequencies follow a Zipf distribution: a handful of keys (
the,and) occur in nearly every document, while most keys are rare. A posting list can therefore range from one TID to millions. A single fixed container — “store the list inline next to the key” — is wrong for both ends: it wastes space on rare keys and overflows on common ones. A good inverted index needs an adaptive container that is compact when the list is short and degrades gracefully into a paged structure when it is long. -
Write amplification on insertion. Inserting one document with
kterms means touchingkposting lists, scattered across the key space. Each touch is a B-tree descent and a page rewrite. Indexing a stream of documents this way is brutally slow because the same hot keys are re-written over and over. The classic IR answer is batching: accumulate postings in memory (or a side structure), sort them by key, and merge them into the main index in bulk so each hot key’s list is rewritten once per batch rather than once per document.
The third property is generality. A relational engine cannot bake in “this
is text” — the same machinery must serve integer[], text[], tsvector,
jsonb, hstore, trigram indexes, and user-defined types. So the index must
be parameterized by a small set of strategy functions supplied per data
type: one to extract the keys from a stored value, one to extract the
keys from a query (plus a search mode: must-match-all, match-any,
everything), and one — the consistent function — that, given which of the
query keys were found for a candidate row, decides whether the row actually
satisfies the operator. This is the same “the index does not know which
operator it accelerates” philosophy as GiST, expressed for inverted indexes.
The GIN README states it plainly: “Generalized means that the index does not
know which operation it accelerates. It instead works with custom strategies,
defined for specific data types.”
Common DBMS Design
Section titled “Common DBMS Design”Search engines (Lucene/Elasticsearch, Solr) and full-text features bolted
onto relational engines (Oracle Text, SQL Server Full-Text, MySQL FULLTEXT)
all build on the inverted-index skeleton, and they converge on a recognizable
set of design decisions:
-
Dictionary + postings split. A term dictionary (often itself a B-tree, a sorted array, or an FST as in Lucene) maps each key to the location of its posting list. The postings themselves live separately, because their size distribution is so different from the keys’. Lucene stores the dictionary in a
.tim/.tipfile and postings in a.docfile; a relational engine typically stores the dictionary as a B-tree whose leaves point at posting storage. -
Delta + variable-byte compression of postings. Because a posting list is a sorted list of monotonically increasing document IDs, you store the differences (gaps) between consecutive IDs rather than the IDs themselves, then encode the small gaps with a variable-length integer scheme (varbyte, Simple-9, PForDelta, Roaring bitmaps for dense lists). Sorted-gap + varbyte is the canonical baseline and shrinks dense posting lists several-fold.
-
Segment / batch architecture with background merge. Lucene is the archetype: writes go into a small in-memory segment, which is flushed to an immutable on-disk segment; a background merge policy periodically combines small segments into larger ones. Deletes are tombstones applied at merge time, not in place. This makes ingestion append-mostly and pushes the expensive sorted merge off the write path. The “never modify a posting list in place; rebuild it in bulk” instinct is universal.
-
No in-place delete of dictionary terms. Vocabulary churns slowly: the set of distinct words in a corpus is nearly stable after the corpus grows past a threshold. Engines exploit this by never bothering to remove a term from the dictionary even when its posting list empties — the cost of the empty entry is trivial compared to the concurrency-control machinery a dictionary delete would require.
-
Query as multi-stream merge + a decision function. A boolean query over several terms opens one posting cursor per term, advances them in document-ID order, and at each candidate document evaluates the boolean expression (
AND/OR/NOT, possibly with proximity or scoring). The cursors are advanced with a “skip to document ≥ X” primitive so that intersecting a rare term with a common one skips most of the common term’s list. Skip lists / skip pointers inside the postings make this cheap.
flowchart TD
doc["Indexable item<br/>array / tsvector / jsonb"] -->|extract keys| keys["keys: k1, k2, k3 ..."]
keys --> dict["Term dictionary<br/>(B-tree of keys)"]
dict -->|per key| pl["Posting list<br/>sorted, delta+varbyte compressed<br/>doc IDs / heap TIDs"]
q["Query operator<br/>e.g. arr @> '{7}'"] -->|extract query keys + searchMode| qkeys["query keys + which-must-match"]
qkeys --> cursors["Per-key posting cursors"]
cursors -->|merge in doc-ID order| cand["Candidate doc"]
cand -->|consistent / boolean fn| match{"matches?"}
match -->|yes| out["emit doc ID"]
match -->|no| cursors
Where engines diverge is how aggressively they batch and where the latency lands. A dedicated search engine accepts eventual consistency: a freshly indexed document is searchable only after a refresh/commit. A relational engine indexing a column must keep the index transactionally consistent with the heap — a row inserted in a transaction must be found by that transaction’s later scans — which constrains how far it can push batching off the write path. The compromise, as we’ll see, is to batch into a durable side structure that scans also read, rather than into a volatile buffer.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”GIN — Generalized Inverted index, which the README insists “should be
considered as a genie, not a drink” — is PostgreSQL’s inverted index AM,
originally by Teodor Sigaev and Oleg Bartunov. It is an ordinary entry in the
pg_am catalog dispatched through the IndexAmRoutine table (see
postgres-index-am.md); ginhandler (in ginutil.c) fills that table. The two
fields that define GIN’s personality are amgettuple = NULL and
amgetbitmap = gingetbitmap:
// ginhandler — src/backend/access/gin/ginutil.camroutine->amoptionalkey = true;amroutine->amcanreturn = NULL;// ... condensed ...amroutine->amgettuple = NULL; /* GIN cannot do one-tuple-at-a-time */amroutine->amgetbitmap = gingetbitmap; /* only bitmap scans */GIN supports only bitmap scans, never amgettuple. The README explains
the deep reason: because of the pending list (below), the same heap TID can be
visited twice during one scan, so GIN cannot promise the duplicate-free,
ordered, resumable stream that amgettuple requires. It re-sets the same bit
in a TIDBitmap instead, which is idempotent. This is why EXPLAIN of a GIN
query always shows a Bitmap Index Scan.
GIN’s structure has four kinds of pages, all sharing the standard
PageHeader plus a GIN-specific opaque footer (GinPageOpaque, holding
rightlink and flags):
- Metapage (block 0) —
GinMetaPageData: head/tail of the pending list, page/tuple counts, and statistics. - Entry tree — a Lehman-Yao B-tree (right-links, no left-links: GIN never scans backward) whose keys are the extracted key data and whose leaf tuples carry either an inline posting list or a posting-tree downlink.
- Posting trees — secondary B-trees keyed by
ItemPointer, created when a key’s posting list outgrows an entry-tree page. - Pending list pages — the fast-update side structure.
flowchart TD meta["Metapage (blk 0)<br/>head/tail of pending list<br/>nPendingPages, stats"] meta -.->|GinGetUseFastUpdate| pl0["pending page<br/>(unsorted, 1 TID per entry)"] pl0 -->|rightlink| pl1["pending page"] pl1 -->|rightlink| pl2["pending tail page"] meta --> root["Entry-tree root<br/>(L&Y B-tree of keys)"] root --> e1["entry leaf: key 'fox'<br/>inline posting list (compressed TIDs)"] root --> e2["entry leaf: key 'the'<br/>GinIsPostingTree -> downlink"] e2 --> pt["Posting-tree root<br/>(B-tree keyed by ItemPointer)"] pt --> ptl1["posting leaf<br/>segments of compressed TIDs"] pt --> ptl2["posting leaf<br/>segments of compressed TIDs"]
Key entries, not items. GIN never stores the indexed value as a whole. On
insert it calls the opclass extractValueFn to break the value into a sorted,
de-duplicated array of keys, each with a null category; ginExtractEntries
(in ginutil.c) wraps that and synthesizes a placeholder key for the
otherwise-keyless cases (a NULL value, or an empty array) so that even those
rows have some index entry — without it, full-index scans and IS NULL
style queries would silently miss rows:
// ginExtractEntries — src/backend/access/gin/ginutil.cif (isNull) /* NULL item -> placeholder */{ *nentries = 1; (*categories)[0] = GIN_CAT_NULL_ITEM; return entries;}entries = DatumGetPointer(FunctionCall3Coll(&ginstate->extractValueFn[attnum - 1], ginstate->supportCollation[attnum - 1], value, PointerGetDatum(nentries), PointerGetDatum(&nullFlags)));if (entries == NULL || *nentries <= 0) /* empty array etc. -> placeholder */{ *nentries = 1; (*categories)[0] = GIN_CAT_EMPTY_ITEM; return entries;}/* if there's more than one key, sort and unique-ify */The three null categories — GIN_CAT_NORM_KEY (an ordinary key),
GIN_CAT_NULL_KEY (a genuine NULL element inside an item),
GIN_CAT_EMPTY_ITEM/GIN_CAT_NULL_ITEM (placeholders for keyless/null
items) — are treated as distinct keys by the entry-tree comparator, so they
sort and merge just like real keys.
Posting list → posting tree, the adaptive container. A leaf entry’s TID
list lives inline in the index tuple as long as it fits within GinMaxItemSize
(roughly a third of a page). The TIDs are stored compressed: the README
describes the scheme — combine each (block, offset) into a 43-bit integer,
delta-encode against the previous TID, then varbyte-encode the gaps. When a
merge would push the inline list over the size limit, GIN converts the entry to
a posting tree: a separate B-tree, keyed by ItemPointer, whose leaves hold
compressed posting-list segments. The leaf entry then stores only the
posting-tree root block number, flagged by the magic offset GIN_TREE_POSTING
(0xffff) so GinIsPostingTree(itup) can tell the two cases apart by reading
the tuple header. This is GIN’s answer to Zipf skew: rare keys pay one inline
tuple, the word the gets a whole B-tree.
Fast update: batch the writes. Retail insertion of one heap row touches
one entry per extracted key, each a full entry-tree descent — exactly the
write-amplification problem from the theory section. GIN’s opt-in remedy
(fastupdate, on by default) is the pending list: gininsert appends the
new entries, unsorted, to a singly-linked list of pages anchored on the
metapage, and returns. The expensive merge into the entry tree is deferred to
ginInsertCleanup, which runs when the list exceeds gin_pending_list_limit,
at the next VACUUM, or on an explicit gin_clean_pending_list(). The win is
exactly IR’s batching win: thousands of new entries are sorted once and merged
into each hot key’s list once, instead of thousands of separate descents. The
cost is that every scan must read the pending list linearly first — so the
list must be kept small. With fastupdate=off, gininsert calls
ginHeapTupleInsert → ginEntryInsert directly for each key.
flowchart TD
ins["gininsert(row)"] --> fu{"GinGetUseFastUpdate?"}
fu -->|yes| collect["ginHeapTupleFastCollect<br/>extract keys -> collector"]
collect --> fast["ginHeapTupleFastInsert<br/>append to pending list (metapage)"]
fast --> chk{"pending list ><br/>gin_pending_list_limit?"}
chk -->|yes| clean["ginInsertCleanup<br/>(merge into entry tree)"]
chk -->|no| done1["return"]
fu -->|no| retail["per key: ginHeapTupleInsert<br/>-> ginEntryInsert (descend + merge)"]
retail --> done2["return"]
clean --> done1
Scan = pending list, then main index, merged through consistent(). A GIN
scan (gingetbitmap) first calls scanPendingInsert to scan the pending list
and then scanGetItem/keyGetItem to walk the main index, both feeding the
same TIDBitmap. The query is decomposed by extractQueryFn into one
GinScanEntry per query key (with a searchMode and per-entry partial-match
flag), grouped under GinScanKeys (one per qual). Each entry is a cursor over
its key’s TIDs; keyGetItem merges the cursors in TID order and, at each
candidate TID, invokes the opclass consistent function with a boolean
vector of which entries were present — the consistent function encodes the
operator semantics (@> needs all query keys; && needs any). Entries are
classified required vs additional: at least one required entry must be
present, which lets the merge skip TIDs cheaply.
Source Walkthrough
Section titled “Source Walkthrough”The GIN code splits cleanly along the four files this doc targets:
gininsert.c (insertion + build), ginfast.c (the pending list and its
cleanup), ginbtree.c (the shared tree-navigation engine for both the entry
tree and posting trees), and ginget.c (scanning). ginvacuum.c rounds out
the lifecycle. We follow the call flows.
Retail insertion: gininsert → ginEntryInsert → the adaptive container
Section titled “Retail insertion: gininsert → ginEntryInsert → the adaptive container”gininsert is the AM entry point. Its only branch is fast-update vs retail:
// gininsert — src/backend/access/gin/gininsert.cif (GinGetUseFastUpdate(index)){ GinTupleCollector collector; memset(&collector, 0, sizeof(GinTupleCollector)); for (i = 0; i < ginstate->origTupdesc->natts; i++) ginHeapTupleFastCollect(ginstate, &collector, (OffsetNumber) (i + 1), values[i], isnull[i], ht_ctid); ginHeapTupleFastInsert(ginstate, &collector); /* -> pending list */}else{ for (i = 0; i < ginstate->origTupdesc->natts; i++) ginHeapTupleInsert(ginstate, (OffsetNumber) (i + 1), values[i], isnull[i], ht_ctid); /* -> entry tree */}The retail path lands in ginEntryInsert, the heart of GIN’s
posting-list-or-tree logic. It descends the entry tree with ginFindLeafPage,
then takes one of three actions depending on what it finds at the key:
// ginEntryInsert — src/backend/access/gin/gininsert.cstack = ginFindLeafPage(&btree, false, false);page = BufferGetPage(stack->buffer);
if (btree.findItem(&btree, stack)) /* key already present */{ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, stack->off)); if (GinIsPostingTree(itup)) /* (1) already a posting tree */ { BlockNumber rootPostingTree = GinGetPostingTree(itup); LockBuffer(stack->buffer, GIN_UNLOCK); freeGinBtreeStack(stack); ginInsertItemPointers(ginstate->index, rootPostingTree, items, nitem, buildStats); /* into posting tree */ return; } CheckForSerializableConflictIn(ginstate->index, NULL, BufferGetBlockNumber(stack->buffer)); itup = addItemPointersToLeafTuple(ginstate, itup, /* (2) grow inline list */ items, nitem, buildStats, stack->buffer); insertdata.isDelete = true;}else{ CheckForSerializableConflictIn(ginstate->index, NULL, BufferGetBlockNumber(stack->buffer)); itup = buildFreshLeafTuple(ginstate, attnum, key, category, /* (3) new key */ items, nitem, buildStats, stack->buffer); if (buildStats) buildStats->nEntries++;}insertdata.entry = itup;ginInsertValue(&btree, stack, &insertdata, buildStats);Case (2), addItemPointersToLeafTuple, is where the container transition
happens. It merges the old and new TID lists (ginMergeItemPointers),
re-compresses, and tries to re-form an inline tuple. If the compressed list no
longer fits (compressedList == NULL from ginCompressPostingList with the
GinMaxItemSize ceiling), it promotes to a posting tree:
// addItemPointersToLeafTuple — src/backend/access/gin/gininsert.cnewItems = ginMergeItemPointers(items, nitem, oldItems, oldNPosting, &newNPosting);compressedList = ginCompressPostingList(newItems, newNPosting, GinMaxItemSize, NULL);if (compressedList) res = GinFormTuple(ginstate, attnum, key, category, (char *) compressedList, SizeOfGinPostingList(compressedList), newNPosting, false);if (!res){ /* posting list would be too big, convert to posting tree */ BlockNumber postingRoot; postingRoot = createPostingTree(ginstate->index, oldItems, oldNPosting, buildStats, buffer); ginInsertItemPointers(ginstate->index, postingRoot, items, nitem, buildStats); res = GinFormTuple(ginstate, attnum, key, category, NULL, 0, 0, true); GinSetPostingTree(res, postingRoot); /* leaf tuple now holds a downlink */}buildFreshLeafTuple (case 3) is the same logic for a brand-new key:
try inline first, fall back to a fresh posting tree if the initial list is
already too large.
The shared B-tree engine: ginbtree.c
Section titled “The shared B-tree engine: ginbtree.c”Both the entry tree and every posting tree are driven by one generic engine
parameterized by a GinBtreeData vtable (findChildPage, isMoveRight,
beginPlaceToPage, execPlaceToPage, prepareDownlink, fillRoot).
ginFindLeafPage is the descent. It latch-couples one page at a time and
handles two L&Y conditions on the way down — finishing any incomplete split
it lands on, and stepping right when a concurrent split has moved the keyspace:
// ginFindLeafPage — src/backend/access/gin/ginbtree.caccess = ginTraverseLock(stack->buffer, searchMode);/* If we're going to modify the tree, finish any incomplete splits we meet */if (!searchMode && GinPageIsIncompleteSplit(page)) ginFinishOldSplit(btree, stack, NULL, access);/* root never has a right link, so small optimization */while (btree->fullScan == false && stack->blkno != btree->rootBlkno && btree->isMoveRight(btree, page)){ BlockNumber rightlink = GinPageGetOpaque(page)->rightlink; if (rightlink == InvalidBlockNumber) break; /* rightmost page */ stack->buffer = ginStepRight(stack->buffer, btree->index, access); stack->blkno = rightlink; page = BufferGetPage(stack->buffer);}ginStepRight is the L&Y move-right primitive and the linchpin of GIN’s
concurrency-vs-VACUUM interlock: it locks the next page before releasing the
current one, so a concurrent page deletion cannot unlink a page out from under
a reader that is about to follow the right-link:
// ginStepRight — src/backend/access/gin/ginbtree.cnextbuffer = ReadBuffer(index, blkno);LockBuffer(nextbuffer, lockmode);UnlockReleaseBuffer(buffer); /* release only AFTER taking next lock */page = BufferGetPage(nextbuffer);if (isLeaf != GinPageIsLeaf(page) || isData != GinPageIsData(page)) elog(ERROR, "right sibling of GIN page is of different type");ginPlaceToPage performs the actual insertion or split. Its three-state
protocol — GPTP_NO_WORK / GPTP_INSERT / GPTP_SPLIT — mirrors nbtree’s
atomic-split design: when an insert to an internal page completes the
posting of a downlink, it clears the child’s GIN_INCOMPLETE_SPLIT flag in the
same WAL action, so a crash mid-split leaves a tree that searches can still
traverse via right-links until the next writer finishes the split.
// ginPlaceToPage — src/backend/access/gin/ginbtree.crc = btree->beginPlaceToPage(btree, stack->buffer, stack, insertdata, updateblkno, &ptp_workspace, &newlpage, &newrpage);if (rc == GPTP_INSERT){ START_CRIT_SECTION(); btree->execPlaceToPage(btree, stack->buffer, stack, insertdata, updateblkno, ptp_workspace); if (BufferIsValid(childbuf)) /* downlink insert finishes child's split */ { GinPageGetOpaque(childpage)->flags &= ~GIN_INCOMPLETE_SPLIT; MarkBufferDirty(childbuf); } /* ... WAL ... */}Fast update: ginfast.c
Section titled “Fast update: ginfast.c”ginHeapTupleFastCollect is the collect half: it extracts keys (same
ginExtractEntries as retail) and forms pending-format tuples, which differ
from leaf tuples in that there is no posting list — the single heap TID is
stuffed straight into t_tid:
// ginHeapTupleFastCollect — src/backend/access/gin/ginfast.centries = ginExtractEntries(ginstate, attnum, value, isNull, &nentries, &categories);for (i = 0; i < nentries; i++){ IndexTuple itup = GinFormTuple(ginstate, attnum, entries[i], categories[i], NULL, 0, 0, true); itup->t_tid = *ht_ctid; /* one TID per pending entry */ collector->tuples[collector->ntuples++] = itup; collector->sumsize += IndexTupleSize(itup);}ginHeapTupleFastInsert appends the collected tuples to the metapage-anchored
list. If the batch is bigger than a page (or the tail page lacks room), it
builds a separate sublist and splices it onto the tail; otherwise it appends to
the tail page. The README’s GIN_LIST_FULLROW flag bit records whether a page
holds complete heap-tuple rows or a fragment, which the scan side relies on.
Note the metapage locking dance that keeps concurrency high — it drops the
metapage lock while building the sublist, then re-acquires it:
// ginHeapTupleFastInsert — src/backend/access/gin/ginfast.cif (collector->sumsize + collector->ntuples * sizeof(ItemIdData) > GinListPageSize) separateList = true;else{ LockBuffer(metabuffer, GIN_EXCLUSIVE); metadata = GinPageGetMeta(metapage); if (metadata->head == InvalidBlockNumber || collector->sumsize + collector->ntuples * sizeof(ItemIdData) > metadata->tailFreeSize) { separateList = true; LockBuffer(metabuffer, GIN_UNLOCK); /* unlock to keep concurrency */ }}ginInsertCleanup is the merge half — the function that pays back the deferred
write cost. It guards against concurrent cleanup with a LockPage on the
metapage (a heavyweight page lock, not a buffer lock, so concurrent
inserts to the pending list are still allowed), then walks the pending pages,
accumulating entries into a BuildAccumulator (a red-black tree keyed by
(attnum, key)), and flushes to the entry tree via ginEntryInsert once
memory fills or it reaches the list end:
// ginInsertCleanup — src/backend/access/gin/ginfast.cif (forceCleanup) /* from VACUUM / gin_clean_pending_list */ LockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock); /* wait out rivals */else if (!ConditionalLockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock)) return; /* a rival is already cleaning; bail *//* ... */processPendingPage(&accum, &datums, page, FirstOffsetNumber);/* when memory is full or list end reached: */ginBeginBAScan(&accum);while ((list = ginGetBAEntry(&accum, &attnum, &key, &category, &nlist)) != NULL) ginEntryInsert(ginstate, attnum, key, category, list, nlist, NULL);shiftList(index, metabuffer, blkno, fill_fsm, stats); /* unlink merged pages */The BuildAccumulator is the batching payoff: every pending entry for the
same key is gathered before a single ginEntryInsert, so a hot key’s posting
list is rewritten once per cleanup rather than once per row. shiftList
then unlinks the now-merged head pages from the metapage list (in batches of
GIN_NDELETE_AT_ONCE) and frees them to the FSM.
Scanning: ginget.c
Section titled “Scanning: ginget.c”gingetbitmap is the AM scan entry. The ordering is load-bearing — pending
list before main index — and the comment explains why a TID may be visited
twice (and why that is harmless for a bitmap, but fatal for amgettuple):
// gingetbitmap — src/backend/access/gin/ginget.cginNewScanKey(scan); /* extractQuery -> GinScanKey/GinScanEntry */if (GinIsVoidRes(scan)) return 0;scanPendingInsert(scan, tbm, &ntids); /* pending list FIRST */startScan(scan);ItemPointerSetMin(&iptr);for (;;){ if (!scanGetItem(scan, iptr, &iptr, &recheck)) break; if (ItemPointerIsLossyPage(&iptr)) tbm_add_page(tbm, ItemPointerGetBlockNumber(&iptr)); /* whole page */ else tbm_add_tuples(tbm, &iptr, 1, recheck); ntids++;}scanGetItem advances all the scan keys in lock-step to find the next TID that
satisfies every key; it uses the “nothing < this matches, so skip ahead”
optimization to leapfrog. Per key, keyGetItem finds the minimum TID
advancePastamong the key’srequiredEntries, then loads theadditionalEntriesfor that TID and calls the consistent function:
// keyGetItem — src/backend/access/gin/ginget.cItemPointerSetMax(&minItem);for (i = 0; i < key->nrequired; i++){ entry = key->requiredEntries[i]; if (entry->isFinished) continue; if (ginCompareItemPointers(&entry->curItem, &advancePast) <= 0) { entryGetItem(ginstate, entry, advancePast); /* advance this cursor */ if (entry->isFinished) continue; } allFinished = false; if (ginCompareItemPointers(&entry->curItem, &minItem) < 0) minItem = entry->curItem; /* smallest required TID = next candidate */}Each entry’s cursor is fed by entryGetItem, which pulls TIDs from either the
inline posting list or the entry’s posting tree (advancing page by page), or
from a pre-collected matchBitmap for partial-match / full-scan entries.
Partial-match and full scans: collectMatchBitmap. For a partial-match
entry (e.g. a LIKE 'foo%' accelerated by pg_trgm, or a range), GIN cannot
point at a single key — it must scan a range of the entry tree and union all
their posting lists into one bitmap. collectMatchBitmap walks entry-tree
leaves left-to-right via moveRightIfItNeeded, calling the opclass
comparePartialFn to decide where the range ends, and folds each matching
entry’s TIDs (inline or via scanPostingTree) into scanEntry->matchBitmap:
// collectMatchBitmap — src/backend/access/gin/ginget.cif (scanEntry->isPartialMatch){ if (icategory != GIN_CAT_NORM_KEY) /* partial match never matches nulls */ return true; cmp = DatumGetInt32(FunctionCall4Coll(&btree->ginstate->comparePartialFn[attnum - 1], btree->ginstate->supportCollation[attnum - 1], scanEntry->queryKey, idatum, UInt16GetDatum(scanEntry->strategy), PointerGetDatum(scanEntry->extra_data))); if (cmp > 0) return true; /* past the range: stop */ else if (cmp < 0) { stack->off++; continue; } /* before the range: skip */ /* cmp == 0: this entry is in range -> fold its TIDs into matchBitmap */}moveRightIfItNeeded also acquires a predicate lock on each entry leaf it
steps onto, which is how GIN supports serializable isolation (the README’s
predicate-locking notes: lock the leaf “gaps”, lock posting-tree roots for
equality on a tree-backed key, and always lock the metapage to interlock with
the pending list). SSI details belong to postgres-ssi-predicate-locking.md.
Pending-list scan: scanPendingInsert. Before the main index, GIN scans the
pending list row-by-row. Because pending entries are unsorted and one row’s
entries may span pages, it reads a whole heap row’s entries
(collectMatchesForHeapRow), fills the entryRes/hasMatchKey arrays, and
calls each key’s boolConsistentFn directly:
// scanPendingInsert — src/backend/access/gin/ginget.cPredicateLockPage(scan->indexRelation, GIN_METAPAGE_BLKNO, scan->xs_snapshot);LockBuffer(metabuffer, GIN_SHARE);blkno = GinPageGetMeta(page)->head;/* ... */while (scanGetCandidate(scan, &pos)) /* one heap row at a time */{ if (!collectMatchesForHeapRow(scan, &pos)) continue; match = true; for (i = 0; i < so->nkeys; i++) { GinScanKey key = so->keys + i; if (!key->boolConsistentFn(key)) { match = false; break; } recheck |= key->recheckCurItem; } if (match) { tbm_add_tuples(tbm, &pos.item, 1, recheck); (*ntids)++; }}Build: ginbuild
Section titled “Build: ginbuild”ginbuild initializes the metapage and a leaf root, then scans the heap.
Rather than retail-insert each row, it accumulates postings in a
BuildAccumulator in maintenance_work_mem and flushes batches via
ginEntryInsert — the same bulk-merge mechanism ginInsertCleanup uses, which
is precisely why the README calls bulk insertion “much faster than retail
insertion.” PG 18 can also drive the build with parallel workers
(_gin_begin_parallel), where workers extract and tuplesort GinTuples and
the leader merges the sorted stream.
// ginbuild — src/backend/access/gin/gininsert.cMetaBuffer = GinNewBuffer(index);RootBuffer = GinNewBuffer(index);GinInitMetabuffer(MetaBuffer);GinInitBuffer(RootBuffer, GIN_LEAF);/* ... */buildstate.accum.ginstate = &buildstate.ginstate;ginInitBA(&buildstate.accum);if (indexInfo->ii_ParallelWorkers > 0) _gin_begin_parallel(state, heap, index, indexInfo->ii_Concurrent, indexInfo->ii_ParallelWorkers);VACUUM: ginvacuum.c
Section titled “VACUUM: ginvacuum.c”GIN’s amvacuumcleanup/ambulkdelete are ginvacuumcleanup and
ginbulkdelete. The README’s rule is “Vacuum never deletes tuples or pages
from the entry tree” — the key vocabulary is assumed stable. So ginbulkdelete
first flushes the pending list (ginInsertCleanup with forceCleanup), then
walks the entry-tree leaves left-to-right by right-link, stripping dead TIDs
from each leaf’s inline posting list (ginVacuumEntryPage) and recursing into
each posting tree:
// ginbulkdelete — src/backend/access/gin/ginvacuum.cif (stats == NULL){ stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult)); ginInsertCleanup(&gvs.ginstate, !AmAutoVacuumWorkerProcess(), false, true, stats); /* drain pending list first */}/* ... walk entry leaves by rightlink ... */resPage = ginVacuumEntryPage(&gvs, buffer, rootOfPostingTree, &nRoot);blkno = GinPageGetOpaque(page)->rightlink;/* ... */for (i = 0; i < nRoot; i++) ginVacuumPostingTree(&gvs, rootOfPostingTree[i]); /* recurse into trees */Posting trees are shrunk. ginVacuumPostingTree runs in two stages
(matching the README): first ginVacuumPostingTreeLeaves strips dead TIDs from
every leaf (and notes whether any leaf went empty); only if an empty leaf
appeared does it take a cleanup lock on the posting-tree root
(LockBufferForCleanup, which waits out all pinholders, i.e. in-flight
readers) and run ginScanToDelete, a depth-first traversal that physically
unlinks empty pages:
// ginVacuumPostingTree — src/backend/access/gin/ginvacuum.cif (ginVacuumPostingTreeLeaves(gvs, rootBlkno)) /* stage 1: strip TIDs */{ /* stage 2: at least one empty page, so rescan and delete empties */ buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, rootBlkno, RBM_NORMAL, gvs->strategy); LockBufferForCleanup(buffer); /* blocks concurrent inserts + readers */ /* ... */ ginScanToDelete(gvs, rootBlkno, true, &root, InvalidOffsetNumber);}ginScanToDelete keeps exclusive locks on the left siblings of pages on the
path under investigation, so that when a page is removed both its downlink and
its right-link can be fixed without ever taking a right-to-left lock order
(which would deadlock against concurrent ginStepRight). A deleted page is
marked with a “newest xid that might visit it” so it cannot be recycled until
every reader that may have already read a pointer to it has finished — the same
deferred-reclaim discipline as nbtree. ginvacuumcleanup finally returns the
emptied posting-tree pages to the FSM and refreshes the metapage statistics.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
ginhandler | src/backend/access/gin/ginutil.c | 38 |
ginExtractEntries | src/backend/access/gin/ginutil.c | 488 |
gininsert | src/backend/access/gin/gininsert.c | 851 |
ginEntryInsert | src/backend/access/gin/gininsert.c | 341 |
addItemPointersToLeafTuple | src/backend/access/gin/gininsert.c | 211 |
buildFreshLeafTuple | src/backend/access/gin/gininsert.c | 291 |
ginbuild | src/backend/access/gin/gininsert.c | 608 |
_gin_begin_parallel | src/backend/access/gin/gininsert.c | 922 |
ginHeapTupleFastCollect | src/backend/access/gin/ginfast.c | 483 |
ginHeapTupleFastInsert | src/backend/access/gin/ginfast.c | 219 |
ginInsertCleanup | src/backend/access/gin/ginfast.c | 780 |
shiftList | src/backend/access/gin/ginfast.c | 554 |
ginFindLeafPage | src/backend/access/gin/ginbtree.c | 83 |
ginStepRight | src/backend/access/gin/ginbtree.c | 177 |
ginTraverseLock | src/backend/access/gin/ginbtree.c | 39 |
ginPlaceToPage | src/backend/access/gin/ginbtree.c | 337 |
gingetbitmap | src/backend/access/gin/ginget.c | 1931 |
scanGetItem | src/backend/access/gin/ginget.c | 1300 |
keyGetItem | src/backend/access/gin/ginget.c | 1005 |
entryGetItem | src/backend/access/gin/ginget.c | 812 |
collectMatchBitmap | src/backend/access/gin/ginget.c | 121 |
moveRightIfItNeeded | src/backend/access/gin/ginget.c | 43 |
scanPostingTree | src/backend/access/gin/ginget.c | 69 |
scanPendingInsert | src/backend/access/gin/ginget.c | 1837 |
ginbulkdelete | src/backend/access/gin/ginvacuum.c | 564 |
ginvacuumcleanup | src/backend/access/gin/ginvacuum.c | 687 |
ginVacuumPostingTree | src/backend/access/gin/ginvacuum.c | 408 |
ginVacuumPostingTreeLeaves | src/backend/access/gin/ginvacuum.c | 345 |
ginScanToDelete | src/backend/access/gin/ginvacuum.c | 246 |
GinIsPostingTree / GIN_TREE_POSTING | src/include/access/ginblock.h | 232 / 231 |
GinPageIsIncompleteSplit | src/include/access/ginblock.h | 128 |
GinScanKeyData | src/include/access/gin_private.h | 269 |
GinScanEntryData | src/include/access/gin_private.h | 337 |
GinBtreeData | src/include/access/gin_private.h | 151 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verification was done against the working tree at
/data/hgryoo/references/postgres, REL_18_STABLE, commit 273fe94852b
(“Fix off-by-one with NFC recomposition for Hangul U+11A7 (TBASE)”). Every
code excerpt above was copied from the file named in its leading
// symbol — path comment and condensed only by eliding lines (marked
/* ... */); no logic was paraphrased. The following facts were checked
directly:
- AM surface.
ginhandler(ginutil.c) setsamgettuple = NULLandamgetbitmap = gingetbitmap; confirmedgrep "amget" ginutil.c. GIN is therefore bitmap-only, as the prose states. - Posting-tree discriminator.
ginblock.hdefinesGIN_TREE_POSTING ((OffsetNumber)0xffff)andGinIsPostingTree(itup) (GinGetNPosting(itup) == GIN_TREE_POSTING), withGinGetNPostingreadingt_tid’s offset field — confirming the header-field abuse the README documents. - Fast-update toggle.
gininsertbranches onGinGetUseFastUpdate(index);fastupdatedefaults on (GinOptions/ginutil.c’s reloptions). The pending list is anchored onGinMetaPageData(head,tail,tailFreeSize,nPendingPages) inginblock.h. - Cleanup interlock.
ginInsertCleanupusesLockPage/ConditionalLockPageonGIN_METAPAGE_BLKNO(a heavyweight page lock,storage/lmgr), distinct from the buffer content lock — verified by reading the two lock calls and their comments. - VACUUM asymmetry.
ginbulkdeletecallsginInsertCleanup(..., forceCleanup = true)before walking entry leaves; there is no entry-tree page deletion (noginDeletePagecall from the entry-tree walk —ginDeletePageis reached only viaginScanToDeleteon posting trees). Confirmedgrep -n ginDeletePage ginvacuum.c. - Cleanup lock for posting-tree deletion.
ginVacuumPostingTreecallsLockBufferForCleanup(buffer)on the posting-tree root only when stage 1 reported an empty page; matches the README’s “full cleanup lock on the tree root.” - Predicate locking.
collectMatchBitmap/moveRightIfItNeededandscanPendingInsertcallPredicateLockPageon entry leaves, posting-tree roots, and the metapage respectively — the three lock targets the README’s Predicate Locking section enumerates.
One nuance worth flagging for a future reader: the README’s concurrency
pictures describe pin/lock coupling during descent and step-right, but
ginFindLeafPage in search mode (searchMode = true) does not keep parent
pins — it uses ReleaseAndReadBuffer and “may forget path to leaf,” whereas
insert mode keeps the pins. The code comment at the top of ginFindLeafPage
is the authority; the README’s “Insert” picture is the insert-mode case.
Limitations confirmed from the README and code: GIN ignores
scan->kill_prior_tuple/ignore_killed_tuples (no LP_DEAD hinting like
nbtree), and it matches entries only by equality or the partial-match range
mechanism — there is no general inequality search.
A last cross-check on the encoding claim. The README’s “Posting List
Compression” section and ginpostinglist.c agree that posting lists — both
the inline lists in entry-tree leaves and the leaf pages of a posting tree —
are stored delta-encoded with varbyte, not raw ItemPointerData arrays:
// ginCompressPostingList — src/backend/access/gin/ginpostinglist.c// item pointers arrive sorted; we store the *difference* from the previous oneresult->first = ipd[0]; /* first TID stored verbatim */prev = itemptr_to_uint64(&result->first);for (totalpacked = 1; totalpacked < nipd; totalpacked++){ uint64 val = itemptr_to_uint64(&ipd[totalpacked]); uint64 delta = val - prev; Assert(val > prev); /* ... bounds check elided ... */ encode_varbyte(delta, &ptr); /* small deltas => fewer bytes */ prev = val;}encode_varbyte emits 7 payload bits per byte with the high bit as a
continuation flag, so a TID one block away from its predecessor costs a byte or
two instead of six. This is the on-disk reason a hot key’s million-TID posting
list stays compact, and it is why GinItupIsCompressed(itup) exists: an entry
leaf may carry either the compressed form or (for backward compatibility) a
raw array, and the reader branches on the flag.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”-
Lucene / classic IR inverted files. GIN’s entry-tree-plus-posting-list shape is the relational-engine reflection of a Lucene segment: a term dictionary (Lucene’s FST; GIN’s entry B-tree) over distinct keys, each pointing at a postings stream (Lucene’s
.doc/.posfiles; GIN’s inline list or posting tree). The varbyte delta encoding here is the same family as Lucene’sPForDelta/FORblock codecs — both exploit sorted doc-IDs and small gaps. The sharpest divergence is the fast-update pending list: Lucene is segment-immutable and merges whole segments, whereas GIN mutates one main index in place and keeps a metapage-anchored side log. A note mapping GIN’sginInsertCleanupagainst Lucene’s segment-merge policy would frame the in-place-vs-immutable trade directly. -
Bitmap and variant indexes (O’Neil & Quass 1997). The “one key occurs in many rows” workload GIN targets is exactly the low-cardinality case bitmap indexes were built for, and GIN’s consistent-function conjunction is a posting-list intersection where a bitmap index would AND two bitmaps. The Variant Indexes paper (
dbms-papers/variant-index.md) — bit-sliced and projection indexes, and the bitmap-vs-list space crossover — is the right lens on why GIN compresses long lists into varbyte runs rather than materializing a bitmap: GIN’s TID space is sparse and unbounded (heap can grow), so a gap-coded list dominates a dense bitmap until the key is nearly universal. -
RUM (the GIN successor extension). The
rumaccess method (a third-party extension, out of core scope here, named only as an example) extends GIN’s posting entries to carry additional attributes — lexeme positions, a timestamp, a distance — so that ordering operators (tsvector <=> tsquery, ranking) can be answered from the index without a heap fetch. GIN deliberately stores only TIDs in its postings; comparing RUM’s fatter-posting design against GIN’s TID-only list quantifies what GIN’s bitmap-onlyamgetbitmapinterface gives up versus an index that can return ordered, scored results. -
The pending-list / LSM analogy. GIN’s fast-update path — buffer unordered inserts in a side log, merge them in bulk later — is structurally a one-level log-structured merge: the pending list is the in-place “L0” and
ginInsertCleanupis the compaction into the sorted main index. Unlike a real LSM, GIN has exactly one merge target and reads the pending list linearly on every scan, so it pays an unbounded read tax if cleanup falls behind (thegin_pending_list_limit/ autovacuum trigger bounds this). Positioning GIN against RocksDB-style multi-level LSMs would clarify why GIN keeps a single unsorted buffer rather than a tiered, sorted run hierarchy. -
Partial match and prefix search vs. trigram / suffix structures. GIN’s
extractQuerypartial-match mode (collectMatchBitmapwalking an entry-key range) gives prefix and range matching essentially for free because the entry tree is ordered, which is howpg_trgmandLIKE-acceleration opclasses ride on GIN. A comparison against dedicated suffix-array / FM-index text structures would frame what GIN’s general-purpose ordered entry tree gives up in exchange for serving arrays, jsonb, and tsvector from one mechanism. -
Concurrency: L&Y entry tree without page deletion. GIN’s entry tree is a Lehman-Yao B-tree (right-links, step-right recovery) but — unlike nbtree — it never deletes entry pages, because the key vocabulary is assumed stable (
postgres-nbtree.mdowns the full L&Y deletion protocol). The research question this raises is bloat: a GIN index over a column whose key set churns (keys appear then permanently vanish) accumulates dead entry leaves that only a fullREINDEXreclaims. Contrasting this with nbtree’s_bt_pagedelhalf-dead/unlink machinery isolates the cost of GIN’s “vocabulary is forever” assumption.
Sources
Section titled “Sources”In-tree READMEs and source files (REL_18_STABLE, commit 273fe94)
Section titled “In-tree READMEs and source files (REL_18_STABLE, commit 273fe94)”src/backend/access/gin/README— the design document: inverted-index basics, the generalized strategy-function contract, entry-tree vs. posting list vs. posting tree layout, the fast-update pending list and cleanup, posting-list (varbyte) compression, concurrency and predicate locking, and the “no entry-page deletion” limitation.src/include/access/ginblock.h—GinMetaPageData(pending-list anchor:head,tail,tailFreeSize,nPendingPages),GinPageOpaqueData, theGIN_TREE_POSTINGsentinel andGinIsPostingTree/GinGetNPostingheader-field accessors,GinItupIsCompressed.src/include/access/gin_private.h—GinState,GinBtreeData, scan-key and scan-entry structs (GinScanKey,GinScanEntry),GinBtreeStack.src/backend/access/gin/ginutil.c—ginhandler(the AM surface:amgetbitmap = gingetbitmap,amgettuple = NULL), reloptions (fastupdate,gin_pending_list_limit),GinInitMetabuffer.src/backend/access/gin/gininsert.c—gininsert,ginEntryInsert,ginBuildCallback, the build-time accumulator, the fast-update branch onGinGetUseFastUpdate.src/backend/access/gin/ginget.c—gingetbitmap,startScan,collectMatchBitmap,keyGetItem,entryGetItem,scanPendingInsert, the consistent-function driver.src/backend/access/gin/ginfast.c—ginHeapTupleFastInsert,ginInsertCleanup,makeSublist,shiftList,writeListPage, the metapageLockPage/ConditionalLockPageinterlock.src/backend/access/gin/ginbtree.c—ginFindLeafPage,ginStepRight,ginFinishSplit,ginInsertValue— the shared L&Y descent/step-right used by both the entry tree and posting trees.src/backend/access/gin/ginvacuum.c—ginbulkdelete,ginVacuumPostingTree,ginVacuumEntryPage,ginScanToDelete/ginDeletePage(posting-tree page reclamation under a cleanup lock).src/backend/access/gin/ginpostinglist.c—ginCompressPostingList,encode_varbyte/decode_varbyte, the delta+varbyte posting codec.
Papers and textbook chapters
Section titled “Papers and textbook chapters”- Database System Concepts (Silberschatz, Korth, Sudarshan, 7e), ch. 21
“Information Retrieval” — the inverted-index / posting-list model and
conjunctive-query intersection that GIN implements
(
knowledge/research/dbms-general/database-system-concepts.md). - O’Neil, P. & Quass, D. (1997). “Improved Query Performance with Variant
Indexes.” SIGMOD 1997. Bitmap / bit-sliced / projection indexes and the
bitmap-vs-list space trade-off (
dbms-papers/variant-index.md). - Lehman, P. & Yao, S. B. (1981). “Efficient Locking for Concurrent
Operations on B-Trees.” ACM TODS 6(4) — the right-link/step-right algorithm
GIN’s entry tree and posting trees reuse (text in
dbms-papers/btree.md; full deviation analysis inpostgres-nbtree.md). - Database Internals (Petrov 2019), ch. 2-6 — B-tree concurrency, page
layout, and WAL framing background (
knowledge/research/dbms-general/).
Sibling docs (cross-references — mechanism owned there, not duplicated here)
Section titled “Sibling docs (cross-references — mechanism owned there, not duplicated here)”postgres-index-am.md— theIndexAmRoutinedispatch contract thatginhandlerfills in (build, insert,amgetbitmap, bulkdelete, vacuum).postgres-nbtree.md— the canonical Lehman-Yao B-tree: the descent, step-right recovery, and the page-deletion protocol GIN’s entry tree deliberately omits.postgres-full-text-search.md— thetsvector/tsqueryopclasses and theextractValue/extractQuery/consistentstrategy functions GIN drives; the text-search semantics live there.postgres-buffer-manager.md— buffer pins and content-lock (LWLock) latching underneath GIN’s page locks and the metapageLockPageinterlock.postgres-ssi-predicate-locking.md— the predicate-lock machinerycollectMatchBitmap/scanPendingInsertfeed viaPredicateLockPage.postgres-vacuum.md/postgres-autovacuum.md— the VACUUM driver that invokesginbulkdeleteand the autovacuum trigger that forces pending-list cleanup.