Skip to content

PostgreSQL GIN — Generalized Inverted Index, Posting Trees, and the Fast-Update List

Contents:

A secondary index in the textbook sense maps one indexed value to the set of rows that carry it. For an int column the mapping is one-to-one-ish: a B-tree leaf entry holds a key and a single (or small) set of row pointers, and the access method (AM) treats the column value as opaque and orders it by a single comparison operator. That model breaks the moment the indexed column is composite — an array {4, 7, 12}, a tsvector of a hundred lexemes, a jsonb document with dozens of paths — and the queries you want to accelerate are not “find rows where the array equals {4,7,12}” but “find rows whose array contains 7”, “whose document matches the full-text query fox & quick”, “whose jsonb @> '{"a":1}'”. The value as a whole is the wrong indexing granularity; the elements inside it are what queries probe.

The data structure that fits this shape is the inverted index, the workhorse of information retrieval. Database System Concepts (Silberschatz 7e, ch. 21 on information retrieval) describes it directly: “An inverted index stores for each word a list of all document [identifiers] in which the word occurs.” Each distinct term (word, lexeme, array element — generically a key) maps to a posting list: the sorted set of document identifiers (here, heap tuple IDs) that contain that key. A conjunctive query fox & quick is answered by intersecting the posting lists of fox and quick; a disjunctive query unions them. The index is “inverted” because it inverts the natural document-to-terms relationship into a terms-to-documents one. This is exactly the structure a search engine builds over a web corpus, and it is Database System Concepts’ “inverted index” of ch. 21 — the same idea relational engines borrow to index array/document/text columns.

Two properties of real corpora make the inverted index a non-trivial engineering problem rather than a hash table of lists:

  1. Skew. Term frequencies follow a Zipf distribution: a handful of keys (the, and) occur in nearly every document, while most keys are rare. A posting list can therefore range from one TID to millions. A single fixed container — “store the list inline next to the key” — is wrong for both ends: it wastes space on rare keys and overflows on common ones. A good inverted index needs an adaptive container that is compact when the list is short and degrades gracefully into a paged structure when it is long.

  2. Write amplification on insertion. Inserting one document with k terms means touching k posting lists, scattered across the key space. Each touch is a B-tree descent and a page rewrite. Indexing a stream of documents this way is brutally slow because the same hot keys are re-written over and over. The classic IR answer is batching: accumulate postings in memory (or a side structure), sort them by key, and merge them into the main index in bulk so each hot key’s list is rewritten once per batch rather than once per document.

The third property is generality. A relational engine cannot bake in “this is text” — the same machinery must serve integer[], text[], tsvector, jsonb, hstore, trigram indexes, and user-defined types. So the index must be parameterized by a small set of strategy functions supplied per data type: one to extract the keys from a stored value, one to extract the keys from a query (plus a search mode: must-match-all, match-any, everything), and one — the consistent function — that, given which of the query keys were found for a candidate row, decides whether the row actually satisfies the operator. This is the same “the index does not know which operator it accelerates” philosophy as GiST, expressed for inverted indexes. The GIN README states it plainly: “Generalized means that the index does not know which operation it accelerates. It instead works with custom strategies, defined for specific data types.”

Search engines (Lucene/Elasticsearch, Solr) and full-text features bolted onto relational engines (Oracle Text, SQL Server Full-Text, MySQL FULLTEXT) all build on the inverted-index skeleton, and they converge on a recognizable set of design decisions:

  • Dictionary + postings split. A term dictionary (often itself a B-tree, a sorted array, or an FST as in Lucene) maps each key to the location of its posting list. The postings themselves live separately, because their size distribution is so different from the keys’. Lucene stores the dictionary in a .tim/.tip file and postings in a .doc file; a relational engine typically stores the dictionary as a B-tree whose leaves point at posting storage.

  • Delta + variable-byte compression of postings. Because a posting list is a sorted list of monotonically increasing document IDs, you store the differences (gaps) between consecutive IDs rather than the IDs themselves, then encode the small gaps with a variable-length integer scheme (varbyte, Simple-9, PForDelta, Roaring bitmaps for dense lists). Sorted-gap + varbyte is the canonical baseline and shrinks dense posting lists several-fold.

  • Segment / batch architecture with background merge. Lucene is the archetype: writes go into a small in-memory segment, which is flushed to an immutable on-disk segment; a background merge policy periodically combines small segments into larger ones. Deletes are tombstones applied at merge time, not in place. This makes ingestion append-mostly and pushes the expensive sorted merge off the write path. The “never modify a posting list in place; rebuild it in bulk” instinct is universal.

  • No in-place delete of dictionary terms. Vocabulary churns slowly: the set of distinct words in a corpus is nearly stable after the corpus grows past a threshold. Engines exploit this by never bothering to remove a term from the dictionary even when its posting list empties — the cost of the empty entry is trivial compared to the concurrency-control machinery a dictionary delete would require.

  • Query as multi-stream merge + a decision function. A boolean query over several terms opens one posting cursor per term, advances them in document-ID order, and at each candidate document evaluates the boolean expression (AND/OR/NOT, possibly with proximity or scoring). The cursors are advanced with a “skip to document ≥ X” primitive so that intersecting a rare term with a common one skips most of the common term’s list. Skip lists / skip pointers inside the postings make this cheap.

flowchart TD
  doc["Indexable item<br/>array / tsvector / jsonb"] -->|extract keys| keys["keys: k1, k2, k3 ..."]
  keys --> dict["Term dictionary<br/>(B-tree of keys)"]
  dict -->|per key| pl["Posting list<br/>sorted, delta+varbyte compressed<br/>doc IDs / heap TIDs"]
  q["Query operator<br/>e.g. arr @> '{7}'"] -->|extract query keys + searchMode| qkeys["query keys + which-must-match"]
  qkeys --> cursors["Per-key posting cursors"]
  cursors -->|merge in doc-ID order| cand["Candidate doc"]
  cand -->|consistent / boolean fn| match{"matches?"}
  match -->|yes| out["emit doc ID"]
  match -->|no| cursors

Where engines diverge is how aggressively they batch and where the latency lands. A dedicated search engine accepts eventual consistency: a freshly indexed document is searchable only after a refresh/commit. A relational engine indexing a column must keep the index transactionally consistent with the heap — a row inserted in a transaction must be found by that transaction’s later scans — which constrains how far it can push batching off the write path. The compromise, as we’ll see, is to batch into a durable side structure that scans also read, rather than into a volatile buffer.

GIN — Generalized Inverted index, which the README insists “should be considered as a genie, not a drink” — is PostgreSQL’s inverted index AM, originally by Teodor Sigaev and Oleg Bartunov. It is an ordinary entry in the pg_am catalog dispatched through the IndexAmRoutine table (see postgres-index-am.md); ginhandler (in ginutil.c) fills that table. The two fields that define GIN’s personality are amgettuple = NULL and amgetbitmap = gingetbitmap:

// ginhandler — src/backend/access/gin/ginutil.c
amroutine->amoptionalkey = true;
amroutine->amcanreturn = NULL;
// ... condensed ...
amroutine->amgettuple = NULL; /* GIN cannot do one-tuple-at-a-time */
amroutine->amgetbitmap = gingetbitmap; /* only bitmap scans */

GIN supports only bitmap scans, never amgettuple. The README explains the deep reason: because of the pending list (below), the same heap TID can be visited twice during one scan, so GIN cannot promise the duplicate-free, ordered, resumable stream that amgettuple requires. It re-sets the same bit in a TIDBitmap instead, which is idempotent. This is why EXPLAIN of a GIN query always shows a Bitmap Index Scan.

GIN’s structure has four kinds of pages, all sharing the standard PageHeader plus a GIN-specific opaque footer (GinPageOpaque, holding rightlink and flags):

  1. Metapage (block 0) — GinMetaPageData: head/tail of the pending list, page/tuple counts, and statistics.
  2. Entry tree — a Lehman-Yao B-tree (right-links, no left-links: GIN never scans backward) whose keys are the extracted key data and whose leaf tuples carry either an inline posting list or a posting-tree downlink.
  3. Posting trees — secondary B-trees keyed by ItemPointer, created when a key’s posting list outgrows an entry-tree page.
  4. Pending list pages — the fast-update side structure.
flowchart TD
  meta["Metapage (blk 0)<br/>head/tail of pending list<br/>nPendingPages, stats"]
  meta -.->|GinGetUseFastUpdate| pl0["pending page<br/>(unsorted, 1 TID per entry)"]
  pl0 -->|rightlink| pl1["pending page"]
  pl1 -->|rightlink| pl2["pending tail page"]
  meta --> root["Entry-tree root<br/>(L&Y B-tree of keys)"]
  root --> e1["entry leaf: key 'fox'<br/>inline posting list (compressed TIDs)"]
  root --> e2["entry leaf: key 'the'<br/>GinIsPostingTree -> downlink"]
  e2 --> pt["Posting-tree root<br/>(B-tree keyed by ItemPointer)"]
  pt --> ptl1["posting leaf<br/>segments of compressed TIDs"]
  pt --> ptl2["posting leaf<br/>segments of compressed TIDs"]

Key entries, not items. GIN never stores the indexed value as a whole. On insert it calls the opclass extractValueFn to break the value into a sorted, de-duplicated array of keys, each with a null category; ginExtractEntries (in ginutil.c) wraps that and synthesizes a placeholder key for the otherwise-keyless cases (a NULL value, or an empty array) so that even those rows have some index entry — without it, full-index scans and IS NULL style queries would silently miss rows:

// ginExtractEntries — src/backend/access/gin/ginutil.c
if (isNull) /* NULL item -> placeholder */
{
*nentries = 1;
(*categories)[0] = GIN_CAT_NULL_ITEM;
return entries;
}
entries = DatumGetPointer(FunctionCall3Coll(&ginstate->extractValueFn[attnum - 1],
ginstate->supportCollation[attnum - 1],
value, PointerGetDatum(nentries),
PointerGetDatum(&nullFlags)));
if (entries == NULL || *nentries <= 0) /* empty array etc. -> placeholder */
{
*nentries = 1;
(*categories)[0] = GIN_CAT_EMPTY_ITEM;
return entries;
}
/* if there's more than one key, sort and unique-ify */

The three null categories — GIN_CAT_NORM_KEY (an ordinary key), GIN_CAT_NULL_KEY (a genuine NULL element inside an item), GIN_CAT_EMPTY_ITEM/GIN_CAT_NULL_ITEM (placeholders for keyless/null items) — are treated as distinct keys by the entry-tree comparator, so they sort and merge just like real keys.

Posting list → posting tree, the adaptive container. A leaf entry’s TID list lives inline in the index tuple as long as it fits within GinMaxItemSize (roughly a third of a page). The TIDs are stored compressed: the README describes the scheme — combine each (block, offset) into a 43-bit integer, delta-encode against the previous TID, then varbyte-encode the gaps. When a merge would push the inline list over the size limit, GIN converts the entry to a posting tree: a separate B-tree, keyed by ItemPointer, whose leaves hold compressed posting-list segments. The leaf entry then stores only the posting-tree root block number, flagged by the magic offset GIN_TREE_POSTING (0xffff) so GinIsPostingTree(itup) can tell the two cases apart by reading the tuple header. This is GIN’s answer to Zipf skew: rare keys pay one inline tuple, the word the gets a whole B-tree.

Fast update: batch the writes. Retail insertion of one heap row touches one entry per extracted key, each a full entry-tree descent — exactly the write-amplification problem from the theory section. GIN’s opt-in remedy (fastupdate, on by default) is the pending list: gininsert appends the new entries, unsorted, to a singly-linked list of pages anchored on the metapage, and returns. The expensive merge into the entry tree is deferred to ginInsertCleanup, which runs when the list exceeds gin_pending_list_limit, at the next VACUUM, or on an explicit gin_clean_pending_list(). The win is exactly IR’s batching win: thousands of new entries are sorted once and merged into each hot key’s list once, instead of thousands of separate descents. The cost is that every scan must read the pending list linearly first — so the list must be kept small. With fastupdate=off, gininsert calls ginHeapTupleInsertginEntryInsert directly for each key.

flowchart TD
  ins["gininsert(row)"] --> fu{"GinGetUseFastUpdate?"}
  fu -->|yes| collect["ginHeapTupleFastCollect<br/>extract keys -> collector"]
  collect --> fast["ginHeapTupleFastInsert<br/>append to pending list (metapage)"]
  fast --> chk{"pending list ><br/>gin_pending_list_limit?"}
  chk -->|yes| clean["ginInsertCleanup<br/>(merge into entry tree)"]
  chk -->|no| done1["return"]
  fu -->|no| retail["per key: ginHeapTupleInsert<br/>-> ginEntryInsert (descend + merge)"]
  retail --> done2["return"]
  clean --> done1

Scan = pending list, then main index, merged through consistent(). A GIN scan (gingetbitmap) first calls scanPendingInsert to scan the pending list and then scanGetItem/keyGetItem to walk the main index, both feeding the same TIDBitmap. The query is decomposed by extractQueryFn into one GinScanEntry per query key (with a searchMode and per-entry partial-match flag), grouped under GinScanKeys (one per qual). Each entry is a cursor over its key’s TIDs; keyGetItem merges the cursors in TID order and, at each candidate TID, invokes the opclass consistent function with a boolean vector of which entries were present — the consistent function encodes the operator semantics (@> needs all query keys; && needs any). Entries are classified required vs additional: at least one required entry must be present, which lets the merge skip TIDs cheaply.

The GIN code splits cleanly along the four files this doc targets: gininsert.c (insertion + build), ginfast.c (the pending list and its cleanup), ginbtree.c (the shared tree-navigation engine for both the entry tree and posting trees), and ginget.c (scanning). ginvacuum.c rounds out the lifecycle. We follow the call flows.

Retail insertion: gininsert → ginEntryInsert → the adaptive container

Section titled “Retail insertion: gininsert → ginEntryInsert → the adaptive container”

gininsert is the AM entry point. Its only branch is fast-update vs retail:

// gininsert — src/backend/access/gin/gininsert.c
if (GinGetUseFastUpdate(index))
{
GinTupleCollector collector;
memset(&collector, 0, sizeof(GinTupleCollector));
for (i = 0; i < ginstate->origTupdesc->natts; i++)
ginHeapTupleFastCollect(ginstate, &collector,
(OffsetNumber) (i + 1),
values[i], isnull[i], ht_ctid);
ginHeapTupleFastInsert(ginstate, &collector); /* -> pending list */
}
else
{
for (i = 0; i < ginstate->origTupdesc->natts; i++)
ginHeapTupleInsert(ginstate, (OffsetNumber) (i + 1),
values[i], isnull[i], ht_ctid); /* -> entry tree */
}

The retail path lands in ginEntryInsert, the heart of GIN’s posting-list-or-tree logic. It descends the entry tree with ginFindLeafPage, then takes one of three actions depending on what it finds at the key:

// ginEntryInsert — src/backend/access/gin/gininsert.c
stack = ginFindLeafPage(&btree, false, false);
page = BufferGetPage(stack->buffer);
if (btree.findItem(&btree, stack)) /* key already present */
{
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, stack->off));
if (GinIsPostingTree(itup)) /* (1) already a posting tree */
{
BlockNumber rootPostingTree = GinGetPostingTree(itup);
LockBuffer(stack->buffer, GIN_UNLOCK);
freeGinBtreeStack(stack);
ginInsertItemPointers(ginstate->index, rootPostingTree,
items, nitem, buildStats); /* into posting tree */
return;
}
CheckForSerializableConflictIn(ginstate->index, NULL,
BufferGetBlockNumber(stack->buffer));
itup = addItemPointersToLeafTuple(ginstate, itup, /* (2) grow inline list */
items, nitem, buildStats, stack->buffer);
insertdata.isDelete = true;
}
else
{
CheckForSerializableConflictIn(ginstate->index, NULL,
BufferGetBlockNumber(stack->buffer));
itup = buildFreshLeafTuple(ginstate, attnum, key, category, /* (3) new key */
items, nitem, buildStats, stack->buffer);
if (buildStats)
buildStats->nEntries++;
}
insertdata.entry = itup;
ginInsertValue(&btree, stack, &insertdata, buildStats);

Case (2), addItemPointersToLeafTuple, is where the container transition happens. It merges the old and new TID lists (ginMergeItemPointers), re-compresses, and tries to re-form an inline tuple. If the compressed list no longer fits (compressedList == NULL from ginCompressPostingList with the GinMaxItemSize ceiling), it promotes to a posting tree:

// addItemPointersToLeafTuple — src/backend/access/gin/gininsert.c
newItems = ginMergeItemPointers(items, nitem, oldItems, oldNPosting, &newNPosting);
compressedList = ginCompressPostingList(newItems, newNPosting, GinMaxItemSize, NULL);
if (compressedList)
res = GinFormTuple(ginstate, attnum, key, category,
(char *) compressedList,
SizeOfGinPostingList(compressedList), newNPosting, false);
if (!res)
{
/* posting list would be too big, convert to posting tree */
BlockNumber postingRoot;
postingRoot = createPostingTree(ginstate->index, oldItems, oldNPosting,
buildStats, buffer);
ginInsertItemPointers(ginstate->index, postingRoot, items, nitem, buildStats);
res = GinFormTuple(ginstate, attnum, key, category, NULL, 0, 0, true);
GinSetPostingTree(res, postingRoot); /* leaf tuple now holds a downlink */
}

buildFreshLeafTuple (case 3) is the same logic for a brand-new key: try inline first, fall back to a fresh posting tree if the initial list is already too large.

Both the entry tree and every posting tree are driven by one generic engine parameterized by a GinBtreeData vtable (findChildPage, isMoveRight, beginPlaceToPage, execPlaceToPage, prepareDownlink, fillRoot). ginFindLeafPage is the descent. It latch-couples one page at a time and handles two L&Y conditions on the way down — finishing any incomplete split it lands on, and stepping right when a concurrent split has moved the keyspace:

// ginFindLeafPage — src/backend/access/gin/ginbtree.c
access = ginTraverseLock(stack->buffer, searchMode);
/* If we're going to modify the tree, finish any incomplete splits we meet */
if (!searchMode && GinPageIsIncompleteSplit(page))
ginFinishOldSplit(btree, stack, NULL, access);
/* root never has a right link, so small optimization */
while (btree->fullScan == false && stack->blkno != btree->rootBlkno &&
btree->isMoveRight(btree, page))
{
BlockNumber rightlink = GinPageGetOpaque(page)->rightlink;
if (rightlink == InvalidBlockNumber)
break; /* rightmost page */
stack->buffer = ginStepRight(stack->buffer, btree->index, access);
stack->blkno = rightlink;
page = BufferGetPage(stack->buffer);
}

ginStepRight is the L&Y move-right primitive and the linchpin of GIN’s concurrency-vs-VACUUM interlock: it locks the next page before releasing the current one, so a concurrent page deletion cannot unlink a page out from under a reader that is about to follow the right-link:

// ginStepRight — src/backend/access/gin/ginbtree.c
nextbuffer = ReadBuffer(index, blkno);
LockBuffer(nextbuffer, lockmode);
UnlockReleaseBuffer(buffer); /* release only AFTER taking next lock */
page = BufferGetPage(nextbuffer);
if (isLeaf != GinPageIsLeaf(page) || isData != GinPageIsData(page))
elog(ERROR, "right sibling of GIN page is of different type");

ginPlaceToPage performs the actual insertion or split. Its three-state protocol — GPTP_NO_WORK / GPTP_INSERT / GPTP_SPLIT — mirrors nbtree’s atomic-split design: when an insert to an internal page completes the posting of a downlink, it clears the child’s GIN_INCOMPLETE_SPLIT flag in the same WAL action, so a crash mid-split leaves a tree that searches can still traverse via right-links until the next writer finishes the split.

// ginPlaceToPage — src/backend/access/gin/ginbtree.c
rc = btree->beginPlaceToPage(btree, stack->buffer, stack, insertdata,
updateblkno, &ptp_workspace, &newlpage, &newrpage);
if (rc == GPTP_INSERT)
{
START_CRIT_SECTION();
btree->execPlaceToPage(btree, stack->buffer, stack, insertdata,
updateblkno, ptp_workspace);
if (BufferIsValid(childbuf)) /* downlink insert finishes child's split */
{
GinPageGetOpaque(childpage)->flags &= ~GIN_INCOMPLETE_SPLIT;
MarkBufferDirty(childbuf);
}
/* ... WAL ... */
}

ginHeapTupleFastCollect is the collect half: it extracts keys (same ginExtractEntries as retail) and forms pending-format tuples, which differ from leaf tuples in that there is no posting list — the single heap TID is stuffed straight into t_tid:

// ginHeapTupleFastCollect — src/backend/access/gin/ginfast.c
entries = ginExtractEntries(ginstate, attnum, value, isNull, &nentries, &categories);
for (i = 0; i < nentries; i++)
{
IndexTuple itup = GinFormTuple(ginstate, attnum, entries[i], categories[i],
NULL, 0, 0, true);
itup->t_tid = *ht_ctid; /* one TID per pending entry */
collector->tuples[collector->ntuples++] = itup;
collector->sumsize += IndexTupleSize(itup);
}

ginHeapTupleFastInsert appends the collected tuples to the metapage-anchored list. If the batch is bigger than a page (or the tail page lacks room), it builds a separate sublist and splices it onto the tail; otherwise it appends to the tail page. The README’s GIN_LIST_FULLROW flag bit records whether a page holds complete heap-tuple rows or a fragment, which the scan side relies on. Note the metapage locking dance that keeps concurrency high — it drops the metapage lock while building the sublist, then re-acquires it:

// ginHeapTupleFastInsert — src/backend/access/gin/ginfast.c
if (collector->sumsize + collector->ntuples * sizeof(ItemIdData) > GinListPageSize)
separateList = true;
else
{
LockBuffer(metabuffer, GIN_EXCLUSIVE);
metadata = GinPageGetMeta(metapage);
if (metadata->head == InvalidBlockNumber ||
collector->sumsize + collector->ntuples * sizeof(ItemIdData) > metadata->tailFreeSize)
{
separateList = true;
LockBuffer(metabuffer, GIN_UNLOCK); /* unlock to keep concurrency */
}
}

ginInsertCleanup is the merge half — the function that pays back the deferred write cost. It guards against concurrent cleanup with a LockPage on the metapage (a heavyweight page lock, not a buffer lock, so concurrent inserts to the pending list are still allowed), then walks the pending pages, accumulating entries into a BuildAccumulator (a red-black tree keyed by (attnum, key)), and flushes to the entry tree via ginEntryInsert once memory fills or it reaches the list end:

// ginInsertCleanup — src/backend/access/gin/ginfast.c
if (forceCleanup) /* from VACUUM / gin_clean_pending_list */
LockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock); /* wait out rivals */
else if (!ConditionalLockPage(index, GIN_METAPAGE_BLKNO, ExclusiveLock))
return; /* a rival is already cleaning; bail */
/* ... */
processPendingPage(&accum, &datums, page, FirstOffsetNumber);
/* when memory is full or list end reached: */
ginBeginBAScan(&accum);
while ((list = ginGetBAEntry(&accum, &attnum, &key, &category, &nlist)) != NULL)
ginEntryInsert(ginstate, attnum, key, category, list, nlist, NULL);
shiftList(index, metabuffer, blkno, fill_fsm, stats); /* unlink merged pages */

The BuildAccumulator is the batching payoff: every pending entry for the same key is gathered before a single ginEntryInsert, so a hot key’s posting list is rewritten once per cleanup rather than once per row. shiftList then unlinks the now-merged head pages from the metapage list (in batches of GIN_NDELETE_AT_ONCE) and frees them to the FSM.

gingetbitmap is the AM scan entry. The ordering is load-bearing — pending list before main index — and the comment explains why a TID may be visited twice (and why that is harmless for a bitmap, but fatal for amgettuple):

// gingetbitmap — src/backend/access/gin/ginget.c
ginNewScanKey(scan); /* extractQuery -> GinScanKey/GinScanEntry */
if (GinIsVoidRes(scan)) return 0;
scanPendingInsert(scan, tbm, &ntids); /* pending list FIRST */
startScan(scan);
ItemPointerSetMin(&iptr);
for (;;)
{
if (!scanGetItem(scan, iptr, &iptr, &recheck))
break;
if (ItemPointerIsLossyPage(&iptr))
tbm_add_page(tbm, ItemPointerGetBlockNumber(&iptr)); /* whole page */
else
tbm_add_tuples(tbm, &iptr, 1, recheck);
ntids++;
}

scanGetItem advances all the scan keys in lock-step to find the next TID that satisfies every key; it uses the “nothing < this matches, so skip ahead” optimization to leapfrog. Per key, keyGetItem finds the minimum TID

advancePast among the key’s requiredEntries, then loads the additionalEntries for that TID and calls the consistent function:

// keyGetItem — src/backend/access/gin/ginget.c
ItemPointerSetMax(&minItem);
for (i = 0; i < key->nrequired; i++)
{
entry = key->requiredEntries[i];
if (entry->isFinished) continue;
if (ginCompareItemPointers(&entry->curItem, &advancePast) <= 0)
{
entryGetItem(ginstate, entry, advancePast); /* advance this cursor */
if (entry->isFinished) continue;
}
allFinished = false;
if (ginCompareItemPointers(&entry->curItem, &minItem) < 0)
minItem = entry->curItem; /* smallest required TID = next candidate */
}

Each entry’s cursor is fed by entryGetItem, which pulls TIDs from either the inline posting list or the entry’s posting tree (advancing page by page), or from a pre-collected matchBitmap for partial-match / full-scan entries.

Partial-match and full scans: collectMatchBitmap. For a partial-match entry (e.g. a LIKE 'foo%' accelerated by pg_trgm, or a range), GIN cannot point at a single key — it must scan a range of the entry tree and union all their posting lists into one bitmap. collectMatchBitmap walks entry-tree leaves left-to-right via moveRightIfItNeeded, calling the opclass comparePartialFn to decide where the range ends, and folds each matching entry’s TIDs (inline or via scanPostingTree) into scanEntry->matchBitmap:

// collectMatchBitmap — src/backend/access/gin/ginget.c
if (scanEntry->isPartialMatch)
{
if (icategory != GIN_CAT_NORM_KEY) /* partial match never matches nulls */
return true;
cmp = DatumGetInt32(FunctionCall4Coll(&btree->ginstate->comparePartialFn[attnum - 1],
btree->ginstate->supportCollation[attnum - 1],
scanEntry->queryKey, idatum,
UInt16GetDatum(scanEntry->strategy),
PointerGetDatum(scanEntry->extra_data)));
if (cmp > 0) return true; /* past the range: stop */
else if (cmp < 0) { stack->off++; continue; } /* before the range: skip */
/* cmp == 0: this entry is in range -> fold its TIDs into matchBitmap */
}

moveRightIfItNeeded also acquires a predicate lock on each entry leaf it steps onto, which is how GIN supports serializable isolation (the README’s predicate-locking notes: lock the leaf “gaps”, lock posting-tree roots for equality on a tree-backed key, and always lock the metapage to interlock with the pending list). SSI details belong to postgres-ssi-predicate-locking.md.

Pending-list scan: scanPendingInsert. Before the main index, GIN scans the pending list row-by-row. Because pending entries are unsorted and one row’s entries may span pages, it reads a whole heap row’s entries (collectMatchesForHeapRow), fills the entryRes/hasMatchKey arrays, and calls each key’s boolConsistentFn directly:

// scanPendingInsert — src/backend/access/gin/ginget.c
PredicateLockPage(scan->indexRelation, GIN_METAPAGE_BLKNO, scan->xs_snapshot);
LockBuffer(metabuffer, GIN_SHARE);
blkno = GinPageGetMeta(page)->head;
/* ... */
while (scanGetCandidate(scan, &pos)) /* one heap row at a time */
{
if (!collectMatchesForHeapRow(scan, &pos)) continue;
match = true;
for (i = 0; i < so->nkeys; i++)
{
GinScanKey key = so->keys + i;
if (!key->boolConsistentFn(key)) { match = false; break; }
recheck |= key->recheckCurItem;
}
if (match) { tbm_add_tuples(tbm, &pos.item, 1, recheck); (*ntids)++; }
}

ginbuild initializes the metapage and a leaf root, then scans the heap. Rather than retail-insert each row, it accumulates postings in a BuildAccumulator in maintenance_work_mem and flushes batches via ginEntryInsert — the same bulk-merge mechanism ginInsertCleanup uses, which is precisely why the README calls bulk insertion “much faster than retail insertion.” PG 18 can also drive the build with parallel workers (_gin_begin_parallel), where workers extract and tuplesort GinTuples and the leader merges the sorted stream.

// ginbuild — src/backend/access/gin/gininsert.c
MetaBuffer = GinNewBuffer(index);
RootBuffer = GinNewBuffer(index);
GinInitMetabuffer(MetaBuffer);
GinInitBuffer(RootBuffer, GIN_LEAF);
/* ... */
buildstate.accum.ginstate = &buildstate.ginstate;
ginInitBA(&buildstate.accum);
if (indexInfo->ii_ParallelWorkers > 0)
_gin_begin_parallel(state, heap, index, indexInfo->ii_Concurrent,
indexInfo->ii_ParallelWorkers);

GIN’s amvacuumcleanup/ambulkdelete are ginvacuumcleanup and ginbulkdelete. The README’s rule is “Vacuum never deletes tuples or pages from the entry tree” — the key vocabulary is assumed stable. So ginbulkdelete first flushes the pending list (ginInsertCleanup with forceCleanup), then walks the entry-tree leaves left-to-right by right-link, stripping dead TIDs from each leaf’s inline posting list (ginVacuumEntryPage) and recursing into each posting tree:

// ginbulkdelete — src/backend/access/gin/ginvacuum.c
if (stats == NULL)
{
stats = (IndexBulkDeleteResult *) palloc0(sizeof(IndexBulkDeleteResult));
ginInsertCleanup(&gvs.ginstate, !AmAutoVacuumWorkerProcess(),
false, true, stats); /* drain pending list first */
}
/* ... walk entry leaves by rightlink ... */
resPage = ginVacuumEntryPage(&gvs, buffer, rootOfPostingTree, &nRoot);
blkno = GinPageGetOpaque(page)->rightlink;
/* ... */
for (i = 0; i < nRoot; i++)
ginVacuumPostingTree(&gvs, rootOfPostingTree[i]); /* recurse into trees */

Posting trees are shrunk. ginVacuumPostingTree runs in two stages (matching the README): first ginVacuumPostingTreeLeaves strips dead TIDs from every leaf (and notes whether any leaf went empty); only if an empty leaf appeared does it take a cleanup lock on the posting-tree root (LockBufferForCleanup, which waits out all pinholders, i.e. in-flight readers) and run ginScanToDelete, a depth-first traversal that physically unlinks empty pages:

// ginVacuumPostingTree — src/backend/access/gin/ginvacuum.c
if (ginVacuumPostingTreeLeaves(gvs, rootBlkno)) /* stage 1: strip TIDs */
{
/* stage 2: at least one empty page, so rescan and delete empties */
buffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, rootBlkno, RBM_NORMAL,
gvs->strategy);
LockBufferForCleanup(buffer); /* blocks concurrent inserts + readers */
/* ... */
ginScanToDelete(gvs, rootBlkno, true, &root, InvalidOffsetNumber);
}

ginScanToDelete keeps exclusive locks on the left siblings of pages on the path under investigation, so that when a page is removed both its downlink and its right-link can be fixed without ever taking a right-to-left lock order (which would deadlock against concurrent ginStepRight). A deleted page is marked with a “newest xid that might visit it” so it cannot be recycled until every reader that may have already read a pointer to it has finished — the same deferred-reclaim discipline as nbtree. ginvacuumcleanup finally returns the emptied posting-tree pages to the FSM and refreshes the metapage statistics.

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
ginhandlersrc/backend/access/gin/ginutil.c38
ginExtractEntriessrc/backend/access/gin/ginutil.c488
gininsertsrc/backend/access/gin/gininsert.c851
ginEntryInsertsrc/backend/access/gin/gininsert.c341
addItemPointersToLeafTuplesrc/backend/access/gin/gininsert.c211
buildFreshLeafTuplesrc/backend/access/gin/gininsert.c291
ginbuildsrc/backend/access/gin/gininsert.c608
_gin_begin_parallelsrc/backend/access/gin/gininsert.c922
ginHeapTupleFastCollectsrc/backend/access/gin/ginfast.c483
ginHeapTupleFastInsertsrc/backend/access/gin/ginfast.c219
ginInsertCleanupsrc/backend/access/gin/ginfast.c780
shiftListsrc/backend/access/gin/ginfast.c554
ginFindLeafPagesrc/backend/access/gin/ginbtree.c83
ginStepRightsrc/backend/access/gin/ginbtree.c177
ginTraverseLocksrc/backend/access/gin/ginbtree.c39
ginPlaceToPagesrc/backend/access/gin/ginbtree.c337
gingetbitmapsrc/backend/access/gin/ginget.c1931
scanGetItemsrc/backend/access/gin/ginget.c1300
keyGetItemsrc/backend/access/gin/ginget.c1005
entryGetItemsrc/backend/access/gin/ginget.c812
collectMatchBitmapsrc/backend/access/gin/ginget.c121
moveRightIfItNeededsrc/backend/access/gin/ginget.c43
scanPostingTreesrc/backend/access/gin/ginget.c69
scanPendingInsertsrc/backend/access/gin/ginget.c1837
ginbulkdeletesrc/backend/access/gin/ginvacuum.c564
ginvacuumcleanupsrc/backend/access/gin/ginvacuum.c687
ginVacuumPostingTreesrc/backend/access/gin/ginvacuum.c408
ginVacuumPostingTreeLeavessrc/backend/access/gin/ginvacuum.c345
ginScanToDeletesrc/backend/access/gin/ginvacuum.c246
GinIsPostingTree / GIN_TREE_POSTINGsrc/include/access/ginblock.h232 / 231
GinPageIsIncompleteSplitsrc/include/access/ginblock.h128
GinScanKeyDatasrc/include/access/gin_private.h269
GinScanEntryDatasrc/include/access/gin_private.h337
GinBtreeDatasrc/include/access/gin_private.h151

Verification was done against the working tree at /data/hgryoo/references/postgres, REL_18_STABLE, commit 273fe94852b (“Fix off-by-one with NFC recomposition for Hangul U+11A7 (TBASE)”). Every code excerpt above was copied from the file named in its leading // symbol — path comment and condensed only by eliding lines (marked /* ... */); no logic was paraphrased. The following facts were checked directly:

  • AM surface. ginhandler (ginutil.c) sets amgettuple = NULL and amgetbitmap = gingetbitmap; confirmed grep "amget" ginutil.c. GIN is therefore bitmap-only, as the prose states.
  • Posting-tree discriminator. ginblock.h defines GIN_TREE_POSTING ((OffsetNumber)0xffff) and GinIsPostingTree(itup) (GinGetNPosting(itup) == GIN_TREE_POSTING), with GinGetNPosting reading t_tid’s offset field — confirming the header-field abuse the README documents.
  • Fast-update toggle. gininsert branches on GinGetUseFastUpdate(index); fastupdate defaults on (GinOptions / ginutil.c’s reloptions). The pending list is anchored on GinMetaPageData (head, tail, tailFreeSize, nPendingPages) in ginblock.h.
  • Cleanup interlock. ginInsertCleanup uses LockPage / ConditionalLockPage on GIN_METAPAGE_BLKNO (a heavyweight page lock, storage/lmgr), distinct from the buffer content lock — verified by reading the two lock calls and their comments.
  • VACUUM asymmetry. ginbulkdelete calls ginInsertCleanup(..., forceCleanup = true) before walking entry leaves; there is no entry-tree page deletion (no ginDeletePage call from the entry-tree walk — ginDeletePage is reached only via ginScanToDelete on posting trees). Confirmed grep -n ginDeletePage ginvacuum.c.
  • Cleanup lock for posting-tree deletion. ginVacuumPostingTree calls LockBufferForCleanup(buffer) on the posting-tree root only when stage 1 reported an empty page; matches the README’s “full cleanup lock on the tree root.”
  • Predicate locking. collectMatchBitmap/moveRightIfItNeeded and scanPendingInsert call PredicateLockPage on entry leaves, posting-tree roots, and the metapage respectively — the three lock targets the README’s Predicate Locking section enumerates.

One nuance worth flagging for a future reader: the README’s concurrency pictures describe pin/lock coupling during descent and step-right, but ginFindLeafPage in search mode (searchMode = true) does not keep parent pins — it uses ReleaseAndReadBuffer and “may forget path to leaf,” whereas insert mode keeps the pins. The code comment at the top of ginFindLeafPage is the authority; the README’s “Insert” picture is the insert-mode case.

Limitations confirmed from the README and code: GIN ignores scan->kill_prior_tuple/ignore_killed_tuples (no LP_DEAD hinting like nbtree), and it matches entries only by equality or the partial-match range mechanism — there is no general inequality search.

A last cross-check on the encoding claim. The README’s “Posting List Compression” section and ginpostinglist.c agree that posting lists — both the inline lists in entry-tree leaves and the leaf pages of a posting tree — are stored delta-encoded with varbyte, not raw ItemPointerData arrays:

// ginCompressPostingList — src/backend/access/gin/ginpostinglist.c
// item pointers arrive sorted; we store the *difference* from the previous one
result->first = ipd[0]; /* first TID stored verbatim */
prev = itemptr_to_uint64(&result->first);
for (totalpacked = 1; totalpacked < nipd; totalpacked++)
{
uint64 val = itemptr_to_uint64(&ipd[totalpacked]);
uint64 delta = val - prev;
Assert(val > prev);
/* ... bounds check elided ... */
encode_varbyte(delta, &ptr); /* small deltas => fewer bytes */
prev = val;
}

encode_varbyte emits 7 payload bits per byte with the high bit as a continuation flag, so a TID one block away from its predecessor costs a byte or two instead of six. This is the on-disk reason a hot key’s million-TID posting list stays compact, and it is why GinItupIsCompressed(itup) exists: an entry leaf may carry either the compressed form or (for backward compatibility) a raw array, and the reader branches on the flag.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”
  • Lucene / classic IR inverted files. GIN’s entry-tree-plus-posting-list shape is the relational-engine reflection of a Lucene segment: a term dictionary (Lucene’s FST; GIN’s entry B-tree) over distinct keys, each pointing at a postings stream (Lucene’s .doc/.pos files; GIN’s inline list or posting tree). The varbyte delta encoding here is the same family as Lucene’s PForDelta/FOR block codecs — both exploit sorted doc-IDs and small gaps. The sharpest divergence is the fast-update pending list: Lucene is segment-immutable and merges whole segments, whereas GIN mutates one main index in place and keeps a metapage-anchored side log. A note mapping GIN’s ginInsertCleanup against Lucene’s segment-merge policy would frame the in-place-vs-immutable trade directly.

  • Bitmap and variant indexes (O’Neil & Quass 1997). The “one key occurs in many rows” workload GIN targets is exactly the low-cardinality case bitmap indexes were built for, and GIN’s consistent-function conjunction is a posting-list intersection where a bitmap index would AND two bitmaps. The Variant Indexes paper (dbms-papers/variant-index.md) — bit-sliced and projection indexes, and the bitmap-vs-list space crossover — is the right lens on why GIN compresses long lists into varbyte runs rather than materializing a bitmap: GIN’s TID space is sparse and unbounded (heap can grow), so a gap-coded list dominates a dense bitmap until the key is nearly universal.

  • RUM (the GIN successor extension). The rum access method (a third-party extension, out of core scope here, named only as an example) extends GIN’s posting entries to carry additional attributes — lexeme positions, a timestamp, a distance — so that ordering operators (tsvector <=> tsquery, ranking) can be answered from the index without a heap fetch. GIN deliberately stores only TIDs in its postings; comparing RUM’s fatter-posting design against GIN’s TID-only list quantifies what GIN’s bitmap-only amgetbitmap interface gives up versus an index that can return ordered, scored results.

  • The pending-list / LSM analogy. GIN’s fast-update path — buffer unordered inserts in a side log, merge them in bulk later — is structurally a one-level log-structured merge: the pending list is the in-place “L0” and ginInsertCleanup is the compaction into the sorted main index. Unlike a real LSM, GIN has exactly one merge target and reads the pending list linearly on every scan, so it pays an unbounded read tax if cleanup falls behind (the gin_pending_list_limit / autovacuum trigger bounds this). Positioning GIN against RocksDB-style multi-level LSMs would clarify why GIN keeps a single unsorted buffer rather than a tiered, sorted run hierarchy.

  • Partial match and prefix search vs. trigram / suffix structures. GIN’s extractQuery partial-match mode (collectMatchBitmap walking an entry-key range) gives prefix and range matching essentially for free because the entry tree is ordered, which is how pg_trgm and LIKE-acceleration opclasses ride on GIN. A comparison against dedicated suffix-array / FM-index text structures would frame what GIN’s general-purpose ordered entry tree gives up in exchange for serving arrays, jsonb, and tsvector from one mechanism.

  • Concurrency: L&Y entry tree without page deletion. GIN’s entry tree is a Lehman-Yao B-tree (right-links, step-right recovery) but — unlike nbtree — it never deletes entry pages, because the key vocabulary is assumed stable (postgres-nbtree.md owns the full L&Y deletion protocol). The research question this raises is bloat: a GIN index over a column whose key set churns (keys appear then permanently vanish) accumulates dead entry leaves that only a full REINDEX reclaims. Contrasting this with nbtree’s _bt_pagedel half-dead/unlink machinery isolates the cost of GIN’s “vocabulary is forever” assumption.

In-tree READMEs and source files (REL_18_STABLE, commit 273fe94)

Section titled “In-tree READMEs and source files (REL_18_STABLE, commit 273fe94)”
  • src/backend/access/gin/README — the design document: inverted-index basics, the generalized strategy-function contract, entry-tree vs. posting list vs. posting tree layout, the fast-update pending list and cleanup, posting-list (varbyte) compression, concurrency and predicate locking, and the “no entry-page deletion” limitation.
  • src/include/access/ginblock.hGinMetaPageData (pending-list anchor: head, tail, tailFreeSize, nPendingPages), GinPageOpaqueData, the GIN_TREE_POSTING sentinel and GinIsPostingTree / GinGetNPosting header-field accessors, GinItupIsCompressed.
  • src/include/access/gin_private.hGinState, GinBtreeData, scan-key and scan-entry structs (GinScanKey, GinScanEntry), GinBtreeStack.
  • src/backend/access/gin/ginutil.cginhandler (the AM surface: amgetbitmap = gingetbitmap, amgettuple = NULL), reloptions (fastupdate, gin_pending_list_limit), GinInitMetabuffer.
  • src/backend/access/gin/gininsert.cgininsert, ginEntryInsert, ginBuildCallback, the build-time accumulator, the fast-update branch on GinGetUseFastUpdate.
  • src/backend/access/gin/ginget.cgingetbitmap, startScan, collectMatchBitmap, keyGetItem, entryGetItem, scanPendingInsert, the consistent-function driver.
  • src/backend/access/gin/ginfast.cginHeapTupleFastInsert, ginInsertCleanup, makeSublist, shiftList, writeListPage, the metapage LockPage / ConditionalLockPage interlock.
  • src/backend/access/gin/ginbtree.cginFindLeafPage, ginStepRight, ginFinishSplit, ginInsertValue — the shared L&Y descent/step-right used by both the entry tree and posting trees.
  • src/backend/access/gin/ginvacuum.cginbulkdelete, ginVacuumPostingTree, ginVacuumEntryPage, ginScanToDelete / ginDeletePage (posting-tree page reclamation under a cleanup lock).
  • src/backend/access/gin/ginpostinglist.cginCompressPostingList, encode_varbyte / decode_varbyte, the delta+varbyte posting codec.
  • Database System Concepts (Silberschatz, Korth, Sudarshan, 7e), ch. 21 “Information Retrieval” — the inverted-index / posting-list model and conjunctive-query intersection that GIN implements (knowledge/research/dbms-general/database-system-concepts.md).
  • O’Neil, P. & Quass, D. (1997). “Improved Query Performance with Variant Indexes.” SIGMOD 1997. Bitmap / bit-sliced / projection indexes and the bitmap-vs-list space trade-off (dbms-papers/variant-index.md).
  • Lehman, P. & Yao, S. B. (1981). “Efficient Locking for Concurrent Operations on B-Trees.” ACM TODS 6(4) — the right-link/step-right algorithm GIN’s entry tree and posting trees reuse (text in dbms-papers/btree.md; full deviation analysis in postgres-nbtree.md).
  • Database Internals (Petrov 2019), ch. 2-6 — B-tree concurrency, page layout, and WAL framing background (knowledge/research/dbms-general/).

Sibling docs (cross-references — mechanism owned there, not duplicated here)

Section titled “Sibling docs (cross-references — mechanism owned there, not duplicated here)”
  • postgres-index-am.md — the IndexAmRoutine dispatch contract that ginhandler fills in (build, insert, amgetbitmap, bulkdelete, vacuum).
  • postgres-nbtree.md — the canonical Lehman-Yao B-tree: the descent, step-right recovery, and the page-deletion protocol GIN’s entry tree deliberately omits.
  • postgres-full-text-search.md — the tsvector / tsquery opclasses and the extractValue / extractQuery / consistent strategy functions GIN drives; the text-search semantics live there.
  • postgres-buffer-manager.md — buffer pins and content-lock (LWLock) latching underneath GIN’s page locks and the metapage LockPage interlock.
  • postgres-ssi-predicate-locking.md — the predicate-lock machinery collectMatchBitmap / scanPendingInsert feed via PredicateLockPage.
  • postgres-vacuum.md / postgres-autovacuum.md — the VACUUM driver that invokes ginbulkdelete and the autovacuum trigger that forces pending-list cleanup.