Skip to content

PostgreSQL SP-GiST — Space-Partitioned Trees on Disk

Contents:

The B-tree (see postgres-nbtree.md) is the right index when the keys are totally ordered: every node splits the key space into contiguous ranges, and balance keeps every leaf at the same depth. But a great many data types have no useful total order. A 2-D point, an IP range, a polygon’s bounding box, a string viewed character-by-character — for these the natural index is not a range tree but a space-partitioning tree: a structure that recursively cuts the domain (the plane, the address space, the alphabet) into disjoint regions and routes each value to the region that contains it. Database System Concepts (Silberschatz 7e) develops exactly this family in §24.4 “Indexing of Spatial Data”, naming the three canonical members:

  • The quadtree. A 2-D quadtree node stores a centroid point and has four children, one per quadrant (NE, SE, SW, NW) relative to the centroid. A point descends into the quadrant it falls in; the partition is by region, and the four regions exactly tile the parent’s region.
  • The k-d tree. A k-d tree alternates the splitting dimension by level: at even levels it splits on x (a vertical cutting line), at odd levels on y (a horizontal cutting line). Each node is binary — “left of the line” vs. “right of it” — so a k-d tree is a quadtree’s space cut into two halves per level instead of four quadrants per level.
  • The radix tree (trie). For strings, a trie partitions by prefix: a node represents a common prefix, and its children are labeled by the next character. The depth of a leaf is the length of the discriminating prefix, not log N, so a trie’s shape follows the data, not a balance invariant.

Three structural facts unite these and separate them sharply from the B-tree. First, they are unbalanced — the depth of a branch depends on how clustered the data is, with no rebalancing machinery. Second, the partition is disjoint and exhaustive by construction: a value belongs to exactly one child region, so insertion and search descend a single path (unlike R-trees, whose bounding boxes overlap and force multi-path search). Third, in their textbook form they are pointer-chasing in-memory structures: each node is a small heap-allocated record holding child pointers. A quadtree over a million points might be twenty levels deep, and naively each level is one pointer dereference — which on disk would be one random I/O per level. That is fatal for a disk index, where the whole game is to keep the number of page reads near log_F N with F (the fanout) large.

So the theoretical problem SP-GiST solves is a mapping problem, stated at the very top of src/backend/access/spgist/README: “The challenge is to map tree nodes to disk pages in such a way that the search algorithm accesses only a few disk pages, even if it traverses many nodes.” A pure node-per-page layout wastes a page on a four-pointer quadtree node; a pure node-per-pointer layout is the in-memory structure that bleeds I/O. SP-GiST’s answer is to pack many logical nodes onto a page and to require that structurally-linked groups (one inner tuple, one leaf-tuple chain) never straddle a page boundary, so descending one logical level usually stays on the same page and crossing to a child page is the exception, not the rule.

The “GiST” in the name is the second half of the story. GiST (Generalized Search Tree, Hellerstein-Naughton-Pfeffer 1995; see postgres-gist.md) showed that a balanced, overlapping search tree could be made datatype-agnostic by factoring all the type-specific logic into a handful of opclass methods (a consistent predicate, a union, a penalty, a picksplit) while the core owns page layout, WAL, and concurrency. SP-GiST is the same idea applied to the space-partitioning family: the core owns the disk mapping, the redirect/placeholder bookkeeping, the latching, and the WAL, while five opclass methods — config, choose, picksplit, inner_consistent, leaf_consistent — encode whatever it means to be a quadtree, or a k-d tree, or a radix trie. PostgreSQL ships point_ops (quadtree) and kd_point_ops (k-d tree) over point, plus text_ops (radix tree) over text, and inet_ops, range_ops, box_ops/poly via the same five-method contract.

Two design knobs organize the rest of this document, exactly the way the L&Y invariants organized the nbtree doc:

  1. How is a pointer-linked tree mapped to pages without losing the single-path-descent property — what an inner tuple, a node, and a leaf chain physically are, and the rules that keep a chain and an inner tuple each on one page.
  2. How do the five opclass methods carve the responsibility between the datatype (which knows what a “quadrant” or a “common prefix” is) and the core (which knows nothing about geometry or strings but everything about pages, locks, and crash recovery).

Before SP-GiST’s specific symbols, it helps to name the recurring techniques that any engine mapping a space-partitioning tree to disk converges on. The PostgreSQL choices in the next section then read as points in this space.

The first move every disk-resident trie/quadtree makes is to stop equating “tree node” with “disk page.” A 4-way quadtree node is ~40 bytes; an 8 KB page can hold ~200 of them. So the page is a bucket of logical nodes, and the on-disk pointer becomes a (page, offset) pair rather than a raw memory address. The crucial refinement — the one that makes the I/O bound work — is to co-locate nodes that are likely to be visited together. SP-GiST’s rule is blunt and effective: an inner tuple’s whole node array lives in one physical tuple on one page, and the set of leaf tuples reachable from a single node forms a chain on one page, linked by 2-byte offsets rather than full TIDs. Descending one logical edge that stays within a page is free; only edges that cross to a child page cost an I/O.

Lazy, offset-stable deletion via tombstones

Section titled “Lazy, offset-stable deletion via tombstones”

A space-partitioning tree has an awkward property: a child’s address is an inner tuple’s downlink, and inner tuples elsewhere on the page must keep their offset numbers stable, or every downlink pointing at them breaks. So an engine cannot simply compact a page. The universal answer is tombstones: when a tuple must logically go away but its slot can’t move, leave a placeholder occupying the offset. SP-GiST has a four-state lattice — LIVE, REDIRECT, DEAD, PLACEHOLDER — that distinguishes “moved, here’s where to” (redirect) from “gone but still referenced” (dead) from “gone and safe to reuse” (placeholder).

Section titled “Redirects for lock-light concurrent search”

If a search releases the parent’s lock before locking the child (to avoid holding a chain of locks), a concurrent insert can move the child’s contents elsewhere in the window between unlock and lock. The classic fix, shared with many B-link-style designs, is to leave a forwarding pointer at the old location: a scan that arrives at a moved tuple finds a redirect telling it the new address. The redirect is needed only as long as some snapshot-old scan could still be in flight, after which VACUUM can recycle it.

Conditional latch-coupling to bound deadlocks

Section titled “Conditional latch-coupling to bound deadlocks”

An insert into a non-balanced tree must hold the parent locked while it locks the child (to update the parent’s downlink atomically), which risks deadlock when two inserts descend cross-linked branches in opposite order. Engines bound this either with a global lock ordering (impractical here, since the tree shape is data-dependent) or with conditional locking plus restart: try to lock the child without blocking; if that fails, drop everything and retry the whole insert. SP-GiST adds a placement heuristic — “triple parity” — that makes the cross-link pattern statistically rare.

Sequential-scan VACUUM with a pending list

Section titled “Sequential-scan VACUUM with a pending list”

Reclaiming dead entries in a tree that inserts can reshape underneath you is hard if you walk the tree logically. The simpler approach many engines take is a physical sequential scan of every page, which visits each page exactly once and is immune to concurrent shape changes — except for the one race where an insert moves a yet-to-be-deleted entry from an unvisited page onto an already-visited page. SP-GiST handles that with a pending list: when the scan sees a recently-created redirect, it records the target and re-checks it between pages, so no target TID can escape deletion.

flowchart TB
  subgraph Logical["Logical space-partitioning tree"]
    R["inner: centroid c0<br/>nodes: NE SE SW NW"]
    A["inner: centroid c1<br/>nodes ..."]
    L1["leaf chain:<br/>pt, pt, pt"]
    L2["leaf chain:<br/>pt, pt"]
    R -->|NE| A
    R -->|SE| L1
    A -->|SW| L2
  end
  subgraph Physical["Physical page mapping"]
    P1["inner page 1<br/>tuple R (4 nodes)<br/>tuple A (4 nodes)"]
    P2["leaf page 5<br/>chain L1: off3->off7->0"]
    P3["leaf page 9<br/>chain L2: off2->off4->0"]
    P1 -.->|"downlink TID (5,3)"| P2
    P1 -.->|"downlink TID (9,2)"| P3
  end

PostgreSQL’s SP-GiST is a faithful, careful implementation of the mapping described in its README, authored by Teodor Sigaev and Oleg Bartunov. The access-method handler spghandler wires the five core entry points into the generic index-AM dispatch table (postgres-index-am.md covers that contract); note amcanorderbyop = true (SP-GiST supports nearest-neighbor ordered scans), amsearchnulls = true, amcaninclude = true, and amsupport = SPGISTNProc (seven support procs).

// spghandler — src/backend/access/spgist/spgutils.c
amroutine->amsupport = SPGISTNProc; /* 7 support procs */
amroutine->amcanorderbyop = true; /* kNN ordered scan */
amroutine->amsearchnulls = true; /* IS NULL via nulls tree */
amroutine->amcaninclude = true; /* INCLUDE columns */
amroutine->ambuild = spgbuild;
amroutine->aminsert = spginsert;
amroutine->ambulkdelete = spgbulkdelete;
amroutine->amgettuple = spggettuple;
amroutine->amgetbitmap = spggetbitmap;
amroutine->amcanreturn = spgcanreturn; /* index-only scans */

Everything on disk is one of two real tuple shapes (plus the dead-tuple overlay). An inner tuple is a fixed header, an optional prefix datum, then a packed array of node tuples; each node tuple is an IndexTupleData header carrying a (label, downlink-TID) pair. The bit-packed header is the heart of the format:

// SpGistInnerTupleData — src/include/access/spgist_private.h
typedef struct SpGistInnerTupleData
{
unsigned int tupstate:2, /* LIVE/REDIRECT/DEAD/PLACEHOLDER */
allTheSame:1, /* all nodes in tuple are equivalent */
nNodes:13, /* number of nodes within inner tuple */
prefixSize:16; /* size of prefix, or 0 if none */
uint16 size; /* total size of inner tuple */
/* prefix datum follows, then nodes */
} SpGistInnerTupleData;

A leaf tuple holds a leaf datum, the heap TID it indexes, and — crucially — a nextOffset link (packed into t_info) chaining it to the next leaf under the same parent node on the same page:

// SpGistLeafTupleData — src/include/access/spgist_private.h
typedef struct SpGistLeafTupleData
{
unsigned int tupstate:2, /* LIVE/REDIRECT/DEAD/PLACEHOLDER */
size:30;
uint16 t_info; /* nextOffset (14 bits) + flag bits */
ItemPointerData heapPtr; /* TID of represented heap tuple */
/* nulls bitmap (if flagged), then leaf datum + INCLUDE datums */
} SpGistLeafTupleData;

The nextOffset is a 2-byte OffsetNumber, not a full 6-byte TID, precisely because the README requires the whole chain to live on one page — that is the invariant that makes leaf chains cheap. A REDIRECT/DEAD/PLACEHOLDER slot is reinterpreted through SpGistDeadTupleData, whose tupstate aliases the same first two bits and whose pointer/xid carry the forwarding TID and the inserting transaction.

An SP-GiST relation has three fixed blocks: the metapage (block 0, carrying a magic number and the per-backend last-used-page cache), the root of the main tree (block 1), and the root of the nulls tree (block 2). The README’s NULLS-handling rule is implemented directly: indexed operators are assumed strict, so nulls live in a separate tree on a disjoint page set, searched with hardwired logic rather than opclass methods. The insert path picks its starting block from the key’s null-ness:

// spgdoinsert — src/backend/access/spgist/spgdoinsert.c
current.blkno = isnull ? SPGIST_NULL_BLKNO : SPGIST_ROOT_BLKNO;
// constants — src/include/access/spgist_private.h
#define SPGIST_METAPAGE_BLKNO (0) /* metapage */
#define SPGIST_ROOT_BLKNO (1) /* root for normal entries */
#define SPGIST_NULL_BLKNO (2) /* root for null-value entries */
#define SPGIST_LIVE 0 /* normal live tuple */
#define SPGIST_REDIRECT 1 /* temporary redirection placeholder */
#define SPGIST_DEAD 2 /* dead, cannot be removed (still linked) */
#define SPGIST_PLACEHOLDER 3 /* placeholder to preserve offsets */

The datatype is encapsulated in five (plus two optional) support procedures. spgGetCache calls config once per index and caches the result in rd_amcache; this declares the four types the core must marshal — attType (input), attLeafType (leaf storage), attPrefixType (inner prefix), and attLabelType (node labels) — plus longValuesOK (can the opclass shorten a too-long leaf datum by suffixing?). The radix-tree config is illustrative: text prefixes, int2 (single-character) labels, and suffixing on:

// spg_text_config — src/backend/access/spgist/spgtextproc.c
cfg->prefixType = TEXTOID;
cfg->labelType = INT2OID;
cfg->canReturnData = true;
cfg->longValuesOK = true; /* suffixing will shorten long values */

On insert, choose is the decision oracle at each inner tuple. It is handed the value, the current prefix, and the node labels, and returns one of three verdicts — spgMatchNode (descend through node N, adjusting the level and the “rest” of the datum), spgAddNode (this prefix matches but no node does; add a labeled node), or spgSplitTuple (the prefix itself is too specific; break it into a shorter prefix tuple plus a postfix child). The quad-tree choose only ever matches, because a centroid is never “inconsistent” with a point — it just routes to the right quadrant:

// spg_quad_choose — src/backend/access/spgist/spgquadtreeproc.c
out->resultType = spgMatchNode;
out->result.matchNode.nodeN = getQuadrant(centroid, inPoint) - 1;
out->result.matchNode.levelAdd = 0;
out->result.matchNode.restDatum = PointPGetDatum(inPoint);

When a leaf chain overflows its page, picksplit converts the whole list into one inner tuple whose nodes fan the leaves into sublists. The quad-tree version computes the centroid (the average of the points), declares four nodes, and maps each input point to its quadrant:

// spg_quad_picksplit — src/backend/access/spgist/spgquadtreeproc.c
out->hasPrefix = true;
out->prefixDatum = PointPGetDatum(centroid);
out->nNodes = 4;
out->nodeLabels = NULL; /* quadrants are positional, unlabeled */
for (i = 0; i < in->nTuples; i++)
{
Point *p = DatumGetPointP(in->datums[i]);
int quadrant = getQuadrant(centroid, p) - 1;
out->leafTupleDatums[i] = PointPGetDatum(p);
out->mapTuplesToNodes[i] = quadrant;
}

On search, the dual pair inner_consistent and leaf_consistent prune. inner_consistent is given the inner tuple’s prefix and node set and a list of scan keys, and returns the subset of nodes whose subtrees could contain a match — for a quadtree it intersects each query operator (left-of, contained- by, …) against the quadrant geometry and returns a bitmask of surviving quadrants. leaf_consistent makes the final per-tuple decision. The k-d tree’s choose shows the level-alternating dimension and the levelAdd = 1 that drives the alternation:

// spg_kd_choose — src/backend/access/spgist/spgkdtreeproc.c
coord = DatumGetFloat8(in->prefixDatum);
out->resultType = spgMatchNode;
out->result.matchNode.nodeN =
(getSide(coord, in->level % 2, inPoint) > 0) ? 0 : 1;
out->result.matchNode.levelAdd = 1; /* alternate x/y each level */
out->result.matchNode.restDatum = PointPGetDatum(inPoint);
flowchart TB
  Ins["spgdoinsert: leaf datum + heap TID"] --> Loop{"current page type?"}
  Loop -->|leaf| Fit{"leaf tuple fits<br/>on page?"}
  Fit -->|yes| AddLeaf["addLeafTuple: link into chain"]
  Fit -->|"no, small chain"| Move["moveLeafs: relocate whole chain"]
  Fit -->|"no, large"| Pick["doPickSplit: build inner tuple,<br/>fan leaves to sublists"]
  Loop -->|inner| Choose["call choose()"]
  Choose -->|spgMatchNode| Descend["descend node N,<br/>level += levelAdd"]
  Choose -->|spgAddNode| AddNode["spgAddNodeAction,<br/>retry choose"]
  Choose -->|spgSplitTuple| Split["spgSplitNodeAction:<br/>prefix + postfix, retry"]
  Descend --> Loop
  AddNode --> Choose
  Split --> Choose
  Pick --> Descend

The code splits cleanly along the index-AM surface: build and metapage in spginsert.c, the insertion engine in spgdoinsert.c, the search engine in spgscan.c, support and tuple formation in spgutils.c, the per-opclass methods in spg{quad,kd,text}treeproc.c, and reclamation in spgvacuum.c.

spghandler (in spgutils.c) returns the IndexAmRoutine. spgbuild (in spginsert.c) creates the three fixed pages — it initializes the metapage via SpGistInitMetapage, the main root (block 1), and the nulls root (block 2) — then scans the heap, calling spgdoinsert per tuple through a build callback. spgGetCache (in spgutils.c) lazily runs the opclass config method once and caches the four SpGistTypeDesc entries plus spgConfigOut in rd_amcache; it also reads the last-used-page cache off the metapage. initSpGistState populates the per-operation SpGistState.

spgdoinsert (in spgdoinsert.c) is the heart. It first forms the leaf datum (calling the optional compress method, else de-toasting the input), computes leafSize, and rejects oversize values up front unless longValuesOK lets the opclass suffix them. It seeds current.blkno to the main or nulls root and enters a for(;;) loop. Each iteration acquires the current buffer (conditionally locking child pages to avoid deadlock — on failure it returns false and the caller spginsert retries), then branches on page type:

  • Leaf page. It forms the leaf tuple and tries three things in order: addLeafTuple if it fits; moveLeafs if the chain is small enough to relocate wholesale (checkSplitConditions reports chain size, gated at < SPGIST_PAGE_CAPACITY/2, < 64 tuples); otherwise doPickSplit.
  • Inner page (process_inner_tuple: label). It marshals a spgChooseIn, calls the opclass choose (or forces spgMatchNode for the nulls tree), and dispatches on out.resultType.

The dispatch on the choose verdict is the structural core of insertion:

// spgdoinsert — src/backend/access/spgist/spgdoinsert.c
switch (out.resultType)
{
case spgMatchNode:
spgMatchNodeAction(index, state, innerTuple,
&current, &parent,
out.result.matchNode.nodeN);
level += out.result.matchNode.levelAdd;
if (!isnull)
{
leafDatums[spgKeyColumn] = out.result.matchNode.restDatum;
leafSize = SpGistGetLeafTupleSize(leafDescriptor,
leafDatums, isnulls);
leafSize += sizeof(ItemIdData);
}
break; /* loop: descend to child */
case spgAddNode:
spgAddNodeAction(index, state, innerTuple,
&current, &parent,
out.result.addNode.nodeN,
out.result.addNode.nodeLabel);
goto process_inner_tuple; /* retry choose on enlarged tuple */
case spgSplitTuple:
spgSplitNodeAction(index, state, innerTuple, &current, &out);
goto process_inner_tuple; /* retry choose on the prefix tuple */
}

addLeafTuple (in spgdoinsert.c) handles the two chain cases. A root or chain-head insert sets nextOffset = Invalid and, if there’s a parent, updates the parent’s downlink via saveNodeLink. A mid-chain insert exploits the README’s “insert second” trick — it links the new tuple where the head pointed, then points the head at the new tuple, so the chain head address never changes and the parent downlink stays valid without chasing the chain:

// addLeafTuple — src/backend/access/spgist/spgdoinsert.c
SGLT_SET_NEXTOFFSET(leafTuple, SGLT_GET_NEXTOFFSET(head));
offnum = SpGistPageAddNewItem(state, current->page,
(Item) leafTuple, leafTuple->size, NULL, false);
/* re-get head (it may have moved on page) and link it to the new tuple */
head = (SpGistLeafTuple) PageGetItem(current->page,
PageGetItemId(current->page, current->offnum));
SGLT_SET_NEXTOFFSET(head, offnum);

PickSplit: turning a leaf chain into an inner tuple

Section titled “PickSplit: turning a leaf chain into an inner tuple”

doPickSplit (in spgdoinsert.c) is invoked when a leaf chain can neither absorb the new tuple nor be relocated. It gathers all LIVE leaf datums from the chain (or, for a root-as-leaf split, all tuples on the page), calls the opclass picksplit to get a prefix and a mapTuplesToNodes assignment, then forms node tuples and the inner tuple with spgFormNodeTuple / spgFormInnerTuple. A subtle safety net is checkAllTheSame: if picksplit dumped every tuple into one node (a degenerate split that would loop forever), the core overrides it into allTheSame mode and distributes randomly. The new inner tuple replaces the chain head’s parent downlink; leaves are distributed to one or more leaf pages, and old slots become redirects or placeholders.

// doPickSplit — src/backend/access/spgist/spgdoinsert.c
innerTuple = spgFormInnerTuple(state,
out.hasPrefix, out.prefixDatum,
out.nNodes, nodes);
innerTuple->allTheSame = allTheSame;
/* re-point nodes[] into the formed tuple so downlinks can be patched */
SGITITERATE(innerTuple, i, node)
{
nodes[i] = node;
}

SplitTuple: shortening an over-specific prefix

Section titled “SplitTuple: shortening an over-specific prefix”

spgSplitNodeAction (in spgdoinsert.c) implements the radix-tree case where the inner tuple’s prefix only partially matches the new value. It builds a prefix tuple (the matching head, which must be no larger than the original so it can overwrite it in place) and a postfix tuple (the original node array under a new shorter prefix), placing the postfix on the same page or, if it won’t fit (or the page is the root), on a page with the next triple-parity:

// spgSplitNodeAction — src/backend/access/spgist/spgdoinsert.c
if (prefixTuple->size > innerTuple->size)
elog(ERROR, "SPGiST inner-tuple split must not produce longer prefix");
...
if (SpGistBlockIsRoot(current->blkno) ||
SpGistPageGetFreeSpace(current->page, 1) + innerTuple->size <
prefixTuple->size + postfixTuple->size + sizeof(ItemIdData))
{
/* postfix is a child of prefix -> next triple parity */
newBuffer = SpGistGetBuffer(index,
GBUF_INNER_PARITY(current->blkno + 1),
postfixTuple->size + sizeof(ItemIdData),
&xlrec.newPage);
}

Page allocation, triple parity, and the last-used cache

Section titled “Page allocation, triple parity, and the last-used cache”

SpGistGetBuffer (in spgutils.c) is the allocator. It consults the per-backend last-used-page cache (GET_LUP) for a page of the right flavor (leaf vs. inner-by-parity, main vs. nulls), conditionally locks it, rechecks the page type and free space (the cache can be stale), and falls back to allocNewBuffer otherwise. SpGistSetLastUsedPage writes the freshest free space back to the cache when a page is released. The triple-parity rule — GBUF_INNER_PARITY(x) = x % 3, with a child placed at parity (parent+1) % 3 — makes it unlikely that two inserts descending different branches will hold cross-referenced page locks, reducing deadlock restarts.

Search: spgWalk and the consistent methods

Section titled “Search: spgWalk and the consistent methods”

spgrescan/spgPrepareScanKeys (in spgscan.c) set up the scan, separating null-related quals. spgWalk is the traversal driver: it pops a SpGistSearchItem from a pairing heap (scanQueue) — a priority queue ordered by distance for kNN scans, FIFO-like otherwise — share-locks the target page, and dispatches. Critically it locks one page at a time and releases before visiting the next stack item, so searches never deadlock; the cost is that it must honor redirect tuples left by concurrent inserts:

// spgWalk — src/backend/access/spgist/spgscan.c
else /* page is inner */
{
SpGistInnerTuple innerTuple = (SpGistInnerTuple)
PageGetItem(page, PageGetItemId(page, offset));
if (innerTuple->tupstate != SPGIST_LIVE)
{
if (innerTuple->tupstate == SPGIST_REDIRECT)
{
/* transfer attention to redirect point */
item->heapPtr = ((SpGistDeadTuple) innerTuple)->pointer;
goto redirect;
}
elog(ERROR, "unexpected SPGiST tuple state: %d", innerTuple->tupstate);
}
spgInnerTest(so, item, innerTuple, isnull);
}

spgInnerTest (in spgscan.c) calls the opclass inner_consistent (or, for nulls, forces all children), then for each surviving node builds a child SpGistSearchItem and pushes it on the queue. spgTestLeafTuple walks a leaf chain, following redirects and skipping DEAD entries, calling spgLeafTestleaf_consistent per LIVE tuple; matches are either reported immediately (unordered scan) or queued as heap items (ordered/kNN scan). spggettuple, spggetbitmap, and spgcanreturn (index-only scan eligibility) sit on top.

spgbulkdelete (in spgvacuum.c) sets up a spgBulkDeleteState and calls spgvacuumscan, which reads every non-meta page in physical order through a read stream and calls spgvacuumpage on each. After each page it drains the pending list via spgprocesspending — the mechanism that catches tuples a concurrent PickSplit moved onto an already-scanned page. vacuumLeafPage (in spgvacuum.c) builds a predecessor[] reverse-map of the nextOffset links to find chain heads, then computes six work arrays (dead / placeholder / move-src / move-dest / chain-src / chain-dest) so the mutation and its WAL replay do exactly the same thing. vacuumRedirectAndPlaceholder converts expired redirects to placeholders using a global-visibility test and truncates trailing placeholders:

// vacuumRedirectAndPlaceholder — src/backend/access/spgist/spgvacuum.c
if (dt->tupstate == SPGIST_REDIRECT &&
(!TransactionIdIsValid(dt->xid) ||
GlobalVisTestIsRemovableXid(vistest, dt->xid)))
{
dt->tupstate = SPGIST_PLACEHOLDER;
opaque->nRedirection--;
opaque->nPlaceholder++;
ItemPointerSetInvalid(&dt->pointer);
itemToPlaceholder[xlrec.nToPlaceholder++] = i;
}

spgvacuumcleanup (in spgvacuum.c) does nothing if spgbulkdelete already ran; otherwise it runs an empty-target scan to recycle redirects/placeholders, refresh the FSM, and gather statistics.

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
spghandlersrc/backend/access/spgist/spgutils.c44
spgGetCachesrc/backend/access/spgist/spgutils.c188
SpGistGetBuffersrc/backend/access/spgist/spgutils.c569
SpGistSetLastUsedPagesrc/backend/access/spgist/spgutils.c673
SpGistInitMetapagesrc/backend/access/spgist/spgutils.c732
spgFormLeafTuplesrc/backend/access/spgist/spgutils.c871
spgFormNodeTuplesrc/backend/access/spgist/spgutils.c960
spgFormInnerTuplesrc/backend/access/spgist/spgutils.c1002
spgExtractNodeLabelssrc/backend/access/spgist/spgutils.c1160
spgbuildsrc/backend/access/spgist/spginsert.c73
spginsertsrc/backend/access/spgist/spginsert.c183
addLeafTuplesrc/backend/access/spgist/spgdoinsert.c203
moveLeafssrc/backend/access/spgist/spgdoinsert.c387
checkAllTheSamesrc/backend/access/spgist/spgdoinsert.c599
doPickSplitsrc/backend/access/spgist/spgdoinsert.c677
spgMatchNodeActionsrc/backend/access/spgist/spgdoinsert.c1459
spgAddNodeActionsrc/backend/access/spgist/spgdoinsert.c1513
spgSplitNodeActionsrc/backend/access/spgist/spgdoinsert.c1715
spgdoinsertsrc/backend/access/spgist/spgdoinsert.c1914
spgPrepareScanKeyssrc/backend/access/spgist/spgscan.c208
spgLeafTestsrc/backend/access/spgist/spgscan.c516
spgInnerTestsrc/backend/access/spgist/spgscan.c667
spgTestLeafTuplesrc/backend/access/spgist/spgscan.c763
spgWalksrc/backend/access/spgist/spgscan.c817
spggetbitmapsrc/backend/access/spgist/spgscan.c942
spggettuplesrc/backend/access/spgist/spgscan.c1026
spgcanreturnsrc/backend/access/spgist/spgscan.c1083
vacuumLeafPagesrc/backend/access/spgist/spgvacuum.c126
vacuumRedirectAndPlaceholdersrc/backend/access/spgist/spgvacuum.c494
spgvacuumpagesrc/backend/access/spgist/spgvacuum.c622
spgprocesspendingsrc/backend/access/spgist/spgvacuum.c688
spgvacuumscansrc/backend/access/spgist/spgvacuum.c800
spgbulkdeletesrc/backend/access/spgist/spgvacuum.c950
spgvacuumcleanupsrc/backend/access/spgist/spgvacuum.c981
spg_quad_choosesrc/backend/access/spgist/spgquadtreeproc.c115
spg_quad_picksplitsrc/backend/access/spgist/spgquadtreeproc.c169
spg_quad_inner_consistentsrc/backend/access/spgist/spgquadtreeproc.c227
spg_kd_choosesrc/backend/access/spgist/spgkdtreeproc.c53
spg_kd_picksplitsrc/backend/access/spgist/spgkdtreeproc.c108
spg_text_configsrc/backend/access/spgist/spgtextproc.c96
SpGistInnerTupleDatasrc/include/access/spgist_private.h293
SpGistLeafTupleDatasrc/include/access/spgist_private.h383
SpGistDeadTupleDatasrc/include/access/spgist_private.h427

All code excerpts and symbols above were read directly from the REL_18_STABLE working tree at commit 273fe94. Verification notes:

  • Five-method opclass contract. src/include/access/spgist.h defines SPGIST_CONFIG_PROC (1), SPGIST_CHOOSE_PROC (2), SPGIST_PICKSPLIT_PROC (3), SPGIST_INNER_CONSISTENT_PROC (4), SPGIST_LEAF_CONSISTENT_PROC (5), with optional SPGIST_COMPRESS_PROC (6) and SPGIST_OPTIONS_PROC (7); SPGISTNProc = 7. spghandler sets amsupport = SPGISTNProc.
  • Tuple states. The four tupstate values SPGIST_LIVE/REDIRECT/DEAD/PLACEHOLDER (0..3) and the fixed blocks SPGIST_METAPAGE_BLKNO(0)/SPGIST_ROOT_BLKNO(1)/SPGIST_NULL_BLKNO(2) are confirmed in spgist_private.h.
  • Bit-packed headers. SpGistInnerTupleData packs tupstate:2, allTheSame:1, nNodes:13, prefixSize:16; SGITMAXNNODES = 0x1FFF and SGITMAXPREFIXSIZE = 0xFFFF bound them. SpGistLeafTupleData packs tupstate:2, size:30 with nextOffset in the low 14 bits of t_info (SGLT_GET_NEXTOFFSET masks 0x3FFF). These are exact, not paraphrased.
  • Single-page chain invariant. The nextOffset field is a 2-byte OffsetNumber (not a TID), which is only sound because the README requires a chain to stay on one page; confirmed in the SpGistLeafTupleData comment.
  • Triple parity. GBUF_INNER_PARITY(x) = (x) % 3 in spgist_private.h; spgSplitNodeAction and doPickSplit request GBUF_INNER_PARITY(blkno+1) for child pages, matching the README’s (N+1) mod 3 rule.
  • Conditional locking restart. spgdoinsert uses ConditionalLockBuffer on child pages and returns false (caller retries) on failure — verified in the descent block. spgWalk share-locks one page at a time and goto redirect on a SPGIST_REDIRECT tuple.
  • PG18-correct. SP-GiST is not parallel-aware (amcanparallel = false, amcanbuildparallel = false); it participates in parallel vacuum via VACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_COND_CLEANUP. spgvacuumscan uses the read-stream API (read_stream_begin_relation with READ_STREAM_USE_BATCHING), which is the PG17+/PG18 form. No PG19-only symbols are referenced.
  • WAL. Redo handlers live in spgxlog.c under RM_SPGIST_ID; record types such as XLOG_SPGIST_ADD_LEAF and XLOG_SPGIST_VACUUM_REDIRECT are emitted from the functions quoted above. WAL replay detail is out of scope here (see postgres-xlog-wal.md and postgres-recovery-redo.md).
  • Handler flags read in full. spghandler sets amcanorder = false, amcanorderbyop = true, amcanbackward = false, amoptionalkey = true, amstorage = true, amclusterable = false, ampredlocks = false, amsummarizing = false, amkeytype = InvalidOid, and amoptsprocnum = SPGIST_OPTIONS_PROC. SP-GiST has no SSI predicate-lock support (ampredlocks = false), unlike nbtree/gin/gist/hash — a consequence of its lock-light, redirect-based concurrency, in which the set of pages a scan touches is not stable enough to lock for serializable conflict detection. These were read verbatim from spgutils.c.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”

SP-GiST occupies an unusual niche: it is not a single index but a generic host for an entire family of space-partitioning structures. Placing it in the broader landscape clarifies both what it buys and what it gives up.

Lineage — the original SP-GiST framework

Section titled “Lineage — the original SP-GiST framework”

PostgreSQL’s SP-GiST descends directly from the academic SP-GiST framework of Aoki and, later, the comprehensive treatment by Eltabakh, Eldawy, and Mokbel (“SP-GiST: An Extensible Database Index for Supporting Space Partitioning Trees”, which generalized Hellerstein-Naughton-Pfeffer’s GiST, 1995 — see postgres-gist.md). GiST factored balanced, overlapping trees (B-trees, R-trees) behind a consistent/union/penalty/picksplit contract; SP-GiST does the same for unbalanced, disjoint, recursive-partition trees. The two are deliberately complementary: a datatype whose natural decomposition tiles the domain without overlap (a quadtree over points, a trie over strings) maps to SP-GiST; one whose decomposition needs overlapping bounding regions (an R-tree over polygons) maps to GiST. PostgreSQL is one of very few production systems to ship both generalized frameworks, which is why the two READMEs read as siblings.

The disjoint-partition advantage, and its cost

Section titled “The disjoint-partition advantage, and its cost”

The defining property — disjoint, exhaustive child regions — is exactly what lets SP-GiST descend a single path on insert and (for a point query) a single path on search. An R-tree must sometimes follow several children because their bounding boxes overlap; SP-GiST never does for a containment query. The cost is paid elsewhere:

  • No balance guarantee. Depth follows data clustering. A trie over pathological strings (a long common prefix) can be deep; the prefixSize and SplitTuple/postfix machinery bound the damage but cannot make the tree balanced the way a B-tree’s split-and-propagate does.
  • Range and kNN queries still fan out. inner_consistent returns a set of surviving nodes, so a window query over a quadtree visits every quadrant the window touches. SP-GiST’s win is the single-page chain and high fanout, not single-path traversal for these.
SystemQuadtree / k-d / trie equivalentMapping strategy
PostgreSQL SP-GiSTquadtree, k-d, radix trie, inet, rangegeneric 5-method opclass; inner-tuple + leaf-chain on page
OracleR-tree (Spatial), no generic SP frameworkdedicated spatial cartridge; no extensible trie host
MySQL/InnoDBR-tree (spatial), B-tree only otherwiseno quadtree/trie AM; spatial uses R-tree
SQL Servergrid-based spatial (multi-level tessellation)B-tree over a linearized grid, not a true partition tree
SQLiteR*-tree modulevirtual-table R-tree; no trie/quadtree host
Specialized (Lucene, RocksDB)FST / radix / ARTin-memory or LSM tries, not a page-mapped generic host

The contrast that matters: most engines hard-code one spatial structure (usually an R-tree) and stop, or push tries into an in-memory layer (Lucene’s FST, an Adaptive Radix Tree). PostgreSQL instead exposes the partition-tree abstraction itself so a text trie, a point quadtree, an inet prefix tree (network_spgist.c), and a range 2-D quad-partition (rangetypes_spgist.c) all share one page-mapping, one concurrency protocol, and one VACUUM — differing only in five C functions.

  • Cache- and SIMD-friendly tries. The Adaptive Radix Tree (ART, Leis et al. 2013) and Masstree show that in-memory tries can rival B-trees by adapting node fanout to occupancy and packing keys for SIMD comparison. SP-GiST’s allTheSame and prefix-suffix machinery are a coarse, on-disk analogue of ART’s node-size adaptivity, but SP-GiST does not yet adapt node representation the way ART switches among Node4/16/48/256.
  • Learned spatial indexes. RMI-style and Flood/Tsunami learned multi-dim indexes replace the fixed quadrant/centroid split with a model fit to the data distribution. The picksplit hook is the natural seam where a learned splitter could be injected without touching the core page mapping.
  • Concurrency without redirects. SP-GiST’s redirect/placeholder lattice is its answer to lock-light search; Bw-tree-style lock-free deltas and epoch-based reclamation are the modern alternative. SP-GiST’s redirect-then-recycle-at-VACUUM is closer to RCU than to a true lock-free structure, and its single-page-at-a-time share-lock walk is its simplest, most robust guarantee.
  • No SSI integration. Because ampredlocks = false, SP-GiST indexes cannot contribute fine-grained predicate locks to serializable transactions (see postgres-ssi-predicate-locking.md); a serializable query that relies on an SP-GiST scan falls back to coarser conflict detection. Extending SP-GiST with page- or node-granular predicate locks is an open engineering item.
flowchart TB
  Domain["A datatype with a natural<br/>recursive decomposition"] --> Q{"Do child regions<br/>overlap?"}
  Q -->|"no — disjoint tiling"| SP["SP-GiST<br/>single-path descent<br/>quadtree / k-d / trie / inet"]
  Q -->|"yes — bounding boxes overlap"| G["GiST<br/>balanced, multi-path<br/>R-tree / range / fulltext"]
  SP --> M["5 methods:<br/>config choose picksplit<br/>inner_consistent leaf_consistent"]
  G --> N["7 methods:<br/>consistent union penalty<br/>picksplit same compress decompress"]
  M --> Core["Shared core: page mapping,<br/>WAL, VACUUM, concurrency"]
  N --> Core

PostgreSQL source tree (REL_18_STABLE, commit 273fe94, as of 2026-06-05; all paths relative to the repo root at /data/hgryoo/references/postgres):

  • src/backend/access/spgist/README — the authoritative description of the disk-mapping problem, the inner/leaf tuple structure, the single-page chain invariant, redirect/placeholder states, triple parity, and the conditional-locking concurrency protocol.
  • src/backend/access/spgist/spgdoinsert.c — the insertion engine: spgdoinsert, addLeafTuple, moveLeafs, doPickSplit, checkAllTheSame, spgMatchNodeAction, spgAddNodeAction, spgSplitNodeAction.
  • src/backend/access/spgist/spgscan.c — the search engine: spgWalk, spgInnerTest, spgLeafTest, spgTestLeafTuple, spgPrepareScanKeys, spggettuple, spggetbitmap, spgcanreturn.
  • src/backend/access/spgist/spgutils.cspghandler (the IndexAmRoutine), spgGetCache, SpGistGetBuffer, SpGistSetLastUsedPage, SpGistInitMetapage, and the tuple-formation helpers spgFormLeafTuple/spgFormNodeTuple/spgFormInnerTuple.
  • src/backend/access/spgist/spgvacuum.c — reclamation: spgbulkdelete, spgvacuumscan, spgvacuumpage, vacuumLeafPage, vacuumRedirectAndPlaceholder, spgprocesspending, spgvacuumcleanup.
  • src/backend/access/spgist/spginsert.cspgbuild, spgbuildempty, spginsert (build callback and the retry-on-conditional-lock-failure loop).
  • src/backend/access/spgist/spgquadtreeproc.c, src/backend/access/spgist/spgkdtreeproc.c, src/backend/access/spgist/spgtextproc.c — the shipped opclass methods for the quadtree, k-d tree, and radix-trie examples quoted above.
  • src/backend/utils/adt/network_spgist.c, src/backend/utils/adt/rangetypes_spgist.cinet/cidr prefix-tree and range quad-partition opclasses (named as further examples of the same five-method contract; not quoted in depth).
  • src/include/access/spgist.h, src/include/access/spgist_private.h — the opclass support-proc numbers (SPGISTNProc = 7), the tupstate lattice, the fixed block numbers, and the bit-packed SpGistInnerTupleData / SpGistLeafTupleData / SpGistDeadTupleData headers.

Theory anchors (cross-referenced from knowledge/research/dbms-general/ and the paper bibliography at .omc/plans/postgres-paper-bibliography.md):

  • Database System Concepts (Silberschatz, Korth, Sudarshan, 7e), §24.4 “Indexing of Spatial Data” — quadtrees, k-d trees, and the region-partitioning family.
  • Hellerstein, Naughton & Pfeffer, “Generalized Search Trees for Database Systems” (VLDB 1995) — the GiST framework SP-GiST generalizes for the space-partitioning case (dbms-papers/-tracked; see postgres-gist.md).
  • Samet, Foundations of Multidimensional and Metric Data Structures — canonical reference for quadtree/k-d-tree/trie theory.
  • Leis, Kemper & Neumann, “The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases” (ICDE 2013) — modern in-memory trie contrast cited in the research-frontiers discussion.

Related KB docs: postgres-gist.md (the balanced/overlapping sibling framework and the lossy-key model), postgres-index-am.md (the IndexAmRoutine dispatch contract and amhandler mechanics), postgres-nbtree.md (the totally-ordered B-tree contrast), postgres-buffer-manager.md (page latching), postgres-vacuum.md and postgres-page-layout.md (heap-side reclamation and item-pointer layout), postgres-ssi-predicate-locking.md (why ampredlocks = false matters), postgres-xlog-wal.md / postgres-recovery-redo.md (the WAL replay path for the RM_SPGIST_ID records this doc only names).