PostgreSQL SP-GiST — Space-Partitioned Trees on Disk
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”The B-tree (see postgres-nbtree.md) is the right index when the keys are
totally ordered: every node splits the key space into contiguous ranges,
and balance keeps every leaf at the same depth. But a great many data types
have no useful total order. A 2-D point, an IP range, a polygon’s bounding
box, a string viewed character-by-character — for these the natural index is
not a range tree but a space-partitioning tree: a structure that
recursively cuts the domain (the plane, the address space, the alphabet)
into disjoint regions and routes each value to the region that contains it.
Database System Concepts (Silberschatz 7e) develops exactly this family in
§24.4 “Indexing of Spatial Data”, naming the three canonical members:
- The quadtree. A 2-D quadtree node stores a centroid point and has four children, one per quadrant (NE, SE, SW, NW) relative to the centroid. A point descends into the quadrant it falls in; the partition is by region, and the four regions exactly tile the parent’s region.
- The k-d tree. A k-d tree alternates the splitting dimension by level:
at even levels it splits on
x(a vertical cutting line), at odd levels ony(a horizontal cutting line). Each node is binary — “left of the line” vs. “right of it” — so a k-d tree is a quadtree’s space cut into two halves per level instead of four quadrants per level. - The radix tree (trie). For strings, a trie partitions by prefix: a
node represents a common prefix, and its children are labeled by the next
character. The depth of a leaf is the length of the discriminating
prefix, not
log N, so a trie’s shape follows the data, not a balance invariant.
Three structural facts unite these and separate them sharply from the
B-tree. First, they are unbalanced — the depth of a branch depends on
how clustered the data is, with no rebalancing machinery. Second, the
partition is disjoint and exhaustive by construction: a value belongs to
exactly one child region, so insertion and search descend a single path
(unlike R-trees, whose bounding boxes overlap and force multi-path search).
Third, in their textbook form they are pointer-chasing in-memory
structures: each node is a small heap-allocated record holding child
pointers. A quadtree over a million points might be twenty levels deep, and
naively each level is one pointer dereference — which on disk would be one
random I/O per level. That is fatal for a disk index, where the whole game
is to keep the number of page reads near log_F N with F (the fanout)
large.
So the theoretical problem SP-GiST solves is a mapping problem, stated at
the very top of src/backend/access/spgist/README: “The challenge is to map
tree nodes to disk pages in such a way that the search algorithm accesses
only a few disk pages, even if it traverses many nodes.” A pure
node-per-page layout wastes a page on a four-pointer quadtree node; a pure
node-per-pointer layout is the in-memory structure that bleeds I/O. SP-GiST’s
answer is to pack many logical nodes onto a page and to require that
structurally-linked groups (one inner tuple, one leaf-tuple chain) never
straddle a page boundary, so descending one logical level usually stays on
the same page and crossing to a child page is the exception, not the rule.
The “GiST” in the name is the second half of the story. GiST
(Generalized Search Tree, Hellerstein-Naughton-Pfeffer 1995; see
postgres-gist.md) showed that a balanced, overlapping search tree could
be made datatype-agnostic by factoring all the type-specific logic into a
handful of opclass methods (a consistent predicate, a union, a
penalty, a picksplit) while the core owns page layout, WAL, and
concurrency. SP-GiST is the same idea applied to the space-partitioning
family: the core owns the disk mapping, the redirect/placeholder bookkeeping,
the latching, and the WAL, while five opclass methods — config, choose,
picksplit, inner_consistent, leaf_consistent — encode whatever it
means to be a quadtree, or a k-d tree, or a radix trie. PostgreSQL ships
point_ops (quadtree) and kd_point_ops (k-d tree) over point, plus
text_ops (radix tree) over text, and inet_ops, range_ops,
box_ops/poly via the same five-method contract.
Two design knobs organize the rest of this document, exactly the way the L&Y invariants organized the nbtree doc:
- How is a pointer-linked tree mapped to pages without losing the single-path-descent property — what an inner tuple, a node, and a leaf chain physically are, and the rules that keep a chain and an inner tuple each on one page.
- How do the five opclass methods carve the responsibility between the datatype (which knows what a “quadrant” or a “common prefix” is) and the core (which knows nothing about geometry or strings but everything about pages, locks, and crash recovery).
Common DBMS Design
Section titled “Common DBMS Design”Before SP-GiST’s specific symbols, it helps to name the recurring techniques that any engine mapping a space-partitioning tree to disk converges on. The PostgreSQL choices in the next section then read as points in this space.
Bucketing many logical nodes per page
Section titled “Bucketing many logical nodes per page”The first move every disk-resident trie/quadtree makes is to stop equating
“tree node” with “disk page.” A 4-way quadtree node is ~40 bytes; an 8 KB
page can hold ~200 of them. So the page is a bucket of logical nodes, and
the on-disk pointer becomes a (page, offset) pair rather than a raw memory
address. The crucial refinement — the one that makes the I/O bound work — is
to co-locate nodes that are likely to be visited together. SP-GiST’s rule
is blunt and effective: an inner tuple’s whole node array lives in one
physical tuple on one page, and the set of leaf tuples reachable from a
single node forms a chain on one page, linked by 2-byte offsets rather
than full TIDs. Descending one logical edge that stays within a page is free;
only edges that cross to a child page cost an I/O.
Lazy, offset-stable deletion via tombstones
Section titled “Lazy, offset-stable deletion via tombstones”A space-partitioning tree has an awkward property: a child’s address is an
inner tuple’s downlink, and inner tuples elsewhere on the page must keep
their offset numbers stable, or every downlink pointing at them breaks. So an
engine cannot simply compact a page. The universal answer is tombstones:
when a tuple must logically go away but its slot can’t move, leave a
placeholder occupying the offset. SP-GiST has a four-state lattice —
LIVE, REDIRECT, DEAD, PLACEHOLDER — that distinguishes “moved, here’s
where to” (redirect) from “gone but still referenced” (dead) from “gone and
safe to reuse” (placeholder).
Redirects for lock-light concurrent search
Section titled “Redirects for lock-light concurrent search”If a search releases the parent’s lock before locking the child (to avoid holding a chain of locks), a concurrent insert can move the child’s contents elsewhere in the window between unlock and lock. The classic fix, shared with many B-link-style designs, is to leave a forwarding pointer at the old location: a scan that arrives at a moved tuple finds a redirect telling it the new address. The redirect is needed only as long as some snapshot-old scan could still be in flight, after which VACUUM can recycle it.
Conditional latch-coupling to bound deadlocks
Section titled “Conditional latch-coupling to bound deadlocks”An insert into a non-balanced tree must hold the parent locked while it locks the child (to update the parent’s downlink atomically), which risks deadlock when two inserts descend cross-linked branches in opposite order. Engines bound this either with a global lock ordering (impractical here, since the tree shape is data-dependent) or with conditional locking plus restart: try to lock the child without blocking; if that fails, drop everything and retry the whole insert. SP-GiST adds a placement heuristic — “triple parity” — that makes the cross-link pattern statistically rare.
Sequential-scan VACUUM with a pending list
Section titled “Sequential-scan VACUUM with a pending list”Reclaiming dead entries in a tree that inserts can reshape underneath you is hard if you walk the tree logically. The simpler approach many engines take is a physical sequential scan of every page, which visits each page exactly once and is immune to concurrent shape changes — except for the one race where an insert moves a yet-to-be-deleted entry from an unvisited page onto an already-visited page. SP-GiST handles that with a pending list: when the scan sees a recently-created redirect, it records the target and re-checks it between pages, so no target TID can escape deletion.
flowchart TB
subgraph Logical["Logical space-partitioning tree"]
R["inner: centroid c0<br/>nodes: NE SE SW NW"]
A["inner: centroid c1<br/>nodes ..."]
L1["leaf chain:<br/>pt, pt, pt"]
L2["leaf chain:<br/>pt, pt"]
R -->|NE| A
R -->|SE| L1
A -->|SW| L2
end
subgraph Physical["Physical page mapping"]
P1["inner page 1<br/>tuple R (4 nodes)<br/>tuple A (4 nodes)"]
P2["leaf page 5<br/>chain L1: off3->off7->0"]
P3["leaf page 9<br/>chain L2: off2->off4->0"]
P1 -.->|"downlink TID (5,3)"| P2
P1 -.->|"downlink TID (9,2)"| P3
end
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL’s SP-GiST is a faithful, careful implementation of the mapping
described in its README, authored by Teodor Sigaev and Oleg Bartunov. The
access-method handler spghandler wires the five core entry points into the
generic index-AM dispatch table (postgres-index-am.md covers that contract);
note amcanorderbyop = true (SP-GiST supports nearest-neighbor ordered
scans), amsearchnulls = true, amcaninclude = true, and amsupport = SPGISTNProc (seven support procs).
// spghandler — src/backend/access/spgist/spgutils.camroutine->amsupport = SPGISTNProc; /* 7 support procs */amroutine->amcanorderbyop = true; /* kNN ordered scan */amroutine->amsearchnulls = true; /* IS NULL via nulls tree */amroutine->amcaninclude = true; /* INCLUDE columns */amroutine->ambuild = spgbuild;amroutine->aminsert = spginsert;amroutine->ambulkdelete = spgbulkdelete;amroutine->amgettuple = spggettuple;amroutine->amgetbitmap = spggetbitmap;amroutine->amcanreturn = spgcanreturn; /* index-only scans */The two physical tuple types
Section titled “The two physical tuple types”Everything on disk is one of two real tuple shapes (plus the dead-tuple
overlay). An inner tuple is a fixed header, an optional prefix datum,
then a packed array of node tuples; each node tuple is an
IndexTupleData header carrying a (label, downlink-TID) pair. The
bit-packed header is the heart of the format:
// SpGistInnerTupleData — src/include/access/spgist_private.htypedef struct SpGistInnerTupleData{ unsigned int tupstate:2, /* LIVE/REDIRECT/DEAD/PLACEHOLDER */ allTheSame:1, /* all nodes in tuple are equivalent */ nNodes:13, /* number of nodes within inner tuple */ prefixSize:16; /* size of prefix, or 0 if none */ uint16 size; /* total size of inner tuple */ /* prefix datum follows, then nodes */} SpGistInnerTupleData;A leaf tuple holds a leaf datum, the heap TID it indexes, and — crucially
— a nextOffset link (packed into t_info) chaining it to the next leaf
under the same parent node on the same page:
// SpGistLeafTupleData — src/include/access/spgist_private.htypedef struct SpGistLeafTupleData{ unsigned int tupstate:2, /* LIVE/REDIRECT/DEAD/PLACEHOLDER */ size:30; uint16 t_info; /* nextOffset (14 bits) + flag bits */ ItemPointerData heapPtr; /* TID of represented heap tuple */ /* nulls bitmap (if flagged), then leaf datum + INCLUDE datums */} SpGistLeafTupleData;The nextOffset is a 2-byte OffsetNumber, not a full 6-byte TID, precisely
because the README requires the whole chain to live on one page — that is
the invariant that makes leaf chains cheap. A REDIRECT/DEAD/PLACEHOLDER
slot is reinterpreted through SpGistDeadTupleData, whose tupstate aliases
the same first two bits and whose pointer/xid carry the forwarding TID and
the inserting transaction.
Fixed page geography and the nulls tree
Section titled “Fixed page geography and the nulls tree”An SP-GiST relation has three fixed blocks: the metapage (block 0, carrying a magic number and the per-backend last-used-page cache), the root of the main tree (block 1), and the root of the nulls tree (block 2). The README’s NULLS-handling rule is implemented directly: indexed operators are assumed strict, so nulls live in a separate tree on a disjoint page set, searched with hardwired logic rather than opclass methods. The insert path picks its starting block from the key’s null-ness:
// spgdoinsert — src/backend/access/spgist/spgdoinsert.ccurrent.blkno = isnull ? SPGIST_NULL_BLKNO : SPGIST_ROOT_BLKNO;// constants — src/include/access/spgist_private.h#define SPGIST_METAPAGE_BLKNO (0) /* metapage */#define SPGIST_ROOT_BLKNO (1) /* root for normal entries */#define SPGIST_NULL_BLKNO (2) /* root for null-value entries */#define SPGIST_LIVE 0 /* normal live tuple */#define SPGIST_REDIRECT 1 /* temporary redirection placeholder */#define SPGIST_DEAD 2 /* dead, cannot be removed (still linked) */#define SPGIST_PLACEHOLDER 3 /* placeholder to preserve offsets */The five opclass methods
Section titled “The five opclass methods”The datatype is encapsulated in five (plus two optional) support procedures.
spgGetCache calls config once per index and caches the result in
rd_amcache; this declares the four types the core must marshal — attType
(input), attLeafType (leaf storage), attPrefixType (inner prefix), and
attLabelType (node labels) — plus longValuesOK (can the opclass shorten a
too-long leaf datum by suffixing?). The radix-tree config is illustrative:
text prefixes, int2 (single-character) labels, and suffixing on:
// spg_text_config — src/backend/access/spgist/spgtextproc.ccfg->prefixType = TEXTOID;cfg->labelType = INT2OID;cfg->canReturnData = true;cfg->longValuesOK = true; /* suffixing will shorten long values */On insert, choose is the decision oracle at each inner tuple. It is
handed the value, the current prefix, and the node labels, and returns one of
three verdicts — spgMatchNode (descend through node N, adjusting the level
and the “rest” of the datum), spgAddNode (this prefix matches but no node
does; add a labeled node), or spgSplitTuple (the prefix itself is too
specific; break it into a shorter prefix tuple plus a postfix child). The
quad-tree choose only ever matches, because a centroid is never
“inconsistent” with a point — it just routes to the right quadrant:
// spg_quad_choose — src/backend/access/spgist/spgquadtreeproc.cout->resultType = spgMatchNode;out->result.matchNode.nodeN = getQuadrant(centroid, inPoint) - 1;out->result.matchNode.levelAdd = 0;out->result.matchNode.restDatum = PointPGetDatum(inPoint);When a leaf chain overflows its page, picksplit converts the whole list
into one inner tuple whose nodes fan the leaves into sublists. The quad-tree
version computes the centroid (the average of the points), declares four
nodes, and maps each input point to its quadrant:
// spg_quad_picksplit — src/backend/access/spgist/spgquadtreeproc.cout->hasPrefix = true;out->prefixDatum = PointPGetDatum(centroid);out->nNodes = 4;out->nodeLabels = NULL; /* quadrants are positional, unlabeled */for (i = 0; i < in->nTuples; i++){ Point *p = DatumGetPointP(in->datums[i]); int quadrant = getQuadrant(centroid, p) - 1; out->leafTupleDatums[i] = PointPGetDatum(p); out->mapTuplesToNodes[i] = quadrant;}On search, the dual pair inner_consistent and leaf_consistent prune.
inner_consistent is given the inner tuple’s prefix and node set and a list of
scan keys, and returns the subset of nodes whose subtrees could contain a
match — for a quadtree it intersects each query operator (left-of, contained-
by, …) against the quadrant geometry and returns a bitmask of surviving
quadrants. leaf_consistent makes the final per-tuple decision. The k-d
tree’s choose shows the level-alternating dimension and the levelAdd = 1
that drives the alternation:
// spg_kd_choose — src/backend/access/spgist/spgkdtreeproc.ccoord = DatumGetFloat8(in->prefixDatum);out->resultType = spgMatchNode;out->result.matchNode.nodeN = (getSide(coord, in->level % 2, inPoint) > 0) ? 0 : 1;out->result.matchNode.levelAdd = 1; /* alternate x/y each level */out->result.matchNode.restDatum = PointPGetDatum(inPoint);flowchart TB
Ins["spgdoinsert: leaf datum + heap TID"] --> Loop{"current page type?"}
Loop -->|leaf| Fit{"leaf tuple fits<br/>on page?"}
Fit -->|yes| AddLeaf["addLeafTuple: link into chain"]
Fit -->|"no, small chain"| Move["moveLeafs: relocate whole chain"]
Fit -->|"no, large"| Pick["doPickSplit: build inner tuple,<br/>fan leaves to sublists"]
Loop -->|inner| Choose["call choose()"]
Choose -->|spgMatchNode| Descend["descend node N,<br/>level += levelAdd"]
Choose -->|spgAddNode| AddNode["spgAddNodeAction,<br/>retry choose"]
Choose -->|spgSplitTuple| Split["spgSplitNodeAction:<br/>prefix + postfix, retry"]
Descend --> Loop
AddNode --> Choose
Split --> Choose
Pick --> Descend
Source Walkthrough
Section titled “Source Walkthrough”The code splits cleanly along the index-AM surface: build and metapage in
spginsert.c, the insertion engine in spgdoinsert.c, the search engine in
spgscan.c, support and tuple formation in spgutils.c, the per-opclass
methods in spg{quad,kd,text}treeproc.c, and reclamation in spgvacuum.c.
Build, metapage, and the AM handler
Section titled “Build, metapage, and the AM handler”spghandler (in spgutils.c) returns the IndexAmRoutine. spgbuild (in
spginsert.c) creates the three fixed pages — it initializes the metapage via
SpGistInitMetapage, the main root (block 1), and the nulls root (block 2) —
then scans the heap, calling spgdoinsert per tuple through a build callback.
spgGetCache (in spgutils.c) lazily runs the opclass config method once
and caches the four SpGistTypeDesc entries plus spgConfigOut in
rd_amcache; it also reads the last-used-page cache off the metapage.
initSpGistState populates the per-operation SpGistState.
Insertion: the spgdoinsert descent loop
Section titled “Insertion: the spgdoinsert descent loop”spgdoinsert (in spgdoinsert.c) is the heart. It first forms the leaf
datum (calling the optional compress method, else de-toasting the input),
computes leafSize, and rejects oversize values up front unless
longValuesOK lets the opclass suffix them. It seeds current.blkno to the
main or nulls root and enters a for(;;) loop. Each iteration acquires the
current buffer (conditionally locking child pages to avoid deadlock — on
failure it returns false and the caller spginsert retries), then branches
on page type:
- Leaf page. It forms the leaf tuple and tries three things in order:
addLeafTupleif it fits;moveLeafsif the chain is small enough to relocate wholesale (checkSplitConditionsreports chain size, gated at< SPGIST_PAGE_CAPACITY/2,< 64tuples); otherwisedoPickSplit. - Inner page (
process_inner_tuple:label). It marshals aspgChooseIn, calls the opclasschoose(or forcesspgMatchNodefor the nulls tree), and dispatches onout.resultType.
The dispatch on the choose verdict is the structural core of insertion:
// spgdoinsert — src/backend/access/spgist/spgdoinsert.cswitch (out.resultType){ case spgMatchNode: spgMatchNodeAction(index, state, innerTuple, ¤t, &parent, out.result.matchNode.nodeN); level += out.result.matchNode.levelAdd; if (!isnull) { leafDatums[spgKeyColumn] = out.result.matchNode.restDatum; leafSize = SpGistGetLeafTupleSize(leafDescriptor, leafDatums, isnulls); leafSize += sizeof(ItemIdData); } break; /* loop: descend to child */ case spgAddNode: spgAddNodeAction(index, state, innerTuple, ¤t, &parent, out.result.addNode.nodeN, out.result.addNode.nodeLabel); goto process_inner_tuple; /* retry choose on enlarged tuple */ case spgSplitTuple: spgSplitNodeAction(index, state, innerTuple, ¤t, &out); goto process_inner_tuple; /* retry choose on the prefix tuple */}addLeafTuple (in spgdoinsert.c) handles the two chain cases. A root or
chain-head insert sets nextOffset = Invalid and, if there’s a parent,
updates the parent’s downlink via saveNodeLink. A mid-chain insert exploits
the README’s “insert second” trick — it links the new tuple where the head
pointed, then points the head at the new tuple, so the chain head address
never changes and the parent downlink stays valid without chasing the chain:
// addLeafTuple — src/backend/access/spgist/spgdoinsert.cSGLT_SET_NEXTOFFSET(leafTuple, SGLT_GET_NEXTOFFSET(head));offnum = SpGistPageAddNewItem(state, current->page, (Item) leafTuple, leafTuple->size, NULL, false);/* re-get head (it may have moved on page) and link it to the new tuple */head = (SpGistLeafTuple) PageGetItem(current->page, PageGetItemId(current->page, current->offnum));SGLT_SET_NEXTOFFSET(head, offnum);PickSplit: turning a leaf chain into an inner tuple
Section titled “PickSplit: turning a leaf chain into an inner tuple”doPickSplit (in spgdoinsert.c) is invoked when a leaf chain can neither
absorb the new tuple nor be relocated. It gathers all LIVE leaf datums from
the chain (or, for a root-as-leaf split, all tuples on the page), calls the
opclass picksplit to get a prefix and a mapTuplesToNodes assignment, then
forms node tuples and the inner tuple with spgFormNodeTuple /
spgFormInnerTuple. A subtle safety net is checkAllTheSame: if picksplit
dumped every tuple into one node (a degenerate split that would loop forever),
the core overrides it into allTheSame mode and distributes randomly. The
new inner tuple replaces the chain head’s parent downlink; leaves are
distributed to one or more leaf pages, and old slots become redirects or
placeholders.
// doPickSplit — src/backend/access/spgist/spgdoinsert.cinnerTuple = spgFormInnerTuple(state, out.hasPrefix, out.prefixDatum, out.nNodes, nodes);innerTuple->allTheSame = allTheSame;/* re-point nodes[] into the formed tuple so downlinks can be patched */SGITITERATE(innerTuple, i, node){ nodes[i] = node;}SplitTuple: shortening an over-specific prefix
Section titled “SplitTuple: shortening an over-specific prefix”spgSplitNodeAction (in spgdoinsert.c) implements the radix-tree case where
the inner tuple’s prefix only partially matches the new value. It builds a
prefix tuple (the matching head, which must be no larger than the original
so it can overwrite it in place) and a postfix tuple (the original node
array under a new shorter prefix), placing the postfix on the same page or, if
it won’t fit (or the page is the root), on a page with the next triple-parity:
// spgSplitNodeAction — src/backend/access/spgist/spgdoinsert.cif (prefixTuple->size > innerTuple->size) elog(ERROR, "SPGiST inner-tuple split must not produce longer prefix");...if (SpGistBlockIsRoot(current->blkno) || SpGistPageGetFreeSpace(current->page, 1) + innerTuple->size < prefixTuple->size + postfixTuple->size + sizeof(ItemIdData)){ /* postfix is a child of prefix -> next triple parity */ newBuffer = SpGistGetBuffer(index, GBUF_INNER_PARITY(current->blkno + 1), postfixTuple->size + sizeof(ItemIdData), &xlrec.newPage);}Page allocation, triple parity, and the last-used cache
Section titled “Page allocation, triple parity, and the last-used cache”SpGistGetBuffer (in spgutils.c) is the allocator. It consults the
per-backend last-used-page cache (GET_LUP) for a page of the right flavor
(leaf vs. inner-by-parity, main vs. nulls), conditionally locks it, rechecks
the page type and free space (the cache can be stale), and falls back to
allocNewBuffer otherwise. SpGistSetLastUsedPage writes the freshest free
space back to the cache when a page is released. The triple-parity rule —
GBUF_INNER_PARITY(x) = x % 3, with a child placed at parity
(parent+1) % 3 — makes it unlikely that two inserts descending different
branches will hold cross-referenced page locks, reducing deadlock restarts.
Search: spgWalk and the consistent methods
Section titled “Search: spgWalk and the consistent methods”spgrescan/spgPrepareScanKeys (in spgscan.c) set up the scan, separating
null-related quals. spgWalk is the traversal driver: it pops a
SpGistSearchItem from a pairing heap (scanQueue) — a priority queue
ordered by distance for kNN scans, FIFO-like otherwise — share-locks the
target page, and dispatches. Critically it locks one page at a time and
releases before visiting the next stack item, so searches never deadlock; the
cost is that it must honor redirect tuples left by concurrent inserts:
// spgWalk — src/backend/access/spgist/spgscan.celse /* page is inner */{ SpGistInnerTuple innerTuple = (SpGistInnerTuple) PageGetItem(page, PageGetItemId(page, offset)); if (innerTuple->tupstate != SPGIST_LIVE) { if (innerTuple->tupstate == SPGIST_REDIRECT) { /* transfer attention to redirect point */ item->heapPtr = ((SpGistDeadTuple) innerTuple)->pointer; goto redirect; } elog(ERROR, "unexpected SPGiST tuple state: %d", innerTuple->tupstate); } spgInnerTest(so, item, innerTuple, isnull);}spgInnerTest (in spgscan.c) calls the opclass inner_consistent (or, for
nulls, forces all children), then for each surviving node builds a child
SpGistSearchItem and pushes it on the queue. spgTestLeafTuple walks a leaf
chain, following redirects and skipping DEAD entries, calling spgLeafTest →
leaf_consistent per LIVE tuple; matches are either reported immediately
(unordered scan) or queued as heap items (ordered/kNN scan). spggettuple,
spggetbitmap, and spgcanreturn (index-only scan eligibility) sit on top.
VACUUM: sequential scan plus pending list
Section titled “VACUUM: sequential scan plus pending list”spgbulkdelete (in spgvacuum.c) sets up a spgBulkDeleteState and calls
spgvacuumscan, which reads every non-meta page in physical order through a
read stream and calls spgvacuumpage on each. After each page it drains the
pending list via spgprocesspending — the mechanism that catches tuples a
concurrent PickSplit moved onto an already-scanned page. vacuumLeafPage (in
spgvacuum.c) builds a predecessor[] reverse-map of the nextOffset links
to find chain heads, then computes six work arrays (dead / placeholder /
move-src / move-dest / chain-src / chain-dest) so the mutation and its WAL
replay do exactly the same thing. vacuumRedirectAndPlaceholder converts
expired redirects to placeholders using a global-visibility test and truncates
trailing placeholders:
// vacuumRedirectAndPlaceholder — src/backend/access/spgist/spgvacuum.cif (dt->tupstate == SPGIST_REDIRECT && (!TransactionIdIsValid(dt->xid) || GlobalVisTestIsRemovableXid(vistest, dt->xid))){ dt->tupstate = SPGIST_PLACEHOLDER; opaque->nRedirection--; opaque->nPlaceholder++; ItemPointerSetInvalid(&dt->pointer); itemToPlaceholder[xlrec.nToPlaceholder++] = i;}spgvacuumcleanup (in spgvacuum.c) does nothing if spgbulkdelete already
ran; otherwise it runs an empty-target scan to recycle redirects/placeholders,
refresh the FSM, and gather statistics.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
spghandler | src/backend/access/spgist/spgutils.c | 44 |
spgGetCache | src/backend/access/spgist/spgutils.c | 188 |
SpGistGetBuffer | src/backend/access/spgist/spgutils.c | 569 |
SpGistSetLastUsedPage | src/backend/access/spgist/spgutils.c | 673 |
SpGistInitMetapage | src/backend/access/spgist/spgutils.c | 732 |
spgFormLeafTuple | src/backend/access/spgist/spgutils.c | 871 |
spgFormNodeTuple | src/backend/access/spgist/spgutils.c | 960 |
spgFormInnerTuple | src/backend/access/spgist/spgutils.c | 1002 |
spgExtractNodeLabels | src/backend/access/spgist/spgutils.c | 1160 |
spgbuild | src/backend/access/spgist/spginsert.c | 73 |
spginsert | src/backend/access/spgist/spginsert.c | 183 |
addLeafTuple | src/backend/access/spgist/spgdoinsert.c | 203 |
moveLeafs | src/backend/access/spgist/spgdoinsert.c | 387 |
checkAllTheSame | src/backend/access/spgist/spgdoinsert.c | 599 |
doPickSplit | src/backend/access/spgist/spgdoinsert.c | 677 |
spgMatchNodeAction | src/backend/access/spgist/spgdoinsert.c | 1459 |
spgAddNodeAction | src/backend/access/spgist/spgdoinsert.c | 1513 |
spgSplitNodeAction | src/backend/access/spgist/spgdoinsert.c | 1715 |
spgdoinsert | src/backend/access/spgist/spgdoinsert.c | 1914 |
spgPrepareScanKeys | src/backend/access/spgist/spgscan.c | 208 |
spgLeafTest | src/backend/access/spgist/spgscan.c | 516 |
spgInnerTest | src/backend/access/spgist/spgscan.c | 667 |
spgTestLeafTuple | src/backend/access/spgist/spgscan.c | 763 |
spgWalk | src/backend/access/spgist/spgscan.c | 817 |
spggetbitmap | src/backend/access/spgist/spgscan.c | 942 |
spggettuple | src/backend/access/spgist/spgscan.c | 1026 |
spgcanreturn | src/backend/access/spgist/spgscan.c | 1083 |
vacuumLeafPage | src/backend/access/spgist/spgvacuum.c | 126 |
vacuumRedirectAndPlaceholder | src/backend/access/spgist/spgvacuum.c | 494 |
spgvacuumpage | src/backend/access/spgist/spgvacuum.c | 622 |
spgprocesspending | src/backend/access/spgist/spgvacuum.c | 688 |
spgvacuumscan | src/backend/access/spgist/spgvacuum.c | 800 |
spgbulkdelete | src/backend/access/spgist/spgvacuum.c | 950 |
spgvacuumcleanup | src/backend/access/spgist/spgvacuum.c | 981 |
spg_quad_choose | src/backend/access/spgist/spgquadtreeproc.c | 115 |
spg_quad_picksplit | src/backend/access/spgist/spgquadtreeproc.c | 169 |
spg_quad_inner_consistent | src/backend/access/spgist/spgquadtreeproc.c | 227 |
spg_kd_choose | src/backend/access/spgist/spgkdtreeproc.c | 53 |
spg_kd_picksplit | src/backend/access/spgist/spgkdtreeproc.c | 108 |
spg_text_config | src/backend/access/spgist/spgtextproc.c | 96 |
SpGistInnerTupleData | src/include/access/spgist_private.h | 293 |
SpGistLeafTupleData | src/include/access/spgist_private.h | 383 |
SpGistDeadTupleData | src/include/access/spgist_private.h | 427 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”All code excerpts and symbols above were read directly from the REL_18_STABLE working tree at commit 273fe94. Verification notes:
- Five-method opclass contract.
src/include/access/spgist.hdefinesSPGIST_CONFIG_PROC(1),SPGIST_CHOOSE_PROC(2),SPGIST_PICKSPLIT_PROC(3),SPGIST_INNER_CONSISTENT_PROC(4),SPGIST_LEAF_CONSISTENT_PROC(5), with optionalSPGIST_COMPRESS_PROC(6) andSPGIST_OPTIONS_PROC(7);SPGISTNProc = 7.spghandlersetsamsupport = SPGISTNProc. - Tuple states. The four
tupstatevaluesSPGIST_LIVE/REDIRECT/DEAD/PLACEHOLDER(0..3) and the fixed blocksSPGIST_METAPAGE_BLKNO(0)/SPGIST_ROOT_BLKNO(1)/SPGIST_NULL_BLKNO(2) are confirmed inspgist_private.h. - Bit-packed headers.
SpGistInnerTupleDatapackstupstate:2, allTheSame:1, nNodes:13, prefixSize:16;SGITMAXNNODES = 0x1FFFandSGITMAXPREFIXSIZE = 0xFFFFbound them.SpGistLeafTupleDatapackstupstate:2, size:30withnextOffsetin the low 14 bits oft_info(SGLT_GET_NEXTOFFSETmasks0x3FFF). These are exact, not paraphrased. - Single-page chain invariant. The
nextOffsetfield is a 2-byteOffsetNumber(not a TID), which is only sound because the README requires a chain to stay on one page; confirmed in theSpGistLeafTupleDatacomment. - Triple parity.
GBUF_INNER_PARITY(x) = (x) % 3inspgist_private.h;spgSplitNodeActionanddoPickSplitrequestGBUF_INNER_PARITY(blkno+1)for child pages, matching the README’s(N+1) mod 3rule. - Conditional locking restart.
spgdoinsertusesConditionalLockBufferon child pages and returnsfalse(caller retries) on failure — verified in the descent block.spgWalkshare-locks one page at a time andgoto redirecton aSPGIST_REDIRECTtuple. - PG18-correct. SP-GiST is not parallel-aware (
amcanparallel = false,amcanbuildparallel = false); it participates in parallel vacuum viaVACUUM_OPTION_PARALLEL_BULKDEL | VACUUM_OPTION_PARALLEL_COND_CLEANUP.spgvacuumscanuses the read-stream API (read_stream_begin_relationwithREAD_STREAM_USE_BATCHING), which is the PG17+/PG18 form. No PG19-only symbols are referenced. - WAL. Redo handlers live in
spgxlog.cunderRM_SPGIST_ID; record types such asXLOG_SPGIST_ADD_LEAFandXLOG_SPGIST_VACUUM_REDIRECTare emitted from the functions quoted above. WAL replay detail is out of scope here (seepostgres-xlog-wal.mdandpostgres-recovery-redo.md). - Handler flags read in full.
spghandlersetsamcanorder = false,amcanorderbyop = true,amcanbackward = false,amoptionalkey = true,amstorage = true,amclusterable = false,ampredlocks = false,amsummarizing = false,amkeytype = InvalidOid, andamoptsprocnum = SPGIST_OPTIONS_PROC. SP-GiST has no SSI predicate-lock support (ampredlocks = false), unlike nbtree/gin/gist/hash — a consequence of its lock-light, redirect-based concurrency, in which the set of pages a scan touches is not stable enough to lock for serializable conflict detection. These were read verbatim fromspgutils.c.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”SP-GiST occupies an unusual niche: it is not a single index but a generic host for an entire family of space-partitioning structures. Placing it in the broader landscape clarifies both what it buys and what it gives up.
Lineage — the original SP-GiST framework
Section titled “Lineage — the original SP-GiST framework”PostgreSQL’s SP-GiST descends directly from the academic SP-GiST framework
of Aoki and, later, the comprehensive treatment by Eltabakh, Eldawy, and
Mokbel (“SP-GiST: An Extensible Database Index for Supporting Space
Partitioning Trees”, which generalized Hellerstein-Naughton-Pfeffer’s GiST,
1995 — see postgres-gist.md). GiST factored balanced, overlapping trees
(B-trees, R-trees) behind a consistent/union/penalty/picksplit
contract; SP-GiST does the same for unbalanced, disjoint, recursive-partition
trees. The two are deliberately complementary: a datatype whose natural
decomposition tiles the domain without overlap (a quadtree over points, a trie
over strings) maps to SP-GiST; one whose decomposition needs overlapping
bounding regions (an R-tree over polygons) maps to GiST. PostgreSQL is one of
very few production systems to ship both generalized frameworks, which is
why the two READMEs read as siblings.
The disjoint-partition advantage, and its cost
Section titled “The disjoint-partition advantage, and its cost”The defining property — disjoint, exhaustive child regions — is exactly what lets SP-GiST descend a single path on insert and (for a point query) a single path on search. An R-tree must sometimes follow several children because their bounding boxes overlap; SP-GiST never does for a containment query. The cost is paid elsewhere:
- No balance guarantee. Depth follows data clustering. A trie over
pathological strings (a long common prefix) can be deep; the
prefixSizeandSplitTuple/postfix machinery bound the damage but cannot make the tree balanced the way a B-tree’s split-and-propagate does. - Range and kNN queries still fan out.
inner_consistentreturns a set of surviving nodes, so a window query over a quadtree visits every quadrant the window touches. SP-GiST’s win is the single-page chain and high fanout, not single-path traversal for these.
How other systems index the same shapes
Section titled “How other systems index the same shapes”| System | Quadtree / k-d / trie equivalent | Mapping strategy |
|---|---|---|
| PostgreSQL SP-GiST | quadtree, k-d, radix trie, inet, range | generic 5-method opclass; inner-tuple + leaf-chain on page |
| Oracle | R-tree (Spatial), no generic SP framework | dedicated spatial cartridge; no extensible trie host |
| MySQL/InnoDB | R-tree (spatial), B-tree only otherwise | no quadtree/trie AM; spatial uses R-tree |
| SQL Server | grid-based spatial (multi-level tessellation) | B-tree over a linearized grid, not a true partition tree |
| SQLite | R*-tree module | virtual-table R-tree; no trie/quadtree host |
| Specialized (Lucene, RocksDB) | FST / radix / ART | in-memory or LSM tries, not a page-mapped generic host |
The contrast that matters: most engines hard-code one spatial structure
(usually an R-tree) and stop, or push tries into an in-memory layer (Lucene’s
FST, an Adaptive Radix Tree). PostgreSQL instead exposes the partition-tree
abstraction itself so a text trie, a point quadtree, an inet prefix
tree (network_spgist.c), and a range 2-D quad-partition
(rangetypes_spgist.c) all share one page-mapping, one concurrency protocol,
and one VACUUM — differing only in five C functions.
Research frontiers
Section titled “Research frontiers”- Cache- and SIMD-friendly tries. The Adaptive Radix Tree (ART, Leis et
al. 2013) and Masstree show that in-memory tries can rival B-trees by
adapting node fanout to occupancy and packing keys for SIMD comparison.
SP-GiST’s
allTheSameand prefix-suffix machinery are a coarse, on-disk analogue of ART’s node-size adaptivity, but SP-GiST does not yet adapt node representation the way ART switches among Node4/16/48/256. - Learned spatial indexes. RMI-style and Flood/Tsunami learned multi-dim
indexes replace the fixed quadrant/centroid split with a model fit to the
data distribution. The
picksplithook is the natural seam where a learned splitter could be injected without touching the core page mapping. - Concurrency without redirects. SP-GiST’s redirect/placeholder lattice is its answer to lock-light search; Bw-tree-style lock-free deltas and epoch-based reclamation are the modern alternative. SP-GiST’s redirect-then-recycle-at-VACUUM is closer to RCU than to a true lock-free structure, and its single-page-at-a-time share-lock walk is its simplest, most robust guarantee.
- No SSI integration. Because
ampredlocks = false, SP-GiST indexes cannot contribute fine-grained predicate locks to serializable transactions (seepostgres-ssi-predicate-locking.md); a serializable query that relies on an SP-GiST scan falls back to coarser conflict detection. Extending SP-GiST with page- or node-granular predicate locks is an open engineering item.
flowchart TB
Domain["A datatype with a natural<br/>recursive decomposition"] --> Q{"Do child regions<br/>overlap?"}
Q -->|"no — disjoint tiling"| SP["SP-GiST<br/>single-path descent<br/>quadtree / k-d / trie / inet"]
Q -->|"yes — bounding boxes overlap"| G["GiST<br/>balanced, multi-path<br/>R-tree / range / fulltext"]
SP --> M["5 methods:<br/>config choose picksplit<br/>inner_consistent leaf_consistent"]
G --> N["7 methods:<br/>consistent union penalty<br/>picksplit same compress decompress"]
M --> Core["Shared core: page mapping,<br/>WAL, VACUUM, concurrency"]
N --> Core
Sources
Section titled “Sources”PostgreSQL source tree (REL_18_STABLE, commit 273fe94, as of 2026-06-05;
all paths relative to the repo root at /data/hgryoo/references/postgres):
src/backend/access/spgist/README— the authoritative description of the disk-mapping problem, the inner/leaf tuple structure, the single-page chain invariant, redirect/placeholder states, triple parity, and the conditional-locking concurrency protocol.src/backend/access/spgist/spgdoinsert.c— the insertion engine:spgdoinsert,addLeafTuple,moveLeafs,doPickSplit,checkAllTheSame,spgMatchNodeAction,spgAddNodeAction,spgSplitNodeAction.src/backend/access/spgist/spgscan.c— the search engine:spgWalk,spgInnerTest,spgLeafTest,spgTestLeafTuple,spgPrepareScanKeys,spggettuple,spggetbitmap,spgcanreturn.src/backend/access/spgist/spgutils.c—spghandler(theIndexAmRoutine),spgGetCache,SpGistGetBuffer,SpGistSetLastUsedPage,SpGistInitMetapage, and the tuple-formation helpersspgFormLeafTuple/spgFormNodeTuple/spgFormInnerTuple.src/backend/access/spgist/spgvacuum.c— reclamation:spgbulkdelete,spgvacuumscan,spgvacuumpage,vacuumLeafPage,vacuumRedirectAndPlaceholder,spgprocesspending,spgvacuumcleanup.src/backend/access/spgist/spginsert.c—spgbuild,spgbuildempty,spginsert(build callback and the retry-on-conditional-lock-failure loop).src/backend/access/spgist/spgquadtreeproc.c,src/backend/access/spgist/spgkdtreeproc.c,src/backend/access/spgist/spgtextproc.c— the shipped opclass methods for the quadtree, k-d tree, and radix-trie examples quoted above.src/backend/utils/adt/network_spgist.c,src/backend/utils/adt/rangetypes_spgist.c—inet/cidrprefix-tree andrangequad-partition opclasses (named as further examples of the same five-method contract; not quoted in depth).src/include/access/spgist.h,src/include/access/spgist_private.h— the opclass support-proc numbers (SPGISTNProc = 7), thetupstatelattice, the fixed block numbers, and the bit-packedSpGistInnerTupleData/SpGistLeafTupleData/SpGistDeadTupleDataheaders.
Theory anchors (cross-referenced from
knowledge/research/dbms-general/ and the paper bibliography at
.omc/plans/postgres-paper-bibliography.md):
- Database System Concepts (Silberschatz, Korth, Sudarshan, 7e), §24.4 “Indexing of Spatial Data” — quadtrees, k-d trees, and the region-partitioning family.
- Hellerstein, Naughton & Pfeffer, “Generalized Search Trees for Database
Systems” (VLDB 1995) — the GiST framework SP-GiST generalizes for the
space-partitioning case (
dbms-papers/-tracked; seepostgres-gist.md). - Samet, Foundations of Multidimensional and Metric Data Structures — canonical reference for quadtree/k-d-tree/trie theory.
- Leis, Kemper & Neumann, “The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases” (ICDE 2013) — modern in-memory trie contrast cited in the research-frontiers discussion.
Related KB docs: postgres-gist.md (the balanced/overlapping sibling
framework and the lossy-key model), postgres-index-am.md (the
IndexAmRoutine dispatch contract and amhandler mechanics),
postgres-nbtree.md (the totally-ordered B-tree contrast),
postgres-buffer-manager.md (page latching), postgres-vacuum.md and
postgres-page-layout.md (heap-side reclamation and item-pointer layout),
postgres-ssi-predicate-locking.md (why ampredlocks = false matters),
postgres-xlog-wal.md / postgres-recovery-redo.md (the WAL replay path for
the RM_SPGIST_ID records this doc only names).