PostgreSQL B-Tree (nbtree) — Lehman-Yao Concurrency, Splits, and Index-Tuple Deletion
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”The B-tree is the default disk index of nearly every relational engine, and
the reason is structural: it is a balanced, high-fan-out search tree whose
every leaf sits at the same depth, so a lookup, insertion, or range scan
costs O(log_F N) page reads where F (the fan-out) is large enough that a
billion-row table is three or four levels deep. Comer’s classic survey The
Ubiquitous B-Tree (Comer 1979; captured in dbms-papers/btree.md) and
Database System Concepts (Silberschatz 7e, ch. 14 “Indexing”) frame the
data structure: internal pages hold separator keys alternating with
pointers (downlinks) to children, leaf pages hold the actual keyed
entries, and the tree stays balanced by splitting a full page into two and
propagating a new separator up to the parent — recursively, up to a root
split that increases the tree’s height by one. PostgreSQL, like most
engines, uses the B+-tree variant: all data entries live at the leaf
level, internal pages hold only navigation keys, and the leaves are
chained into a linked list so a range scan never re-ascends the tree.
The textbook B-tree is a single-threaded data structure. The hard part — the part that the textbook chapter mostly waves at — is concurrency. Consider two design knobs that every concurrent B-tree implementation must turn:
-
How does a reader avoid being corrupted by a concurrent writer? The naive answer is lock-coupling (a.k.a. latch-crabbing): a descending search holds a read lock on the parent until it has locked the child, so a writer splitting the child cannot do so while a reader is in flight. This is correct but throttles concurrency — every reader serializes against writers along its whole path, and the root becomes a hot spot because every operation passes through it.
-
How does a split publish its result atomically? A split mutates three pages (the splitting page, its new right sibling, and the parent that gains a downlink). If a reader can observe the new right page before the parent’s downlink exists, a naive design either has to hold locks across all three pages or risk the reader missing keys that moved right.
The landmark answer is Lehman & Yao 1981, “Efficient Locking for
Concurrent Operations on B-Trees” (TODS; the operative reference cited at
the very top of src/backend/access/nbtree/README, and the paper text is
captured in dbms-papers/btree.md). Their insight is to make the tree
self-repairing under concurrency by adding two fields to every page:
- a right-link — a pointer to the page’s immediate right sibling on the same level (turning each level into a singly-linked list left-to-right), and
- a high key — an upper bound on every key that may legally live on the page.
With these two additions, a split becomes a local, left-to-right operation whose effect is visible the instant the new right page is linked in, even before the parent’s downlink is posted. A reader that descends to a page which has since split simply observes that its search key exceeds the page’s high key, concludes “the key I want moved right,” and follows the right-link — no lock on the parent, no lock on the root, no lock-coupling between levels. The tree is searchable with at most one page locked at a time. The “B-link tree” this produces is the foundation of essentially every modern concurrent B-tree.
The companion reference, Lanin & Shasha 1986, “A Symmetric Concurrent B-Tree Algorithm”, supplies the deletion side that Lehman & Yao left underspecified: how to remove an empty page from the tree without breaking in-flight searches. Its two relevant ideas — the “drain technique” (a deleted page lingers as a tombstone until no concurrent operation can still reference it) and the “fast root” (skip past degenerate single-child levels) — are both named in the PostgreSQL README as direct lineage.
Two implementation choices follow from adopting L&Y, and they organize the rest of this document:
- What exactly does “move right to recover” cost, and when is it triggered — for searches, for inserts that must re-find a parent, for backward scans, and for VACUUM scanning in physical order while pages split underneath it.
- How are the multi-page mutations (split, page deletion) made crash-safe — i.e., decomposed into single-page atomic WAL actions such that any intermediate state is still a valid, searchable L&Y tree.
Common DBMS Design
Section titled “Common DBMS Design”Engineers building a concurrent B-tree converge on a recurring playbook that the textbook does not spell out. Naming these conventions first lets the PostgreSQL-specific symbols in the next section read as choices within a shared space rather than inventions.
The B-link tree and “move right to recover”
Section titled “The B-link tree and “move right to recover””Almost every high-concurrency B-tree shipped since 1990 is a variant of the Lehman-Yao B-link tree, not a lock-coupled classic B-tree. The shared pattern: each level is a right-linked list; each non-rightmost page has a high key; a descending operation that lands on a stale page recovers by walking right until the high key tells it the key belongs here. The differences between engines are in the details of recovery (how many right-steps are tolerated, whether read locks are still taken at all) and in deletion (the hardest part), not in the core idea.
Latches vs. locks — two different things on a B-tree page
Section titled “Latches vs. locks — two different things on a B-tree page”A B-tree page is touched by two distinct concurrency mechanisms that implementers keep strictly separate:
- Latches (PostgreSQL calls these buffer content locks / LWLocks) are short-lived, held only while a backend physically reads or writes the bytes of one page, and never held across an I/O wait or a deadlock-detectable wait. They protect the physical consistency of the page image.
- Locks (the SQL-level heavyweight lock manager, plus predicate locks for SERIALIZABLE) protect logical database objects and are held to end-of-transaction. A B-tree search takes no heavyweight lock on the index pages it reads.
The L&Y design is what makes this separation pay off: because recovery-by-
moving-right means a search never needs to hold more than one page latch at
once, the index becomes a structure coordinated almost entirely by
short latches, with transaction-duration locking pushed out to the heap
tuples the index points at. (PostgreSQL’s page latching lives in
postgres-buffer-manager.md; the heavyweight lock table in
postgres-lock-manager.md.)
Splits decomposed into atomic, recoverable steps
Section titled “Splits decomposed into atomic, recoverable steps”Because a split spans multiple pages but a write-ahead log entry is atomic per record, implementers decompose a split into a sequence of single-record actions, each of which leaves a valid tree. The canonical decomposition: (1) atomically create-and-link the new right page and shrink the old page, leaving a flag that says “my right sibling has no parent downlink yet”; then (2) atomically post the downlink to the parent and clear the flag. A crash between (1) and (2) leaves a tree that is fully searchable (right-links carry readers across the gap) and that a later operation repairs lazily. This is the WAL-shaped expression of the L&Y “a split is locally complete the moment the right-link is set” property.
Lazy, multi-tier garbage collection of index entries
Section titled “Lazy, multi-tier garbage collection of index entries”A no-overwrite MVCC engine (PostgreSQL, and broadly Oracle’s undo model) generates dead index tuples whenever a row is updated or deleted, because the index still points at the now-dead heap tuple version. Reclaiming them is a multi-tier affair, ordered by cost:
- Hint-bit marking — a scan that visits the heap and finds a tuple dead-to-everyone marks the index entry with a cheap “known dead” bit (set under a shared latch, like a hint bit), so future scans skip it.
- In-page physical deletion — when a page is about to split, the engine first physically removes the known-dead entries (and opportunistically checks nearby entries), buying space without growing the tree.
- Bulk deletion by the vacuum process — a background sweep removes all index entries pointing at heap tuples being reclaimed.
- Empty-page reclamation — a page emptied of all entries can be unlinked from the tree and eventually recycled.
Each tier is a backstop for the one above it. The hard correctness constraint across all of them is the MVCC engine’s: a TID (heap tuple identifier) must not be recycled (reused for an unrelated new row) while any scan might still follow a stale index pointer to it — the “concurrent TID recycling” hazard.
Suffix truncation and deduplication for fan-out
Section titled “Suffix truncation and deduplication for fan-out”Two space optimizations recur. Suffix truncation shortens the separator key written to a parent during a split to the shortest prefix that still distinguishes left from right (Bayer & Unterauer’s prefix B-trees), which keeps internal pages dense and the tree shallow. Deduplication folds many equal leaf entries into one physical entry carrying a posting list (an array of TIDs), trading a small decode cost for large space savings on low- cardinality columns.
Theory ↔ PostgreSQL mapping
Section titled “Theory ↔ PostgreSQL mapping”| Theory / convention | PostgreSQL name (nbtree) |
|---|---|
| B-link tree right-link | BTPageOpaqueData.btpo_next (and btpo_prev left-link, a PG addition for backward scans) |
| High key (page keyspace upper bound) | P_HIKEY item (offset 1) on non-rightmost pages |
| ”Move right to recover” on search | _bt_moveright (compares search key to P_HIKEY) |
| Latch-coupling descent (lock child, drop parent) | _bt_search loop via _bt_relandgetbuf |
| Split step 1 (link right page, flag left) | _bt_split sets BTP_INCOMPLETE_SPLIT |
| Split step 2 (post downlink, clear flag) | _bt_insert_parent → _bt_insertonpg |
| Lazy parent-downlink repair | _bt_finish_split / _bt_getstackbuf |
| Root split (L&Y left unspecified) | _bt_newlevel + metapage btm_root update |
| Fast root (Lanin & Shasha) | BTMetaPageData.btm_fastroot / btm_fastlevel |
| Hint-bit “known dead” marking | LP_DEAD item flag, set in _bt_killitems / unique check |
| In-page deletion before split | _bt_delete_or_dedup_one_page, _bt_simpledel_pass, _bt_bottomupdel_pass |
| Bulk deletion by vacuum | btbulkdelete → btvacuumscan → btvacuumpage |
| Vacuum/split race detector | BTCycleId (btpo_cycleid) + BTP_SPLIT_END |
| Two-stage empty-page deletion | _bt_pagedel → _bt_mark_page_halfdead → _bt_unlink_halfdead_page (BTP_HALF_DEAD → BTP_DELETED) |
| “Drain technique” deferred recycle | BTPageIsRecyclable + BTDeletedPageData.safexid + _bt_pendingfsm_* |
| Suffix truncation | _bt_truncate (called from _bt_split) |
| Deduplication / posting lists | _bt_dedup_pass, BTreeTupleIsPosting, BT_IS_POSTING |
| AM dispatch table | bthandler returns IndexAmRoutine (see postgres-index-am.md) |
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL’s src/backend/access/nbtree/ is, in its own README’s words, “a
correct implementation of Lehman and Yao’s high-concurrency B-tree management
algorithm” with “a simplified version of the deletion logic described in
Lanin and Shasha.” It is a B+-tree at BTREE_VERSION 4 (the current
on-disk version): all data entries are on leaves, leaves are doubly chained
(L&Y require only a right-link; PostgreSQL adds a left-link to support
backward scans), and — crucially since version 4 — heap TID is a tiebreaker
key column, so every entry is physically unique on its level. That last
point is what lets the implementation rely on the strict-inequality form of
the L&Y subtree invariant.
Page layout and the L&Y invariants
Section titled “Page layout and the L&Y invariants”Every nbtree page reserves a “special area” at its end for a
BTPageOpaqueData, which is where the L&Y additions live:
// BTPageOpaqueData — src/include/access/nbtree.htypedef struct BTPageOpaqueData{ BlockNumber btpo_prev; /* left sibling, or P_NONE if leftmost */ BlockNumber btpo_next; /* right sibling, or P_NONE if rightmost */ uint32 btpo_level; /* tree level --- zero for leaf pages */ uint16 btpo_flags; /* flag bits, see below */ BTCycleId btpo_cycleid; /* vacuum cycle ID of latest split */} BTPageOpaqueData;btpo_next is the L&Y right-link; btpo_prev is PostgreSQL’s left-link
addition. btpo_level counts up from zero at the leaf level (so splitting
the root never renumbers existing pages). The flag bits encode page state:
// btpo_flags bits — src/include/access/nbtree.h#define BTP_LEAF (1 << 0) /* leaf page, i.e. not internal page */#define BTP_ROOT (1 << 1) /* root page (has no parent) */#define BTP_DELETED (1 << 2) /* page has been deleted from tree */#define BTP_META (1 << 3) /* meta-page */#define BTP_HALF_DEAD (1 << 4) /* empty, but still in tree */#define BTP_SPLIT_END (1 << 5) /* rightmost page of split group */#define BTP_HAS_GARBAGE (1 << 6) /* page has LP_DEAD tuples (deprecated) */#define BTP_INCOMPLETE_SPLIT (1 << 7) /* right sibling's downlink is missing */#define BTP_HAS_FULLXID (1 << 8) /* contains BTDeletedPageData */The high key is not a separate field — it is the first item on the
page. On a non-rightmost page item 1 (P_HIKEY) holds the high key and real
data starts at item 2 (P_FIRSTKEY); a rightmost page has no high key
(its keyspace upper bound is implicitly +∞) so data starts at item 1:
// high-key placement macros — src/include/access/nbtree.h#define P_HIKEY ((OffsetNumber) 1)#define P_FIRSTKEY ((OffsetNumber) 2)#define P_FIRSTDATAKEY(opaque) (P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY)flowchart LR
subgraph L0["Leaf level (btpo_level = 0)"]
direction LR
P1["Leaf A\nhigh key = k1\n[entries < k1]"]
P2["Leaf B\nhigh key = k2\n[k1 <= entries < k2]"]
P3["Leaf C (rightmost)\nno high key (+inf)\n[entries >= k2]"]
end
P1 -- "btpo_next (right-link)" --> P2
P2 -- "btpo_next" --> P3
P2 -- "btpo_prev (left-link)" --> P1
P3 -- "btpo_prev" --> P2
ROOT["Internal root (btpo_level = 1)\ndownlinks + separators"] --> P1
ROOT --> P2
ROOT --> P3
META["Meta page (block 0)\nbtm_root, btm_fastroot"] -.-> ROOT
Figure 1 — The L&Y B-link tree as PostgreSQL stores it. Each level is a
doubly-linked list (btpo_next right-link is the L&Y requirement; btpo_prev
left-link is PostgreSQL’s addition for backward scans). The high key on a
non-rightmost page (P_HIKEY, item 1) upper-bounds that page’s keyspace; the
rightmost page omits it (implicit +∞). The meta page at block 0 points at both
the true root and the “fast root.” A search that lands on a page split since
it read the parent recovers by following btpo_next, with no lock above the
current page.
The header comment for the high key states the invariant the whole algorithm rests on:
// nbtree.h — high key invariant (comment)// The high key on a page is required to be greater than or equal to any// other key that appears on the page. If we find ourselves trying to// insert a key that is strictly > high key, we know we need to move right// (this should only happen if the page was split since we examined the// parent page).Search descent: latch-coupling down, move-right to recover
Section titled “Search descent: latch-coupling down, move-right to recover”_bt_search descends from the root, latch-coupling down the tree: it holds
a read lock on the current page only long enough to find the downlink to
follow, then atomically trades that lock for one on the child via
_bt_relandgetbuf. At each level it first calls _bt_moveright to handle the
case where the page split after the parent’s downlink was read:
// _bt_search — src/backend/access/nbtree/nbtsearch.c (condensed)*bufP = _bt_getroot(rel, heaprel, access);/* ... */for (;;){ /* Race -- the page we grabbed may have split since we read its * downlink in its parent. If so, move right to its new sibling. */ *bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE), stack_in, page_access); page = BufferGetPage(*bufP); opaque = BTPageGetOpaque(page); if (P_ISLEAF(opaque)) break; /* found the leaf */
offnum = _bt_binsrch(rel, key, *bufP); /* pick the downlink */ itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum)); child = BTreeTupleGetDownLink(itup); /* ... push (blkno, offnum) on the descent stack ... */ /* drop the read lock on the page, then acquire one on its child */ *bufP = _bt_relandgetbuf(rel, *bufP, child, page_access); stack_in = new_stack;}The recovery step is _bt_moveright. The whole L&Y read protocol is in this
loop — compare the search key to the page’s high key (P_HIKEY); if the key
is not less than the high key, the page must have split, so step right:
// _bt_moveright — src/backend/access/nbtree/nbtsearch.c (condensed)cmpval = key->nextkey ? 0 : 1;for (;;){ page = BufferGetPage(buf); opaque = BTPageGetOpaque(page); if (P_RIGHTMOST(opaque)) break; /* no sibling; high key is +inf */
if (forupdate && P_INCOMPLETE_SPLIT(opaque)) /* finish a split we ran into while locking for write ... */;
if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval) { /* step right one page */ buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access); continue; } else break; /* key belongs on this page */}Two things to read off this. First, the page could have split more than
once, so the loop steps right “as far as needed.” Second, P_IGNORE (a
deleted or half-dead page) also forces a right-step — a search that lands on a
page being deleted recovers exactly as if it had hit a concurrent split. That
unification is the reason page deletion can be made safe without ever blocking
searches.
The actual key comparison is _bt_compare, a 3-way comparator that walks the
scankey attributes, then uses the heap-TID tiebreaker. The two rules that make
the L&Y invariant work — that the first data item on an internal page is
treated as −∞, and that a truncated (missing) attribute is also −∞ — are both
here:
// _bt_compare — src/backend/access/nbtree/nbtsearch.c (condensed)/* Force result ">" if target item is first data item on an internal page */if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque)) return 1;/* ... compare ncmpkey scankey attributes against the tuple ... *//* All non-truncated attributes equal: treat truncated attrs as -inf */if (key->keysz > ntupatts) return 1;/* ... then break ties on heap TID via key->scantid ... */flowchart TD
START["Enter page P (latched READ)"] --> RM{"P is rightmost?\n(btpo_next == P_NONE)"}
RM -- yes --> STAY["High key is +inf:\nkey belongs here"]
RM -- no --> IGN{"P deleted/half-dead?\n(P_IGNORE)"}
IGN -- yes --> STEP["Release P, latch P.btpo_next"]
IGN -- no --> CMP{"_bt_compare(key, P_HIKEY)\n>= cmpval ?"}
CMP -- "yes (key past high key)" --> STEP
CMP -- "no (key <= high key)" --> STAY
STEP --> START
STAY --> DONE["Use this page\n(descend or scan)"]
Figure 2 — _bt_moveright’s recovery decision, run on entry to every page
during descent and during a scan step. It compares the search key to the
page’s high key; if the key is at or past the high key (or the page is
deleted), the page split or was deleted concurrently and the search follows
btpo_next one step right, repeating until the key falls within the page’s
keyspace. At most one page is latched at any instant — the L&Y guarantee.
Insertion: descend, check, find location, insert-or-split
Section titled “Insertion: descend, check, find location, insert-or-split”btinsert (the AM entry point) forms an index tuple and calls _bt_doinsert,
which is the insertion driver. It builds an insertion scankey, descends to the
correct leaf with a write lock via _bt_search_insert, optionally runs the
unique-constraint check, then finds the exact in-page offset and inserts:
// _bt_doinsert — src/backend/access/nbtree/nbtinsert.c (condensed)itup_key = _bt_mkscankey(rel, itup);/* ... */search: stack = _bt_search_insert(rel, heapRel, &insertstate); if (checkingunique) { xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique, &is_unique, &speculativeToken); if (TransactionIdIsValid(xwait)) { _bt_relbuf(rel, insertstate.buf); /* release, wait, retry */ /* ... XactLockTableWait(xwait, ...) ... */ goto search; } if (itup_key->heapkeyspace) itup_key->scantid = &itup->t_tid; /* restore TID tiebreaker */ } CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf)); newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique, indexUnchanged, stack, heapRel); _bt_insertonpg(rel, heapRel, itup_key, insertstate.buf, InvalidBuffer, stack, itup, insertstate.itemsz, newitemoff, insertstate.postingoff, false);Several design points are visible here. The uniqueness check holds a write
lock on “the first page the value could be on” continuously from the check
through the insert — this is how PostgreSQL serializes two concurrent
inserters of the same key without a separate lock. _bt_check_unique uses a
SnapshotDirty to see uncommitted conflicting tuples and returns the xid to
wait on; on conflict with an in-progress speculative insertion it cooperates
with the INSERT … ON CONFLICT machinery. And note the scantid dance: while
checking uniqueness the heap-TID tiebreaker is omitted from the scankey (so
the search finds the leftmost page any duplicate could be on); once uniqueness
is established, the TID is restored so the insert lands at the physically
correct, unique position.
_bt_search_insert also implements the fastpath optimization: a backend
inserting monotonically increasing keys (a serial PK, a timestamp) caches the
rightmost leaf block in RelationGetTargetBlock and, on the next insert,
conditionally locks that block and checks it is still rightmost with room —
skipping the whole root-to-leaf descent in the common case.
The page split: one atomic action, then a deferred parent insert
Section titled “The page split: one atomic action, then a deferred parent insert”When _bt_insertonpg finds the target leaf lacks free space, it calls
_bt_split. This is the heart of the L&Y write protocol. The split builds the
new left page in a temporary buffer (so a failure before the new right page
is acquired leaves the original untouched), chooses a split point via
_bt_findsplitloc, computes the new high key (with suffix truncation on leaf
pages via _bt_truncate), allocates the right page, fixes the right sibling’s
left-link, and writes it all in one critical section / one WAL record —
with the left page flagged BTP_INCOMPLETE_SPLIT:
// _bt_split — src/backend/access/nbtree/nbtinsert.c (condensed)firstrightoff = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz, newitem, &newitemonleft);leftpage = PageGetTempPage(origpage); /* build left half in temp */_bt_pageinit(leftpage, BufferGetPageSize(buf));lopaque = BTPageGetOpaque(leftpage);lopaque->btpo_flags = oopaque->btpo_flags;/* set flag in leftpage indicating that rightpage has no downlink yet */lopaque->btpo_flags |= BTP_INCOMPLETE_SPLIT;lopaque->btpo_prev = oopaque->btpo_prev;/* ... compute lefthighkey (suffix-truncated copy of firstright on leaves) ... lefthighkey = _bt_truncate(rel, lastleft, firstright, itup_key); ... */rbuf = _bt_allocbuf(rel, heaprel); /* the new right page */lopaque->btpo_next = rightpagenumber; /* left -> right link */ropaque->btpo_prev = origpagenumber; /* right -> left link */ropaque->btpo_next = oopaque->btpo_next; /* right -> old next *//* ... copy data items to left/right per split point; insert newitem ... */if (!isrightmost){ sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE); /* old right sib */ /* ... verify sopaque->btpo_prev == origpagenumber (corruption check) ... */}START_CRIT_SECTION();PageRestoreTempPage(leftpage, origpage); /* left never moves */MarkBufferDirty(buf); MarkBufferDirty(rbuf);if (!isrightmost) { sopaque->btpo_prev = rightpagenumber; MarkBufferDirty(sbuf); }/* XLOG: one xl_btree_split record over buf, rbuf, sbuf (+cbuf for internal) */END_CRIT_SECTION();Three invariants make this crash-safe and concurrency-safe:
- The left page never moves. The original block number stays the left half; the new block is always the right half. This is what lets every existing downlink and right-link that pointed at the original page remain valid — they now point at the (smaller) left page, and any key that moved right is reachable by following the new right-link.
- Locks couple strictly left-to-right. The old right sibling is locked
after the splitting page, so the lock order is always increasing toward
the right — deadlock-free by construction. (The
_bt_splitcomment: “We are guaranteed that this is deadlock-free, since we couple the locks in the standard order: left to right.”) - The split is logically complete the instant the right-link is set,
even though no parent downlink exists yet. A concurrent search reaching the
left page sees its new (lower) high key, moves right, and finds the moved
keys. The
BTP_INCOMPLETE_SPLITflag is a note to writers, not to readers.
After _bt_split returns the locked right buffer, _bt_insertonpg calls
_bt_insert_parent to post the downlink — a separate atomic action that
recursively inserts into the parent (splitting it in turn if needed), and on a
root split builds a brand-new root via _bt_newlevel:
// _bt_insert_parent — src/backend/access/nbtree/nbtinsert.c (condensed)if (isroot){ rootbuf = _bt_newlevel(rel, heaprel, buf, rbuf); /* new root one level up */ /* ... release buffers ... */}else{ ritem = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY)); new_item = CopyIndexTuple(ritem); /* parent separator = left's high key */ BTreeTupleSetDownLink(new_item, rbknum); /* ... points at new right page */ pbuf = _bt_getstackbuf(rel, heaprel, stack, bknum);/* re-find parent via descent stack */ _bt_relbuf(rel, rbuf); /* right child can be released now */ /* ... ereport if pbuf invalid (failed to re-find parent) ... */ _bt_insertonpg(rel, heaprel, NULL, pbuf, buf, stack->bts_parent, new_item, MAXALIGN(IndexTupleSize(new_item)), stack->bts_offset + 1, 0, isonly); /* recurse up */}The separator key posted to the parent is the left page’s high key — which
the comment in _bt_split notes is a (truncated) copy of the firstright tuple,
satisfying L&Y’s requirement that the parent separator be strictly less than
every key in the right subtree. _bt_getstackbuf re-finds the parent using the
descent stack recorded in _bt_search, matching on the child’s block number
(not on the separator key) — a deliberate PostgreSQL simplification over L&Y
that means it never has to couple locks across levels during the ascent, and
never cares if the separator key has changed.
sequenceDiagram
participant I as Inserter
participant L as Left page (orig block)
participant R as New right page
participant S as Old right sibling
participant P as Parent page
Note over I,S: Atomic action 1 — one xl_btree_split WAL record
I->>L: build left half in temp buffer, new (lower) high key
I->>R: _bt_allocbuf; copy moved items; set btpo_prev=L, btpo_next=S
I->>L: btpo_next = R; set BTP_INCOMPLETE_SPLIT
I->>S: btpo_prev = R (fix left-link; locked after L, left->right)
I->>L: PageRestoreTempPage (left never moves); MarkBufferDirty all
Note over I,P: tree is now searchable; readers move right across the gap
Note over I,P: Atomic action 2 — separate xl_btree_insert (UPPER) record
I->>P: _bt_getstackbuf re-finds parent by child blkno
I->>P: insert downlink (separator = L's high key) at stack offset+1
I->>L: clear BTP_INCOMPLETE_SPLIT atomically with the parent insert
Figure 3 — A page split as two independent atomic WAL actions. Action 1
links the new right page in and flags the left page incomplete; the tree is
fully searchable at that point because readers recover via the right-link.
Action 2 posts the downlink to the parent and clears the flag. A crash between
the two leaves a valid (if downlink-missing) tree; the next inserter to pass
through _bt_moveright or _bt_getstackbuf finishes the split lazily via
_bt_finish_split.
Choosing the split point
Section titled “Choosing the split point”_bt_findsplitloc (in nbtsplitloc.c) picks where to split. Its primary
goal is to equalize free space across the two halves after accounting for
the incoming tuple, but it also biases the choice to make suffix truncation
effective and to keep groups of duplicates from spanning pages. For a rightmost
page it leaves the left side fillfactor% full (default 90% for leaves) so
that monotonically increasing inserts produce densely packed pages instead of
the 50% utilization a naive midpoint split would give:
// fillfactor constants — src/include/access/nbtree.h#define BTREE_DEFAULT_FILLFACTOR 90#define BTREE_NONLEAF_FILLFACTOR 70#define BTREE_SINGLEVAL_FILLFACTOR 96It materializes a small list of candidate split points within an acceptable
free-space range and runs one of three strategies — SPLIT_DEFAULT,
SPLIT_MANY_DUPLICATES (find the minimally-distinguishing point), or
SPLIT_SINGLE_VALUE (a page entirely of one value: leave the left side nearly
full to absorb future duplicates) — choosing the one that maximizes suffix
truncation without unduly unbalancing the split.
Index-tuple deletion: four tiers before a split
Section titled “Index-tuple deletion: four tiers before a split”PostgreSQL removes dead index entries through the tiered scheme described
above. The cheapest tier is LP_DEAD hint marking: a plain index scan
that visits the heap and finds a tuple dead-to-all returns to the index and
sets the item’s LP_DEAD line-pointer flag (also done while checking a unique
index for conflicts, where an exclusive lock is already held). The README:
// nbtree/README — Simple deletion// Once an index tuple has been marked LP_DEAD it can actually be deleted// from the index immediately; since index scans only stop "between" pages,// no scan can lose its place from such a deletion.When an insert is about to split a page, _bt_delete_or_dedup_one_page first
tries to reclaim space without growing the tree, in this order: physically
delete any LP_DEAD-marked tuples; if that is not enough and the executor
hinted the incoming tuple is an unchanged duplicate (version churn from
UPDATEs), run a bottom-up deletion pass (_bt_bottomupdel_pass) that asks
the table AM which nearby duplicates are actually dead; and as a last resort
run a deduplication pass (_bt_dedup_pass). Both deletion passes share
the same WAL records (_bt_delitems_delete) and the same table-AM cooperation
for computing a snapshotConflictHorizon (used to generate recovery conflicts
on standbys).
The README frames bottom-up deletion as a backstop grounded in the “generational hypothesis” — most index tuples die young — that targets the pathological version-churn workload where a few leaf pages would otherwise split repeatedly purely to store dead UPDATE versions.
VACUUM: bulk deletion in physical order, and the split race
Section titled “VACUUM: bulk deletion in physical order, and the split race”btbulkdelete is the AM’s bulk-deletion callback. It acquires a vacuum cycle
ID and runs btvacuumscan, which sweeps the index in physical block order
(not key order) for read-ahead efficiency, calling btvacuumpage on each page.
For every leaf page it takes a cleanup lock (a super-exclusive lock requiring
no other pins) — even on pages with nothing to delete — which is the interlock
that prevents concurrent TID recycling from confusing in-flight scans.
The subtle correctness problem is that a page can split while VACUUM is
mid-scan, moving tuples from a not-yet-visited page to an already-visited
lower-numbered page. The BTCycleId mechanism detects this: a split stamps
both halves with the running VACUUM’s cycle ID, so when btvacuumpage
encounters a leaf whose btpo_cycleid equals the current cycle and whose
right-link points backward (to a lower block than the scan’s current
position), it knows the page split since the scan started and backtracks to
follow the right-link:
// btvacuumpage — src/backend/access/nbtree/nbtree.c (condensed)/* Check whether we need to backtrack to earlier pages. What we are * concerned about is a page split that happened since we started the * vacuum scan, moving right-half tuples to a block we already passed. */if (vstate->cycleid != 0 && opaque->btpo_cycleid == vstate->cycleid && !(opaque->btpo_flags & BTP_SPLIT_END) && !P_RIGHTMOST(opaque) && opaque->btpo_next < scanblkno) backtrack_to = opaque->btpo_next;BTP_SPLIT_END is the optimization that bounds the backtracking: when a split
detects its right sibling has a different cycle ID, it flags the right page
BTP_SPLIT_END, telling VACUUM it need not chase further right. The cycle-ID
test tolerates false positives (it may backtrack unnecessarily) but never false
negatives, which is why a small 16-bit counter (MAX_BT_CYCLE_ID) suffices.
Empty-page deletion: the two-stage half-dead protocol
Section titled “Empty-page deletion: the two-stage half-dead protocol”A leaf page emptied of all data items becomes a candidate for deletion —
removal from the tree. PostgreSQL never deletes the rightmost page of a level
and never merges partly-full pages; it only unlinks fully-empty leaves (and the
“skinny” chains of single-child internal pages above them). _bt_pagedel
drives a two-stage protocol:
- Stage 1 — mark half-dead (
_bt_mark_page_halfdead): re-find the parent, remove the leaf’s downlink, point that keyspace at the right sibling, and setBTP_HALF_DEADon the leaf. A half-dead page isP_IGNORE— searches skip it by moving right, exactly as for an incomplete split. The tree is left “similar to a page split: the page has no downlink pointing to it, but it’s still linked to its siblings.” - Stage 2 — unlink (
_bt_unlink_halfdead_page): lock left sibling, target, and right sibling (in that left-to-right order), splice the target out of the sibling chain, and setBTP_DELETED.
stateDiagram-v2
[*] --> Live
Live --> HalfDead: stage 1 _bt_mark_page_halfdead\nremove downlink from parent\nkeyspace moves to right sibling
HalfDead --> Deleted: stage 2 _bt_unlink_halfdead_page\nsplice out of sibling chain\nstamp safexid
Deleted --> Recyclable: BTPageIsRecyclable\nsafexid visible-to-everyone
Recyclable --> Reused: taken from FSM\noverwritten as a new right page
Reused --> [*]
Figure 4 — The empty-page deletion lifecycle. A page moves Live → HalfDead →
Deleted, where each arrow is a separate atomic WAL action so an interrupted
VACUUM always leaves a consistent tree (the next VACUUM finishes a stranded
half-dead page). A Deleted page is not immediately reusable: it lingers as a
tombstone, stamped with a safexid, until BTPageIsRecyclable confirms no
snapshot could still hold a reference — the Lanin & Shasha “drain technique.”
Each stage is its own WAL record, so an interrupted VACUUM leaves a consistent
tree and the next VACUUM finishes the job. The deleted page is not recycled
immediately. It is stamped with a safexid (the next-transaction counter) and
left as a tombstone, because a concurrent search may have already read a
downlink to it. BTPageIsRecyclable gates reuse on that xid becoming
visible-to-everyone:
// BTPageIsRecyclable — src/include/access/nbtree.h (condensed)opaque = BTPageGetOpaque(page);if (P_ISDELETED(opaque)){ FullTransactionId safexid = BTPageGetDeleteXid(page); /* Recycle only once the deletion XID is visible to everyone, so no * in-progress scan could still follow a downlink to this page. */ return GlobalVisCheckRemovableFullXid(heaprel, safexid);}return false;This deferral is the “drain technique.” VACUUM tracks newly deleted pages in a
BTPendingFSM array and, at the end of the scan (_bt_pendingfsm_finalize),
places any that have already become safe into the free-space map; the rest are
left for a future VACUUM (the metapage’s btm_last_cleanup_num_delpages carries
the count forward).
Deduplication and posting lists
Section titled “Deduplication and posting lists”When deletion cannot free enough space, _bt_dedup_pass folds runs of equal
non-pivot tuples into a single posting-list tuple — one key value plus an
array of heap TIDs — recognized by the BT_IS_POSTING status bit repurposing
the tuple’s t_tid:
// nbtree.h — posting-list tuple recognition (condensed)#define BT_IS_POSTING 0x2000static inline boolBTreeTupleIsPosting(IndexTuple itup){ if ((itup->t_info & INDEX_ALT_TID_MASK) == 0) return false; /* presence of BT_IS_POSTING in offset number indicates posting tuple */ return (ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) & BT_IS_POSTING) != 0;}Deduplication is always lazy (only at the point a split looms) and only for
indexes whose opclass guarantees btm_allequalimage (equal datums are
byte-identical, so collapsing them loses nothing). In unique indexes it
serves a different purpose: not space saving (there are no real duplicates long
term) but buying time for garbage collection to run before a version-churn
split. When an incoming tuple’s heap TID falls inside an existing posting
list’s range, _bt_insertonpg performs a posting-list split in passing,
folded into the same atomic action as the insert.
Source Walkthrough
Section titled “Source Walkthrough”Symbols grouped by sub-system. Position-hint table follows.
AM interface (nbtree.c)
Section titled “AM interface (nbtree.c)”bthandler— returns theIndexAmRoutinefilled with nbtree’s capability flags (amcanorder,amcanbackward,amcanunique,amcanparallel,ampredlocks,amcaninclude, …) and callback pointers. The dispatch contract itself is owned bypostgres-index-am.md.btinsert— AM insert callback; forms theIndexTuple, sets its heap TID, delegates to_bt_doinsert.btgettuple/btgetbitmap— AM scan callbacks (the read path; ordered scan vs. bitmap).btbuildempty— initializes a metapage via_bt_initmetapage.
Search and scan (nbtsearch.c)
Section titled “Search and scan (nbtsearch.c)”_bt_search— root-to-leaf descent, latch-coupling down, calling_bt_moverightat each level. Returns the descent stack used later by splits to re-find parents._bt_moveright— the L&Y recovery primitive: step right while the search key is at/past the page’s high key, or the page isP_IGNORE. Finishes incomplete splits whenforupdate._bt_compare— 3-way key comparator; encodes “first internal data item is −∞”, “truncated attribute is −∞”, and the heap-TID tiebreaker._bt_binsrch/_bt_binsrch_insert— within-page binary search (navigation vs. insertion position, the latter withscantid)._bt_first/_bt_next/_bt_readpage/_bt_steppage— the ordered-scan state machine;_bt_firstpositions via_bt_search, then scans leaves along the sibling links, remembering the right-link at scan time so a concurrent split cannot make it re-scan moved items._bt_lock_and_validate_left— backward-scan move-left, handling the harder case of a left sibling that itself split (the README’s 4-step move-left algorithm).
Insertion and split (nbtinsert.c)
Section titled “Insertion and split (nbtinsert.c)”_bt_doinsert— insertion driver: build scankey, descend write-locked, unique-check, find location, insert._bt_search_insert—_bt_searchwrapper with the rightmost-leaf fastpath cache (RelationGetTargetBlock)._bt_check_unique— uniqueness enforcement underSnapshotDirty; returns the conflicting xid to wait on._bt_findinsertloc/_bt_stepright— find the exact in-page offset, stepping right if the value belongs on a sibling._bt_insertonpg— the insert/split fork: if the item fits, add it (logging one WAL record, clearing a child’s incomplete-split flag if this is the parent insert); else call_bt_splitthen_bt_insert_parent._bt_split— the atomic split action: temp left page, new high key (with_bt_truncate), right page allocation, right-sibling left-link fixup,BTP_INCOMPLETE_SPLITflag, single WAL record._bt_insert_parent— post the downlink to the parent (or_bt_newlevelfor a root split); re-finds the parent with_bt_getstackbuf._bt_finish_split/_bt_getstackbuf— lazy repair of an incomplete split, re-finding the parent by matching the child block number._bt_newlevel— build a new root one level up; update the metapage._bt_delete_or_dedup_one_page/_bt_simpledel_pass— pre-split space reclamation (LP_DEAD physical deletion, simple deletion).
Split-point choice (nbtsplitloc.c)
Section titled “Split-point choice (nbtsplitloc.c)”_bt_findsplitloc— choose the split point; equalize free space, bias for suffix truncation, special-case duplicates and rightmost pages._bt_strategy/_bt_bestsplitloc— strategy selection (SPLIT_DEFAULT,SPLIT_MANY_DUPLICATES,SPLIT_SINGLE_VALUE) and best-point search.
Pages, deletion, VACUUM (nbtpage.c, nbtree.c, nbtdedup.c)
Section titled “Pages, deletion, VACUUM (nbtpage.c, nbtree.c, nbtdedup.c)”_bt_getroot/_bt_gettrueroot— fetch the (fast) root, following the metapage and recovering if the cached root pointer is stale._bt_initmetapage/_bt_getmeta/_bt_upgrademetapage— metapage I/O.btbulkdelete→btvacuumscan→btvacuumpage— the VACUUM scan in physical order; cleanup-lock interlock; cycle-ID backtracking._bt_pagedel— page-deletion driver._bt_mark_page_halfdead/_bt_unlink_halfdead_page— the two deletion stages._bt_lock_subtree_parent— re-find the parent of a to-be-deleted subtree.BTPageIsRecyclable/_bt_pendingfsm_init/_bt_pendingfsm_finalize— deferred recycling (the drain technique) and FSM placement._bt_delitems_vacuum/_bt_delitems_delete/_bt_delitems_update— WAL-logged item removal/update on a page._bt_dedup_pass/_bt_bottomupdel_pass— deduplication and bottom-up deletion.
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
bthandler | nbtree.c | 115 |
btinsert | nbtree.c | 202 |
btgettuple | nbtree.c | 226 |
btbulkdelete | nbtree.c | 1068 |
btvacuumscan | nbtree.c | 1186 |
btvacuumpage | nbtree.c | 1361 |
_bt_search | nbtsearch.c | 107 |
_bt_moveright | nbtsearch.c | 246 |
_bt_binsrch | nbtsearch.c | 348 |
_bt_compare | nbtsearch.c | 693 |
_bt_first | nbtsearch.c | 887 |
_bt_steppage | nbtsearch.c | 2169 |
_bt_doinsert | nbtinsert.c | 102 |
_bt_search_insert | nbtinsert.c | 317 |
_bt_check_unique | nbtinsert.c | 408 |
_bt_findinsertloc | nbtinsert.c | 815 |
_bt_stepright | nbtinsert.c | 1027 |
_bt_insertonpg | nbtinsert.c | 1105 |
_bt_split | nbtinsert.c | 1467 |
_bt_insert_parent | nbtinsert.c | 2099 |
_bt_finish_split | nbtinsert.c | 2241 |
_bt_getstackbuf | nbtinsert.c | 2319 |
_bt_newlevel | nbtinsert.c | 2444 |
_bt_delete_or_dedup_one_page | nbtinsert.c | 2683 |
_bt_simpledel_pass | nbtinsert.c | 2812 |
_bt_findsplitloc | nbtsplitloc.c | 129 |
_bt_getroot | nbtpage.c | 344 |
_bt_pagedel | nbtpage.c | 1802 |
_bt_mark_page_halfdead | nbtpage.c | 2088 |
_bt_unlink_halfdead_page | nbtpage.c | 2314 |
_bt_lock_subtree_parent | nbtpage.c | 2813 |
_bt_pendingfsm_finalize | nbtpage.c | 2996 |
_bt_dedup_pass | nbtdedup.c | 58 |
_bt_bottomupdel_pass | nbtdedup.c | 307 |
BTPageOpaqueData | nbtree.h | 63 |
BTPageIsRecyclable | nbtree.h | 292 |
P_HIKEY / P_FIRSTKEY | nbtree.h | 368 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified facts
Section titled “Verified facts”-
The right-link (
btpo_next), left-link (btpo_prev), level, flags, and vacuum cycle ID live inBTPageOpaqueDatain the page special area. Read fromnbtree.hon 2026-06-05. The left-link is PostgreSQL’s addition over L&Y (which require only a right-link); confirmed by the README’s “Differences to the Lehman & Yao algorithm” section. -
A search holds at most one page latch at a time and recovers from a concurrent split by following
btpo_next. Verified in_bt_search(drops the parent lock via_bt_relandgetbufbefore acquiring the child) and_bt_moveright(steps right while_bt_compare(key, P_HIKEY) >= cmpval). The README confirms PostgreSQL still takes a read latch on the page being examined — L&Y assume unshared in-memory copies, but PostgreSQL shares buffers, so it must latch to keep the bytes stable while reading. -
A page split is two independent atomic WAL actions; the first flags the left page
BTP_INCOMPLETE_SPLIT, the second posts the parent downlink and clears it. Verified in_bt_split(sets the flag, single critical section /xl_btree_splitrecord) and_bt_insertonpg/_bt_insert_parent(the separate parent insert that clears the flag). The README “WAL Considerations” section states a crash between the two leaves a searchable tree whose missing downlink is created on-the-fly by the next inserter via_bt_finish_split. -
Split locks couple strictly left-to-right and the left half keeps the original block number. Verified in
_bt_split: the old right sibling (oopaque->btpo_next) is locked after the splitting page, andPageRestoreTempPage(leftpage, origpage)writes the left half back onto the original block. The README explicitly calls this deadlock-free by lock ordering. -
The parent downlink is re-found by matching the child’s block number, not the separator key. Verified in
_bt_insert_parent(passesbknumto_bt_getstackbuf). The README notes this is a deliberate simplification over L&Y/Lanin-Shasha that avoids coupling locks across levels during the ascent. -
VACUUM scans in physical block order, takes a cleanup lock on every leaf, and backtracks on a concurrent split via the cycle-ID test. Verified in
btvacuumscan(read-stream over physical blocks),btvacuumpage(_bt_upgradelockbufcleanupon every leaf; thebacktrack_toassignment guarded bybtpo_cycleid == vstate->cycleid && opaque->btpo_next < scanblkno).BTP_SPLIT_ENDbounds the backtracking. -
Empty-page deletion is a two-stage half-dead → deleted protocol, and a deleted page is deferred into the FSM until its
safexidis visible-to- everyone. Verified in_bt_pagedel(drives both stages),_bt_mark_page_halfdead(BTP_HALF_DEAD),_bt_unlink_halfdead_page(BTP_DELETED+safexid), andBTPageIsRecyclable(GlobalVisCheckRemovableFullXid). The README names this the Lanin & Shasha “drain technique.” -
Deduplication is lazy, gated on
btm_allequalimage, and produces posting-list tuples flaggedBT_IS_POSTING. Verified in_bt_dedup_passand theBTreeTupleIsPostinginline innbtree.h. The on-disk version that enables it isBTREE_VERSION 4; v3 indexes must REINDEX to use it (per thenbtree.hversion comment). -
Heap TID is a tiebreaker key column as of
BTREE_VERSION 4, making every entry unique on its level. Verified in thenbtree.hversion comment and the_bt_compareheap-TID/scantidpath. This is what licenses the strict- inequality form of the L&Y subtree invariant (Ki < v <= Ki+1).
Open questions
Section titled “Open questions”-
The exact bound on backtracking work in
btvacuumpage.BTP_SPLIT_ENDstops VACUUM chasing right-links past a split group, but the worst-case number of backtrack hops under sustained concurrent splitting during a single VACUUM is not obvious from the code. Investigation path: trace the interaction between_bt_split’sBTP_SPLIT_END-setting (only when the right sibling has a different cycle ID) and thebtvacuumpagebacktrack loop under a synthetic split-heavy workload. -
How often newly deleted pages actually become recyclable within the same VACUUM. The README says PG14+ checks recycle-safety at the end of the scan via
_bt_pendingfsm_finalize, but the fraction that clearsGlobalVisCheckRemovableFullXidimmediately vs. waits for a future VACUUM is workload-dependent and unmeasured here. Investigation path: instrument_bt_pendingfsm_finalizeagainst a long-running-snapshot workload. -
Whether the fastpath rightmost-leaf cache (
RelationGetTargetBlock) interacts pathologically with bottom-up deletion on unique indexes. Both target the rightmost / churning leaf; the code paths are separate but the workloads overlap. Investigation path: review whether a fastpath insert can defeat a bottom-up deletion opportunity on the same page.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”-
CUBRID’s B+-tree. CUBRID also uses a B+-tree with sibling links, but its concurrency model leans more on the page-level latches and the lock manager than on a self-repairing L&Y right-link recovery; a side-by-side of CUBRID’s index-page latch protocol against
_bt_moverightwould sharpen what the L&Y “move right to recover” actually buys in lock-traffic terms. (See the CUBRID index analysis in the cubrid tree.) -
Lehman & Yao 1981 vs. the PostgreSQL implementation. PostgreSQL deviates from the paper in named ways (read latches because buffers are shared; a left-link for backward scans; re-finding parents by child block number rather than separator key; lazy parent-downlink creation). A focused note mapping each deviation to its README justification would be the natural companion to this doc. Paper: Lehman & Yao 1981 (
dbms-papers/btree.md). -
Lanin & Shasha 1986 deletion vs. PostgreSQL’s simplification. PostgreSQL moves keyspace right on deletion (Lanin-Shasha prefer left) because it has left-links anyway, and adopts their “drain technique” and “fast root” ideas. Comparing the symmetric Lanin-Shasha algorithm against
_bt_pagedel’s half-dead protocol is a research-grade exercise in deletion correctness. -
Blink-tree latching vs. lock-free / optimistic B-trees (OLFIT, Bw-tree). Modern in-memory engines (Microsoft’s Bw-tree, OLFIT) push past latched B-link trees toward latch-free or optimistic-version designs. Positioning PostgreSQL’s latch-coupled L&Y tree against these would frame what a disk-oriented, buffer-shared engine gives up for them.
-
Suffix truncation / prefix B-trees. PostgreSQL implements only the “simple prefix B-tree” form of Bayer & Unterauer 1977 (whole-attribute truncation). An opclass that manufactures a shorter separator for variable-length types (e.g.,
text) is an unrealized extension the README itself sketches.
Sources
Section titled “Sources”In-tree READMEs and source files (REL_18_STABLE, commit 273fe94)
Section titled “In-tree READMEs and source files (REL_18_STABLE, commit 273fe94)”src/backend/access/nbtree/README— the design document: L&Y basics, PostgreSQL’s differences, deletion during VACUUM, page deletion, the drain technique, fast root, WAL considerations, scans during recovery, suffix truncation, deduplication, posting-list splits, data representation.src/include/access/nbtree.h—BTPageOpaqueData,BTMetaPageData, flag bits,P_HIKEY/P_FIRSTKEY, pivot/posting tuple format,BTPageIsRecyclable,BTStackData, version/fillfactor constants.src/backend/access/nbtree/nbtree.c—bthandler, AM callbacks, the VACUUM scan (btbulkdelete/btvacuumscan/btvacuumpage).src/backend/access/nbtree/nbtinsert.c—_bt_doinsert,_bt_check_unique,_bt_insertonpg,_bt_split,_bt_insert_parent,_bt_finish_split.src/backend/access/nbtree/nbtsearch.c—_bt_search,_bt_moveright,_bt_compare,_bt_binsrch, the scan state machine.src/backend/access/nbtree/nbtpage.c— metapage,_bt_pagedel, the half-dead/unlink stages, deferred-FSM recycling.src/backend/access/nbtree/nbtsplitloc.c—_bt_findsplitlocand split strategies.src/backend/access/nbtree/nbtdedup.c—_bt_dedup_pass,_bt_bottomupdel_pass.
Papers and textbook chapters
Section titled “Papers and textbook chapters”- Lehman, P. & Yao, S. B. (1981). “Efficient Locking for Concurrent
Operations on B-Trees.” ACM TODS 6(4):650-670. The operative algorithm;
text captured in
knowledge/research/dbms-papers/btree.md. - Lanin, V. & Shasha, D. (1986). “A Symmetric Concurrent B-Tree Algorithm.” Proc. 1986 Fall Joint Computer Conf., 380-389. Deletion logic lineage (drain technique, fast root).
- Comer, D. (1979). “The Ubiquitous B-Tree.” ACM Computing Surveys
11(2). B-tree background (
knowledge/research/dbms-papers/btree.md). - Bayer, R. & Unterauer, K. (1977). “Prefix B-Trees.” ACM TODS 2(1):11-26. Suffix-truncation basis (cited in the nbtree README).
- Database System Concepts (Silberschatz, Korth, Sudarshan, 7e), ch. 14
“Indexing” — the single-threaded B+-tree foundation
(
knowledge/research/dbms-general/). - Database Internals (Petrov 2019), ch. 2-6 — B-tree concurrency, page layout, and WAL framing.
Sibling docs (cross-references — mechanism owned there, not duplicated here)
Section titled “Sibling docs (cross-references — mechanism owned there, not duplicated here)”postgres-index-am.md— theIndexAmRoutinedispatch contract thatbthandlerfills in.postgres-buffer-manager.md— buffer pins and content-lock (LWLock) latching that nbtree’s page locks are built on.postgres-lock-manager.md— the SQL-level heavyweight lock table and (viapostgres-ssi-predicate-locking.md) predicate locks thatCheckForSerializableConflictIn/PredicateLockPageSplitfeed.postgres-architecture-overview.md— Axis 4 (pluggable access methods), where nbtree sits as the default index AM.