Skip to content

PostgreSQL B-Tree (nbtree) — Lehman-Yao Concurrency, Splits, and Index-Tuple Deletion

Contents:

The B-tree is the default disk index of nearly every relational engine, and the reason is structural: it is a balanced, high-fan-out search tree whose every leaf sits at the same depth, so a lookup, insertion, or range scan costs O(log_F N) page reads where F (the fan-out) is large enough that a billion-row table is three or four levels deep. Comer’s classic survey The Ubiquitous B-Tree (Comer 1979; captured in dbms-papers/btree.md) and Database System Concepts (Silberschatz 7e, ch. 14 “Indexing”) frame the data structure: internal pages hold separator keys alternating with pointers (downlinks) to children, leaf pages hold the actual keyed entries, and the tree stays balanced by splitting a full page into two and propagating a new separator up to the parent — recursively, up to a root split that increases the tree’s height by one. PostgreSQL, like most engines, uses the B+-tree variant: all data entries live at the leaf level, internal pages hold only navigation keys, and the leaves are chained into a linked list so a range scan never re-ascends the tree.

The textbook B-tree is a single-threaded data structure. The hard part — the part that the textbook chapter mostly waves at — is concurrency. Consider two design knobs that every concurrent B-tree implementation must turn:

  1. How does a reader avoid being corrupted by a concurrent writer? The naive answer is lock-coupling (a.k.a. latch-crabbing): a descending search holds a read lock on the parent until it has locked the child, so a writer splitting the child cannot do so while a reader is in flight. This is correct but throttles concurrency — every reader serializes against writers along its whole path, and the root becomes a hot spot because every operation passes through it.

  2. How does a split publish its result atomically? A split mutates three pages (the splitting page, its new right sibling, and the parent that gains a downlink). If a reader can observe the new right page before the parent’s downlink exists, a naive design either has to hold locks across all three pages or risk the reader missing keys that moved right.

The landmark answer is Lehman & Yao 1981, “Efficient Locking for Concurrent Operations on B-Trees” (TODS; the operative reference cited at the very top of src/backend/access/nbtree/README, and the paper text is captured in dbms-papers/btree.md). Their insight is to make the tree self-repairing under concurrency by adding two fields to every page:

  • a right-link — a pointer to the page’s immediate right sibling on the same level (turning each level into a singly-linked list left-to-right), and
  • a high key — an upper bound on every key that may legally live on the page.

With these two additions, a split becomes a local, left-to-right operation whose effect is visible the instant the new right page is linked in, even before the parent’s downlink is posted. A reader that descends to a page which has since split simply observes that its search key exceeds the page’s high key, concludes “the key I want moved right,” and follows the right-link — no lock on the parent, no lock on the root, no lock-coupling between levels. The tree is searchable with at most one page locked at a time. The “B-link tree” this produces is the foundation of essentially every modern concurrent B-tree.

The companion reference, Lanin & Shasha 1986, “A Symmetric Concurrent B-Tree Algorithm”, supplies the deletion side that Lehman & Yao left underspecified: how to remove an empty page from the tree without breaking in-flight searches. Its two relevant ideas — the “drain technique” (a deleted page lingers as a tombstone until no concurrent operation can still reference it) and the “fast root” (skip past degenerate single-child levels) — are both named in the PostgreSQL README as direct lineage.

Two implementation choices follow from adopting L&Y, and they organize the rest of this document:

  1. What exactly does “move right to recover” cost, and when is it triggered — for searches, for inserts that must re-find a parent, for backward scans, and for VACUUM scanning in physical order while pages split underneath it.
  2. How are the multi-page mutations (split, page deletion) made crash-safe — i.e., decomposed into single-page atomic WAL actions such that any intermediate state is still a valid, searchable L&Y tree.

Engineers building a concurrent B-tree converge on a recurring playbook that the textbook does not spell out. Naming these conventions first lets the PostgreSQL-specific symbols in the next section read as choices within a shared space rather than inventions.

Section titled “The B-link tree and “move right to recover””

Almost every high-concurrency B-tree shipped since 1990 is a variant of the Lehman-Yao B-link tree, not a lock-coupled classic B-tree. The shared pattern: each level is a right-linked list; each non-rightmost page has a high key; a descending operation that lands on a stale page recovers by walking right until the high key tells it the key belongs here. The differences between engines are in the details of recovery (how many right-steps are tolerated, whether read locks are still taken at all) and in deletion (the hardest part), not in the core idea.

Latches vs. locks — two different things on a B-tree page

Section titled “Latches vs. locks — two different things on a B-tree page”

A B-tree page is touched by two distinct concurrency mechanisms that implementers keep strictly separate:

  • Latches (PostgreSQL calls these buffer content locks / LWLocks) are short-lived, held only while a backend physically reads or writes the bytes of one page, and never held across an I/O wait or a deadlock-detectable wait. They protect the physical consistency of the page image.
  • Locks (the SQL-level heavyweight lock manager, plus predicate locks for SERIALIZABLE) protect logical database objects and are held to end-of-transaction. A B-tree search takes no heavyweight lock on the index pages it reads.

The L&Y design is what makes this separation pay off: because recovery-by- moving-right means a search never needs to hold more than one page latch at once, the index becomes a structure coordinated almost entirely by short latches, with transaction-duration locking pushed out to the heap tuples the index points at. (PostgreSQL’s page latching lives in postgres-buffer-manager.md; the heavyweight lock table in postgres-lock-manager.md.)

Splits decomposed into atomic, recoverable steps

Section titled “Splits decomposed into atomic, recoverable steps”

Because a split spans multiple pages but a write-ahead log entry is atomic per record, implementers decompose a split into a sequence of single-record actions, each of which leaves a valid tree. The canonical decomposition: (1) atomically create-and-link the new right page and shrink the old page, leaving a flag that says “my right sibling has no parent downlink yet”; then (2) atomically post the downlink to the parent and clear the flag. A crash between (1) and (2) leaves a tree that is fully searchable (right-links carry readers across the gap) and that a later operation repairs lazily. This is the WAL-shaped expression of the L&Y “a split is locally complete the moment the right-link is set” property.

Lazy, multi-tier garbage collection of index entries

Section titled “Lazy, multi-tier garbage collection of index entries”

A no-overwrite MVCC engine (PostgreSQL, and broadly Oracle’s undo model) generates dead index tuples whenever a row is updated or deleted, because the index still points at the now-dead heap tuple version. Reclaiming them is a multi-tier affair, ordered by cost:

  1. Hint-bit marking — a scan that visits the heap and finds a tuple dead-to-everyone marks the index entry with a cheap “known dead” bit (set under a shared latch, like a hint bit), so future scans skip it.
  2. In-page physical deletion — when a page is about to split, the engine first physically removes the known-dead entries (and opportunistically checks nearby entries), buying space without growing the tree.
  3. Bulk deletion by the vacuum process — a background sweep removes all index entries pointing at heap tuples being reclaimed.
  4. Empty-page reclamation — a page emptied of all entries can be unlinked from the tree and eventually recycled.

Each tier is a backstop for the one above it. The hard correctness constraint across all of them is the MVCC engine’s: a TID (heap tuple identifier) must not be recycled (reused for an unrelated new row) while any scan might still follow a stale index pointer to it — the “concurrent TID recycling” hazard.

Suffix truncation and deduplication for fan-out

Section titled “Suffix truncation and deduplication for fan-out”

Two space optimizations recur. Suffix truncation shortens the separator key written to a parent during a split to the shortest prefix that still distinguishes left from right (Bayer & Unterauer’s prefix B-trees), which keeps internal pages dense and the tree shallow. Deduplication folds many equal leaf entries into one physical entry carrying a posting list (an array of TIDs), trading a small decode cost for large space savings on low- cardinality columns.

Theory / conventionPostgreSQL name (nbtree)
B-link tree right-linkBTPageOpaqueData.btpo_next (and btpo_prev left-link, a PG addition for backward scans)
High key (page keyspace upper bound)P_HIKEY item (offset 1) on non-rightmost pages
”Move right to recover” on search_bt_moveright (compares search key to P_HIKEY)
Latch-coupling descent (lock child, drop parent)_bt_search loop via _bt_relandgetbuf
Split step 1 (link right page, flag left)_bt_split sets BTP_INCOMPLETE_SPLIT
Split step 2 (post downlink, clear flag)_bt_insert_parent_bt_insertonpg
Lazy parent-downlink repair_bt_finish_split / _bt_getstackbuf
Root split (L&Y left unspecified)_bt_newlevel + metapage btm_root update
Fast root (Lanin & Shasha)BTMetaPageData.btm_fastroot / btm_fastlevel
Hint-bit “known dead” markingLP_DEAD item flag, set in _bt_killitems / unique check
In-page deletion before split_bt_delete_or_dedup_one_page, _bt_simpledel_pass, _bt_bottomupdel_pass
Bulk deletion by vacuumbtbulkdeletebtvacuumscanbtvacuumpage
Vacuum/split race detectorBTCycleId (btpo_cycleid) + BTP_SPLIT_END
Two-stage empty-page deletion_bt_pagedel_bt_mark_page_halfdead_bt_unlink_halfdead_page (BTP_HALF_DEADBTP_DELETED)
“Drain technique” deferred recycleBTPageIsRecyclable + BTDeletedPageData.safexid + _bt_pendingfsm_*
Suffix truncation_bt_truncate (called from _bt_split)
Deduplication / posting lists_bt_dedup_pass, BTreeTupleIsPosting, BT_IS_POSTING
AM dispatch tablebthandler returns IndexAmRoutine (see postgres-index-am.md)

PostgreSQL’s src/backend/access/nbtree/ is, in its own README’s words, “a correct implementation of Lehman and Yao’s high-concurrency B-tree management algorithm” with “a simplified version of the deletion logic described in Lanin and Shasha.” It is a B+-tree at BTREE_VERSION 4 (the current on-disk version): all data entries are on leaves, leaves are doubly chained (L&Y require only a right-link; PostgreSQL adds a left-link to support backward scans), and — crucially since version 4 — heap TID is a tiebreaker key column, so every entry is physically unique on its level. That last point is what lets the implementation rely on the strict-inequality form of the L&Y subtree invariant.

Every nbtree page reserves a “special area” at its end for a BTPageOpaqueData, which is where the L&Y additions live:

// BTPageOpaqueData — src/include/access/nbtree.h
typedef struct BTPageOpaqueData
{
BlockNumber btpo_prev; /* left sibling, or P_NONE if leftmost */
BlockNumber btpo_next; /* right sibling, or P_NONE if rightmost */
uint32 btpo_level; /* tree level --- zero for leaf pages */
uint16 btpo_flags; /* flag bits, see below */
BTCycleId btpo_cycleid; /* vacuum cycle ID of latest split */
} BTPageOpaqueData;

btpo_next is the L&Y right-link; btpo_prev is PostgreSQL’s left-link addition. btpo_level counts up from zero at the leaf level (so splitting the root never renumbers existing pages). The flag bits encode page state:

// btpo_flags bits — src/include/access/nbtree.h
#define BTP_LEAF (1 << 0) /* leaf page, i.e. not internal page */
#define BTP_ROOT (1 << 1) /* root page (has no parent) */
#define BTP_DELETED (1 << 2) /* page has been deleted from tree */
#define BTP_META (1 << 3) /* meta-page */
#define BTP_HALF_DEAD (1 << 4) /* empty, but still in tree */
#define BTP_SPLIT_END (1 << 5) /* rightmost page of split group */
#define BTP_HAS_GARBAGE (1 << 6) /* page has LP_DEAD tuples (deprecated) */
#define BTP_INCOMPLETE_SPLIT (1 << 7) /* right sibling's downlink is missing */
#define BTP_HAS_FULLXID (1 << 8) /* contains BTDeletedPageData */

The high key is not a separate field — it is the first item on the page. On a non-rightmost page item 1 (P_HIKEY) holds the high key and real data starts at item 2 (P_FIRSTKEY); a rightmost page has no high key (its keyspace upper bound is implicitly +∞) so data starts at item 1:

// high-key placement macros — src/include/access/nbtree.h
#define P_HIKEY ((OffsetNumber) 1)
#define P_FIRSTKEY ((OffsetNumber) 2)
#define P_FIRSTDATAKEY(opaque) (P_RIGHTMOST(opaque) ? P_HIKEY : P_FIRSTKEY)
flowchart LR
  subgraph L0["Leaf level (btpo_level = 0)"]
    direction LR
    P1["Leaf A\nhigh key = k1\n[entries < k1]"]
    P2["Leaf B\nhigh key = k2\n[k1 <= entries < k2]"]
    P3["Leaf C (rightmost)\nno high key (+inf)\n[entries >= k2]"]
  end
  P1 -- "btpo_next (right-link)" --> P2
  P2 -- "btpo_next" --> P3
  P2 -- "btpo_prev (left-link)" --> P1
  P3 -- "btpo_prev" --> P2
  ROOT["Internal root (btpo_level = 1)\ndownlinks + separators"] --> P1
  ROOT --> P2
  ROOT --> P3
  META["Meta page (block 0)\nbtm_root, btm_fastroot"] -.-> ROOT

Figure 1 — The L&Y B-link tree as PostgreSQL stores it. Each level is a doubly-linked list (btpo_next right-link is the L&Y requirement; btpo_prev left-link is PostgreSQL’s addition for backward scans). The high key on a non-rightmost page (P_HIKEY, item 1) upper-bounds that page’s keyspace; the rightmost page omits it (implicit +∞). The meta page at block 0 points at both the true root and the “fast root.” A search that lands on a page split since it read the parent recovers by following btpo_next, with no lock above the current page.

The header comment for the high key states the invariant the whole algorithm rests on:

// nbtree.h — high key invariant (comment)
// The high key on a page is required to be greater than or equal to any
// other key that appears on the page. If we find ourselves trying to
// insert a key that is strictly > high key, we know we need to move right
// (this should only happen if the page was split since we examined the
// parent page).

Search descent: latch-coupling down, move-right to recover

Section titled “Search descent: latch-coupling down, move-right to recover”

_bt_search descends from the root, latch-coupling down the tree: it holds a read lock on the current page only long enough to find the downlink to follow, then atomically trades that lock for one on the child via _bt_relandgetbuf. At each level it first calls _bt_moveright to handle the case where the page split after the parent’s downlink was read:

// _bt_search — src/backend/access/nbtree/nbtsearch.c (condensed)
*bufP = _bt_getroot(rel, heaprel, access);
/* ... */
for (;;)
{
/* Race -- the page we grabbed may have split since we read its
* downlink in its parent. If so, move right to its new sibling. */
*bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
stack_in, page_access);
page = BufferGetPage(*bufP);
opaque = BTPageGetOpaque(page);
if (P_ISLEAF(opaque))
break; /* found the leaf */
offnum = _bt_binsrch(rel, key, *bufP); /* pick the downlink */
itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
child = BTreeTupleGetDownLink(itup);
/* ... push (blkno, offnum) on the descent stack ... */
/* drop the read lock on the page, then acquire one on its child */
*bufP = _bt_relandgetbuf(rel, *bufP, child, page_access);
stack_in = new_stack;
}

The recovery step is _bt_moveright. The whole L&Y read protocol is in this loop — compare the search key to the page’s high key (P_HIKEY); if the key is not less than the high key, the page must have split, so step right:

// _bt_moveright — src/backend/access/nbtree/nbtsearch.c (condensed)
cmpval = key->nextkey ? 0 : 1;
for (;;)
{
page = BufferGetPage(buf);
opaque = BTPageGetOpaque(page);
if (P_RIGHTMOST(opaque))
break; /* no sibling; high key is +inf */
if (forupdate && P_INCOMPLETE_SPLIT(opaque))
/* finish a split we ran into while locking for write ... */;
if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
{
/* step right one page */
buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
continue;
}
else
break; /* key belongs on this page */
}

Two things to read off this. First, the page could have split more than once, so the loop steps right “as far as needed.” Second, P_IGNORE (a deleted or half-dead page) also forces a right-step — a search that lands on a page being deleted recovers exactly as if it had hit a concurrent split. That unification is the reason page deletion can be made safe without ever blocking searches.

The actual key comparison is _bt_compare, a 3-way comparator that walks the scankey attributes, then uses the heap-TID tiebreaker. The two rules that make the L&Y invariant work — that the first data item on an internal page is treated as −∞, and that a truncated (missing) attribute is also −∞ — are both here:

// _bt_compare — src/backend/access/nbtree/nbtsearch.c (condensed)
/* Force result ">" if target item is first data item on an internal page */
if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
return 1;
/* ... compare ncmpkey scankey attributes against the tuple ... */
/* All non-truncated attributes equal: treat truncated attrs as -inf */
if (key->keysz > ntupatts)
return 1;
/* ... then break ties on heap TID via key->scantid ... */
flowchart TD
  START["Enter page P (latched READ)"] --> RM{"P is rightmost?\n(btpo_next == P_NONE)"}
  RM -- yes --> STAY["High key is +inf:\nkey belongs here"]
  RM -- no --> IGN{"P deleted/half-dead?\n(P_IGNORE)"}
  IGN -- yes --> STEP["Release P, latch P.btpo_next"]
  IGN -- no --> CMP{"_bt_compare(key, P_HIKEY)\n>= cmpval ?"}
  CMP -- "yes (key past high key)" --> STEP
  CMP -- "no (key <= high key)" --> STAY
  STEP --> START
  STAY --> DONE["Use this page\n(descend or scan)"]

Figure 2 — _bt_moveright’s recovery decision, run on entry to every page during descent and during a scan step. It compares the search key to the page’s high key; if the key is at or past the high key (or the page is deleted), the page split or was deleted concurrently and the search follows btpo_next one step right, repeating until the key falls within the page’s keyspace. At most one page is latched at any instant — the L&Y guarantee.

Insertion: descend, check, find location, insert-or-split

Section titled “Insertion: descend, check, find location, insert-or-split”

btinsert (the AM entry point) forms an index tuple and calls _bt_doinsert, which is the insertion driver. It builds an insertion scankey, descends to the correct leaf with a write lock via _bt_search_insert, optionally runs the unique-constraint check, then finds the exact in-page offset and inserts:

// _bt_doinsert — src/backend/access/nbtree/nbtinsert.c (condensed)
itup_key = _bt_mkscankey(rel, itup);
/* ... */
search:
stack = _bt_search_insert(rel, heapRel, &insertstate);
if (checkingunique)
{
xwait = _bt_check_unique(rel, &insertstate, heapRel, checkUnique,
&is_unique, &speculativeToken);
if (TransactionIdIsValid(xwait))
{
_bt_relbuf(rel, insertstate.buf); /* release, wait, retry */
/* ... XactLockTableWait(xwait, ...) ... */
goto search;
}
if (itup_key->heapkeyspace)
itup_key->scantid = &itup->t_tid; /* restore TID tiebreaker */
}
CheckForSerializableConflictIn(rel, NULL, BufferGetBlockNumber(insertstate.buf));
newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
indexUnchanged, stack, heapRel);
_bt_insertonpg(rel, heapRel, itup_key, insertstate.buf, InvalidBuffer,
stack, itup, insertstate.itemsz, newitemoff,
insertstate.postingoff, false);

Several design points are visible here. The uniqueness check holds a write lock on “the first page the value could be on” continuously from the check through the insert — this is how PostgreSQL serializes two concurrent inserters of the same key without a separate lock. _bt_check_unique uses a SnapshotDirty to see uncommitted conflicting tuples and returns the xid to wait on; on conflict with an in-progress speculative insertion it cooperates with the INSERT … ON CONFLICT machinery. And note the scantid dance: while checking uniqueness the heap-TID tiebreaker is omitted from the scankey (so the search finds the leftmost page any duplicate could be on); once uniqueness is established, the TID is restored so the insert lands at the physically correct, unique position.

_bt_search_insert also implements the fastpath optimization: a backend inserting monotonically increasing keys (a serial PK, a timestamp) caches the rightmost leaf block in RelationGetTargetBlock and, on the next insert, conditionally locks that block and checks it is still rightmost with room — skipping the whole root-to-leaf descent in the common case.

The page split: one atomic action, then a deferred parent insert

Section titled “The page split: one atomic action, then a deferred parent insert”

When _bt_insertonpg finds the target leaf lacks free space, it calls _bt_split. This is the heart of the L&Y write protocol. The split builds the new left page in a temporary buffer (so a failure before the new right page is acquired leaves the original untouched), chooses a split point via _bt_findsplitloc, computes the new high key (with suffix truncation on leaf pages via _bt_truncate), allocates the right page, fixes the right sibling’s left-link, and writes it all in one critical section / one WAL record — with the left page flagged BTP_INCOMPLETE_SPLIT:

// _bt_split — src/backend/access/nbtree/nbtinsert.c (condensed)
firstrightoff = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
newitem, &newitemonleft);
leftpage = PageGetTempPage(origpage); /* build left half in temp */
_bt_pageinit(leftpage, BufferGetPageSize(buf));
lopaque = BTPageGetOpaque(leftpage);
lopaque->btpo_flags = oopaque->btpo_flags;
/* set flag in leftpage indicating that rightpage has no downlink yet */
lopaque->btpo_flags |= BTP_INCOMPLETE_SPLIT;
lopaque->btpo_prev = oopaque->btpo_prev;
/* ... compute lefthighkey (suffix-truncated copy of firstright on leaves) ...
lefthighkey = _bt_truncate(rel, lastleft, firstright, itup_key); ... */
rbuf = _bt_allocbuf(rel, heaprel); /* the new right page */
lopaque->btpo_next = rightpagenumber; /* left -> right link */
ropaque->btpo_prev = origpagenumber; /* right -> left link */
ropaque->btpo_next = oopaque->btpo_next; /* right -> old next */
/* ... copy data items to left/right per split point; insert newitem ... */
if (!isrightmost)
{
sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE); /* old right sib */
/* ... verify sopaque->btpo_prev == origpagenumber (corruption check) ... */
}
START_CRIT_SECTION();
PageRestoreTempPage(leftpage, origpage); /* left never moves */
MarkBufferDirty(buf); MarkBufferDirty(rbuf);
if (!isrightmost) { sopaque->btpo_prev = rightpagenumber; MarkBufferDirty(sbuf); }
/* XLOG: one xl_btree_split record over buf, rbuf, sbuf (+cbuf for internal) */
END_CRIT_SECTION();

Three invariants make this crash-safe and concurrency-safe:

  1. The left page never moves. The original block number stays the left half; the new block is always the right half. This is what lets every existing downlink and right-link that pointed at the original page remain valid — they now point at the (smaller) left page, and any key that moved right is reachable by following the new right-link.
  2. Locks couple strictly left-to-right. The old right sibling is locked after the splitting page, so the lock order is always increasing toward the right — deadlock-free by construction. (The _bt_split comment: “We are guaranteed that this is deadlock-free, since we couple the locks in the standard order: left to right.”)
  3. The split is logically complete the instant the right-link is set, even though no parent downlink exists yet. A concurrent search reaching the left page sees its new (lower) high key, moves right, and finds the moved keys. The BTP_INCOMPLETE_SPLIT flag is a note to writers, not to readers.

After _bt_split returns the locked right buffer, _bt_insertonpg calls _bt_insert_parent to post the downlink — a separate atomic action that recursively inserts into the parent (splitting it in turn if needed), and on a root split builds a brand-new root via _bt_newlevel:

// _bt_insert_parent — src/backend/access/nbtree/nbtinsert.c (condensed)
if (isroot)
{
rootbuf = _bt_newlevel(rel, heaprel, buf, rbuf); /* new root one level up */
/* ... release buffers ... */
}
else
{
ritem = (IndexTuple) PageGetItem(page, PageGetItemId(page, P_HIKEY));
new_item = CopyIndexTuple(ritem); /* parent separator = left's high key */
BTreeTupleSetDownLink(new_item, rbknum); /* ... points at new right page */
pbuf = _bt_getstackbuf(rel, heaprel, stack, bknum);/* re-find parent via descent stack */
_bt_relbuf(rel, rbuf); /* right child can be released now */
/* ... ereport if pbuf invalid (failed to re-find parent) ... */
_bt_insertonpg(rel, heaprel, NULL, pbuf, buf, stack->bts_parent,
new_item, MAXALIGN(IndexTupleSize(new_item)),
stack->bts_offset + 1, 0, isonly); /* recurse up */
}

The separator key posted to the parent is the left page’s high key — which the comment in _bt_split notes is a (truncated) copy of the firstright tuple, satisfying L&Y’s requirement that the parent separator be strictly less than every key in the right subtree. _bt_getstackbuf re-finds the parent using the descent stack recorded in _bt_search, matching on the child’s block number (not on the separator key) — a deliberate PostgreSQL simplification over L&Y that means it never has to couple locks across levels during the ascent, and never cares if the separator key has changed.

sequenceDiagram
    participant I as Inserter
    participant L as Left page (orig block)
    participant R as New right page
    participant S as Old right sibling
    participant P as Parent page
    Note over I,S: Atomic action 1 — one xl_btree_split WAL record
    I->>L: build left half in temp buffer, new (lower) high key
    I->>R: _bt_allocbuf; copy moved items; set btpo_prev=L, btpo_next=S
    I->>L: btpo_next = R; set BTP_INCOMPLETE_SPLIT
    I->>S: btpo_prev = R (fix left-link; locked after L, left->right)
    I->>L: PageRestoreTempPage (left never moves); MarkBufferDirty all
    Note over I,P: tree is now searchable; readers move right across the gap
    Note over I,P: Atomic action 2 — separate xl_btree_insert (UPPER) record
    I->>P: _bt_getstackbuf re-finds parent by child blkno
    I->>P: insert downlink (separator = L's high key) at stack offset+1
    I->>L: clear BTP_INCOMPLETE_SPLIT atomically with the parent insert

Figure 3 — A page split as two independent atomic WAL actions. Action 1 links the new right page in and flags the left page incomplete; the tree is fully searchable at that point because readers recover via the right-link. Action 2 posts the downlink to the parent and clears the flag. A crash between the two leaves a valid (if downlink-missing) tree; the next inserter to pass through _bt_moveright or _bt_getstackbuf finishes the split lazily via _bt_finish_split.

_bt_findsplitloc (in nbtsplitloc.c) picks where to split. Its primary goal is to equalize free space across the two halves after accounting for the incoming tuple, but it also biases the choice to make suffix truncation effective and to keep groups of duplicates from spanning pages. For a rightmost page it leaves the left side fillfactor% full (default 90% for leaves) so that monotonically increasing inserts produce densely packed pages instead of the 50% utilization a naive midpoint split would give:

// fillfactor constants — src/include/access/nbtree.h
#define BTREE_DEFAULT_FILLFACTOR 90
#define BTREE_NONLEAF_FILLFACTOR 70
#define BTREE_SINGLEVAL_FILLFACTOR 96

It materializes a small list of candidate split points within an acceptable free-space range and runs one of three strategies — SPLIT_DEFAULT, SPLIT_MANY_DUPLICATES (find the minimally-distinguishing point), or SPLIT_SINGLE_VALUE (a page entirely of one value: leave the left side nearly full to absorb future duplicates) — choosing the one that maximizes suffix truncation without unduly unbalancing the split.

Index-tuple deletion: four tiers before a split

Section titled “Index-tuple deletion: four tiers before a split”

PostgreSQL removes dead index entries through the tiered scheme described above. The cheapest tier is LP_DEAD hint marking: a plain index scan that visits the heap and finds a tuple dead-to-all returns to the index and sets the item’s LP_DEAD line-pointer flag (also done while checking a unique index for conflicts, where an exclusive lock is already held). The README:

// nbtree/README — Simple deletion
// Once an index tuple has been marked LP_DEAD it can actually be deleted
// from the index immediately; since index scans only stop "between" pages,
// no scan can lose its place from such a deletion.

When an insert is about to split a page, _bt_delete_or_dedup_one_page first tries to reclaim space without growing the tree, in this order: physically delete any LP_DEAD-marked tuples; if that is not enough and the executor hinted the incoming tuple is an unchanged duplicate (version churn from UPDATEs), run a bottom-up deletion pass (_bt_bottomupdel_pass) that asks the table AM which nearby duplicates are actually dead; and as a last resort run a deduplication pass (_bt_dedup_pass). Both deletion passes share the same WAL records (_bt_delitems_delete) and the same table-AM cooperation for computing a snapshotConflictHorizon (used to generate recovery conflicts on standbys).

The README frames bottom-up deletion as a backstop grounded in the “generational hypothesis” — most index tuples die young — that targets the pathological version-churn workload where a few leaf pages would otherwise split repeatedly purely to store dead UPDATE versions.

VACUUM: bulk deletion in physical order, and the split race

Section titled “VACUUM: bulk deletion in physical order, and the split race”

btbulkdelete is the AM’s bulk-deletion callback. It acquires a vacuum cycle ID and runs btvacuumscan, which sweeps the index in physical block order (not key order) for read-ahead efficiency, calling btvacuumpage on each page. For every leaf page it takes a cleanup lock (a super-exclusive lock requiring no other pins) — even on pages with nothing to delete — which is the interlock that prevents concurrent TID recycling from confusing in-flight scans.

The subtle correctness problem is that a page can split while VACUUM is mid-scan, moving tuples from a not-yet-visited page to an already-visited lower-numbered page. The BTCycleId mechanism detects this: a split stamps both halves with the running VACUUM’s cycle ID, so when btvacuumpage encounters a leaf whose btpo_cycleid equals the current cycle and whose right-link points backward (to a lower block than the scan’s current position), it knows the page split since the scan started and backtracks to follow the right-link:

// btvacuumpage — src/backend/access/nbtree/nbtree.c (condensed)
/* Check whether we need to backtrack to earlier pages. What we are
* concerned about is a page split that happened since we started the
* vacuum scan, moving right-half tuples to a block we already passed. */
if (vstate->cycleid != 0 &&
opaque->btpo_cycleid == vstate->cycleid &&
!(opaque->btpo_flags & BTP_SPLIT_END) &&
!P_RIGHTMOST(opaque) &&
opaque->btpo_next < scanblkno)
backtrack_to = opaque->btpo_next;

BTP_SPLIT_END is the optimization that bounds the backtracking: when a split detects its right sibling has a different cycle ID, it flags the right page BTP_SPLIT_END, telling VACUUM it need not chase further right. The cycle-ID test tolerates false positives (it may backtrack unnecessarily) but never false negatives, which is why a small 16-bit counter (MAX_BT_CYCLE_ID) suffices.

Empty-page deletion: the two-stage half-dead protocol

Section titled “Empty-page deletion: the two-stage half-dead protocol”

A leaf page emptied of all data items becomes a candidate for deletion — removal from the tree. PostgreSQL never deletes the rightmost page of a level and never merges partly-full pages; it only unlinks fully-empty leaves (and the “skinny” chains of single-child internal pages above them). _bt_pagedel drives a two-stage protocol:

  • Stage 1 — mark half-dead (_bt_mark_page_halfdead): re-find the parent, remove the leaf’s downlink, point that keyspace at the right sibling, and set BTP_HALF_DEAD on the leaf. A half-dead page is P_IGNORE — searches skip it by moving right, exactly as for an incomplete split. The tree is left “similar to a page split: the page has no downlink pointing to it, but it’s still linked to its siblings.”
  • Stage 2 — unlink (_bt_unlink_halfdead_page): lock left sibling, target, and right sibling (in that left-to-right order), splice the target out of the sibling chain, and set BTP_DELETED.
stateDiagram-v2
    [*] --> Live
    Live --> HalfDead: stage 1 _bt_mark_page_halfdead\nremove downlink from parent\nkeyspace moves to right sibling
    HalfDead --> Deleted: stage 2 _bt_unlink_halfdead_page\nsplice out of sibling chain\nstamp safexid
    Deleted --> Recyclable: BTPageIsRecyclable\nsafexid visible-to-everyone
    Recyclable --> Reused: taken from FSM\noverwritten as a new right page
    Reused --> [*]

Figure 4 — The empty-page deletion lifecycle. A page moves Live → HalfDead → Deleted, where each arrow is a separate atomic WAL action so an interrupted VACUUM always leaves a consistent tree (the next VACUUM finishes a stranded half-dead page). A Deleted page is not immediately reusable: it lingers as a tombstone, stamped with a safexid, until BTPageIsRecyclable confirms no snapshot could still hold a reference — the Lanin & Shasha “drain technique.”

Each stage is its own WAL record, so an interrupted VACUUM leaves a consistent tree and the next VACUUM finishes the job. The deleted page is not recycled immediately. It is stamped with a safexid (the next-transaction counter) and left as a tombstone, because a concurrent search may have already read a downlink to it. BTPageIsRecyclable gates reuse on that xid becoming visible-to-everyone:

// BTPageIsRecyclable — src/include/access/nbtree.h (condensed)
opaque = BTPageGetOpaque(page);
if (P_ISDELETED(opaque))
{
FullTransactionId safexid = BTPageGetDeleteXid(page);
/* Recycle only once the deletion XID is visible to everyone, so no
* in-progress scan could still follow a downlink to this page. */
return GlobalVisCheckRemovableFullXid(heaprel, safexid);
}
return false;

This deferral is the “drain technique.” VACUUM tracks newly deleted pages in a BTPendingFSM array and, at the end of the scan (_bt_pendingfsm_finalize), places any that have already become safe into the free-space map; the rest are left for a future VACUUM (the metapage’s btm_last_cleanup_num_delpages carries the count forward).

When deletion cannot free enough space, _bt_dedup_pass folds runs of equal non-pivot tuples into a single posting-list tuple — one key value plus an array of heap TIDs — recognized by the BT_IS_POSTING status bit repurposing the tuple’s t_tid:

// nbtree.h — posting-list tuple recognition (condensed)
#define BT_IS_POSTING 0x2000
static inline bool
BTreeTupleIsPosting(IndexTuple itup)
{
if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
return false;
/* presence of BT_IS_POSTING in offset number indicates posting tuple */
return (ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) & BT_IS_POSTING) != 0;
}

Deduplication is always lazy (only at the point a split looms) and only for indexes whose opclass guarantees btm_allequalimage (equal datums are byte-identical, so collapsing them loses nothing). In unique indexes it serves a different purpose: not space saving (there are no real duplicates long term) but buying time for garbage collection to run before a version-churn split. When an incoming tuple’s heap TID falls inside an existing posting list’s range, _bt_insertonpg performs a posting-list split in passing, folded into the same atomic action as the insert.

Symbols grouped by sub-system. Position-hint table follows.

  • bthandler — returns the IndexAmRoutine filled with nbtree’s capability flags (amcanorder, amcanbackward, amcanunique, amcanparallel, ampredlocks, amcaninclude, …) and callback pointers. The dispatch contract itself is owned by postgres-index-am.md.
  • btinsert — AM insert callback; forms the IndexTuple, sets its heap TID, delegates to _bt_doinsert.
  • btgettuple / btgetbitmap — AM scan callbacks (the read path; ordered scan vs. bitmap).
  • btbuildempty — initializes a metapage via _bt_initmetapage.
  • _bt_search — root-to-leaf descent, latch-coupling down, calling _bt_moveright at each level. Returns the descent stack used later by splits to re-find parents.
  • _bt_moveright — the L&Y recovery primitive: step right while the search key is at/past the page’s high key, or the page is P_IGNORE. Finishes incomplete splits when forupdate.
  • _bt_compare — 3-way key comparator; encodes “first internal data item is −∞”, “truncated attribute is −∞”, and the heap-TID tiebreaker.
  • _bt_binsrch / _bt_binsrch_insert — within-page binary search (navigation vs. insertion position, the latter with scantid).
  • _bt_first / _bt_next / _bt_readpage / _bt_steppage — the ordered-scan state machine; _bt_first positions via _bt_search, then scans leaves along the sibling links, remembering the right-link at scan time so a concurrent split cannot make it re-scan moved items.
  • _bt_lock_and_validate_left — backward-scan move-left, handling the harder case of a left sibling that itself split (the README’s 4-step move-left algorithm).
  • _bt_doinsert — insertion driver: build scankey, descend write-locked, unique-check, find location, insert.
  • _bt_search_insert_bt_search wrapper with the rightmost-leaf fastpath cache (RelationGetTargetBlock).
  • _bt_check_unique — uniqueness enforcement under SnapshotDirty; returns the conflicting xid to wait on.
  • _bt_findinsertloc / _bt_stepright — find the exact in-page offset, stepping right if the value belongs on a sibling.
  • _bt_insertonpg — the insert/split fork: if the item fits, add it (logging one WAL record, clearing a child’s incomplete-split flag if this is the parent insert); else call _bt_split then _bt_insert_parent.
  • _bt_split — the atomic split action: temp left page, new high key (with _bt_truncate), right page allocation, right-sibling left-link fixup, BTP_INCOMPLETE_SPLIT flag, single WAL record.
  • _bt_insert_parent — post the downlink to the parent (or _bt_newlevel for a root split); re-finds the parent with _bt_getstackbuf.
  • _bt_finish_split / _bt_getstackbuf — lazy repair of an incomplete split, re-finding the parent by matching the child block number.
  • _bt_newlevel — build a new root one level up; update the metapage.
  • _bt_delete_or_dedup_one_page / _bt_simpledel_pass — pre-split space reclamation (LP_DEAD physical deletion, simple deletion).
  • _bt_findsplitloc — choose the split point; equalize free space, bias for suffix truncation, special-case duplicates and rightmost pages.
  • _bt_strategy / _bt_bestsplitloc — strategy selection (SPLIT_DEFAULT, SPLIT_MANY_DUPLICATES, SPLIT_SINGLE_VALUE) and best-point search.

Pages, deletion, VACUUM (nbtpage.c, nbtree.c, nbtdedup.c)

Section titled “Pages, deletion, VACUUM (nbtpage.c, nbtree.c, nbtdedup.c)”
  • _bt_getroot / _bt_gettrueroot — fetch the (fast) root, following the metapage and recovering if the cached root pointer is stale.
  • _bt_initmetapage / _bt_getmeta / _bt_upgrademetapage — metapage I/O.
  • btbulkdeletebtvacuumscanbtvacuumpage — the VACUUM scan in physical order; cleanup-lock interlock; cycle-ID backtracking.
  • _bt_pagedel — page-deletion driver.
  • _bt_mark_page_halfdead / _bt_unlink_halfdead_page — the two deletion stages.
  • _bt_lock_subtree_parent — re-find the parent of a to-be-deleted subtree.
  • BTPageIsRecyclable / _bt_pendingfsm_init / _bt_pendingfsm_finalize — deferred recycling (the drain technique) and FSM placement.
  • _bt_delitems_vacuum / _bt_delitems_delete / _bt_delitems_update — WAL-logged item removal/update on a page.
  • _bt_dedup_pass / _bt_bottomupdel_pass — deduplication and bottom-up deletion.

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
bthandlernbtree.c115
btinsertnbtree.c202
btgettuplenbtree.c226
btbulkdeletenbtree.c1068
btvacuumscannbtree.c1186
btvacuumpagenbtree.c1361
_bt_searchnbtsearch.c107
_bt_moverightnbtsearch.c246
_bt_binsrchnbtsearch.c348
_bt_comparenbtsearch.c693
_bt_firstnbtsearch.c887
_bt_steppagenbtsearch.c2169
_bt_doinsertnbtinsert.c102
_bt_search_insertnbtinsert.c317
_bt_check_uniquenbtinsert.c408
_bt_findinsertlocnbtinsert.c815
_bt_steprightnbtinsert.c1027
_bt_insertonpgnbtinsert.c1105
_bt_splitnbtinsert.c1467
_bt_insert_parentnbtinsert.c2099
_bt_finish_splitnbtinsert.c2241
_bt_getstackbufnbtinsert.c2319
_bt_newlevelnbtinsert.c2444
_bt_delete_or_dedup_one_pagenbtinsert.c2683
_bt_simpledel_passnbtinsert.c2812
_bt_findsplitlocnbtsplitloc.c129
_bt_getrootnbtpage.c344
_bt_pagedelnbtpage.c1802
_bt_mark_page_halfdeadnbtpage.c2088
_bt_unlink_halfdead_pagenbtpage.c2314
_bt_lock_subtree_parentnbtpage.c2813
_bt_pendingfsm_finalizenbtpage.c2996
_bt_dedup_passnbtdedup.c58
_bt_bottomupdel_passnbtdedup.c307
BTPageOpaqueDatanbtree.h63
BTPageIsRecyclablenbtree.h292
P_HIKEY / P_FIRSTKEYnbtree.h368
  • The right-link (btpo_next), left-link (btpo_prev), level, flags, and vacuum cycle ID live in BTPageOpaqueData in the page special area. Read from nbtree.h on 2026-06-05. The left-link is PostgreSQL’s addition over L&Y (which require only a right-link); confirmed by the README’s “Differences to the Lehman & Yao algorithm” section.

  • A search holds at most one page latch at a time and recovers from a concurrent split by following btpo_next. Verified in _bt_search (drops the parent lock via _bt_relandgetbuf before acquiring the child) and _bt_moveright (steps right while _bt_compare(key, P_HIKEY) >= cmpval). The README confirms PostgreSQL still takes a read latch on the page being examined — L&Y assume unshared in-memory copies, but PostgreSQL shares buffers, so it must latch to keep the bytes stable while reading.

  • A page split is two independent atomic WAL actions; the first flags the left page BTP_INCOMPLETE_SPLIT, the second posts the parent downlink and clears it. Verified in _bt_split (sets the flag, single critical section / xl_btree_split record) and _bt_insertonpg/_bt_insert_parent (the separate parent insert that clears the flag). The README “WAL Considerations” section states a crash between the two leaves a searchable tree whose missing downlink is created on-the-fly by the next inserter via _bt_finish_split.

  • Split locks couple strictly left-to-right and the left half keeps the original block number. Verified in _bt_split: the old right sibling (oopaque->btpo_next) is locked after the splitting page, and PageRestoreTempPage(leftpage, origpage) writes the left half back onto the original block. The README explicitly calls this deadlock-free by lock ordering.

  • The parent downlink is re-found by matching the child’s block number, not the separator key. Verified in _bt_insert_parent (passes bknum to _bt_getstackbuf). The README notes this is a deliberate simplification over L&Y/Lanin-Shasha that avoids coupling locks across levels during the ascent.

  • VACUUM scans in physical block order, takes a cleanup lock on every leaf, and backtracks on a concurrent split via the cycle-ID test. Verified in btvacuumscan (read-stream over physical blocks), btvacuumpage (_bt_upgradelockbufcleanup on every leaf; the backtrack_to assignment guarded by btpo_cycleid == vstate->cycleid && opaque->btpo_next < scanblkno). BTP_SPLIT_END bounds the backtracking.

  • Empty-page deletion is a two-stage half-dead → deleted protocol, and a deleted page is deferred into the FSM until its safexid is visible-to- everyone. Verified in _bt_pagedel (drives both stages), _bt_mark_page_halfdead (BTP_HALF_DEAD), _bt_unlink_halfdead_page (BTP_DELETED + safexid), and BTPageIsRecyclable (GlobalVisCheckRemovableFullXid). The README names this the Lanin & Shasha “drain technique.”

  • Deduplication is lazy, gated on btm_allequalimage, and produces posting-list tuples flagged BT_IS_POSTING. Verified in _bt_dedup_pass and the BTreeTupleIsPosting inline in nbtree.h. The on-disk version that enables it is BTREE_VERSION 4; v3 indexes must REINDEX to use it (per the nbtree.h version comment).

  • Heap TID is a tiebreaker key column as of BTREE_VERSION 4, making every entry unique on its level. Verified in the nbtree.h version comment and the _bt_compare heap-TID/scantid path. This is what licenses the strict- inequality form of the L&Y subtree invariant (Ki < v <= Ki+1).

  1. The exact bound on backtracking work in btvacuumpage. BTP_SPLIT_END stops VACUUM chasing right-links past a split group, but the worst-case number of backtrack hops under sustained concurrent splitting during a single VACUUM is not obvious from the code. Investigation path: trace the interaction between _bt_split’s BTP_SPLIT_END-setting (only when the right sibling has a different cycle ID) and the btvacuumpage backtrack loop under a synthetic split-heavy workload.

  2. How often newly deleted pages actually become recyclable within the same VACUUM. The README says PG14+ checks recycle-safety at the end of the scan via _bt_pendingfsm_finalize, but the fraction that clears GlobalVisCheckRemovableFullXid immediately vs. waits for a future VACUUM is workload-dependent and unmeasured here. Investigation path: instrument _bt_pendingfsm_finalize against a long-running-snapshot workload.

  3. Whether the fastpath rightmost-leaf cache (RelationGetTargetBlock) interacts pathologically with bottom-up deletion on unique indexes. Both target the rightmost / churning leaf; the code paths are separate but the workloads overlap. Investigation path: review whether a fastpath insert can defeat a bottom-up deletion opportunity on the same page.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”
  • CUBRID’s B+-tree. CUBRID also uses a B+-tree with sibling links, but its concurrency model leans more on the page-level latches and the lock manager than on a self-repairing L&Y right-link recovery; a side-by-side of CUBRID’s index-page latch protocol against _bt_moveright would sharpen what the L&Y “move right to recover” actually buys in lock-traffic terms. (See the CUBRID index analysis in the cubrid tree.)

  • Lehman & Yao 1981 vs. the PostgreSQL implementation. PostgreSQL deviates from the paper in named ways (read latches because buffers are shared; a left-link for backward scans; re-finding parents by child block number rather than separator key; lazy parent-downlink creation). A focused note mapping each deviation to its README justification would be the natural companion to this doc. Paper: Lehman & Yao 1981 (dbms-papers/btree.md).

  • Lanin & Shasha 1986 deletion vs. PostgreSQL’s simplification. PostgreSQL moves keyspace right on deletion (Lanin-Shasha prefer left) because it has left-links anyway, and adopts their “drain technique” and “fast root” ideas. Comparing the symmetric Lanin-Shasha algorithm against _bt_pagedel’s half-dead protocol is a research-grade exercise in deletion correctness.

  • Blink-tree latching vs. lock-free / optimistic B-trees (OLFIT, Bw-tree). Modern in-memory engines (Microsoft’s Bw-tree, OLFIT) push past latched B-link trees toward latch-free or optimistic-version designs. Positioning PostgreSQL’s latch-coupled L&Y tree against these would frame what a disk-oriented, buffer-shared engine gives up for them.

  • Suffix truncation / prefix B-trees. PostgreSQL implements only the “simple prefix B-tree” form of Bayer & Unterauer 1977 (whole-attribute truncation). An opclass that manufactures a shorter separator for variable-length types (e.g., text) is an unrealized extension the README itself sketches.

In-tree READMEs and source files (REL_18_STABLE, commit 273fe94)

Section titled “In-tree READMEs and source files (REL_18_STABLE, commit 273fe94)”
  • src/backend/access/nbtree/README — the design document: L&Y basics, PostgreSQL’s differences, deletion during VACUUM, page deletion, the drain technique, fast root, WAL considerations, scans during recovery, suffix truncation, deduplication, posting-list splits, data representation.
  • src/include/access/nbtree.hBTPageOpaqueData, BTMetaPageData, flag bits, P_HIKEY/P_FIRSTKEY, pivot/posting tuple format, BTPageIsRecyclable, BTStackData, version/fillfactor constants.
  • src/backend/access/nbtree/nbtree.cbthandler, AM callbacks, the VACUUM scan (btbulkdelete/btvacuumscan/btvacuumpage).
  • src/backend/access/nbtree/nbtinsert.c_bt_doinsert, _bt_check_unique, _bt_insertonpg, _bt_split, _bt_insert_parent, _bt_finish_split.
  • src/backend/access/nbtree/nbtsearch.c_bt_search, _bt_moveright, _bt_compare, _bt_binsrch, the scan state machine.
  • src/backend/access/nbtree/nbtpage.c — metapage, _bt_pagedel, the half-dead/unlink stages, deferred-FSM recycling.
  • src/backend/access/nbtree/nbtsplitloc.c_bt_findsplitloc and split strategies.
  • src/backend/access/nbtree/nbtdedup.c_bt_dedup_pass, _bt_bottomupdel_pass.
  • Lehman, P. & Yao, S. B. (1981). “Efficient Locking for Concurrent Operations on B-Trees.” ACM TODS 6(4):650-670. The operative algorithm; text captured in knowledge/research/dbms-papers/btree.md.
  • Lanin, V. & Shasha, D. (1986). “A Symmetric Concurrent B-Tree Algorithm.” Proc. 1986 Fall Joint Computer Conf., 380-389. Deletion logic lineage (drain technique, fast root).
  • Comer, D. (1979). “The Ubiquitous B-Tree.” ACM Computing Surveys 11(2). B-tree background (knowledge/research/dbms-papers/btree.md).
  • Bayer, R. & Unterauer, K. (1977). “Prefix B-Trees.” ACM TODS 2(1):11-26. Suffix-truncation basis (cited in the nbtree README).
  • Database System Concepts (Silberschatz, Korth, Sudarshan, 7e), ch. 14 “Indexing” — the single-threaded B+-tree foundation (knowledge/research/dbms-general/).
  • Database Internals (Petrov 2019), ch. 2-6 — B-tree concurrency, page layout, and WAL framing.

Sibling docs (cross-references — mechanism owned there, not duplicated here)

Section titled “Sibling docs (cross-references — mechanism owned there, not duplicated here)”
  • postgres-index-am.md — the IndexAmRoutine dispatch contract that bthandler fills in.
  • postgres-buffer-manager.md — buffer pins and content-lock (LWLock) latching that nbtree’s page locks are built on.
  • postgres-lock-manager.md — the SQL-level heavyweight lock table and (via postgres-ssi-predicate-locking.md) predicate locks that CheckForSerializableConflictIn / PredicateLockPageSplit feed.
  • postgres-architecture-overview.md — Axis 4 (pluggable access methods), where nbtree sits as the default index AM.