Skip to content

PostgreSQL Hash Index — Linear Hashing, Buckets, and Overflow Pages

Contents:

A hash index answers exactly one question — “which rows have this key?” — and it answers nothing else. Unlike a B-tree it cannot serve a range scan, an inequality, an ORDER BY, or a prefix match, because hashing deliberately destroys the ordering of keys: two adjacent values land in arbitrarily distant buckets. What it buys in exchange is the textbook promise of O(1) equality lookup. Database System Concepts (Silberschatz 7e, ch. 14 “Indexing”, captured in research/dbms-general/) frames the ideal: a hash function h(K) maps a key to one of B buckets, and a lookup computes h(K), goes straight to that bucket, and scans only the handful of entries there — no root-to-leaf descent, no log_F N page reads. The cost model is a single bucket access plus an overflow chain walk when a bucket holds more entries than fit in one page.

The hard part of any on-disk hash table is not the lookup; it is growth. A static hash table sized for N rows degrades catastrophically once it holds 10N rows — every bucket becomes a long overflow chain and the O(1) promise collapses to O(chain length). The naive fix, rehash everything into a table twice the size, is fine in memory but ruinous on disk: it rewrites the entire index in one stop-the-world operation, doubling the file and blocking every concurrent reader and writer for the duration. The discipline that solves this is dynamic hashing, and the specific variant PostgreSQL adopts is linear hashing, due to Witold Litwin (1980) and popularized for UNIX by the operative reference at the very top of src/backend/access/hash/README: Margo Seltzer & Ozan Yigit, “A New Hashing Package for UNIX”, USENIX Winter 1991 (catalogued in postgres-paper-bibliography.md; the same lineage feeds PostgreSQL’s in-memory dynahash.c).

Linear hashing’s insight is to grow the table one bucket at a time, and to split buckets in a fixed round-robin order that is independent of which bucket actually overflowed. Maintain two masks: a lowmask covering the current “level” of 2^L buckets and a highmask = 2*lowmask + 1 covering the next level. To address a key, first compute b = h(K) & highmask; if b exceeds the highest bucket that currently exists, fall back to b = h(K) & lowmask. A split pointer (PostgreSQL’s hashm_maxbucket plays this role) marks how far the round-robin expansion has progressed. When the table is judged too full, the bucket at the split pointer is split — its entries are re-hashed under the wider highmask and partitioned between the old bucket and a brand-new high-numbered bucket — and the pointer advances by one. Crucially the bucket that splits is not the one that overflowed; overflow pages absorb local hotspots, while the global split pointer sweeps evenly so that, averaged over a full level, the whole table doubles smoothly without ever rewriting it wholesale.

Two structural consequences organize the rest of this document:

  1. Buckets need overflow pages. Because a split only happens when the global load factor crosses a threshold — not when a particular bucket fills — any individual bucket may temporarily receive more tuples than fit on one page. Those spill onto overflow pages chained off the primary page. Overflow is the pressure-relief valve that lets the split schedule stay round-robin rather than reactive.

  2. The bucket-to-block mapping must be computable, not stored. A naive implementation would keep a directory array mapping bucket number to disk block. Linear hashing avoids the directory: because buckets are allocated in disciplined power-of-two splitpoint groups, the physical block of a bucket can be computed from the bucket number plus a small spares[] array — which is exactly what PostgreSQL’s BUCKET_TO_BLKNO macro does.

The companion theme is concurrency and crash safety. The textbook hash table is single-threaded and in-memory; a production index must let many backends read and insert concurrently while splits and garbage collection run, and must survive a crash mid-split. PostgreSQL’s answer leans entirely on the buffer manager’s three primitives — content locks, pins, and cleanup locks (see postgres-buffer-manager.md) — rather than the heavyweight lock manager, and decomposes every multi-page mutation into single-page atomic WAL actions, exactly the discipline the B-tree uses (postgres-nbtree.md) but adapted to the bucket-chain shape.

Engineers building a disk hash index converge on a recurring set of conventions. Naming them first lets the PostgreSQL symbols read as choices within a shared design space.

Directory hashing vs. directoryless linear hashing

Section titled “Directory hashing vs. directoryless linear hashing”

There are two dominant families of dynamic on-disk hashing:

  • Extendible hashing (Fagin et al. 1979) keeps an explicit directory: an array of 2^d pointers indexed by the top d bits of the hash. When a bucket overflows, only that bucket splits, and the directory either doubles (if d must grow) or simply re-points two slots. Lookups are always exactly two I/Os (directory + bucket), but the directory itself can become large and is a shared mutable structure.
  • Linear hashing (Litwin 1980; Seltzer-Yigit 1991) keeps no directory. Bucket addresses are computed arithmetically from the hash, the two masks, and the split pointer. Growth is round-robin rather than on-demand. This trades the occasional extra overflow-chain hop for the elimination of a centralized directory — a good trade when the directory would otherwise be a contention and memory hotspot.

PostgreSQL chose linear hashing without a directory, which is why the metapage carries highmask, lowmask, maxbucket, and the spares[] array instead of a pointer table.

flowchart TD
  K["key K"] --> H["h = hash(K)<br/>32-bit hashcode"]
  H --> HM["b = h &amp; highmask"]
  HM --> CMP{"b &gt; maxbucket?"}
  CMP -->|yes| LM["b = h &amp; lowmask<br/>bucket not yet split"]
  CMP -->|no| USE["use bucket b"]
  LM --> USE
  USE --> BLK["BUCKET_TO_BLKNO(metap, b)<br/>compute physical block"]
  BLK --> PBP["primary bucket page"]
  PBP --> OVF["overflow page chain<br/>(doubly linked)"]

The bucket chain: primary page plus overflow

Section titled “The bucket chain: primary page plus overflow”

Almost every disk hash index represents a bucket as a linked list of pages: one page that is permanently the bucket’s home, plus a chain of overflow pages added as the bucket grows. PostgreSQL uses a doubly-linked chain (forward hasho_nextblkno, backward hasho_prevblkno) so that garbage collection can splice a freed overflow page out of the middle of the chain by fixing its two neighbors. Entries within a page are kept sorted by hash code so a page can be binary-searched, even though no ordering holds across pages of a bucket.

To avoid a directory while still letting buckets be allocated incrementally, linear-hashing implementations allocate bucket pages in power-of-two groups. PostgreSQL calls each group a splitpoint. Splitpoint s contributes 2^(s-1) new buckets (so the table doubles per group), and to avoid allocating a huge slab at once, groups >= 10 are broken into four equal phases. A per-splitpoint array (spares[]) records how many overflow pages were allocated before each splitpoint’s bucket pages, which is the one piece of state needed to turn a bucket number into a block number arithmetically.

Page-level locking instead of the lock manager

Section titled “Page-level locking instead of the lock manager”

A B-tree and a hash index face the same concurrency problem and reach for the same toolbox: short-lived buffer content locks for individual page reads/writes, long-lived buffer pins to prevent a page from being reorganized out from under an in-flight cursor, and cleanup locks (an exclusive content lock held while observing that no one else has the page pinned) to authorize a wholesale page or bucket reorganization. None of this touches the heavyweight lock manager (postgres-lock-manager.md). The hash AM’s specific convention: a cleanup lock on a primary bucket page = the right to reorganize the entire bucket, and a scan holds a pin on the primary bucket page for its entire lifetime.

Decompose multi-page mutations into atomic WAL actions

Section titled “Decompose multi-page mutations into atomic WAL actions”

A split touches the metapage, the old bucket’s pages, and the new bucket’s pages; an overflow free touches the doomed page, its two neighbors, a bitmap page, and the metapage. No single WAL record can cover an unbounded number of pages (XLR_MAX_BLOCK_ID caps it at 32). The shared discipline is to break each logical operation into a sequence of single-record atomic steps, each of which leaves the index in a valid, searchable state, and to make the recovery rule “any intermediate state is legal” — so a crash between steps is repaired lazily by the next insert, split, or VACUUM rather than requiring an undo.

PostgreSQL’s hash AM is src/backend/access/hash/. The handler hashhandler() fills an IndexAmRoutine (the AM dispatch contract lives in postgres-index-am.md); the interesting machinery is in five files: hashpage.c (metapage, bucket allocation, splits), hashinsert.c (insertion), hashsearch.c (scans), hashovfl.c (overflow page lifecycle and squeeze), and hashutil.c (the addressing arithmetic). We walk the design in the order a tuple experiences it: addressing, then search, then insert, then split, then overflow and garbage collection.

There are four physical page kinds, distinguished by hasho_flag bits in the HashPageOpaqueData special space at the end of every page:

// HashPageOpaqueData — src/include/access/hash.h
typedef struct HashPageOpaqueData
{
BlockNumber hasho_prevblkno; /* prev page in chain, OR maxbucket-as-of-split */
BlockNumber hasho_nextblkno; /* next page in bucket chain */
Bucket hasho_bucket; /* bucket number this page belongs to */
uint16 hasho_flag; /* LH_* page type + transient flag bits */
uint16 hasho_page_id; /* 0xFF80, identifies hash pages to pg_filedump */
} HashPageOpaqueData;
#define LH_OVERFLOW_PAGE (1 << 0)
#define LH_BUCKET_PAGE (1 << 1)
#define LH_BITMAP_PAGE (1 << 2)
#define LH_META_PAGE (1 << 3)
#define LH_BUCKET_BEING_POPULATED (1 << 4)
#define LH_BUCKET_BEING_SPLIT (1 << 5)
#define LH_BUCKET_NEEDS_SPLIT_CLEANUP (1 << 6)
#define LH_PAGE_HAS_DEAD_TUPLES (1 << 7)

The low four bits are the page type (block 0 is always LH_META_PAGE); the high four are transient bucket-state flags used during split and cleanup. Note the dual meaning of hasho_prevblkno: on an overflow page it is the real previous block, but on a primary bucket page (where there is no previous page) it instead stores the value of hashm_maxbucket as of the bucket’s last split. That overload is the linchpin of the metapage cache (below) — it costs nothing because a primary page’s “previous block” is always invalid anyway.

Block 0 is the metapage. It stores the table’s live geometry:

// HashMetaPageData (excerpt) — src/include/access/hash.h
typedef struct HashMetaPageData
{
double hashm_ntuples; /* tuples stored, drives split decision */
uint16 hashm_ffactor; /* target fill factor (tuples per bucket) */
uint32 hashm_maxbucket; /* ID of highest live bucket (= split pointer) */
uint32 hashm_highmask; /* mask into the whole current table */
uint32 hashm_lowmask; /* mask into the lower half */
uint32 hashm_ovflpoint; /* current splitpoint for overflow allocation */
uint32 hashm_firstfree; /* lowest-numbered free overflow bit */
uint32 hashm_nmaps; /* number of bitmap pages */
uint32 hashm_spares[HASH_MAX_SPLITPOINTS]; /* ovfl pages before each splitpoint */
BlockNumber hashm_mapp[HASH_MAX_BITMAPS]; /* blocks of the bitmap pages */
} HashMetaPageData;

The address of a key’s bucket is computed, never looked up in a directory. Two functions do the arithmetic. _hash_hashkey2bucket applies the linear hashing masks:

// _hash_hashkey2bucket — src/backend/access/hash/hashutil.c
Bucket
_hash_hashkey2bucket(uint32 hashkey, uint32 maxbucket,
uint32 highmask, uint32 lowmask)
{
Bucket bucket;
bucket = hashkey & highmask;
if (bucket > maxbucket)
bucket = bucket & lowmask; /* this bucket not yet split off */
return bucket;
}

Then BUCKET_TO_BLKNO turns a bucket number into a physical block by adding the overflow pages that precede that bucket’s splitpoint group — the spares[] entry is what makes this directoryless:

// BUCKET_TO_BLKNO — src/include/access/hash.h
#define BUCKET_TO_BLKNO(metap,B) \
((BlockNumber) ((B) + ((B) ? (metap)->hashm_spares[_hash_spareindex((B)+1)-1] : 0)) + 1)

The + 1 accounts for the metapage at block 0; buckets 0 and 1 always live at blocks 1 and 2. _hash_spareindex maps a bucket count to its global splitpoint phase, and _hash_get_totalbuckets inverts it — together they implement the “groups < 10 are single-phase, groups >= 10 are four-phase” allocation schedule described in the README:

// _hash_spareindex — src/backend/access/hash/hashutil.c
uint32
_hash_spareindex(uint32 num_bucket)
{
uint32 splitpoint_group = pg_ceil_log2_32(num_bucket);
if (splitpoint_group < HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE)
return splitpoint_group;
/* ... groups >= 10 are split into 4 equal phases ... */
return splitpoint_phases;
}

Both scanning and inserting must locate a bucket, which needs maxbucket/highmask/lowmask. Locking the metapage on every operation would make block 0 a brutal hotspot, so each backend keeps a copy of the metapage in its relcache (rd_amcache) via _hash_getcachedmetap. The danger is staleness: a concurrent split could have moved the target key into a new bucket since the cache was filled. The validation trick uses the overloaded hasho_prevblkno: after locking the candidate primary bucket page, compare its stored maxbucket-as-of-last-split against the cached maxbucket. If the page’s value is larger, the bucket has split since our cache was made, so we drop the lock, refresh, and retry:

// _hash_getbucketbuf_from_hashkey — src/backend/access/hash/hashpage.c
for (;;)
{
bucket = _hash_hashkey2bucket(hashkey, metap->hashm_maxbucket,
metap->hashm_highmask, metap->hashm_lowmask);
blkno = BUCKET_TO_BLKNO(metap, bucket);
buf = _hash_getbuf(rel, blkno, access, LH_BUCKET_PAGE);
opaque = HashPageGetOpaque(BufferGetPage(buf));
/* If this bucket hasn't been split since our cache, we're done. */
if (opaque->hasho_prevblkno <= metap->hashm_maxbucket)
break;
/* Stale: drop lock, force-refresh the cached metapage, and retry. */
_hash_relbuf(rel, buf);
metap = _hash_getcachedmetap(rel, &metabuf, true);
}

The README justifies why retries are cheap: any given bucket can split only a few dozen times ever (the total bucket count is bounded by 2^32), whereas accesses are unbounded, so the steady state almost never retries.

A hash scan supports only single-column equality (HTEqualStrategyNumber) and rejects whole-index scans outright. _hash_first computes the hash key, finds the primary bucket page, and pins it for the whole scan. The crucial performance and concurrency move: a hash scan slurps all matching tuples from a page into backend-local storage (so->currPos.items[]) in one shot, then releases the page content lock and processes heap TIDs while holding no index lock — only the bucket-page pin. This is why concurrent inserts can proceed on the same page without the scan having to re-find its position.

flowchart TD
  START["_hash_first(scan, dir)"] --> HK["hashkey = hash(scankey)"]
  HK --> GBB["_hash_getbucketbuf_from_hashkey<br/>pin + lock primary bucket page"]
  GBB --> POP{"bucket being<br/>populated by split?"}
  POP -->|yes| OLD["also pin old (split) bucket<br/>set hashso_buc_populated"]
  POP -->|no| RP["_hash_readpage"]
  OLD --> RP
  RP --> BIN["_hash_binsearch(page, hashkey)<br/>locate first match"]
  BIN --> LOAD["_hash_load_qualified_items<br/>copy matches to currPos.items[]"]
  LOAD --> REL["release content lock<br/>(retain pin on primary bucket)"]
  REL --> NEXT{"more items<br/>in currPos?"}
  NEXT -->|yes| RET["return xs_heaptid"]
  NEXT -->|no, fwd| STEP["_hash_next: step to<br/>hasho_nextblkno overflow page"]
  STEP --> RP

When a scan starts on a bucket that is being populated by a split, it must also scan the old bucket it was split from, while skipping tuples already copied (marked INDEX_MOVED_BY_SPLIT). _hash_load_qualified_items enforces both that skip and the dead-tuple skip:

// _hash_load_qualified_items (forward) — src/backend/access/hash/hashsearch.c
if ((so->hashso_buc_populated && !so->hashso_buc_split &&
(itup->t_info & INDEX_MOVED_BY_SPLIT_MASK)) ||
(scan->ignore_killed_tuples &&
(ItemIdIsDead(PageGetItemId(page, offnum)))))
{
offnum = OffsetNumberNext(offnum); /* skip moved-by-split / dead */
continue;
}
if (so->hashso_sk_hash == _hash_get_indextuple_hashkey(itup) &&
_hash_checkqual(scan, itup))
{
_hash_saveitem(so, itemIndex, offnum, itup); /* qualified match */
itemIndex++;
}
else
break; /* page is hash-sorted, so first miss ends the run */

Because entries are hash-sorted within a page, the binary search jumps to the first possible match and the loop stops at the first non-match — no full page scan.

_hash_doinsert extracts the precomputed hash key from the index tuple, locates and write-locks the primary bucket page, then walks the bucket chain looking for a page with room. Three notable behaviors: (1) if the bucket is being split, it tries to finish the split first, because that may free space and avoid a needless overflow page; (2) if a full page has dead (LP_DEAD-hinted) tuples, it vacuums that one page before resorting to overflow; (3) only after placing the tuple does it lock the metapage to bump the tuple count and decide whether a split is now due.

// _hash_doinsert (chain walk) — src/backend/access/hash/hashinsert.c
while (PageGetFreeSpace(page) < itemsz)
{
BlockNumber nextblkno;
/* try reclaiming dead tuples on this page before overflowing */
if (H_HAS_DEAD_TUPLES(pageopaque) && IsBufferCleanupOK(buf))
{
_hash_vacuum_one_page(rel, heapRel, metabuf, buf);
if (PageGetFreeSpace(page) >= itemsz)
break;
}
nextblkno = pageopaque->hasho_nextblkno;
if (BlockNumberIsValid(nextblkno))
{
/* step to existing overflow page (drop lock; keep pin on primary) */
if (buf != bucket_buf)
_hash_relbuf(rel, buf);
else
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
buf = _hash_getbuf(rel, nextblkno, HASH_WRITE, LH_OVERFLOW_PAGE);
}
else
{
/* end of chain: allocate and link a fresh overflow page */
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
buf = _hash_addovflpage(rel, metabuf, buf, (buf == bucket_buf));
}
page = BufferGetPage(buf);
pageopaque = HashPageGetOpaque(page);
}

The actual placement keeps the page hash-sorted via _hash_pgaddtup (binary search for the insertion point, unless the caller asserts append order). The count bump and split decision are made under the metapage exclusive lock, inside a critical section, and the whole thing is one WAL record covering the target page and the metapage:

// _hash_doinsert (commit + split decision) — src/backend/access/hash/hashinsert.c
LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
START_CRIT_SECTION();
itup_off = _hash_pgaddtup(rel, buf, itemsz, itup, sorted);
MarkBufferDirty(buf);
metap = HashPageGetMeta(metapage);
metap->hashm_ntuples += 1;
do_expand = metap->hashm_ntuples >
(double) metap->hashm_ffactor * (metap->hashm_maxbucket + 1);
MarkBufferDirty(metabuf);
/* ... XLOG_HASH_INSERT registers buf (block 0) and metabuf (block 1) ... */
END_CRIT_SECTION();
LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
_hash_relbuf(rel, buf);
if (do_expand)
_hash_expandtable(rel, metabuf); /* may split one bucket */

The split test — ntuples > ffactor * (maxbucket + 1) — is the global load factor, kept deliberately in sync between _hash_doinsert and _hash_expandtable. It triggers on the whole table being overfull, not on the current bucket, which is the round-robin discipline of linear hashing.

_hash_expandtable is the heart of growth. It locks the metapage, rechecks that a split is still wanted, then picks the bucket to split using the linear hashing rule: the new bucket is maxbucket + 1, and the old bucket it splits from is new_bucket & lowmask. It takes a conditional cleanup lock on the old bucket — conditional because it must not block while holding the metapage lock; if it can’t get the lock it simply gives up, leaving the index overfull but correct, and the next inserter retries.

// _hash_expandtable (bucket selection) — src/backend/access/hash/hashpage.c
new_bucket = metap->hashm_maxbucket + 1;
old_bucket = (new_bucket & metap->hashm_lowmask);
start_oblkno = BUCKET_TO_BLKNO(metap, old_bucket);
buf_oblkno = _hash_getbuf_with_condlock_cleanup(rel, start_oblkno, LH_BUCKET_PAGE);
if (!buf_oblkno)
goto fail; /* couldn't get cleanup lock; abandon split, stay overfull */

If the old bucket still has a pending split (LH_BUCKET_BEING_SPLIT) or leftover split-cleanup work (LH_BUCKET_NEEDS_SPLIT_CLEANUP), it finishes that first and restarts — a bucket can never be in two splits at once. Once clear, it may need to allocate a whole new splitpoint’s worth of bucket pages (_hash_alloc_buckets, which just extends the file’s logical EOF by writing a single zero page at the end of the range, relying on filesystem holes for the rest). Then, inside a critical section, it updates the metapage masks and flags both buckets as in-progress, all under one XLOG_HASH_SPLIT_ALLOCATE_PAGE record:

// _hash_expandtable (mask + flag update) — src/backend/access/hash/hashpage.c
START_CRIT_SECTION();
metap->hashm_maxbucket = new_bucket;
if (new_bucket > metap->hashm_highmask)
{
/* a full doubling completed: widen the masks one bit */
metap->hashm_lowmask = metap->hashm_highmask;
metap->hashm_highmask = new_bucket | metap->hashm_lowmask;
}
/* old bucket: mark being-split AND stamp maxbucket into hasho_prevblkno */
oopaque->hasho_flag |= LH_BUCKET_BEING_SPLIT;
oopaque->hasho_prevblkno = maxbucket;
/* new bucket: fresh primary page, marked being-populated */
nopaque->hasho_flag = LH_BUCKET_PAGE | LH_BUCKET_BEING_POPULATED;

The mask-widening only fires at a power-of-two boundary; between boundaries the table grows one bucket per split while the masks stay fixed — that is linear hashing’s smooth incremental doubling. _hash_splitbucket then walks the old bucket’s chain, and for each tuple recomputes its bucket under the new masks. Tuples destined for the new bucket are copied (not moved in place), stamped INDEX_MOVED_BY_SPLIT_MASK, and added to the new bucket, allocating overflow pages there as needed:

// _hash_splitbucket (partition loop) — src/backend/access/hash/hashpage.c
bucket = _hash_hashkey2bucket(_hash_get_indextuple_hashkey(itup),
maxbucket, highmask, lowmask);
if (bucket == nbucket)
{
new_itup = CopyIndexTuple(itup);
new_itup->t_info |= INDEX_MOVED_BY_SPLIT_MASK; /* scans skip in old bucket */
/* buffer up; flush to new page (and chain overflow) when full */
itups[nitups++] = new_itup;
}
else
Assert(bucket == obucket); /* stays put */

Tuples that stay in the old bucket are left in place for now — they are removed lazily later by the split-cleanup step. When the chain is exhausted, the routine re-locks both primary pages (old first, to respect the lower-bucket-first lock order), clears the in-progress flags, and sets LH_BUCKET_NEEDS_SPLIT_CLEANUP on the old bucket under an XLOG_HASH_SPLIT_COMPLETE record:

// _hash_splitbucket (completion) — src/backend/access/hash/hashpage.c
oopaque->hasho_flag &= ~LH_BUCKET_BEING_SPLIT;
nopaque->hasho_flag &= ~LH_BUCKET_BEING_POPULATED;
oopaque->hasho_flag |= LH_BUCKET_NEEDS_SPLIT_CLEANUP; /* old copies still present */

The two-phase “copy now under moved-by-split, delete later under needs-cleanup” design is what makes a split survivable across a crash and invisible to concurrent scans: a scan that began before the split sees the old bucket’s originals; a scan that began during it sees the new bucket plus the old bucket’s not-yet-moved tuples; and a crash mid-split leaves both flags set, so the reader algorithm transparently scans both halves until the next insert finishes the job.

flowchart TD
  INS["insert pushes ntuples ><br/>ffactor * (maxbucket+1)"] --> EXP["_hash_expandtable"]
  EXP --> META["lock metapage<br/>recheck split needed"]
  META --> PICK["new = maxbucket+1<br/>old = new &amp; lowmask"]
  PICK --> CL{"cond. cleanup lock<br/>on old bucket?"}
  CL -->|no| GIVEUP["give up<br/>stay overfull, retry later"]
  CL -->|yes| PEND{"old has pending<br/>split / cleanup?"}
  PEND -->|yes| FIN["finish it, restart_expand"]
  PEND -->|no| ALLOC["_hash_alloc_buckets if<br/>new splitpoint group"]
  ALLOC --> FLAG["crit section: widen masks,<br/>flag old=BEING_SPLIT new=BEING_POPULATED<br/>WAL: SPLIT_ALLOCATE_PAGE"]
  FLAG --> PART["_hash_splitbucket:<br/>copy nbucket tuples,<br/>mark MOVED_BY_SPLIT<br/>WAL: SPLIT_PAGE per full page"]
  PART --> DONE["clear flags,<br/>set NEEDS_SPLIT_CLEANUP<br/>WAL: SPLIT_COMPLETE"]
  DONE --> CLEAN["hashbucketcleanup:<br/>delete moved originals from old"]

Overflow pages: allocation, freeing, and the free-space bitmap

Section titled “Overflow pages: allocation, freeing, and the free-space bitmap”

Overflow pages are tracked by bitmap pages — dedicated pages whose payload is a giant bit array, one bit per overflow page (0 = free, 1 = in use). _hash_addovflpage first searches the bitmaps starting at hashm_firstfree for a recyclable page; only if none is free does it extend the index. Allocation and chaining are a single XLOG_HASH_ADD_OVFL_PAGE record so a crash cannot leak a half-linked page:

// _hash_addovflpage (recycle vs. extend) — src/backend/access/hash/hashovfl.c
for (; bit <= last_inpage; j++, bit += BITS_PER_MAP)
{
if (freep[j] != ALL_SET) /* a free bit in this bitmap word */
{
bit += _hash_firstfreebit(freep[j]);
blkno = bitno_to_blkno(metap, bit + (i << BMPG_SHIFT(metap)));
ovflbuf = _hash_getinitbuf(rel, blkno); /* recycle this page */
goto found;
}
}
/* ... fell through: extend the relation for a new overflow page ... */

Freeing is the inverse and is atomic with moving the page’s tuples elsewhere_hash_freeovflpage splices the doomed page out of the doubly-linked chain (fixing both neighbors), clears its bitmap bit, lowers hashm_firstfree if appropriate, and adds the relocated tuples to the write page, all under one XLOG_HASH_SQUEEZE_PAGE record. The README is explicit that combining the move and the delink into one action is what prevents a standby from briefly showing a tuple twice:

// _hash_freeovflpage (delink) — src/backend/access/hash/hashovfl.c
if (BufferIsValid(prevbuf))
prevopaque->hasho_nextblkno = nextblkno; /* fore sibling skips us */
if (BufferIsValid(nextbuf))
nextopaque->hasho_prevblkno = prevblkno; /* aft sibling skips us */
CLRBIT(freep, bitmapbit); /* mark page free in bitmap */
if (ovflbitno < metap->hashm_firstfree)
metap->hashm_firstfree = ovflbitno;

VACUUM (hashbulkdeletehashbucketcleanup) removes dead and moved-by-split tuples page by page under cleanup locks chained down the bucket, then _hash_squeezebucket compacts the bucket: it works a read pointer backward from the chain’s tail and a write pointer forward from the primary page, moving tuples onto earlier pages and freeing emptied overflow pages, until the two pointers meet. This reclaims overflow pages into the bitmap free pool — the only mechanism (short of REINDEX) by which a hash index gives space back, since the bucket count itself never shrinks.

Bucket-level locking and the scan/VACUUM interlock

Section titled “Bucket-level locking and the scan/VACUUM interlock”

The locking contract is small but strict. To avoid deadlock between any two operations that must touch two buckets, the lower-numbered bucket is always locked first, and the metapage is locked only after all bucket locks. A scan keeps a pin (not a lock) on its primary bucket page for its whole life, which blocks VACUUM and split (both of which need a cleanup lock, i.e. the sole pin) from reorganizing that bucket while the scan is parked in it. The one residual hazard — a scan that starts after VACUUM released a bucket’s cleanup lock but then overtakes the still-running cleanup, and wrongly kills a TID that VACUUM is about to reuse — is closed by lock chaining: cleanup locks the next page in the chain before releasing the current one, so a scan can never leapfrog the cleanup frontier.

The hash AM lives entirely under src/backend/access/hash/. Symbols are grouped by call-flow below; the position-hint table at the end pairs each with its (file, line) as observed on 2026-06-05.

  • hashhandler — fills the IndexAmRoutine advertised to the index AM layer (postgres-index-am.md). Declares the AM as supporting only equality (amsearcharray = false, one strategy), not unique, not ordered.
  • hashbuild — bulk-build entry point; uses a spool when the index is large enough to sort by hash code first (hashsort.c, out of scope here).
  • hashinsert — per-tuple public entry; computes the index tuple and calls _hash_doinsert.
  • hashgettuple — scan entry; drives _hash_first / _hash_next and manages LP_DEAD kill-item hinting.
  • hashbulkdelete / hashbucketcleanup — VACUUM’s per-bucket cleanup driver, removing dead and moved-by-split tuples and clearing LH_BUCKET_NEEDS_SPLIT_CLEANUP, then squeezing.
  • _hash_datum2hashkey — call the opclass hash proc to get the 32-bit hash code for a key (the cached, non-cross-type fast path).
  • _hash_hashkey2bucket — apply highmask/lowmask/maxbucket to map a hash code to a bucket number.
  • _hash_spareindex — map a bucket count to its global splitpoint phase (single-phase for groups < 10, four-phase for groups >= 10).
  • _hash_get_totalbuckets — inverse: total buckets allocated through a given splitpoint phase.
  • _hash_get_oldblock_from_newbucket / _hash_get_newblock_from_oldbucket — translate between a split’s two bucket numbers, used by scans and split completion.
  • BUCKET_TO_BLKNO (macro, hash.h) — bucket number to physical block.

Metapage, allocation, and split (hashpage.c)

Section titled “Metapage, allocation, and split (hashpage.c)”
  • _hash_init / _hash_init_metabuffer — create the metapage, the initial N bucket pages, and the first bitmap page at CREATE INDEX.
  • _hash_getbuf / _hash_getnewbuf / _hash_getinitbuf / _hash_getbuf_with_condlock_cleanup — typed buffer accessors that run _hash_checkpage against the allowed page-type mask.
  • _hash_getcachedmetap — return (and lazily refresh) the relcache copy of the metapage in rd_amcache.
  • _hash_getbucketbuf_from_hashkey — the cache-validated lookup loop that yields a locked primary bucket page, retrying on stale-cache detection.
  • _hash_expandtable — the split driver: pick the bucket, take cleanup locks, widen masks, allocate, and flag in-progress.
  • _hash_alloc_buckets — extend the file’s logical EOF for a new splitpoint group by writing one zero page at the range end.
  • _hash_splitbucket — re-partition the old bucket’s tuples between old and new, marking moved tuples INDEX_MOVED_BY_SPLIT.
  • _hash_finish_split — complete a split interrupted by a crash or a failed cleanup-lock acquisition, using a TID hash table to skip already-moved tuples.
  • _hash_doinsert — locate bucket, walk the chain for free space (adding overflow if needed), place the tuple, bump the metapage count, and trigger a split if overfull.
  • _hash_pgaddtup / _hash_pgaddmultitup — add one / many tuples to a page preserving hash-code order (binary search for the slot).
  • _hash_vacuum_one_page — opportunistically delete LP_DEAD tuples from a single full page before resorting to an overflow page.
  • _hash_first — start a scan: hash the key, pin the primary bucket page, handle the being-populated case, position via binary search.
  • _hash_readpage — load all qualifying tuples on a page into so->currPos.items[], then release the content lock (keep the pin).
  • _hash_load_qualified_items — the per-page match loop, skipping moved-by-split and dead tuples.
  • _hash_next / _hash_readnext / _hash_readprev — advance to the next/previous page in the chain, crossing into the split partner bucket when a split was in progress at scan start.
  • _hash_addovflpage — recycle a free overflow page from the bitmap, or extend the index; chain it to the bucket tail atomically.
  • _hash_freeovflpage — delink an overflow page, clear its bitmap bit, and move its tuples to a write page, all in one WAL action.
  • _hash_squeezebucket — compact a bucket by moving tuples toward the front and freeing emptied overflow pages.
  • _hash_initbitmapbuffer — initialize a bitmap page with all bits set (“in use”).
  • bitno_to_blkno / _hash_ovflblkno_to_bitno — convert between an overflow page’s bitmap bit number and its physical block number.

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
hashhandlersrc/backend/access/hash/hash.c58
hashbuildsrc/backend/access/hash/hash.c122
hashinsertsrc/backend/access/hash/hash.c258
hashgettuplesrc/backend/access/hash/hash.c290
hashbulkdeletesrc/backend/access/hash/hash.c465
hashbucketcleanupsrc/backend/access/hash/hash.c690
_hash_datum2hashkeysrc/backend/access/hash/hashutil.c82
_hash_hashkey2bucketsrc/backend/access/hash/hashutil.c125
_hash_spareindexsrc/backend/access/hash/hashutil.c142
_hash_get_totalbucketssrc/backend/access/hash/hashutil.c174
_hash_get_oldblock_from_newbucketsrc/backend/access/hash/hashutil.c422
_hash_get_newblock_from_oldbucketsrc/backend/access/hash/hashutil.c461
_hash_initsrc/backend/access/hash/hashpage.c327
_hash_init_metabuffersrc/backend/access/hash/hashpage.c498
_hash_getbufsrc/backend/access/hash/hashpage.c70
_hash_getbuf_with_condlock_cleanupsrc/backend/access/hash/hashpage.c96
_hash_getnewbufsrc/backend/access/hash/hashpage.c198
_hash_expandtablesrc/backend/access/hash/hashpage.c614
_hash_alloc_bucketssrc/backend/access/hash/hashpage.c992
_hash_splitbucketsrc/backend/access/hash/hashpage.c1073
_hash_finish_splitsrc/backend/access/hash/hashpage.c1356
_hash_getcachedmetapsrc/backend/access/hash/hashpage.c1501
_hash_getbucketbuf_from_hashkeysrc/backend/access/hash/hashpage.c1559
_hash_doinsertsrc/backend/access/hash/hashinsert.c38
_hash_pgaddtupsrc/backend/access/hash/hashinsert.c274
_hash_pgaddmultitupsrc/backend/access/hash/hashinsert.c331
_hash_vacuum_one_pagesrc/backend/access/hash/hashinsert.c370
_hash_nextsrc/backend/access/hash/hashsearch.c48
_hash_readnextsrc/backend/access/hash/hashsearch.c131
_hash_readprevsrc/backend/access/hash/hashsearch.c197
_hash_firstsrc/backend/access/hash/hashsearch.c288
_hash_readpagesrc/backend/access/hash/hashsearch.c448
_hash_load_qualified_itemssrc/backend/access/hash/hashsearch.c604
bitno_to_blknosrc/backend/access/hash/hashovfl.c35
_hash_ovflblkno_to_bitnosrc/backend/access/hash/hashovfl.c62
_hash_addovflpagesrc/backend/access/hash/hashovfl.c112
_hash_freeovflpagesrc/backend/access/hash/hashovfl.c490
_hash_initbitmapbuffersrc/backend/access/hash/hashovfl.c777
_hash_squeezebucketsrc/backend/access/hash/hashovfl.c842
HashMetaPageDatasrc/include/access/hash.h244
HashPageOpaqueDatasrc/include/access/hash.h77
BUCKET_TO_BLKNOsrc/include/access/hash.h39

All claims below were checked against the REL_18 tree at /data/hgryoo/references/postgres (commit 273fe94).

  • Page types and flags. The four physical page types (LH_OVERFLOW_PAGE, LH_BUCKET_PAGE, LH_BITMAP_PAGE, LH_META_PAGE) and the four transient flags (LH_BUCKET_BEING_POPULATED, LH_BUCKET_BEING_SPLIT, LH_BUCKET_NEEDS_SPLIT_CLEANUP, LH_PAGE_HAS_DEAD_TUPLES) are defined in hash.h lines 54–61, exactly as quoted. HASHO_PAGE_ID is 0xFF80 (line 101). Verified.
  • hasho_prevblkno overload. The header comment (lines 67–72) confirms that on a primary bucket page hasho_prevblkno stores “the hashm_maxbucket value as of the last time the bucket was last split, or else as of the time the bucket was created”, used for cache validation. _hash_initbuf sets it to max_bucket (line 176) and _hash_expandtable updates it to maxbucket on split (oopaque->hasho_prevblkno = maxbucket). Verified.
  • Metapage cache validation. _hash_getbucketbuf_from_hashkey breaks the loop when opaque->hasho_prevblkno <= metap->hashm_maxbucket and otherwise force-refreshes via _hash_getcachedmetap(rel, &metabuf, true). Verified against the quoted excerpt.
  • Linear hashing mask logic. _hash_hashkey2bucket computes bucket = hashkey & highmask; if (bucket > maxbucket) bucket &= lowmask; verbatim. The mask-widening on a doubling boundary in _hash_expandtable (lowmask = highmask; highmask = new_bucket | lowmask;) is present. Verified.
  • Split selection. new_bucket = maxbucket + 1; old_bucket = new_bucket & lowmask; confirmed in _hash_expandtable. The conditional cleanup lock (_hash_getbuf_with_condlock_cleanup) and the “give up on failure” path (goto fail) are as described. Verified.
  • moved-by-split marking. _hash_splitbucket copies qualifying tuples with CopyIndexTuple and ORs in INDEX_MOVED_BY_SPLIT_MASK (= INDEX_AM_RESERVED_BIT, hash.h line 293), and _hash_load_qualified_items skips such tuples when hashso_buc_populated && !hashso_buc_split. Verified.
  • Split decision formula. Both _hash_doinsert and _hash_expandtable use ntuples > ffactor * (maxbucket + 1), with explicit “keep in sync” comments at both sites. Verified.
  • WAL op codes. XLOG_HASH_INIT_META_PAGE(0x00), INIT_BITMAP_PAGE, INSERT, ADD_OVFL_PAGE, SPLIT_ALLOCATE_PAGE, SPLIT_PAGE, SPLIT_COMPLETE, MOVE_PAGE_CONTENTS, SQUEEZE_PAGE, DELETE, SPLIT_CLEANUP, UPDATE_META_PAGE, VACUUM_ONE_PAGE are all defined in hash_xlog.h (lines 27–45). Each XLogInsert(RM_HASH_ID, ...) call cited matches its operation. Verified.
  • Overflow free-space bitmap. _hash_addovflpage searches bitmaps from hashm_firstfree, recycles a page if a freep[j] != ALL_SET word is found (else extends), and _hash_freeovflpage does CLRBIT(freep, bitmapbit) and lowers hashm_firstfree. The atomic move+delink under XLOG_HASH_SQUEEZE_PAGE is confirmed. Verified.
  • Scan locking. _hash_first keeps so->hashso_bucket_buf pinned; _hash_readpage releases the content lock but retains the primary-bucket pin (LockBuffer(... BUFFER_LOCK_UNLOCK) when currPos.buf == hashso_bucket_buf). The whole-page slurp into currPos.items[] is present. Verified.
  • Whole-index scan rejection. _hash_first raises ERRCODE_FEATURE_NOT_SUPPORTED “hash indexes do not support whole-index scans” when scan->numberOfKeys < 1. Verified.
  • One caveat. The doc says VACUUM “gives space back to the bitmap free pool but never to the OS”; the README (lines 31–34) confirms a hash index is never shrunk except by REINDEX, and overflow pages are recycled but never returned to the filesystem. Verified.
  • WAL and crash recovery. Hash indexes have been fully WAL-logged since PostgreSQL 10; prior to that they were not crash-safe. The RM_HASH_ID resource manager (hash_xlog.c) replays each of the op codes above, and the README’s “Lock definitions” and “Pseudocode” sections describe the recovery invariant that any intermediate split / squeeze state is a legal index. Verified against hash_xlog.h and the START_CRIT_SECTION / XLogInsert call sites quoted above.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”
  • Extendible hashing (Fagin et al. 1979) vs. PostgreSQL’s directoryless linear hashing. The dominant alternative keeps a 2^d directory indexed by the top d hash bits, guaranteeing exactly two I/Os per probe and splitting only the bucket that overflowed. PostgreSQL deliberately rejected the directory — BUCKET_TO_BLKNO computes the block from spares[] instead — trading the occasional extra overflow-chain hop for the elimination of a shared, mutable, memory-resident directory that would itself be a contention hotspot. A focused note mapping each hashm_* metapage field to its extendible-hashing analogue (directory vs. masks, local-depth vs. hasho_prevblkno-as-maxbucket) would sharpen the trade. Paper: Fagin, Nievergelt, Pippenger & Strong, “Extendible Hashing”, ACM TODS 4(3), 1979.

  • Litwin’s original linear hashing vs. Seltzer-Yigit’s UNIX package. The README credits Seltzer & Yigit 1991 (dbms-papers/ bibliography) as the direct lineage, which is itself an engineering refinement of Litwin’s 1980 linear hashing. The PostgreSQL spares[]/splitpoint-phase scheme (groups < 10 single-phase, groups >= 10 four-phase) is a disk-allocation optimization absent from the textbook algorithm — it avoids extending the file by a full power-of-two slab at once. Comparing _hash_spareindex against Litwin’s flat group numbering is a self-contained exercise in how a paper algorithm hardens into production storage code. Paper: Litwin, “Linear Hashing: A New Tool for File and Table Addressing”, VLDB 1980.

  • Hash vs. B-tree as the default equality index. Until WAL logging arrived in PG 10, hash indexes were not crash-safe and the documentation actively steered users to B-tree even for equality. Even now a B-tree on a single column serves equality in O(log_F N) with full crash safety, ordering, and uniqueness, so the hash AM wins only on very large keys (where storing the 4-byte hash code beats storing the full key) and high-cardinality equality workloads. A benchmark note pitting postgres-nbtree.md’s _bt_search against _hash_first on a wide-text equality column would quantify the crossover. (See postgres-nbtree.md.)

  • In-memory dynamic hashing — dynahash.c shares the lineage. PostgreSQL’s own in-memory hash tables (src/backend/utils/hash/dynahash.c, behind the buffer mapping table, the relcache, the lock manager’s LOCK hash) descend from the same Seltzer-Yigit package, but expand by directory segments rather than disk splitpoints and never persist. Contrasting the on-disk hashm_spares[] addressing with dynahash’s segmented directory shows two forks of one algorithm specialized for disk vs. RAM.

  • Modern concurrent / latch-free hash indexes. Research engines push past page-content-lock linear hashing toward latch-free split-ordered lists (Shalev-Shavit 2006) and cache-conscious extendible hashing (e.g. level hashing, CCEH for persistent memory). Positioning PostgreSQL’s cleanup-lock-per-bucket, pin-for-scan-lifetime protocol against these would frame what a disk-oriented, buffer-shared engine gives up for crash safety and replica consistency. The hash AM’s “one atomic WAL action per page mutation, any intermediate state is searchable” discipline is precisely the property latch-free in-memory designs do not need and cannot easily provide.

  • CUBRID has no hash index AM. CUBRID’s persistent indexing is B+-tree only; hashing appears in its query executor (hash join, hash aggregation) and in in-memory structures, not as a secondary index access method. A note on why a DBMS might ship a B-tree-only index catalog — uniformity of the optimizer’s cost model, one recovery protocol to maintain — is the natural companion to this file. (See the CUBRID code-analysis tree.)

In-tree READMEs and source files (REL_18_STABLE, commit 273fe94)

Section titled “In-tree READMEs and source files (REL_18_STABLE, commit 273fe94)”
  • src/backend/access/hash/README — the design document: linear-hashing basics and the Seltzer-Yigit lineage, the splitpoint/spares[] addressing scheme, lock definitions (page content locks, pins, cleanup locks), the metapage-cache validation via hasho_prevblkno, incremental split protocol, overflow-page free-space bitmaps, the squeeze/cleanup VACUUM path, and the WAL crash-safety invariants.
  • src/include/access/hash.hHashPageOpaqueData, HashMetaPageData, the LH_* page-type and transient flag bits, BUCKET_TO_BLKNO, INDEX_MOVED_BY_SPLIT_MASK, HASHO_PAGE_ID, splitpoint/bitmap constants.
  • src/include/access/hash_xlog.h — the XLOG_HASH_* op codes and per-record payload structs replayed by RM_HASH_ID.
  • src/backend/access/hash/hash.chashhandler, the AM callbacks (hashinsert, hashgettuple, hashbulkdelete), and hashbucketcleanup.
  • src/backend/access/hash/hashpage.c — metapage init, typed buffer accessors, _hash_getcachedmetap, _hash_getbucketbuf_from_hashkey, _hash_expandtable, _hash_alloc_buckets, _hash_splitbucket, _hash_finish_split.
  • src/backend/access/hash/hashinsert.c_hash_doinsert, _hash_pgaddtup, _hash_vacuum_one_page.
  • src/backend/access/hash/hashsearch.c_hash_first, _hash_readpage, _hash_load_qualified_items, _hash_next/_hash_readnext/_hash_readprev.
  • src/backend/access/hash/hashovfl.c_hash_addovflpage, _hash_freeovflpage, _hash_squeezebucket, the bitmap helpers.
  • src/backend/access/hash/hashutil.c_hash_hashkey2bucket, _hash_spareindex, _hash_get_totalbuckets, the old/new bucket translators.
  • Seltzer, M. & Yigit, O. (1991). “A New Hashing Package for UNIX.” Proc. Winter USENIX Conf. The operative reference cited at the top of the hash README; source of the linear-hashing core and the spares[] addressing. Catalogued in knowledge/research/dbms-papers/ and the postgres-paper-bibliography.md plan.
  • Litwin, W. (1980). “Linear Hashing: A New Tool for File and Table Addressing.” Proc. VLDB. The original linear-hashing algorithm that Seltzer-Yigit refines.
  • Fagin, R., Nievergelt, J., Pippenger, N. & Strong, H. R. (1979). “Extendible Hashing — A Fast Access Method for Dynamic Files.” ACM TODS 4(3):315-344. The directory-based alternative contrasted in the “Common DBMS Design” and “Beyond PostgreSQL” sections.
  • Database System Concepts (Silberschatz, Korth, Sudarshan, 7e), ch. 14 “Indexing” — the static-hashing and dynamic-hashing foundation (knowledge/research/dbms-general/).
  • Database Internals (Petrov 2019), ch. 2-6 — page layout, overflow chains, and WAL framing for on-disk hash structures.

Sibling docs (cross-references — mechanism owned there, not duplicated here)

Section titled “Sibling docs (cross-references — mechanism owned there, not duplicated here)”
  • postgres-index-am.md — the IndexAmRoutine dispatch contract that hashhandler fills in (build/insert/scan/vacuum callback surface).
  • postgres-nbtree.md — the B-tree AM; the default equality index this AM is measured against, and the source of the shared “decompose into atomic WAL actions” concurrency discipline.
  • postgres-buffer-manager.md — buffer pins, content locks (LWLocks), and cleanup locks that the bucket-level locking protocol is built on.
  • postgres-lock-manager.md — the SQL-level heavyweight lock table, which the hash AM deliberately does not use for page-level concurrency.
  • postgres-wal-records-rmgr.md — the RM_HASH_ID resource manager and the rmgr WAL-replay framing for the XLOG_HASH_* records.
  • postgres-architecture-overview.md — Axis 4 (pluggable access methods), where the hash AM sits beside nbtree, GiST, GIN, SP-GiST, and BRIN.