PostgreSQL Dynamic Hash Tables (dynahash) — Larson Linear Hashing, Partitioned Shared Hashes, and the simplehash Template
Contents:
- Theoretical Background
- Common DBMS Design
- PostgreSQL’s Approach
- Source Walkthrough
- Source verification (as of 2026-06-05)
- Beyond PostgreSQL — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A hash table is the canonical associative data structure: it maps keys to
values in expected O(1) time by reducing the key to a small integer (the
hash value) and using that integer to index a bucket array. Database
System Concepts (Silberschatz 7e, ch. 14 “Indexing”, and ch. 24’s
discussion of in-memory structures) frames the two design axes that every
hash table must settle: collision resolution — what happens when two
distinct keys land in the same bucket — and resizing — what happens when
the table fills up and the average chain length (the load factor) climbs
past the point where lookups stop being O(1).
On collision resolution, the textbook names the two families that organize this entire document:
- Separate chaining. Each bucket holds the head of a linked list of entries that hashed to it. A collision appends to the list. Lookups walk the chain comparing keys. The entries live wherever the allocator put them; the bucket array holds only pointers. The decisive property is that an entry’s address never changes for the entry’s lifetime — inserting or deleting other entries cannot move it — so callers may hold a raw pointer into the table across other operations.
- Open addressing. All entries live inside the bucket array itself; on collision the insert probes a deterministic sequence of alternate buckets (linear, quadratic, or double-hashing) until it finds an empty slot. There are no per-entry allocations and no pointer chasing, so the whole structure is dense and CPU-cache-friendly. The price is that entries get moved (during probing, deletion compaction, and resizing), so a caller cannot hold a stable pointer, and deletions are awkward (classically requiring tombstones to keep probe chains intact).
On resizing, the naive answer — allocate a new array of double the size and
re-hash every element into it — incurs a O(N) stop-the-world pause every
time the table doubles. For a long-lived table whose size is hard to predict
this is acceptable amortized, but the pause itself is undesirable, and for a
hash table living in a fixed shared-memory region that cannot be
reallocated at a stable address, doubling the bucket array in place is simply
impossible. The classic answer to smooth growth is Per-Åke Larson’s
“Dynamic Hash Tables” (CACM 31(4), April 1988, pp. 446-457) — the paper
named verbatim in the historical comment at the top of dynahash.c. Larson’s
scheme (a refinement of Litwin’s linear hashing) grows the table one
bucket at a time: it keeps a pointer to the “next bucket to split,” and on
each expansion step it splits exactly that one bucket, rehashing only the
handful of entries that lived in it into two buckets. The hash-to-bucket
mapping uses two masks — a low_mask covering the lower half of the table
and a high_mask covering the whole — so that a key is hashed with the wider
mask only if the wider-masked bucket has already been split into existence.
The table thus grows continuously, never pausing to rehash everything, and
the maximum work per insert stays bounded.
Two further requirements come not from textbook hashing but from the systems context PostgreSQL operates in:
- Shared memory and discovery. A multi-process server needs hash tables that several backend processes share. Such a table must be carved from a fixed shared-memory arena at server start, addressed identically in every process (so it cannot move or expand its top-level directory), and findable by name by a backend that did not create it.
- Concurrency without a global lock. A single lock over a hot shared table (the buffer-lookup table, the lock table) serializes every backend that touches it. The data structure must support being carved into independently-lockable partitions, so that operations on different partitions proceed in parallel.
These two requirements push hard toward separate chaining (stable pointers
survive a neighbor’s bucket split; chains in different partitions never
interfere) and toward Larson’s directory-of-segments layout (a fixed
directory that never moves, with growth confined to allocating new segments).
That is precisely the design dynahash implements. The complementary case —
a local, single-threaded, allocation- and cache-sensitive hot path that
needs neither shared memory nor partitioning nor stable pointers — is served
by a different structure entirely (simplehash.h), and the contrast between
the two is the throughline of this document.
Common DBMS Design
Section titled “Common DBMS Design”Every database engine accumulates a small zoo of in-memory hash tables — buffer-pool lookup (page identifier to frame), lock tables (lock tag to lock object), catalog/relation caches, plan and prepared-statement caches, the per-query grouping and hash-join tables — and converges on a recurring set of conventions for building them. Naming those conventions first lets the PostgreSQL symbols in the next section read as choices within a shared design space rather than inventions.
Two hash tables, not one
Section titled “Two hash tables, not one”Mature engines almost always ship two distinct hash-table facilities, because the requirements pull in opposite directions:
- A general-purpose, shareable table — separate-chaining, with stable entry addresses, the ability to live in shared memory, and partitioned locking. This is what the buffer pool and lock manager need: many processes, long-lived entries that other code points at, concurrency.
- A specialized, local, fast table — open-addressing, cache-dense, type-specialized at compile time, no indirection. This is what a hash aggregate or hash join needs inside one executor node in one backend: millions of probes per second, no sharing, no stable-pointer requirement, throw the whole thing away at end of query.
dynahash is the first; simplehash is the second. (CUBRID makes a similar
split between its general mht_* memory hash tables and specialized inline
structures.)
Directory + segments for smooth growth
Section titled “Directory + segments for smooth growth”The shareable table almost universally avoids a monolithic bucket array. The standard layout is a two-level one: a fixed-size directory of pointers, each pointing to a segment (a contiguous run of bucket headers). Growth allocates a new segment and (if needed) a bigger directory; the existing segments never move. This is what makes Larson-style one-bucket-at-a-time expansion compatible with shared memory: in a shared table the directory is sized once up front (it must stay at a fixed address) and only segment allocation happens at runtime.
flowchart LR
subgraph dir["Directory (dsize entries)"]
d0["dir[0]"]
d1["dir[1]"]
d2["dir[2]"]
dn["dir[...]"]
end
subgraph seg0["Segment 0 (ssize bucket headers)"]
b0["bucket 0 -> HASHELEMENT -> HASHELEMENT -> NULL"]
b1["bucket 1 -> NULL"]
bk["bucket ssize-1 -> HASHELEMENT -> NULL"]
end
subgraph seg1["Segment 1"]
c0["bucket ssize -> HASHELEMENT -> NULL"]
ck["..."]
end
d0 --> seg0
d1 --> seg1
d2 --> dn
A bucket number is decomposed into (segment_num, segment_ndx) by shifting
and masking with the segment size, exactly as a virtual address decomposes
into page and offset.
Entry-embedded keys and cached hash values
Section titled “Entry-embedded keys and cached hash values”The general table stores each entry as a small fixed header (a next-pointer
for the chain, plus the cached hash value so that a chain walk can reject
non-matching entries with an integer compare before paying for a full key
comparison) immediately followed by the caller’s key-and-data payload. The
key sits at the front of the payload so a single memcmp/strcmp over
keysize bytes settles equality. Caching the hash value in the header is the
standard trick that makes both chain walks and resizing cheap (resizing
re-buckets by cached hash, never re-hashing keys).
Partitioned locking via low-order hash bits
Section titled “Partitioned locking via low-order hash bits”To shard a hot shared table, the engine reserves the low-order bits of the hash value as a partition number and gives each partition its own lock. Because a good hash mixes all bits, the low-order bits distribute entries evenly across partitions; and because the bucket-mapping function is built so that buckets with the same low-order bits stay within one partition, a whole partition is independently lockable. Operations compute the partition, acquire only that partition’s lock, and never contend with operations in other partitions. The cost is that the table can no longer split buckets on-the-fly (a split would move entries across partition boundaries mid-flight), so partitioned tables are pre-sized and never expand their bucket count.
Sequential scan with split-inhibition
Section titled “Sequential scan with split-inhibition”Iterating a chained hash table means walking the directory bucket by bucket, each bucket’s chain in turn. The hazard is a concurrent expansion: if a bucket splits while the scan is partway through it, the scan can miss entries or visit them twice. The standard defense is to register active scans and inhibit bucket splits while any scan is open (deferring the split to a later insert), which is cheap because scans are rare relative to lookups.
PostgreSQL’s Approach
Section titled “PostgreSQL’s Approach”PostgreSQL’s dynahash.c is a direct, lightly-modernized descendant of
Larson’s 1988 algorithm (the file’s historical comment credits the CACM paper
and the 1990 Berkeley additions for “multiple table interface” and the
“ctl structure for shared memory”). The same code compiles to two very
different runtime shapes selected by flags at hash_create time: a
backend-local table (its own memory context, palloc allocator that throws on
OOM, free to expand) and a shared-memory table (header and directory
preallocated in the shared arena, a NULL-returning allocator, fixed
directory). The file’s own header comment lays out dynahash’s niche versus
simplehash precisely:
// dynahash.c — src/backend/utils/hash/dynahash.c * Compared to simplehash, dynahash has the following benefits: * * - It supports partitioning, which is useful for shared memory access using * locks. * - Shared memory hashes are allocated in a fixed size area at startup and * are discoverable by name from other processes. * - Because entries don't need to be moved in the case of hash conflicts, * dynahash has better performance for large entries. * - Guarantees stable pointers to entries.Entry layout: header then key then data
Section titled “Entry layout: header then key then data”Every entry is a HASHELEMENT header — a chain link and the cached 32-bit
hash value — with the caller’s payload glued on right after it at a
MAXALIGN boundary. The key must sit at the start of that payload. Two
macros convert between the header pointer and the key/payload pointer the
caller actually sees:
// HASHELEMENT — src/include/utils/hsearch.htypedef struct HASHELEMENT{ struct HASHELEMENT *link; /* link to next entry in same bucket */ uint32 hashvalue; /* hash function result for this entry */} HASHELEMENT;
// ELEMENTKEY / ELEMENT_FROM_KEY — src/backend/utils/hash/dynahash.c#define ELEMENTKEY(helem) (((char *)(helem)) + MAXALIGN(sizeof(HASHELEMENT)))#define ELEMENT_FROM_KEY(key) \ ((HASHELEMENT *) (((char *) (key)) - MAXALIGN(sizeof(HASHELEMENT))))This is what licenses dynahash’s “stable pointers” guarantee: hash_search
returns ELEMENTKEY(currBucket), a pointer to the caller’s data that remains
valid until that entry is removed, regardless of what inserts, deletes, or
bucket splits happen to other entries. The buffer manager relies on this when
it holds a BufferDesc-adjacent pointer; the lock manager relies on it when a
PROCLOCK references a LOCK.
The directory/segment/bucket geometry
Section titled “The directory/segment/bucket geometry”The control header HASHHDR (which lives in shared memory for a shared table,
with each backend keeping a thin local HTAB that points at it) holds the
sizing state. The directory dir is an array of HASHSEGMENTs; each segment
is an array of ssize HASHBUCKET chain-heads:
// HASHHDR (excerpt) — src/backend/utils/hash/dynahash.cstruct HASHHDR{ FreeListData freeList[NUM_FREELISTS];
/* These fields can change, but not in a partitioned table */ long dsize; /* directory size */ long nsegs; /* number of allocated segments (<= dsize) */ uint32 max_bucket; /* ID of maximum bucket in use */ uint32 high_mask; /* mask to modulo into entire table */ uint32 low_mask; /* mask to modulo into lower half of table */
/* These fields are fixed at hashtable creation */ Size keysize; /* hash key length in bytes */ Size entrysize; /* total user element size in bytes */ long num_partitions; /* # partitions (must be power of 2), or 0 */ long max_dsize; /* 'dsize' limit if directory is fixed size */ long ssize; /* segment size --- must be power of 2 */ int sshift; /* segment shift = log2(ssize) */ int nelem_alloc; /* number of entries to allocate at once */};Mapping a hash value to a bucket is the Larson two-mask trick: mask with
high_mask (the whole table); if the result exceeds the highest bucket that
actually exists yet, fold it back down with low_mask (the lower half). This
is exactly what lets the table exist at a non-power-of-two number of buckets
mid-expansion:
// calc_bucket — src/backend/utils/hash/dynahash.cstatic inline uint32calc_bucket(HASHHDR *hctl, uint32 hash_val){ uint32 bucket;
bucket = hash_val & hctl->high_mask; if (bucket > hctl->max_bucket) bucket = bucket & hctl->low_mask;
return bucket;}The bucket number then splits into a directory index and an in-segment index
by shifting by sshift (= log2 of segment size) and masking — the MOD(x,y)
macro is (x) & ((y)-1), valid because ssize is a power of two:
// hash_initial_lookup — src/backend/utils/hash/dynahash.cbucket = calc_bucket(hctl, hashvalue);
segment_num = bucket >> hashp->sshift;segment_ndx = MOD(bucket, hashp->ssize);
segp = hashp->dir[segment_num];
if (segp == NULL) hash_corrupted(hashp);
*bucketptr = &segp[segment_ndx];Local vs shared: one code path, two shapes
Section titled “Local vs shared: one code path, two shapes”hash_create branches on HASH_SHARED_MEM. For a shared table the HTAB
header is allocated in TopMemoryContext but the HASHHDR and directory come
from the caller-provided shared region (info->hctl), and HASH_ATTACH lets
a second backend bind to an already-initialized table by copying a few local
constants and returning immediately. For a local table everything — header,
directory, segments, entries — lives in a private AllocSet context created
for the table, which is what makes hash_destroy a one-liner
(MemoryContextDelete):
// hash_create (excerpt) — src/backend/utils/hash/dynahash.cif (flags & HASH_SHARED_MEM){ hashp->hctl = info->hctl; hashp->dir = (HASHSEGMENT *) (((char *) info->hctl) + sizeof(HASHHDR)); hashp->hcxt = NULL; hashp->isshared = true;
/* hash table already exists, we're just attaching to it */ if (flags & HASH_ATTACH) { /* make local copies of some heavily-used values */ hctl = hashp->hctl; hashp->keysize = hctl->keysize; hashp->ssize = hctl->ssize; hashp->sshift = hctl->sshift; return hashp; }}The flags also select the hash/compare/copy functions: HASH_STRINGS (C
strings, string_hash/string_compare/strlcpy), HASH_BLOBS (binary,
tag_hash or the uint32_hash fast path for 4-byte keys, memcmp/memcpy),
or HASH_FUNCTION (caller-supplied). HASH_ELEM (key/entry sizes) is now
mandatory, asserted at the top of hash_create.
The search/enter/expand flow
Section titled “The search/enter/expand flow”hash_search_with_hash_value is where the whole structure comes together: an
enter operation may first trigger a one-bucket expansion, then it always does
the directory lookup and chain walk, then forks on the action. The expansion
check is the only place expand_table is called, and it is short-circuited
for partitioned, frozen, or being-scanned tables — the three states in which a
bucket split would be unsafe.
flowchart TD
start["hash_search_with_hash_value: key, hash, action"]
enter{"action is ENTER<br/>or ENTER_NULL?"}
expandok{"nentries gt max_bucket<br/>AND not partitioned<br/>AND not frozen<br/>AND no active seq scan?"}
expand["expand_table: split one bucket,<br/>rehome its chain"]
lookup["hash_initial_lookup:<br/>calc_bucket then segment then chain head"]
walk["walk chain: compare cached hashvalue,<br/>then match on key bytes"]
found{"matching entry found?"}
act{"which action?"}
rfind["FIND: return ELEMENTKEY or NULL"]
rremove["REMOVE: unlink, push onto freelist,<br/>decrement nentries, return dangling ptr"]
renterhit["ENTER hit: return existing ELEMENTKEY"]
rentermiss["ENTER miss: get_hash_entry,<br/>link in, copy key, caller fills data"]
start --> enter
enter -- yes --> expandok
enter -- no --> lookup
expandok -- yes --> expand --> lookup
expandok -- no --> lookup
lookup --> walk --> found
found --> act
act -- FIND --> rfind
act -- REMOVE --> rremove
act -- ENTER/found --> renterhit
act -- ENTER/not found --> rentermiss
Partitioned shared tables
Section titled “Partitioned shared tables”A shared table flagged HASH_PARTITION trades on-the-fly expansion for
concurrency. The mechanism is the freeList[NUM_FREELISTS] array: instead of
one global (nentries, freeList) pair under one lock, the counts and free
chains are sharded into 32 slices, each with its own spinlock. The partition
of a hash value is its low-order bits, and the file-level comment explains why
that aligns with calc_bucket:
// dynahash.c partitioning comment — src/backend/utils/hash/dynahash.c * To use a hash table in partitioned mode, the HASH_PARTITION flag must be * given to hash_create. This prevents any attempt to split buckets on-the-fly. * Therefore, each hash bucket chain operates independently, and no fields * of the hash header change after init except nentries and freeList. ... * We expect callers to use the low-order bits of a lookup key's hash value * as a partition number --- this will work because of the way calc_bucket() * maps hash values to bucket numbers.
// FreeListData / FREELIST_IDX — src/backend/utils/hash/dynahash.ctypedef struct{ slock_t mutex; /* spinlock for this freelist */ long nentries; /* number of entries in associated buckets */ HASHELEMENT *freeList; /* chain of free elements */} FreeListData;
#define IS_PARTITIONED(hctl) ((hctl)->num_partitions != 0)#define FREELIST_IDX(hctl, hashcode) \ (IS_PARTITIONED(hctl) ? (hashcode) % NUM_FREELISTS : 0)The two canonical clients are the buffer-lookup table and the lock tables,
both created via ShmemInitHash with HASH_PARTITION:
// buf_table.c — src/backend/storage/buffer/buf_table.cinfo.num_partitions = NUM_BUFFER_PARTITIONS;...SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table", size, size, &info, HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
// lock.c — src/backend/storage/lmgr/lock.cLockMethodLockHash = ShmemInitHash("LOCK hash", init_table_size, max_table_size, &info, HASH_ELEM | HASH_BLOBS | HASH_PARTITION);Both compute the partition from the hash value (get_hash_value then take the
low bits) before acquiring the matching partition LWLock, then call
hash_search_with_hash_value so the already-computed hash is not recomputed.
The partition-lock count is a separate constant from NUM_FREELISTS; the
two need not be equal because each freelist’s coverage “might be more or less
than one partition” (hence each freelist carries its own spinlock rather than
relying on the caller’s partition lock).
Robin-hood open addressing in simplehash
Section titled “Robin-hood open addressing in simplehash”For the local, cache-bound hot paths, simplehash.h is a wholly separate
implementation that the preprocessor instantiates per element type. The
caller #defines SH_PREFIX, SH_ELEMENT_TYPE, SH_KEY, SH_HASH_KEY,
SH_EQUAL, etc., includes the header, and gets type-safe foo_hash,
foo_create, foo_insert, foo_lookup functions with no indirect calls. The
table is a single flat array of elements (open addressing), grown by doubling,
using a robin-hood variant of linear probing to bound clustering. The
header comment states the trade precisely:
// simplehash.h design notes — src/include/lib/simplehash.h * Compared to dynahash, simplehash has the following benefits: * - Due to the "templated" code generation has known structure sizes and no * indirect function calls ... * - Open addressing has better CPU cache behavior than dynahash's chained * hashtables. ... * of "robin hood" hashing is employed. Robin hood hashing optimizes * chaining lengths by moving elements close to their optimal bucket * ("rich" elements), out of the way if a to-be-inserted element is further * away from its optimal position (i.e. it's "poor").The contrast that matters for a reader choosing between them: simplehash has no stable pointers (robin-hood shifting and grow-time rehashing move entries), no shared-memory mode, and no partitioning — but for small entries probed millions of times in one backend it is materially faster. The two tables are complementary, not competitors.
Source Walkthrough
Section titled “Source Walkthrough”The code divides cleanly into create/destroy, search (the public CRUD), the
sequential-scan machinery, and the private growth/allocation utilities. The
two clients (buffer table, lock table) sit on top via ShmemInitHash, and
simplehash.h is an entirely parallel implementation in the include tree.
Creation and sizing (dynahash.c)
Section titled “Creation and sizing (dynahash.c)”hash_create is the single entry point. It asserts HASH_ELEM, picks the
hash/compare/copy functions from the flags, allocates the HTAB (and, for a
local table, a private memory context), fills HASHHDR defaults via
hdefault, then calls init_htab to build the directory and initial
segments. For a shared table — or a small local one — it preallocates entries
into the freelists so the first inserts don’t have to allocate.
hash_create— the constructor; flag dispatch, function selection, shared vs local context, partition setup, preallocation loop.hdefault— zeroHASHHDRand setDEF_DIRSIZE(256),DEF_SEGSIZE(256),NO_MAX_DSIZE,num_partitions = 0.choose_nelem_alloc— pick how many entries to allocate per chunk so the request rounds to a power of two (>= 32 entries).init_htab— derivemax_bucket/high_mask/low_maskfromnext_pow2_int(nelem), ensurenbuckets >= num_partitions, allocate the directory and the initial segments; initialize the per-freelist spinlocks for a partitioned table.hash_estimate_size/hash_select_dirsize/hash_get_shared_size— the shared-memory sizing helpersShmemInitHashcalls to reserve a fixed region (directory cannot grow once placed).hash_destroy— for a local table only:MemoryContextDelete(hashp->hcxt)frees everything; asserts the allocator is the default.
Search: find / enter / remove / update (dynahash.c)
Section titled “Search: find / enter / remove / update (dynahash.c)”hash_search computes the hash and delegates to
hash_search_with_hash_value, which is the workhorse. On HASH_ENTER it
first checks whether to expand (only for a non-partitioned, non-frozen table
with no active scan), then does the initial bucket lookup, walks the chain
comparing cached hash then key, and forks on the action.
hash_search— thin wrapper that hashes the key, then callshash_search_with_hash_value.hash_search_with_hash_value— the CRUD core; the expansion trigger,hash_initial_lookup, the chain walk (hashvalueequality gate thenmatch), and theHASH_FIND/HASH_REMOVE/HASH_ENTER/HASH_ENTER_NULLswitch. On remove it unlinks and pushes the element onto the freelist; on enter-not-found it pulls from the freelist viaget_hash_entryand copies the key.get_hash_value— exported hash-only helper so partitioned callers can compute the partition before locking.calc_bucket/hash_initial_lookup— hash to bucket to(segment, index)to chain-head pointer.hash_update_hash_key— re-key an existing entry in place: find it by its saved hash, find the destination chain, fail on duplicate, and relink only if the bucket changed (never touches the freelist, so it cannot OOM).get_hash_entry— pop a free element fromfreeList[freelist_idx]; if empty,element_alloca new chunk; if that fails in a partitioned table, borrow from other freelists before declaring OOM.hash_get_num_entries— sumnentriesacross freelists (caller must hold all partition locks).
The remove path is the clearest illustration of the chained design’s stable-pointer property and the partitioned-freelist bookkeeping:
// hash_search_with_hash_value, HASH_REMOVE — src/backend/utils/hash/dynahash.ccase HASH_REMOVE: if (currBucket != NULL) { if (IS_PARTITIONED(hctl)) SpinLockAcquire(&(hctl->freeList[freelist_idx].mutex));
Assert(hctl->freeList[freelist_idx].nentries > 0); hctl->freeList[freelist_idx].nentries--;
/* remove record from hash bucket's chain. */ *prevBucketPtr = currBucket->link;
/* add the record to the appropriate freelist. */ currBucket->link = hctl->freeList[freelist_idx].freeList; hctl->freeList[freelist_idx].freeList = currBucket;
if (IS_PARTITIONED(hctl)) SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
return ELEMENTKEY(currBucket); } return NULL;Note the explicit warning in the comments: the returned pointer after
HASH_REMOVE is a dangling pointer — the element is back on the freelist and
will be reused on the next insert. And the HASH_ENTER path ends with a
pointed comment that the caller must fill the data field without any code that
could throw, since a thrown error would leave a half-initialized entry linked
into the table.
Expansion: one bucket at a time (dynahash.c)
Section titled “Expansion: one bucket at a time (dynahash.c)”expand_table is the Larson split step, only ever reached for a
non-partitioned table. It creates exactly one new bucket (max_bucket + 1),
allocating a new segment (and growing the directory via dir_realloc) if the
new bucket crosses into fresh territory, readjusts the masks if it crossed a
power of two, and rehomes the entries of the one old bucket that maps to the
new one:
// expand_table (excerpt) — src/backend/utils/hash/dynahash.cnew_bucket = hctl->max_bucket + 1;...hctl->max_bucket++;old_bucket = (new_bucket & hctl->low_mask);
if ((uint32) new_bucket > hctl->high_mask){ hctl->low_mask = hctl->high_mask; hctl->high_mask = (uint32) new_bucket | hctl->low_mask;}...for (currElement = *oldlink; currElement != NULL; currElement = nextElement){ nextElement = currElement->link; if ((long) calc_bucket(hctl, currElement->hashvalue) == old_bucket) { *oldlink = currElement; oldlink = &currElement->link; } else { *newlink = currElement; newlink = &currElement->link; }}*oldlink = NULL;*newlink = NULL;The expansion is triggered inside hash_search_with_hash_value on enter,
when the live entry count in freeList[0] exceeds max_bucket (load factor
- and the table is eligible:
// hash_search_with_hash_value, expansion trigger — src/backend/utils/hash/dynahash.cif (action == HASH_ENTER || action == HASH_ENTER_NULL){ if (hctl->freeList[0].nentries > (long) hctl->max_bucket && !IS_PARTITIONED(hctl) && !hashp->frozen && !has_seq_scans(hashp)) (void) expand_table(hashp);}The three suppressors are each meaningful: partitioned tables must never split
(IS_PARTITIONED), frozen tables forbid the bucket-count change (frozen),
and a table with an open hash_seq_search must not split mid-scan
(has_seq_scans). A failure to expand is non-fatal — the table just runs at a
higher fill factor.
expand_table— the single-bucket split; mask readjustment; rehoming loop.dir_realloc— double the directory (only whenmax_dsize == NO_MAX_DSIZE, i.e. a local table); copy old pointers, zero the rest,pfreethe old.seg_alloc— allocate and zero one segment ofssizebucket heads.element_alloc— allocatenelementries in one block, thread them ontofreeList[freelist_idx]; returns false (no throw) for a fixed-size table.
Sequential scan (dynahash.c)
Section titled “Sequential scan (dynahash.c)”hash_seq_init records the scan and (unless the table is frozen) registers it
in the per-backend active-scan list so inserts will suppress splits.
hash_seq_search walks bucket by bucket: it follows the current chain, and
when a chain ends it advances curBucket, decomposing each bucket into
(segment_num, segment_ndx) and skipping empty buckets:
// hash_seq_search (excerpt) — src/backend/utils/hash/dynahash.csegment_num = curBucket >> hashp->sshift;segment_ndx = MOD(curBucket, ssize);segp = hashp->dir[segment_num];
while ((curElem = segp[segment_ndx]) == NULL){ /* empty bucket, advance to next */ if (++curBucket > max_bucket) { status->curBucket = curBucket; hash_seq_term(status); return NULL; /* search is done */ } if (++segment_ndx >= ssize) { segment_num++; segment_ndx = 0; segp = hashp->dir[segment_num]; }}hash_seq_init/hash_seq_init_with_hash_value— start a full scan, or a scan restricted to the one bucket holding a given hash value.hash_seq_search— return entries one at a time; the deleting-current element is explicitly allowed, deleting others mid-scan is undefined.hash_seq_term— end-of-scan cleanup; deregister the scan.register_seq_scan/deregister_seq_scan/has_seq_scans— the fixed-size (MAX_SEQ_SCANS= 100) per-backend stack of open scans thatexpand_table’s trigger consults.hash_freeze— lock a local table against future inserts so scans need no registration/termination discipline; forbidden on shared tables.
simplehash template (simplehash.h)
Section titled “simplehash template (simplehash.h)”The whole file is macro machinery: SH_MAKE_NAME stamps the user prefix onto
every type and function. The runtime structure is one flat element array plus
sizing scalars:
// SH_TYPE — src/include/lib/simplehash.htypedef struct SH_TYPE{ uint64 size; /* size of data/bucket array */ uint32 members; /* how many elements have valid contents */ uint32 sizemask; /* mask for bucket and size calculations */ uint32 grow_threshold; /* boundary after which to grow hashtable */ SH_ELEMENT_TYPE *data; /* hash buckets */#ifndef SH_RAW_ALLOCATOR MemoryContext ctx;#endif void *private_data;} SH_TYPE;Sizing rounds the requested element count up by the fill factor (default 0.9)
to the next power of two, so sizemask = size - 1 and the optimal bucket is
just hash & sizemask:
// SH_UPDATE_PARAMETERS / SH_INITIAL_BUCKET — src/include/lib/simplehash.htb->size = size;tb->sizemask = (uint32) (size - 1);if (tb->size == SH_MAX_SIZE) tb->grow_threshold = ((double) tb->size) * SH_MAX_FILLFACTOR;else tb->grow_threshold = ((double) tb->size) * SH_FILLFACTOR;...static inline uint32SH_INITIAL_BUCKET(SH_TYPE * tb, uint32 hash){ return hash & tb->sizemask;}The robin-hood insert is the heart of the file. On a collision it compares the
displacement-from-optimal of the to-be-inserted entry against the colliding
entry’s. If the newcomer is “poorer” (farther from its optimal bucket), it
evicts the “richer” resident by shifting the run of entries forward into the
next empty slot, then drops in. It also force-grows the table if a single
insert would have to probe (SH_GROW_MAX_DIB, 25) or shift
(SH_GROW_MAX_MOVE, 150) too far:
// SH_INSERT_HASH_INTERNAL (excerpt) — src/include/lib/simplehash.hcurhash = SH_ENTRY_HASH(tb, entry);curoptimal = SH_INITIAL_BUCKET(tb, curhash);curdist = SH_DISTANCE_FROM_OPTIMAL(tb, curoptimal, curelem);
if (insertdist > curdist){ /* find next empty bucket, then shift the run forward by one */ ... moveelem = emptyelem; while (moveelem != curelem) { moveelem = SH_PREV(tb, moveelem, startelem); moveentry = &data[moveelem]; memcpy(lastentry, moveentry, sizeof(SH_ELEMENT_TYPE)); lastentry = moveentry; } tb->members++; entry->SH_KEY = key; entry->status = SH_STATUS_IN_USE; *found = false; return entry;}The lookup is plain linear probing that stops at the first SH_STATUS_EMPTY
slot — correct precisely because robin-hood deletion never leaves
tombstones (it back-shifts instead), so an empty slot truly means “not
present”:
// SH_LOOKUP_HASH_INTERNAL — src/include/lib/simplehash.hconst uint32 startelem = SH_INITIAL_BUCKET(tb, hash);uint32 curelem = startelem;while (true){ SH_ELEMENT_TYPE *entry = &tb->data[curelem]; if (entry->status == SH_STATUS_EMPTY) return NULL; if (SH_COMPARE_KEYS(tb, hash, key, entry)) return entry; curelem = SH_NEXT(tb, curelem, startelem);}SH_CREATE/SH_DESTROY/SH_RESET— lifecycle; create rounds size up bySH_FILLFACTOR.SH_INSERT/SH_INSERT_HASH/SH_INSERT_HASH_INTERNAL— robin-hood insert; grows on threshold or onSH_GROW_MAX_DIB/SH_GROW_MAX_MOVE.SH_LOOKUP/SH_LOOKUP_HASH/SH_LOOKUP_HASH_INTERNAL— linear probe to first empty.SH_DELETE/SH_DELETE_ITEM— delete with back-shift compaction (no tombstones).SH_GROW— double and re-place every entry, starting from a non-wrapped bucket so moves never conflict.SH_START_ITERATE/SH_ITERATE— iteration over the flat array, visiting a random-but-stable start bucket so an entry inserted during iteration is not double-visited.SH_STAT— diagnostics (max/avg chain length, collisions).
Position hints (as of 2026-06-05, REL_18 273fe94)
Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”| Symbol | File | Line |
|---|---|---|
HASHELEMENT | hsearch.h | 51 |
HASHCTL | hsearch.h | 65 |
HASHACTION | hsearch.h | 111 |
HASH_SEQ_STATUS | hsearch.h | 120 |
struct HASHHDR | dynahash.c | 168 |
FreeListData | dynahash.c | 153 |
struct HTAB | dynahash.c | 219 |
ELEMENTKEY / ELEMENT_FROM_KEY | dynahash.c | 244 |
FREELIST_IDX | dynahash.c | 212 |
hash_create | dynahash.c | 352 |
hdefault | dynahash.c | 630 |
choose_nelem_alloc | dynahash.c | 657 |
init_htab | dynahash.c | 690 |
hash_estimate_size | dynahash.c | 784 |
hash_select_dirsize | dynahash.c | 831 |
hash_get_shared_size | dynahash.c | 855 |
hash_destroy | dynahash.c | 866 |
get_hash_value | dynahash.c | 912 |
calc_bucket | dynahash.c | 919 |
hash_search | dynahash.c | 956 |
hash_search_with_hash_value | dynahash.c | 969 |
hash_update_hash_key | dynahash.c | 1146 |
get_hash_entry | dynahash.c | 1259 |
hash_get_num_entries | dynahash.c | 1344 |
hash_seq_init | dynahash.c | 1388 |
hash_seq_search | dynahash.c | 1423 |
hash_seq_term | dynahash.c | 1517 |
hash_freeze | dynahash.c | 1537 |
expand_table | dynahash.c | 1554 |
dir_realloc | dynahash.c | 1651 |
seg_alloc | dynahash.c | 1690 |
element_alloc | dynahash.c | 1709 |
hash_initial_lookup | dynahash.c | 1759 |
my_log2 | dynahash.c | 1797 |
SH_TYPE | simplehash.h | 145 |
SH_STATUS | simplehash.h | 175 |
SH_COMPUTE_SIZE | simplehash.h | 310 |
SH_UPDATE_PARAMETERS | simplehash.h | 336 |
SH_INITIAL_BUCKET | simplehash.h | 356 |
SH_DISTANCE_FROM_OPTIMAL | simplehash.h | 385 |
SH_CREATE | simplehash.h | 441 |
SH_GROW | simplehash.h | 493 |
SH_INSERT_HASH_INTERNAL | simplehash.h | 608 |
SH_LOOKUP_HASH_INTERNAL | simplehash.h | 799 |
SH_DELETE | simplehash.h | 856 |
SH_DELETE_ITEM | simplehash.h | 928 |
SH_ITERATE | simplehash.h | 1045 |
Source verification (as of 2026-06-05)
Section titled “Source verification (as of 2026-06-05)”Verified facts
Section titled “Verified facts”-
dynahash is Larson’s 1988 dynamic hashing, grown one bucket at a time. The historical comment in
dynahash.ccredits “Dynamic hashing, after CACM April 1988 pp 446-457, by Per-Ake Larson.” Verified thatexpand_tableadds exactly one bucket per call (new_bucket = hctl->max_bucket + 1; ... hctl->max_bucket++;) and rehomes only the single old bucket(new_bucket & low_mask), splitting its chain in two. The two-maskcalc_bucket(high_maskthen fold tolow_mask) is the linear-hashing address calculation. -
One code base serves both local and shared tables, selected by flags. Verified in
hash_create:HASH_SHARED_MEMpointshashp->hctl/hashp->dirat the caller’s shared region and setsisshared = true; the non-shared path creates a privateAllocSetContext.HASH_ATTACHreturns early after copying local constants, letting a second backend bind to an existing table. -
Entries have stable addresses; the key is embedded after a HASHELEMENT header. Verified in
hsearch.h(HASHELEMENT=link+hashvalue) and theELEMENTKEY/ELEMENT_FROM_KEYmacros indynahash.c(key atMAXALIGN(sizeof(HASHELEMENT))past the header).hash_searchreturnsELEMENTKEY(currBucket); removal pushes the same element onto a freelist rather than freeing it, so the address is reused but never relocated while live. This is the “Guarantees stable pointers to entries” bullet in the file header. -
Partitioned mode shards
(nentries, freeList)into NUM_FREELISTS=32 spinlock-guarded slices and forbids on-the-fly splits. Verified:FreeListDatacarries a per-sliceslock_t mutex;FREELIST_IDXmaps a hashcode to a slice;expand_tableasserts!IS_PARTITIONED(hctl); and thehash_search_with_hash_valueexpansion trigger is gated on!IS_PARTITIONED(hctl).get_hash_entryborrows from other freelists under their spinlocks before declaring OOM, satisfying callers that assume a delete guarantees a subsequent insert can succeed. -
The buffer manager and lock manager are the canonical partitioned clients. Verified
buf_table.ccreatesSharedBufHashwithHASH_ELEM | HASH_BLOBS | HASH_PARTITIONandnum_partitions = NUM_BUFFER_PARTITIONS;lock.ccreatesLockMethodLockHashandLockMethodProcLockHashwithHASH_PARTITION. Both callget_hash_valueto derive the partition before locking, thenhash_search_with_hash_value. -
Expansion is triggered at load factor 1 and suppressed by partitioning, freezing, or an active sequential scan. Verified the trigger condition
hctl->freeList[0].nentries > (long) hctl->max_bucket && !IS_PARTITIONED && !hashp->frozen && !has_seq_scans(hashp). Failure to expand is explicitly non-fatal per the in-code comment. -
hash_seq_searchwalks the directory bucket-by-bucket and registers the scan to inhibit splits. Verifiedhash_seq_initcallsregister_seq_scan(unless frozen);hash_seq_searchdecomposescurBucketinto(segment_num, segment_ndx)viasshift/MOD, skips empty buckets, and follows->linkchains.MAX_SEQ_SCANSis 100. -
simplehash is a separate macro-templated open-addressing robin-hood hash with no stable pointers, no shared memory, and no partitioning. Verified
simplehash.h’sSH_TYPEis a single flatSH_ELEMENT_TYPE *dataarray;SH_INSERT_HASH_INTERNALperforms robin-hood shifting (insertdist > curdistevicts the richer resident bymemcpy-shifting the run);SH_LOOKUP_HASH_INTERNALstops at the firstSH_STATUS_EMPTY(valid becauseSH_DELETEback-shifts rather than leaving tombstones). DefaultSH_FILLFACTOR0.9,SH_MAX_FILLFACTOR0.98, force-grow atSH_GROW_MAX_DIB25 /SH_GROW_MAX_MOVE150. -
HASH_ELEM is now mandatory and the key-class flag is exactly one of STRINGS/BLOBS/FUNCTION. Verified the
Assert(flags & HASH_ELEM)at the top ofhash_create, theAssert(info->keysize > 8)guard against mistaking an integer key for a string, and theuint32_hashfast path for 4-byte BLOB keys.
Open questions
Section titled “Open questions”-
How often the freelist-borrowing path in
get_hash_entryactually fires under real partitioned workloads. The borrowing loop exists for correctness (a delete must guarantee a later insert can find an element even across partitions), but whether it is a measurable hot path under buffer- or lock-table churn is not evident from the code. Investigation path: counter instrumentation on theborrow_from_idxloop under a contended pgbench run. -
The practical cost of the load-factor-1 expansion threshold versus simplehash’s 0.9. dynahash splits when live entries exceed
max_bucket(load factor ~1), tolerating longer chains; simplehash grows at 0.9 to keep probe runs short. Whether dynahash’s higher target load factor materially lengthens chain walks for the catalog/relcache local tables is unmeasured here. Investigation path:hash_stats(HASH_STATISTICS build) on a busy relcache. -
Whether any current shared table would benefit from a NUM_FREELISTS other than 32. The freelist count is a compile-time constant decoupled from the partition-lock count; the comment notes a freelist’s coverage “might be more or less than one partition.” Whether 32 is optimal for the NUM_BUFFER_PARTITIONS / lock-partition counts in REL_18 is not established in-tree. Investigation path: cache-line-contention profiling of the freelist spinlocks under high buffer-eviction load.
Beyond PostgreSQL — Comparative Designs & Research Frontiers
Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”-
Larson 1988 / Litwin linear hashing vs. extendible hashing. dynahash follows Larson’s linear-hashing lineage (split the next bucket in round- robin order, two-mask addressing) rather than Fagin’s extendible hashing (a directory indexed by the top
dbits, doubling the directory on overflow). Extendible hashing bounds the worst-case to two disk accesses and is the classic choice for on-disk hash indexes; linear hashing wins for in-memory tables because it never doubles the directory in one step. A note mappingcalc_bucket’s two-mask scheme onto Litwin’sh_i/h_{i+1}split functions would sharpen the lineage. (Database System Concepts ch. 14 covers both; Database Internals ch. 6 covers extendible hashing for storage engines.) -
Chaining + stable pointers vs. open addressing. The dynahash/simplehash split inside one codebase is a clean natural experiment in the chaining-vs- open-addressing trade. dynahash keeps stable pointers and partitionability at the cost of pointer chasing and per-entry headers; simplehash keeps cache density and inlined comparisons at the cost of moving entries. A comparative microbenchmark (large entries, pointer-held-across-ops workload vs. small entries, probe-heavy workload) would quantify the crossover the file headers assert qualitatively.
-
Robin-hood hashing vs. Swiss tables / hopscotch. simplehash’s robin-hood variant bounds probe-run variance by displacement-balancing. Modern alternatives — Google’s SwissTable (SIMD metadata bytes, abseil
flat_hash_map) and hopscotch hashing — push open-addressing locality further. Positioning simplehash against a SwissTable-style metadata-probe design would frame what PostgreSQL gives up by staying with per-slot status bytes and scalar probing. -
Partitioned dynahash vs. lock-free / sharded concurrent hash maps. dynahash partitions by low-order hash bits with an external LWLock per partition. Lock-free hash maps (split-ordered lists, Cliff Click’s non-blocking hash map) and finer-grained sharded designs eliminate the per-partition lock entirely. The CMU/EPFL “scalable lock manager” work (captured in
dbms-papers/scalable-lock-manager.md) is the database-specific frontier here: it rethinks exactly the lock-table-over-partitioned-hash structure thatLockMethodLockHashembodies. Comparing dynahash’s coarse partition locks against that design is a research-grade exercise in shared-table concurrency. -
CUBRID’s memory hash tables (
mht_*). CUBRID uses a general chained memory-hash facility for its catalog and session structures, much as dynahash serves PostgreSQL’s local caches. A side-by-side of CUBRID’smht_create/mht_get/mht_putrehash strategy against dynahash’s one-bucket-at-a-time expansion would highlight whether CUBRID pays the stop-the-world rehash that Larson’s scheme is designed to avoid. (See the CUBRID code-analysis tree.)
Sources
Section titled “Sources”In-tree source files (REL_18_STABLE, commit 273fe94)
Section titled “In-tree source files (REL_18_STABLE, commit 273fe94)”src/backend/utils/hash/dynahash.c— the implementation:hash_create,hash_search_with_hash_value,expand_table,get_hash_entry,hash_seq_search, the directory/segment/bucket geometry, partitioned freelists, and the Larson historical comment.src/include/utils/hsearch.h— the public surface:HASHELEMENT,HASHCTL, theHASH_*flag bits,HASHACTION,HASH_SEQ_STATUS, and the function prototypes.src/include/lib/simplehash.h— the macro-templated open-addressing robin-hood hash:SH_TYPE,SH_INSERT_HASH_INTERNAL,SH_LOOKUP_HASH_INTERNAL,SH_GROW, and the design-notes header comment.src/backend/storage/buffer/buf_table.c—SharedBufHash, the canonical partitionedHASH_BLOBSclient.src/backend/storage/lmgr/lock.c—LockMethodLockHash/LockMethodProcLockHash, partitioned lock tables built on dynahash.
Papers and textbook chapters
Section titled “Papers and textbook chapters”- Larson, P.-Å. (1988). “Dynamic Hash Tables.” Communications of the ACM
31(4):446-457. The expansion algorithm dynahash implements; cited by name in
the
dynahash.chistorical comment. - Litwin, W. (1980). “Linear Hashing: A New Tool for File and Table Addressing.” Proc. VLDB. The linear-hashing scheme Larson refines.
- Fagin, R., Nievergelt, J., Pippenger, N. & Strong, H. R. (1979). “Extendible Hashing — A Fast Access Method for Dynamic Files.” ACM TODS 4(3). The directory-doubling alternative dynahash does not use.
- Database System Concepts (Silberschatz, Korth, Sudarshan, 7e), ch. 14
“Indexing” — static vs. dynamic hashing, linear and extendible hashing
(
knowledge/research/dbms-general/database-system-concepts.md). - Database Internals (Petrov 2019), ch. 6 — extendible hashing and in-memory
table structures (
knowledge/research/dbms-general/database-internals.md). - “A Scalable Lock Manager for Multicores” — the partitioned-lock-table
concurrency frontier (
knowledge/research/dbms-papers/scalable-lock-manager.md).
Sibling docs (cross-references — mechanism owned there, not duplicated here)
Section titled “Sibling docs (cross-references — mechanism owned there, not duplicated here)”postgres-shared-memory-ipc.md—ShmemInitHash, the shared-memory arena, andShmemInitStructthat allocate the fixed region a shared dynahash lives in.postgres-buffer-manager.md—SharedBufHashand the buffer-partition LWLocks that ride on dynahash’sHASH_PARTITIONmode.postgres-lock-manager.md—LockMethodLockHash/LockMethodProcLockHash, the partitioned lock tables and their partition LWLocks.postgres-memory-contexts.md— theAllocSetcontexts a local dynahash allocates from and destroys with.postgres-architecture-overview.md— Axis “base infrastructure” placement of the dynahash/simplehash facilities.