Skip to content

PostgreSQL Dynamic Hash Tables (dynahash) — Larson Linear Hashing, Partitioned Shared Hashes, and the simplehash Template

Contents:

A hash table is the canonical associative data structure: it maps keys to values in expected O(1) time by reducing the key to a small integer (the hash value) and using that integer to index a bucket array. Database System Concepts (Silberschatz 7e, ch. 14 “Indexing”, and ch. 24’s discussion of in-memory structures) frames the two design axes that every hash table must settle: collision resolution — what happens when two distinct keys land in the same bucket — and resizing — what happens when the table fills up and the average chain length (the load factor) climbs past the point where lookups stop being O(1).

On collision resolution, the textbook names the two families that organize this entire document:

  1. Separate chaining. Each bucket holds the head of a linked list of entries that hashed to it. A collision appends to the list. Lookups walk the chain comparing keys. The entries live wherever the allocator put them; the bucket array holds only pointers. The decisive property is that an entry’s address never changes for the entry’s lifetime — inserting or deleting other entries cannot move it — so callers may hold a raw pointer into the table across other operations.
  2. Open addressing. All entries live inside the bucket array itself; on collision the insert probes a deterministic sequence of alternate buckets (linear, quadratic, or double-hashing) until it finds an empty slot. There are no per-entry allocations and no pointer chasing, so the whole structure is dense and CPU-cache-friendly. The price is that entries get moved (during probing, deletion compaction, and resizing), so a caller cannot hold a stable pointer, and deletions are awkward (classically requiring tombstones to keep probe chains intact).

On resizing, the naive answer — allocate a new array of double the size and re-hash every element into it — incurs a O(N) stop-the-world pause every time the table doubles. For a long-lived table whose size is hard to predict this is acceptable amortized, but the pause itself is undesirable, and for a hash table living in a fixed shared-memory region that cannot be reallocated at a stable address, doubling the bucket array in place is simply impossible. The classic answer to smooth growth is Per-Åke Larson’s “Dynamic Hash Tables” (CACM 31(4), April 1988, pp. 446-457) — the paper named verbatim in the historical comment at the top of dynahash.c. Larson’s scheme (a refinement of Litwin’s linear hashing) grows the table one bucket at a time: it keeps a pointer to the “next bucket to split,” and on each expansion step it splits exactly that one bucket, rehashing only the handful of entries that lived in it into two buckets. The hash-to-bucket mapping uses two masks — a low_mask covering the lower half of the table and a high_mask covering the whole — so that a key is hashed with the wider mask only if the wider-masked bucket has already been split into existence. The table thus grows continuously, never pausing to rehash everything, and the maximum work per insert stays bounded.

Two further requirements come not from textbook hashing but from the systems context PostgreSQL operates in:

  1. Shared memory and discovery. A multi-process server needs hash tables that several backend processes share. Such a table must be carved from a fixed shared-memory arena at server start, addressed identically in every process (so it cannot move or expand its top-level directory), and findable by name by a backend that did not create it.
  2. Concurrency without a global lock. A single lock over a hot shared table (the buffer-lookup table, the lock table) serializes every backend that touches it. The data structure must support being carved into independently-lockable partitions, so that operations on different partitions proceed in parallel.

These two requirements push hard toward separate chaining (stable pointers survive a neighbor’s bucket split; chains in different partitions never interfere) and toward Larson’s directory-of-segments layout (a fixed directory that never moves, with growth confined to allocating new segments). That is precisely the design dynahash implements. The complementary case — a local, single-threaded, allocation- and cache-sensitive hot path that needs neither shared memory nor partitioning nor stable pointers — is served by a different structure entirely (simplehash.h), and the contrast between the two is the throughline of this document.

Every database engine accumulates a small zoo of in-memory hash tables — buffer-pool lookup (page identifier to frame), lock tables (lock tag to lock object), catalog/relation caches, plan and prepared-statement caches, the per-query grouping and hash-join tables — and converges on a recurring set of conventions for building them. Naming those conventions first lets the PostgreSQL symbols in the next section read as choices within a shared design space rather than inventions.

Mature engines almost always ship two distinct hash-table facilities, because the requirements pull in opposite directions:

  • A general-purpose, shareable table — separate-chaining, with stable entry addresses, the ability to live in shared memory, and partitioned locking. This is what the buffer pool and lock manager need: many processes, long-lived entries that other code points at, concurrency.
  • A specialized, local, fast table — open-addressing, cache-dense, type-specialized at compile time, no indirection. This is what a hash aggregate or hash join needs inside one executor node in one backend: millions of probes per second, no sharing, no stable-pointer requirement, throw the whole thing away at end of query.

dynahash is the first; simplehash is the second. (CUBRID makes a similar split between its general mht_* memory hash tables and specialized inline structures.)

The shareable table almost universally avoids a monolithic bucket array. The standard layout is a two-level one: a fixed-size directory of pointers, each pointing to a segment (a contiguous run of bucket headers). Growth allocates a new segment and (if needed) a bigger directory; the existing segments never move. This is what makes Larson-style one-bucket-at-a-time expansion compatible with shared memory: in a shared table the directory is sized once up front (it must stay at a fixed address) and only segment allocation happens at runtime.

flowchart LR
  subgraph dir["Directory (dsize entries)"]
    d0["dir[0]"]
    d1["dir[1]"]
    d2["dir[2]"]
    dn["dir[...]"]
  end
  subgraph seg0["Segment 0 (ssize bucket headers)"]
    b0["bucket 0 -> HASHELEMENT -> HASHELEMENT -> NULL"]
    b1["bucket 1 -> NULL"]
    bk["bucket ssize-1 -> HASHELEMENT -> NULL"]
  end
  subgraph seg1["Segment 1"]
    c0["bucket ssize -> HASHELEMENT -> NULL"]
    ck["..."]
  end
  d0 --> seg0
  d1 --> seg1
  d2 --> dn

A bucket number is decomposed into (segment_num, segment_ndx) by shifting and masking with the segment size, exactly as a virtual address decomposes into page and offset.

Entry-embedded keys and cached hash values

Section titled “Entry-embedded keys and cached hash values”

The general table stores each entry as a small fixed header (a next-pointer for the chain, plus the cached hash value so that a chain walk can reject non-matching entries with an integer compare before paying for a full key comparison) immediately followed by the caller’s key-and-data payload. The key sits at the front of the payload so a single memcmp/strcmp over keysize bytes settles equality. Caching the hash value in the header is the standard trick that makes both chain walks and resizing cheap (resizing re-buckets by cached hash, never re-hashing keys).

Partitioned locking via low-order hash bits

Section titled “Partitioned locking via low-order hash bits”

To shard a hot shared table, the engine reserves the low-order bits of the hash value as a partition number and gives each partition its own lock. Because a good hash mixes all bits, the low-order bits distribute entries evenly across partitions; and because the bucket-mapping function is built so that buckets with the same low-order bits stay within one partition, a whole partition is independently lockable. Operations compute the partition, acquire only that partition’s lock, and never contend with operations in other partitions. The cost is that the table can no longer split buckets on-the-fly (a split would move entries across partition boundaries mid-flight), so partitioned tables are pre-sized and never expand their bucket count.

Iterating a chained hash table means walking the directory bucket by bucket, each bucket’s chain in turn. The hazard is a concurrent expansion: if a bucket splits while the scan is partway through it, the scan can miss entries or visit them twice. The standard defense is to register active scans and inhibit bucket splits while any scan is open (deferring the split to a later insert), which is cheap because scans are rare relative to lookups.

PostgreSQL’s dynahash.c is a direct, lightly-modernized descendant of Larson’s 1988 algorithm (the file’s historical comment credits the CACM paper and the 1990 Berkeley additions for “multiple table interface” and the “ctl structure for shared memory”). The same code compiles to two very different runtime shapes selected by flags at hash_create time: a backend-local table (its own memory context, palloc allocator that throws on OOM, free to expand) and a shared-memory table (header and directory preallocated in the shared arena, a NULL-returning allocator, fixed directory). The file’s own header comment lays out dynahash’s niche versus simplehash precisely:

// dynahash.c — src/backend/utils/hash/dynahash.c
* Compared to simplehash, dynahash has the following benefits:
*
* - It supports partitioning, which is useful for shared memory access using
* locks.
* - Shared memory hashes are allocated in a fixed size area at startup and
* are discoverable by name from other processes.
* - Because entries don't need to be moved in the case of hash conflicts,
* dynahash has better performance for large entries.
* - Guarantees stable pointers to entries.

Every entry is a HASHELEMENT header — a chain link and the cached 32-bit hash value — with the caller’s payload glued on right after it at a MAXALIGN boundary. The key must sit at the start of that payload. Two macros convert between the header pointer and the key/payload pointer the caller actually sees:

// HASHELEMENT — src/include/utils/hsearch.h
typedef struct HASHELEMENT
{
struct HASHELEMENT *link; /* link to next entry in same bucket */
uint32 hashvalue; /* hash function result for this entry */
} HASHELEMENT;
// ELEMENTKEY / ELEMENT_FROM_KEY — src/backend/utils/hash/dynahash.c
#define ELEMENTKEY(helem) (((char *)(helem)) + MAXALIGN(sizeof(HASHELEMENT)))
#define ELEMENT_FROM_KEY(key) \
((HASHELEMENT *) (((char *) (key)) - MAXALIGN(sizeof(HASHELEMENT))))

This is what licenses dynahash’s “stable pointers” guarantee: hash_search returns ELEMENTKEY(currBucket), a pointer to the caller’s data that remains valid until that entry is removed, regardless of what inserts, deletes, or bucket splits happen to other entries. The buffer manager relies on this when it holds a BufferDesc-adjacent pointer; the lock manager relies on it when a PROCLOCK references a LOCK.

The control header HASHHDR (which lives in shared memory for a shared table, with each backend keeping a thin local HTAB that points at it) holds the sizing state. The directory dir is an array of HASHSEGMENTs; each segment is an array of ssize HASHBUCKET chain-heads:

// HASHHDR (excerpt) — src/backend/utils/hash/dynahash.c
struct HASHHDR
{
FreeListData freeList[NUM_FREELISTS];
/* These fields can change, but not in a partitioned table */
long dsize; /* directory size */
long nsegs; /* number of allocated segments (<= dsize) */
uint32 max_bucket; /* ID of maximum bucket in use */
uint32 high_mask; /* mask to modulo into entire table */
uint32 low_mask; /* mask to modulo into lower half of table */
/* These fields are fixed at hashtable creation */
Size keysize; /* hash key length in bytes */
Size entrysize; /* total user element size in bytes */
long num_partitions; /* # partitions (must be power of 2), or 0 */
long max_dsize; /* 'dsize' limit if directory is fixed size */
long ssize; /* segment size --- must be power of 2 */
int sshift; /* segment shift = log2(ssize) */
int nelem_alloc; /* number of entries to allocate at once */
};

Mapping a hash value to a bucket is the Larson two-mask trick: mask with high_mask (the whole table); if the result exceeds the highest bucket that actually exists yet, fold it back down with low_mask (the lower half). This is exactly what lets the table exist at a non-power-of-two number of buckets mid-expansion:

// calc_bucket — src/backend/utils/hash/dynahash.c
static inline uint32
calc_bucket(HASHHDR *hctl, uint32 hash_val)
{
uint32 bucket;
bucket = hash_val & hctl->high_mask;
if (bucket > hctl->max_bucket)
bucket = bucket & hctl->low_mask;
return bucket;
}

The bucket number then splits into a directory index and an in-segment index by shifting by sshift (= log2 of segment size) and masking — the MOD(x,y) macro is (x) & ((y)-1), valid because ssize is a power of two:

// hash_initial_lookup — src/backend/utils/hash/dynahash.c
bucket = calc_bucket(hctl, hashvalue);
segment_num = bucket >> hashp->sshift;
segment_ndx = MOD(bucket, hashp->ssize);
segp = hashp->dir[segment_num];
if (segp == NULL)
hash_corrupted(hashp);
*bucketptr = &segp[segment_ndx];

Local vs shared: one code path, two shapes

Section titled “Local vs shared: one code path, two shapes”

hash_create branches on HASH_SHARED_MEM. For a shared table the HTAB header is allocated in TopMemoryContext but the HASHHDR and directory come from the caller-provided shared region (info->hctl), and HASH_ATTACH lets a second backend bind to an already-initialized table by copying a few local constants and returning immediately. For a local table everything — header, directory, segments, entries — lives in a private AllocSet context created for the table, which is what makes hash_destroy a one-liner (MemoryContextDelete):

// hash_create (excerpt) — src/backend/utils/hash/dynahash.c
if (flags & HASH_SHARED_MEM)
{
hashp->hctl = info->hctl;
hashp->dir = (HASHSEGMENT *) (((char *) info->hctl) + sizeof(HASHHDR));
hashp->hcxt = NULL;
hashp->isshared = true;
/* hash table already exists, we're just attaching to it */
if (flags & HASH_ATTACH)
{
/* make local copies of some heavily-used values */
hctl = hashp->hctl;
hashp->keysize = hctl->keysize;
hashp->ssize = hctl->ssize;
hashp->sshift = hctl->sshift;
return hashp;
}
}

The flags also select the hash/compare/copy functions: HASH_STRINGS (C strings, string_hash/string_compare/strlcpy), HASH_BLOBS (binary, tag_hash or the uint32_hash fast path for 4-byte keys, memcmp/memcpy), or HASH_FUNCTION (caller-supplied). HASH_ELEM (key/entry sizes) is now mandatory, asserted at the top of hash_create.

hash_search_with_hash_value is where the whole structure comes together: an enter operation may first trigger a one-bucket expansion, then it always does the directory lookup and chain walk, then forks on the action. The expansion check is the only place expand_table is called, and it is short-circuited for partitioned, frozen, or being-scanned tables — the three states in which a bucket split would be unsafe.

flowchart TD
  start["hash_search_with_hash_value: key, hash, action"]
  enter{"action is ENTER<br/>or ENTER_NULL?"}
  expandok{"nentries gt max_bucket<br/>AND not partitioned<br/>AND not frozen<br/>AND no active seq scan?"}
  expand["expand_table: split one bucket,<br/>rehome its chain"]
  lookup["hash_initial_lookup:<br/>calc_bucket then segment then chain head"]
  walk["walk chain: compare cached hashvalue,<br/>then match on key bytes"]
  found{"matching entry found?"}
  act{"which action?"}
  rfind["FIND: return ELEMENTKEY or NULL"]
  rremove["REMOVE: unlink, push onto freelist,<br/>decrement nentries, return dangling ptr"]
  renterhit["ENTER hit: return existing ELEMENTKEY"]
  rentermiss["ENTER miss: get_hash_entry,<br/>link in, copy key, caller fills data"]

  start --> enter
  enter -- yes --> expandok
  enter -- no --> lookup
  expandok -- yes --> expand --> lookup
  expandok -- no --> lookup
  lookup --> walk --> found
  found --> act
  act -- FIND --> rfind
  act -- REMOVE --> rremove
  act -- ENTER/found --> renterhit
  act -- ENTER/not found --> rentermiss

A shared table flagged HASH_PARTITION trades on-the-fly expansion for concurrency. The mechanism is the freeList[NUM_FREELISTS] array: instead of one global (nentries, freeList) pair under one lock, the counts and free chains are sharded into 32 slices, each with its own spinlock. The partition of a hash value is its low-order bits, and the file-level comment explains why that aligns with calc_bucket:

// dynahash.c partitioning comment — src/backend/utils/hash/dynahash.c
* To use a hash table in partitioned mode, the HASH_PARTITION flag must be
* given to hash_create. This prevents any attempt to split buckets on-the-fly.
* Therefore, each hash bucket chain operates independently, and no fields
* of the hash header change after init except nentries and freeList.
...
* We expect callers to use the low-order bits of a lookup key's hash value
* as a partition number --- this will work because of the way calc_bucket()
* maps hash values to bucket numbers.
// FreeListData / FREELIST_IDX — src/backend/utils/hash/dynahash.c
typedef struct
{
slock_t mutex; /* spinlock for this freelist */
long nentries; /* number of entries in associated buckets */
HASHELEMENT *freeList; /* chain of free elements */
} FreeListData;
#define IS_PARTITIONED(hctl) ((hctl)->num_partitions != 0)
#define FREELIST_IDX(hctl, hashcode) \
(IS_PARTITIONED(hctl) ? (hashcode) % NUM_FREELISTS : 0)

The two canonical clients are the buffer-lookup table and the lock tables, both created via ShmemInitHash with HASH_PARTITION:

// buf_table.c — src/backend/storage/buffer/buf_table.c
info.num_partitions = NUM_BUFFER_PARTITIONS;
...
SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
size, size, &info,
HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
// lock.c — src/backend/storage/lmgr/lock.c
LockMethodLockHash = ShmemInitHash("LOCK hash",
init_table_size, max_table_size, &info,
HASH_ELEM | HASH_BLOBS | HASH_PARTITION);

Both compute the partition from the hash value (get_hash_value then take the low bits) before acquiring the matching partition LWLock, then call hash_search_with_hash_value so the already-computed hash is not recomputed. The partition-lock count is a separate constant from NUM_FREELISTS; the two need not be equal because each freelist’s coverage “might be more or less than one partition” (hence each freelist carries its own spinlock rather than relying on the caller’s partition lock).

For the local, cache-bound hot paths, simplehash.h is a wholly separate implementation that the preprocessor instantiates per element type. The caller #defines SH_PREFIX, SH_ELEMENT_TYPE, SH_KEY, SH_HASH_KEY, SH_EQUAL, etc., includes the header, and gets type-safe foo_hash, foo_create, foo_insert, foo_lookup functions with no indirect calls. The table is a single flat array of elements (open addressing), grown by doubling, using a robin-hood variant of linear probing to bound clustering. The header comment states the trade precisely:

// simplehash.h design notes — src/include/lib/simplehash.h
* Compared to dynahash, simplehash has the following benefits:
* - Due to the "templated" code generation has known structure sizes and no
* indirect function calls ...
* - Open addressing has better CPU cache behavior than dynahash's chained
* hashtables.
...
* of "robin hood" hashing is employed. Robin hood hashing optimizes
* chaining lengths by moving elements close to their optimal bucket
* ("rich" elements), out of the way if a to-be-inserted element is further
* away from its optimal position (i.e. it's "poor").

The contrast that matters for a reader choosing between them: simplehash has no stable pointers (robin-hood shifting and grow-time rehashing move entries), no shared-memory mode, and no partitioning — but for small entries probed millions of times in one backend it is materially faster. The two tables are complementary, not competitors.

The code divides cleanly into create/destroy, search (the public CRUD), the sequential-scan machinery, and the private growth/allocation utilities. The two clients (buffer table, lock table) sit on top via ShmemInitHash, and simplehash.h is an entirely parallel implementation in the include tree.

hash_create is the single entry point. It asserts HASH_ELEM, picks the hash/compare/copy functions from the flags, allocates the HTAB (and, for a local table, a private memory context), fills HASHHDR defaults via hdefault, then calls init_htab to build the directory and initial segments. For a shared table — or a small local one — it preallocates entries into the freelists so the first inserts don’t have to allocate.

  • hash_create — the constructor; flag dispatch, function selection, shared vs local context, partition setup, preallocation loop.
  • hdefault — zero HASHHDR and set DEF_DIRSIZE (256), DEF_SEGSIZE (256), NO_MAX_DSIZE, num_partitions = 0.
  • choose_nelem_alloc — pick how many entries to allocate per chunk so the request rounds to a power of two (>= 32 entries).
  • init_htab — derive max_bucket/high_mask/low_mask from next_pow2_int(nelem), ensure nbuckets >= num_partitions, allocate the directory and the initial segments; initialize the per-freelist spinlocks for a partitioned table.
  • hash_estimate_size / hash_select_dirsize / hash_get_shared_size — the shared-memory sizing helpers ShmemInitHash calls to reserve a fixed region (directory cannot grow once placed).
  • hash_destroy — for a local table only: MemoryContextDelete(hashp->hcxt) frees everything; asserts the allocator is the default.

Search: find / enter / remove / update (dynahash.c)

Section titled “Search: find / enter / remove / update (dynahash.c)”

hash_search computes the hash and delegates to hash_search_with_hash_value, which is the workhorse. On HASH_ENTER it first checks whether to expand (only for a non-partitioned, non-frozen table with no active scan), then does the initial bucket lookup, walks the chain comparing cached hash then key, and forks on the action.

  • hash_search — thin wrapper that hashes the key, then calls hash_search_with_hash_value.
  • hash_search_with_hash_value — the CRUD core; the expansion trigger, hash_initial_lookup, the chain walk (hashvalue equality gate then match), and the HASH_FIND/HASH_REMOVE/HASH_ENTER/HASH_ENTER_NULL switch. On remove it unlinks and pushes the element onto the freelist; on enter-not-found it pulls from the freelist via get_hash_entry and copies the key.
  • get_hash_value — exported hash-only helper so partitioned callers can compute the partition before locking.
  • calc_bucket / hash_initial_lookup — hash to bucket to (segment, index) to chain-head pointer.
  • hash_update_hash_key — re-key an existing entry in place: find it by its saved hash, find the destination chain, fail on duplicate, and relink only if the bucket changed (never touches the freelist, so it cannot OOM).
  • get_hash_entry — pop a free element from freeList[freelist_idx]; if empty, element_alloc a new chunk; if that fails in a partitioned table, borrow from other freelists before declaring OOM.
  • hash_get_num_entries — sum nentries across freelists (caller must hold all partition locks).

The remove path is the clearest illustration of the chained design’s stable-pointer property and the partitioned-freelist bookkeeping:

// hash_search_with_hash_value, HASH_REMOVE — src/backend/utils/hash/dynahash.c
case HASH_REMOVE:
if (currBucket != NULL)
{
if (IS_PARTITIONED(hctl))
SpinLockAcquire(&(hctl->freeList[freelist_idx].mutex));
Assert(hctl->freeList[freelist_idx].nentries > 0);
hctl->freeList[freelist_idx].nentries--;
/* remove record from hash bucket's chain. */
*prevBucketPtr = currBucket->link;
/* add the record to the appropriate freelist. */
currBucket->link = hctl->freeList[freelist_idx].freeList;
hctl->freeList[freelist_idx].freeList = currBucket;
if (IS_PARTITIONED(hctl))
SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
return ELEMENTKEY(currBucket);
}
return NULL;

Note the explicit warning in the comments: the returned pointer after HASH_REMOVE is a dangling pointer — the element is back on the freelist and will be reused on the next insert. And the HASH_ENTER path ends with a pointed comment that the caller must fill the data field without any code that could throw, since a thrown error would leave a half-initialized entry linked into the table.

Expansion: one bucket at a time (dynahash.c)

Section titled “Expansion: one bucket at a time (dynahash.c)”

expand_table is the Larson split step, only ever reached for a non-partitioned table. It creates exactly one new bucket (max_bucket + 1), allocating a new segment (and growing the directory via dir_realloc) if the new bucket crosses into fresh territory, readjusts the masks if it crossed a power of two, and rehomes the entries of the one old bucket that maps to the new one:

// expand_table (excerpt) — src/backend/utils/hash/dynahash.c
new_bucket = hctl->max_bucket + 1;
...
hctl->max_bucket++;
old_bucket = (new_bucket & hctl->low_mask);
if ((uint32) new_bucket > hctl->high_mask)
{
hctl->low_mask = hctl->high_mask;
hctl->high_mask = (uint32) new_bucket | hctl->low_mask;
}
...
for (currElement = *oldlink; currElement != NULL; currElement = nextElement)
{
nextElement = currElement->link;
if ((long) calc_bucket(hctl, currElement->hashvalue) == old_bucket)
{
*oldlink = currElement;
oldlink = &currElement->link;
}
else
{
*newlink = currElement;
newlink = &currElement->link;
}
}
*oldlink = NULL;
*newlink = NULL;

The expansion is triggered inside hash_search_with_hash_value on enter, when the live entry count in freeList[0] exceeds max_bucket (load factor

  1. and the table is eligible:
// hash_search_with_hash_value, expansion trigger — src/backend/utils/hash/dynahash.c
if (action == HASH_ENTER || action == HASH_ENTER_NULL)
{
if (hctl->freeList[0].nentries > (long) hctl->max_bucket &&
!IS_PARTITIONED(hctl) && !hashp->frozen &&
!has_seq_scans(hashp))
(void) expand_table(hashp);
}

The three suppressors are each meaningful: partitioned tables must never split (IS_PARTITIONED), frozen tables forbid the bucket-count change (frozen), and a table with an open hash_seq_search must not split mid-scan (has_seq_scans). A failure to expand is non-fatal — the table just runs at a higher fill factor.

  • expand_table — the single-bucket split; mask readjustment; rehoming loop.
  • dir_realloc — double the directory (only when max_dsize == NO_MAX_DSIZE, i.e. a local table); copy old pointers, zero the rest, pfree the old.
  • seg_alloc — allocate and zero one segment of ssize bucket heads.
  • element_alloc — allocate nelem entries in one block, thread them onto freeList[freelist_idx]; returns false (no throw) for a fixed-size table.

hash_seq_init records the scan and (unless the table is frozen) registers it in the per-backend active-scan list so inserts will suppress splits. hash_seq_search walks bucket by bucket: it follows the current chain, and when a chain ends it advances curBucket, decomposing each bucket into (segment_num, segment_ndx) and skipping empty buckets:

// hash_seq_search (excerpt) — src/backend/utils/hash/dynahash.c
segment_num = curBucket >> hashp->sshift;
segment_ndx = MOD(curBucket, ssize);
segp = hashp->dir[segment_num];
while ((curElem = segp[segment_ndx]) == NULL)
{
/* empty bucket, advance to next */
if (++curBucket > max_bucket)
{
status->curBucket = curBucket;
hash_seq_term(status);
return NULL; /* search is done */
}
if (++segment_ndx >= ssize)
{
segment_num++;
segment_ndx = 0;
segp = hashp->dir[segment_num];
}
}
  • hash_seq_init / hash_seq_init_with_hash_value — start a full scan, or a scan restricted to the one bucket holding a given hash value.
  • hash_seq_search — return entries one at a time; the deleting-current element is explicitly allowed, deleting others mid-scan is undefined.
  • hash_seq_term — end-of-scan cleanup; deregister the scan.
  • register_seq_scan / deregister_seq_scan / has_seq_scans — the fixed-size (MAX_SEQ_SCANS = 100) per-backend stack of open scans that expand_table’s trigger consults.
  • hash_freeze — lock a local table against future inserts so scans need no registration/termination discipline; forbidden on shared tables.

The whole file is macro machinery: SH_MAKE_NAME stamps the user prefix onto every type and function. The runtime structure is one flat element array plus sizing scalars:

// SH_TYPE — src/include/lib/simplehash.h
typedef struct SH_TYPE
{
uint64 size; /* size of data/bucket array */
uint32 members; /* how many elements have valid contents */
uint32 sizemask; /* mask for bucket and size calculations */
uint32 grow_threshold; /* boundary after which to grow hashtable */
SH_ELEMENT_TYPE *data; /* hash buckets */
#ifndef SH_RAW_ALLOCATOR
MemoryContext ctx;
#endif
void *private_data;
} SH_TYPE;

Sizing rounds the requested element count up by the fill factor (default 0.9) to the next power of two, so sizemask = size - 1 and the optimal bucket is just hash & sizemask:

// SH_UPDATE_PARAMETERS / SH_INITIAL_BUCKET — src/include/lib/simplehash.h
tb->size = size;
tb->sizemask = (uint32) (size - 1);
if (tb->size == SH_MAX_SIZE)
tb->grow_threshold = ((double) tb->size) * SH_MAX_FILLFACTOR;
else
tb->grow_threshold = ((double) tb->size) * SH_FILLFACTOR;
...
static inline uint32
SH_INITIAL_BUCKET(SH_TYPE * tb, uint32 hash)
{
return hash & tb->sizemask;
}

The robin-hood insert is the heart of the file. On a collision it compares the displacement-from-optimal of the to-be-inserted entry against the colliding entry’s. If the newcomer is “poorer” (farther from its optimal bucket), it evicts the “richer” resident by shifting the run of entries forward into the next empty slot, then drops in. It also force-grows the table if a single insert would have to probe (SH_GROW_MAX_DIB, 25) or shift (SH_GROW_MAX_MOVE, 150) too far:

// SH_INSERT_HASH_INTERNAL (excerpt) — src/include/lib/simplehash.h
curhash = SH_ENTRY_HASH(tb, entry);
curoptimal = SH_INITIAL_BUCKET(tb, curhash);
curdist = SH_DISTANCE_FROM_OPTIMAL(tb, curoptimal, curelem);
if (insertdist > curdist)
{
/* find next empty bucket, then shift the run forward by one */
...
moveelem = emptyelem;
while (moveelem != curelem)
{
moveelem = SH_PREV(tb, moveelem, startelem);
moveentry = &data[moveelem];
memcpy(lastentry, moveentry, sizeof(SH_ELEMENT_TYPE));
lastentry = moveentry;
}
tb->members++;
entry->SH_KEY = key;
entry->status = SH_STATUS_IN_USE;
*found = false;
return entry;
}

The lookup is plain linear probing that stops at the first SH_STATUS_EMPTY slot — correct precisely because robin-hood deletion never leaves tombstones (it back-shifts instead), so an empty slot truly means “not present”:

// SH_LOOKUP_HASH_INTERNAL — src/include/lib/simplehash.h
const uint32 startelem = SH_INITIAL_BUCKET(tb, hash);
uint32 curelem = startelem;
while (true)
{
SH_ELEMENT_TYPE *entry = &tb->data[curelem];
if (entry->status == SH_STATUS_EMPTY)
return NULL;
if (SH_COMPARE_KEYS(tb, hash, key, entry))
return entry;
curelem = SH_NEXT(tb, curelem, startelem);
}
  • SH_CREATE / SH_DESTROY / SH_RESET — lifecycle; create rounds size up by SH_FILLFACTOR.
  • SH_INSERT / SH_INSERT_HASH / SH_INSERT_HASH_INTERNAL — robin-hood insert; grows on threshold or on SH_GROW_MAX_DIB/SH_GROW_MAX_MOVE.
  • SH_LOOKUP / SH_LOOKUP_HASH / SH_LOOKUP_HASH_INTERNAL — linear probe to first empty.
  • SH_DELETE / SH_DELETE_ITEM — delete with back-shift compaction (no tombstones).
  • SH_GROW — double and re-place every entry, starting from a non-wrapped bucket so moves never conflict.
  • SH_START_ITERATE / SH_ITERATE — iteration over the flat array, visiting a random-but-stable start bucket so an entry inserted during iteration is not double-visited.
  • SH_STAT — diagnostics (max/avg chain length, collisions).

Position hints (as of 2026-06-05, REL_18 273fe94)

Section titled “Position hints (as of 2026-06-05, REL_18 273fe94)”
SymbolFileLine
HASHELEMENThsearch.h51
HASHCTLhsearch.h65
HASHACTIONhsearch.h111
HASH_SEQ_STATUShsearch.h120
struct HASHHDRdynahash.c168
FreeListDatadynahash.c153
struct HTABdynahash.c219
ELEMENTKEY / ELEMENT_FROM_KEYdynahash.c244
FREELIST_IDXdynahash.c212
hash_createdynahash.c352
hdefaultdynahash.c630
choose_nelem_allocdynahash.c657
init_htabdynahash.c690
hash_estimate_sizedynahash.c784
hash_select_dirsizedynahash.c831
hash_get_shared_sizedynahash.c855
hash_destroydynahash.c866
get_hash_valuedynahash.c912
calc_bucketdynahash.c919
hash_searchdynahash.c956
hash_search_with_hash_valuedynahash.c969
hash_update_hash_keydynahash.c1146
get_hash_entrydynahash.c1259
hash_get_num_entriesdynahash.c1344
hash_seq_initdynahash.c1388
hash_seq_searchdynahash.c1423
hash_seq_termdynahash.c1517
hash_freezedynahash.c1537
expand_tabledynahash.c1554
dir_reallocdynahash.c1651
seg_allocdynahash.c1690
element_allocdynahash.c1709
hash_initial_lookupdynahash.c1759
my_log2dynahash.c1797
SH_TYPEsimplehash.h145
SH_STATUSsimplehash.h175
SH_COMPUTE_SIZEsimplehash.h310
SH_UPDATE_PARAMETERSsimplehash.h336
SH_INITIAL_BUCKETsimplehash.h356
SH_DISTANCE_FROM_OPTIMALsimplehash.h385
SH_CREATEsimplehash.h441
SH_GROWsimplehash.h493
SH_INSERT_HASH_INTERNALsimplehash.h608
SH_LOOKUP_HASH_INTERNALsimplehash.h799
SH_DELETEsimplehash.h856
SH_DELETE_ITEMsimplehash.h928
SH_ITERATEsimplehash.h1045
  • dynahash is Larson’s 1988 dynamic hashing, grown one bucket at a time. The historical comment in dynahash.c credits “Dynamic hashing, after CACM April 1988 pp 446-457, by Per-Ake Larson.” Verified that expand_table adds exactly one bucket per call (new_bucket = hctl->max_bucket + 1; ... hctl->max_bucket++;) and rehomes only the single old bucket (new_bucket & low_mask), splitting its chain in two. The two-mask calc_bucket (high_mask then fold to low_mask) is the linear-hashing address calculation.

  • One code base serves both local and shared tables, selected by flags. Verified in hash_create: HASH_SHARED_MEM points hashp->hctl/hashp->dir at the caller’s shared region and sets isshared = true; the non-shared path creates a private AllocSetContext. HASH_ATTACH returns early after copying local constants, letting a second backend bind to an existing table.

  • Entries have stable addresses; the key is embedded after a HASHELEMENT header. Verified in hsearch.h (HASHELEMENT = link + hashvalue) and the ELEMENTKEY/ELEMENT_FROM_KEY macros in dynahash.c (key at MAXALIGN(sizeof(HASHELEMENT)) past the header). hash_search returns ELEMENTKEY(currBucket); removal pushes the same element onto a freelist rather than freeing it, so the address is reused but never relocated while live. This is the “Guarantees stable pointers to entries” bullet in the file header.

  • Partitioned mode shards (nentries, freeList) into NUM_FREELISTS=32 spinlock-guarded slices and forbids on-the-fly splits. Verified: FreeListData carries a per-slice slock_t mutex; FREELIST_IDX maps a hashcode to a slice; expand_table asserts !IS_PARTITIONED(hctl); and the hash_search_with_hash_value expansion trigger is gated on !IS_PARTITIONED(hctl). get_hash_entry borrows from other freelists under their spinlocks before declaring OOM, satisfying callers that assume a delete guarantees a subsequent insert can succeed.

  • The buffer manager and lock manager are the canonical partitioned clients. Verified buf_table.c creates SharedBufHash with HASH_ELEM | HASH_BLOBS | HASH_PARTITION and num_partitions = NUM_BUFFER_PARTITIONS; lock.c creates LockMethodLockHash and LockMethodProcLockHash with HASH_PARTITION. Both call get_hash_value to derive the partition before locking, then hash_search_with_hash_value.

  • Expansion is triggered at load factor 1 and suppressed by partitioning, freezing, or an active sequential scan. Verified the trigger condition hctl->freeList[0].nentries > (long) hctl->max_bucket && !IS_PARTITIONED && !hashp->frozen && !has_seq_scans(hashp). Failure to expand is explicitly non-fatal per the in-code comment.

  • hash_seq_search walks the directory bucket-by-bucket and registers the scan to inhibit splits. Verified hash_seq_init calls register_seq_scan (unless frozen); hash_seq_search decomposes curBucket into (segment_num, segment_ndx) via sshift/MOD, skips empty buckets, and follows ->link chains. MAX_SEQ_SCANS is 100.

  • simplehash is a separate macro-templated open-addressing robin-hood hash with no stable pointers, no shared memory, and no partitioning. Verified simplehash.h’s SH_TYPE is a single flat SH_ELEMENT_TYPE *data array; SH_INSERT_HASH_INTERNAL performs robin-hood shifting (insertdist > curdist evicts the richer resident by memcpy-shifting the run); SH_LOOKUP_HASH_INTERNAL stops at the first SH_STATUS_EMPTY (valid because SH_DELETE back-shifts rather than leaving tombstones). Default SH_FILLFACTOR 0.9, SH_MAX_FILLFACTOR 0.98, force-grow at SH_GROW_MAX_DIB 25 / SH_GROW_MAX_MOVE 150.

  • HASH_ELEM is now mandatory and the key-class flag is exactly one of STRINGS/BLOBS/FUNCTION. Verified the Assert(flags & HASH_ELEM) at the top of hash_create, the Assert(info->keysize > 8) guard against mistaking an integer key for a string, and the uint32_hash fast path for 4-byte BLOB keys.

  1. How often the freelist-borrowing path in get_hash_entry actually fires under real partitioned workloads. The borrowing loop exists for correctness (a delete must guarantee a later insert can find an element even across partitions), but whether it is a measurable hot path under buffer- or lock-table churn is not evident from the code. Investigation path: counter instrumentation on the borrow_from_idx loop under a contended pgbench run.

  2. The practical cost of the load-factor-1 expansion threshold versus simplehash’s 0.9. dynahash splits when live entries exceed max_bucket (load factor ~1), tolerating longer chains; simplehash grows at 0.9 to keep probe runs short. Whether dynahash’s higher target load factor materially lengthens chain walks for the catalog/relcache local tables is unmeasured here. Investigation path: hash_stats (HASH_STATISTICS build) on a busy relcache.

  3. Whether any current shared table would benefit from a NUM_FREELISTS other than 32. The freelist count is a compile-time constant decoupled from the partition-lock count; the comment notes a freelist’s coverage “might be more or less than one partition.” Whether 32 is optimal for the NUM_BUFFER_PARTITIONS / lock-partition counts in REL_18 is not established in-tree. Investigation path: cache-line-contention profiling of the freelist spinlocks under high buffer-eviction load.

Beyond PostgreSQL — Comparative Designs & Research Frontiers

Section titled “Beyond PostgreSQL — Comparative Designs & Research Frontiers”
  • Larson 1988 / Litwin linear hashing vs. extendible hashing. dynahash follows Larson’s linear-hashing lineage (split the next bucket in round- robin order, two-mask addressing) rather than Fagin’s extendible hashing (a directory indexed by the top d bits, doubling the directory on overflow). Extendible hashing bounds the worst-case to two disk accesses and is the classic choice for on-disk hash indexes; linear hashing wins for in-memory tables because it never doubles the directory in one step. A note mapping calc_bucket’s two-mask scheme onto Litwin’s h_i/h_{i+1} split functions would sharpen the lineage. (Database System Concepts ch. 14 covers both; Database Internals ch. 6 covers extendible hashing for storage engines.)

  • Chaining + stable pointers vs. open addressing. The dynahash/simplehash split inside one codebase is a clean natural experiment in the chaining-vs- open-addressing trade. dynahash keeps stable pointers and partitionability at the cost of pointer chasing and per-entry headers; simplehash keeps cache density and inlined comparisons at the cost of moving entries. A comparative microbenchmark (large entries, pointer-held-across-ops workload vs. small entries, probe-heavy workload) would quantify the crossover the file headers assert qualitatively.

  • Robin-hood hashing vs. Swiss tables / hopscotch. simplehash’s robin-hood variant bounds probe-run variance by displacement-balancing. Modern alternatives — Google’s SwissTable (SIMD metadata bytes, abseil flat_hash_map) and hopscotch hashing — push open-addressing locality further. Positioning simplehash against a SwissTable-style metadata-probe design would frame what PostgreSQL gives up by staying with per-slot status bytes and scalar probing.

  • Partitioned dynahash vs. lock-free / sharded concurrent hash maps. dynahash partitions by low-order hash bits with an external LWLock per partition. Lock-free hash maps (split-ordered lists, Cliff Click’s non-blocking hash map) and finer-grained sharded designs eliminate the per-partition lock entirely. The CMU/EPFL “scalable lock manager” work (captured in dbms-papers/scalable-lock-manager.md) is the database-specific frontier here: it rethinks exactly the lock-table-over-partitioned-hash structure that LockMethodLockHash embodies. Comparing dynahash’s coarse partition locks against that design is a research-grade exercise in shared-table concurrency.

  • CUBRID’s memory hash tables (mht_*). CUBRID uses a general chained memory-hash facility for its catalog and session structures, much as dynahash serves PostgreSQL’s local caches. A side-by-side of CUBRID’s mht_create/mht_get/mht_put rehash strategy against dynahash’s one-bucket-at-a-time expansion would highlight whether CUBRID pays the stop-the-world rehash that Larson’s scheme is designed to avoid. (See the CUBRID code-analysis tree.)

In-tree source files (REL_18_STABLE, commit 273fe94)

Section titled “In-tree source files (REL_18_STABLE, commit 273fe94)”
  • src/backend/utils/hash/dynahash.c — the implementation: hash_create, hash_search_with_hash_value, expand_table, get_hash_entry, hash_seq_search, the directory/segment/bucket geometry, partitioned freelists, and the Larson historical comment.
  • src/include/utils/hsearch.h — the public surface: HASHELEMENT, HASHCTL, the HASH_* flag bits, HASHACTION, HASH_SEQ_STATUS, and the function prototypes.
  • src/include/lib/simplehash.h — the macro-templated open-addressing robin-hood hash: SH_TYPE, SH_INSERT_HASH_INTERNAL, SH_LOOKUP_HASH_INTERNAL, SH_GROW, and the design-notes header comment.
  • src/backend/storage/buffer/buf_table.cSharedBufHash, the canonical partitioned HASH_BLOBS client.
  • src/backend/storage/lmgr/lock.cLockMethodLockHash / LockMethodProcLockHash, partitioned lock tables built on dynahash.
  • Larson, P.-Å. (1988). “Dynamic Hash Tables.” Communications of the ACM 31(4):446-457. The expansion algorithm dynahash implements; cited by name in the dynahash.c historical comment.
  • Litwin, W. (1980). “Linear Hashing: A New Tool for File and Table Addressing.” Proc. VLDB. The linear-hashing scheme Larson refines.
  • Fagin, R., Nievergelt, J., Pippenger, N. & Strong, H. R. (1979). “Extendible Hashing — A Fast Access Method for Dynamic Files.” ACM TODS 4(3). The directory-doubling alternative dynahash does not use.
  • Database System Concepts (Silberschatz, Korth, Sudarshan, 7e), ch. 14 “Indexing” — static vs. dynamic hashing, linear and extendible hashing (knowledge/research/dbms-general/database-system-concepts.md).
  • Database Internals (Petrov 2019), ch. 6 — extendible hashing and in-memory table structures (knowledge/research/dbms-general/database-internals.md).
  • “A Scalable Lock Manager for Multicores” — the partitioned-lock-table concurrency frontier (knowledge/research/dbms-papers/scalable-lock-manager.md).

Sibling docs (cross-references — mechanism owned there, not duplicated here)

Section titled “Sibling docs (cross-references — mechanism owned there, not duplicated here)”
  • postgres-shared-memory-ipc.mdShmemInitHash, the shared-memory arena, and ShmemInitStruct that allocate the fixed region a shared dynahash lives in.
  • postgres-buffer-manager.mdSharedBufHash and the buffer-partition LWLocks that ride on dynahash’s HASH_PARTITION mode.
  • postgres-lock-manager.mdLockMethodLockHash / LockMethodProcLockHash, the partitioned lock tables and their partition LWLocks.
  • postgres-memory-contexts.md — the AllocSet contexts a local dynahash allocates from and destroys with.
  • postgres-architecture-overview.md — Axis “base infrastructure” placement of the dynahash/simplehash facilities.