Skip to content

CUBRID Lock-Free Hash Map — Legacy C, Modern C++, the Bridge, and the Consumers

Contents:

A concurrent hash table is the most-used lock-free data structure in a DBMS. Lock managers, page-buffer maps, session caches, plan caches, statistics tables — every per-key resource that wants O(1) lookup under contention picks some form of concurrent hash. The textbook design space (Herlihy & Shavit, The Art of Multiprocessor Programming, ch. 13 §“Concurrent Hashing”, and Shalev & Shavit, JACM 53(3) 2006) has three axes:

  1. Open vs. closed addressing. Open addressing (linear / quadratic probing, Robin Hood) places entries directly in the bucket array — cache-friendly but resize-unfriendly. Closed addressing (chained hashing) places entries in per-bucket linked lists — easy to grow incrementally because chains absorb collisions.
  2. Static vs. resizable. A static table has a fixed bucket count. A resizable table grows when the load factor exceeds a threshold — but lock-free resize is hard, requiring either split-ordered bucket pointers (Shalev & Shavit) so that the chain itself does not need re-linking, or stop-the-world re-hash (which sacrifices the lock-free property during resize).
  3. What guards the chains? A coarse table mutex (PostgreSQL’s LockHashPartitionLock), per-bucket striping (Java’s pre-Java-8 ConcurrentHashMap), or fully lock-free chains via the Harris–Michael protocol (Harris 2001, Michael 2002).

CUBRID picks closed addressing + static buckets + Harris–Michael chains. The cost is no automatic resize — the bucket array is fixed at init time. The benefit is the implementation can be small and the chain operations are textbook.

The protocol uses three invariants on each chain node:

  1. The next pointer’s low bit is reserved as a delete mark.
  2. Insert is a single CAS: starting from the predecessor’s next, CAS from NULL to the new node. If the predecessor already has a next, the inserter walks past and tries again. Insertion of an already-present key is a no-op (or a “found” return for find-or-insert semantics).
  3. Delete is a two-CAS protocol:
    • Mark. CAS victim->next from next_ptr to mark(next_ptr) — sets the low bit. After this, the victim is logically deleted: any concurrent traverser sees the mark and treats the victim as gone, but the victim is still reachable from predecessor->next.
    • Unlink. CAS predecessor->next from victim to victim->next (the unmarked one) — physically removes the victim from the chain. If this CAS fails (someone else mutated the chain between mark and unlink), the unlinker either retries (rare) or restarts the search (cheaper).

Two side conditions:

  • Restart on staleness. Every chain operation may need to restart from the bucket head if a CAS fails. This is the LF_LIST_BR_RESTARTED signal in the legacy code and the restart_search boolean in the modern code.
  • Reclamation. A logically-deleted but not-yet-physically- unlinked node must remain dereferenceable until every reader who could have reached it via predecessor->next has finished. This is what the transactional reclamation framework (lockfree::tran::*) provides.

The classical ABA scenario for chained hashing: a thread reads predecessor->next == A, plans to CAS-mark A->next. A sibling thread deletes A, then re-allocates a new A' at the same address (e.g., via the freelist), and inserts A' somewhere — now predecessor->next points to A' (or further along the chain), not A. The original thread’s CAS-mark on the saved A pointer either succeeds and silently corrupts A'’s next (use-after- free + corruption) or fails because the address changed (which is the goal but not the worst case).

The transactional reclamation framework prevents this by ensuring that as long as the original thread holds an open descriptor on the table, no node it could have observed is reclaimed — including A. The CAS-mark either succeeds on the original A (which is still in the chain, even if logically deleted) or fails because the chain shape changed. There is no second-allocation aliasing.

CUBRID delegates the hash function to the per-table descriptor’s f_hash callback, which takes a key and a bucket count and returns the bucket index. This is per-consumer:

  • Lock manager uses an OID-shaped hash.
  • Page buffer (legacy LF_HASH_TABLE) uses a VPID hash: (pageid | ((unsigned int)volid) << 24) % htsize (lock_free.c:154).
  • Session table uses session-id modulo bucket count.

The bucket count itself is a one-shot decision per consumer at boot, picked from system parameters. There is no growth, no shrink, no rehash. The only resize-shaped operation is clear, which swaps in an empty back-buffer atomically — see §“Clear and back-buffer.”

Lock managers everywhere use a chained hash

Section titled “Lock managers everywhere use a chained hash”

PostgreSQL’s lock.c partitions a single hash table into NUM_LOCK_PARTITIONS = 16 partitions, each guarded by an LWLock. InnoDB’s lock_sys_t uses a hash array on (space_id, page_no) pairs guarded by lock_sys->mutex (with recent versions adding sharded mutexes). MySQL’s open-table cache is a hash with per-table mutexes. The repeated pattern is partition by hash, lock per partition.

CUBRID’s approach is structurally different — instead of a coarse mutex per partition, every chain is fully lock-free, and the optional using_mutex is per entry rather than per partition. The trade-off:

  • Pro. Find / insert / erase scale linearly with hash size up to chain-length; no partition-size threshold.
  • Pro. Memory access pattern is simpler — a thread holds at most one entry’s mutex at a time.
  • Con. The transactional reclamation overhead is non-zero on every find (the start_tran/end_tran bracket).
  • Con. clear is O(buckets × chain_length) because every entry must be retired — the back-buffer trick spreads the cost but does not eliminate it.

A subtle CUBRID-specific design: some entries carry their own pthread_mutex_t (declared via lf_entry_descriptor::using_mutex = 1 and of_mutex field offset). When set, find and erase grab the entry’s mutex after finding it but before operating on the payload, allowing the caller to subsequently release the mutex with a separate unlock call. This is the pattern used by the lock manager’s resource entries — a thread that finds a lock resource needs to hold the resource entry’s mutex while mutating its waiter list.

When using_mutex = 0, the find/erase result is only valid inside the transaction bracket — the caller must not let the descriptor’s m_tranid go to INVALID_TRANID while it still holds a payload pointer.

Insert-with-find-fallback (the find_or_insert API) is the common case — the caller wants the entry whether or not it already exists. The implementation pattern: speculatively claim a fresh entry from the freelist before searching, search the chain, if the key is found discard the speculative entry, otherwise CAS-link it into place.

CUBRID optimizes the discard with the descriptor::save_reclaimable mechanism described in cubrid-lockfree-transaction.md — the discarded entry is parked on the calling thread’s descriptor rather than retired, and the next call site reuses it without a freelist round-trip.

ConceptCUBRID name (legacy)CUBRID name (modern)
Chained hash tableLF_HASH_TABLElockfree::hashmap<K,T>
Bucket arrayLF_HASH_TABLE::bucketshashmap::m_buckets
Bucket countLF_HASH_TABLE::hash_sizehashmap::m_size
Per-entry layout descriptorLF_ENTRY_DESCRIPTOR(same struct, lf_entry_descriptor)
Hash functionedesc->f_hash (key, htsize)(same)
Key compareedesc->f_key_cmp (k1, k2)(same)
Key copyedesc->f_key_copy (src, dst)(same)
Init / uninitedesc->f_init, edesc->f_uninit(same)
Duplicate-key callbackedesc->f_duplicate(same)
Per-entry mutex flagedesc->using_mutex(same)
Field offset: nextedesc->of_next(same)
Field offset: keyedesc->of_key(same)
Field offset: mutexedesc->of_mutex(same)
Field offset: del_tran_idedesc->of_del_tran_id(legacy only — modern uses freelist node)
Field offset: local_nextedesc->of_local_next(legacy only — modern uses freelist node)
Behaviour flagsLF_LIST_BF_*(same constants reused)
Restart signalLF_LIST_BR_RESTARTED(same constant reused)
Find primitivelf_hash_findhashmap::find
Find-or-insertlf_hash_find_or_inserthashmap::find_or_insert
Insertlf_hash_inserthashmap::insert
Insert-givenlf_hash_insert_givenhashmap::insert_given
Deletelf_hash_deletehashmap::erase
Delete-lockedlf_hash_delete_already_lockedhashmap::erase_locked
Clearlf_hash_clearhashmap::clear
IteratorLF_HASH_TABLE_ITERATORhashmap::iterator
Back-buffer for clearLF_HASH_TABLE::backbufferhashmap::m_backbuffer
C++ shimlf_hash_table_cpp<K,T>(same — used in bridge)
Wrapper deciding generation(n/a)cubthread::lockfree_hashmap<K,T>

The hashmap exists as two complete implementations bridged by a single dispatch class. Both consume the same lf_entry_descriptor so a per-consumer entry layout works unmodified across the migration.

// LF_HASH_TABLE — src/base/lock_free.h:299
struct lf_hash_table {
void **buckets;
void **backbuffer;
pthread_mutex_t backbuffer_mutex;
unsigned int hash_size;
LF_FREELIST *freelist;
LF_ENTRY_DESCRIPTOR *entry_desc;
};

The procedural API: lf_hash_init, lf_hash_find, lf_hash_find_or_insert, lf_hash_insert, lf_hash_insert_given, lf_hash_delete, lf_hash_delete_already_locked, lf_hash_clear. All take an LF_TRAN_ENTRY * (the per-thread state in the legacy reclamation system) and an LF_HASH_TABLE *.

The C++ shim lf_hash_table_cpp<K,T> is a typed wrapper in the same header — it owns an lf_freelist and an lf_hash_table inline and forwards every method to the C functions with a reinterpret_cast<void**>(&t) for the entry pointer. This shim is what cubthread::lockfree_hashmap holds for the OLD path.

// lockfree::hashmap<Key, T> — src/base/lockfree_hashmap.hpp:42
template <class Key, class T>
class hashmap {
public:
void init (tran::system &transys, size_t hash_size,
size_t freelist_block_size, size_t freelist_block_count,
lf_entry_descriptor &edesc);
void destroy ();
T *find (tran::index, Key &);
bool find_or_insert (tran::index, Key &, T *&);
bool insert (tran::index, Key &, T *&);
bool insert_given (tran::index, Key &, T *&);
bool erase (tran::index, Key &);
bool erase_locked (tran::index, Key &, T *&);
void unlock (tran::index, T *&);
void clear (tran::index); // NOT lock-free
T *freelist_claim (tran::index);
void freelist_retire (tran::index, T *&);
void start_tran (tran::index);
void end_tran (tran::index);
// ...
private:
using address_type = address_marker<T>;
struct freelist_node_data { T m_entry; lf_entry_descriptor *m_edesc; void on_reclaim (); };
using freelist_type = freelist<freelist_node_data>;
using free_node_type = typename freelist_type::free_node;
freelist_type *m_freelist;
T **m_buckets;
size_t m_size;
T **m_backbuffer;
std::mutex m_backbuffer_mutex;
lf_entry_descriptor *m_edesc;
// statistics ...
};

Three structural differences from legacy:

  1. Templated on K, T, no void* casts in the public surface.
  2. No global tran-system reference. The constructor takes a tran::system & and builds its own tran::table (inside the freelist).
  3. Entries live as freelist<freelist_node_data>::free_node, not bare T. The freelist owns the node header (the tran::reclaimable_node base + m_owner pointer) and the payload T m_entry — see §“Entry layout”.

Bridge — cubthread::lockfree_hashmap<K,T>

Section titled “Bridge — cubthread::lockfree_hashmap<K,T>”
// cubthread::lockfree_hashmap<Key,T> — src/thread/thread_lockfree_hash_map.hpp:35
template <class Key, class T>
class lockfree_hashmap {
public:
void init (lf_tran_system &transys, int entry_idx, int hash_size,
int freelist_block_size, int freelist_block_count,
lf_entry_descriptor &edesc);
void init_as_old (lf_tran_system &, int hash_size, int freelist_block_count,
int freelist_block_size, lf_entry_descriptor &, int entry_idx);
void init_as_new (lockfree::tran::system &, size_t hash_size,
size_t freelist_block_size, size_t freelist_block_count,
lf_entry_descriptor &);
// forwards: find, find_or_insert, insert, insert_given, erase, erase_locked,
// unlock, clear, freelist_claim, freelist_retire, start_tran, end_tran
private:
enum type { OLD, NEW, UNKNOWN };
lf_hash_table_cpp<Key, T> m_old_hash;
lockfree::hashmap<Key, T> m_new_hash;
type m_type;
int m_entry_idx;
};

The dispatch is a single boolean check at init():

void init (lf_tran_system &transys, int entry_idx, int hash_size,
int freelist_block_size, int freelist_block_count,
lf_entry_descriptor &edesc)
{
if (prm_get_bool_value (PRM_ID_ENABLE_NEW_LFHASH))
init_as_new (get_thread_entry_lftransys (), (size_t) hash_size,
(size_t) freelist_block_size, (size_t) freelist_block_count, edesc);
else
init_as_old (transys, hash_size, freelist_block_count,
freelist_block_size, edesc, entry_idx);
}

Every forwarded method picks the live implementation by reading m_type and dispatching the matching call. The dispatch is cheap — a single conditional branch — but it is on every call. There is no inlined fast path; the branch sits inside a macro (lockfree_hashmap_forward_func) that every public method expands.

Both paths consume the same descriptor:

// lf_entry_descriptor — src/base/lock_free.h:62
struct lf_entry_descriptor {
unsigned int of_local_next; // legacy only
unsigned int of_next;
unsigned int of_del_tran_id; // legacy only
unsigned int of_key;
unsigned int of_mutex;
int using_mutex;
int max_alloc_cnt; // legacy only
LF_ENTRY_ALLOC_FUNC f_alloc; // legacy only
LF_ENTRY_FREE_FUNC f_free; // legacy only
LF_ENTRY_INITIALIZE_FUNC f_init;
LF_ENTRY_UNINITIALIZE_FUNC f_uninit;
LF_ENTRY_KEY_COPY_FUNC f_key_copy;
LF_ENTRY_KEY_COMPARE_FUNC f_key_cmp;
LF_ENTRY_HASH_FUNC f_hash;
LF_ENTRY_DUPLICATE_KEY_HANDLER f_duplicate;
};

Three categories of fields:

  • Always used. of_next, of_key, of_mutex, using_mutex, f_init, f_uninit, f_key_copy, f_key_cmp, f_hash, f_duplicate.
  • Legacy only. of_local_next (used by the legacy retire list and the legacy stack/freelist), of_del_tran_id (used by the legacy retire-id stamping), max_alloc_cnt, f_alloc, f_free (legacy freelist allocates with f_alloc/f_free callbacks; modern freelist uses new/delete on its own free_node type).
  • Modern only. None — the modern path simply ignores legacy-only fields. A descriptor populated for both paths silently uses the right subset.

This is the critical reason the two-generation co-existence works: a single in-tree descriptor table works on either side because the modern code overlooks the legacy fields and the legacy code overlooks the modern absences. Each consumer (lock manager, session table, etc.) defines its descriptor once and the bridge flips the implementation under it.

Where the legacy table stores entries as raw T allocated by f_alloc, the modern path wraps T inside a freelist node:

// lockfree::hashmap<K,T>::freelist_node_data — src/base/lockfree_hashmap.hpp:84
struct freelist_node_data {
T m_entry;
lf_entry_descriptor *m_edesc;
freelist_node_data () = default;
~freelist_node_data () = default;
void on_reclaim () {
if (m_edesc->f_uninit != NULL)
(void) m_edesc->f_uninit (&m_entry);
}
};
using freelist_type = freelist<freelist_node_data>;
using free_node_type = typename freelist_type::free_node;

The free_node template adds a tran::reclaimable_node base plus a freelist *m_owner back-pointer plus the payload (see cubrid-lockfree-freelist.md). The hashmap therefore stores pointers to T m_entry in the buckets, and converts back to the enclosing free_node when retiring:

// hashmap<K,T>::to_free_node — src/base/lockfree_hashmap.hpp:552
free_node_type *to_free_node (T *p) {
// not nice, but necessary until we fully refactor lockfree hashmap
const std::ptrdiff_t off = free_node_offset_of_data (free_node_type ());
char *cp = (char *) p;
cp -= off;
return (free_node_type *) cp;
}

The free_node_offset_of_data constexpr helper computes the offset of m_entry within the free_node so the back-cast is correct regardless of compiler padding. The header comment acknowledges this is a workaround pending a deeper refactor — the “correct” design would have hashmap entries embed reclaimable_node directly rather than via the freelist’s generic free_node.

Both paths mutate T through byte offsets, not member access. The hashmap reads the next-pointer of an entry as:

// hashmap<K,T>::get_nextp_ref — src/base/lockfree_hashmap.hpp:530
T *&get_nextp_ref (T *p) {
return * (T **) get_ref (p, m_edesc->of_next);
}
// hashmap<K,T>::get_keyp — src/base/lockfree_hashmap.hpp:516
void *get_keyp (T *p) {
return get_ptr (p, m_edesc->of_key);
}

This is the design that makes lf_entry_descriptor the single shared type. Both implementations pun a T* into a char*, add a byte offset, cast back to the field type, and access. The template parameter T does not appear in the access path — it is a documentation-only type bound. This is what lets a legacy LK_RES (lock-resource) struct be stored unchanged when the consumer migrates from lf_hash_* to lockfree::hashmap<OID, LK_RES>.

flowchart TB
  Caller["Consumer<br/>(lock_manager, session, ...)"]
  Bridge["cubthread::lockfree_hashmap&lt;K,T&gt;<br/>m_type ∈ {OLD, NEW, UNKNOWN}"]
  Caller --> Bridge

  subgraph Legacy["m_type = OLD"]
    Cpp["lf_hash_table_cpp&lt;K,T&gt;<br/>(template, in lock_free.h)"]
    LFH["lf_hash_table"]
    LFFL["lf_freelist"]
    LFTS["LF_TRAN_SYSTEM (per-consumer global)"]
    LFTE["LF_TRAN_ENTRY[]"]
    Cpp --> LFH
    Cpp --> LFFL
    LFFL --> LFTS
    LFTS --> LFTE
  end

  subgraph Modern["m_type = NEW"]
    HM["lockfree::hashmap&lt;K,T&gt;"]
    FL["lockfree::freelist&lt;freelist_node_data&gt;"]
    TT["lockfree::tran::table"]
    TS["lockfree::tran::system<br/>(server-wide)"]
    HM --> FL
    FL --> TT
    TT --> TS
  end

  Bridge --> Cpp
  Bridge --> HM

  PRM["PRM_ID_ENABLE_NEW_LFHASH<br/>(read at init time)"]
  PRM --> Bridge

  Edesc["lf_entry_descriptor<br/>(shared by both)"]
  Edesc --> Cpp
  Edesc --> HM

The hashmap defers to three private chain primitives:

  • list_find — read-only walk.
  • list_insert_internal — find-or-insert with optional given- entry, optional duplicate callback, optional behavior flags.
  • list_delete — Harris-Michael two-CAS unlink with optional pre-locked entry.

Each is wrapped in an outer loop (hash_insert_internal, hash_erase_internal) that retries on the LF_LIST_BR_RESTARTED flag. The behavior flags LF_LIST_BF_* and response flags LF_LIST_BR_* are taken verbatim from the legacy header — the modern path reuses these constants rather than reinventing.

// hashmap<K,T>::list_insert_internal (abridged) — src/base/lockfree_hashmap.hpp:899
bool list_insert_internal (tran::index tran_index, T *&list_head, Key &key,
int *behavior_flags, T *&entry)
{
pthread_mutex_t *entry_mutex = NULL;
T **curr_p; T *curr;
tran::descriptor &tdes = get_tran_descriptor (tran_index);
while (restart_search) {
start_tran_force (tdes); // bracket open
curr_p = &list_head;
curr = address_type::atomic_strip_address_mark (*curr_p);
while (curr_p != NULL) {
if (curr != NULL) {
if (m_edesc->f_key_cmp (&key, get_keyp (curr)) == 0) {
// FOUND duplicate key
if (m_edesc->using_mutex) {
lock_entry_mutex (*curr, entry_mutex);
end_tran_force (tdes); // mutex supersedes tran
if (address_type::is_address_marked (get_nextp_ref (curr))) {
// raced with delete — restart
unlock_entry_mutex_force (entry_mutex);
return false; // RESTARTED flag set
}
}
if (m_edesc->f_duplicate != NULL) {
// restart on duplicate (the only current behavior)
return false;
}
if (LF_LIST_BF_IS_FLAG_SET (behavior_flags, LF_LIST_BF_INSERT_GIVEN))
freelist_retire (tdes, entry); // give-up the speculative entry
if (!LF_LIST_BF_IS_FLAG_SET (behavior_flags, LF_LIST_BF_FIND_OR_INSERT)) {
// straight insert: failure
return false;
}
entry = curr; // hand back found entry
return false; // not inserted, but found
}
// advance to next
curr_p = &get_nextp_ref (curr);
curr = address_type::strip_address_mark (*curr_p);
} else {
// END OF CHAIN — insert here
if (entry == NULL) {
entry = freelist_claim (tdes); // speculative claim
if (m_edesc->f_key_copy (&key, get_keyp (entry)) != NO_ERROR)
return false;
}
if (m_edesc->using_mutex) lock_entry_mutex (*entry, entry_mutex);
if (!ATOMIC_CAS_ADDR (curr_p, (T *) NULL, entry)) {
// somebody linked first — save speculative for next round and restart
if (m_edesc->using_mutex) unlock_entry_mutex_force (entry_mutex);
if (!LF_LIST_BF_IS_FLAG_SET (behavior_flags, LF_LIST_BF_INSERT_GIVEN))
save_temporary (tdes, entry);
LF_LIST_BR_SET_FLAG (behavior_flags, LF_LIST_BR_RESTARTED);
end_tran_force (tdes);
return false;
}
if (m_edesc->using_mutex) end_tran_force (tdes); // mutex supersedes tran
return true; // INSERTED
}
}
}
}

The insertion is a single CAS on the predecessor’s next-pointer. The dance around it is:

  • save_temporary on race-loss. If the CAS-link fails because a sibling thread linked first, the speculatively-claimed entry is parked on the descriptor (tdes.save_reclaimable) so the next iteration of the outer while reuses it. This is the freelist round-trip avoidance.
  • f_duplicate callback path. If a per-consumer “what to do on duplicate key” callback is registered, the insert calls it (e.g., to bump a session-id counter and request a new key) and then sets RESTARTED to retry from a different bucket.
  • Mutex supersedes tran. When the entry has a mutex, the insert grabs the mutex before CAS-linking and ends the transaction immediately after the link succeeds — the mutex is now sufficient protection for the payload, and ending the tran early frees the descriptor for other ops.
// hashmap<K,T>::list_delete (abridged) — src/base/lockfree_hashmap.hpp:1111
bool list_delete (tran::index tran_index, T *&list_head, Key &key,
T *locked_entry, int *behavior_flags)
{
while (restart_search) {
start_tran_force (tdes);
curr_p = &list_head;
curr = address_type::atomic_strip_address_mark (*curr_p);
while (curr != NULL) {
if (m_edesc->f_key_cmp (&key, get_keyp (curr)) == 0) {
if (locked_entry != NULL && locked_entry != curr) {
// erase_locked semantics: the entry I held the lock on was
// already retired; current entry is a *different* one with the
// same key. Bail.
return false;
}
next_p = &get_nextp_ref (curr);
next = address_type::strip_address_mark (*next_p);
// STEP 1: mark next pointer
if (!ATOMIC_CAS_ADDR (next_p, next, address_type::set_adress_mark (next))) {
// raced — restart
if (LF_LIST_BR_RESTARTED) return false;
break;
}
if (m_edesc->using_mutex) {
if (LF_LIST_BF_IS_FLAG_SET (behavior_flags, LF_LIST_BF_LOCK_ON_DELETE))
lock_entry_mutex (*curr, entry_mutex);
else
entry_mutex = get_pthread_mutexp (curr); // already locked
}
// STEP 2: unlink
if (!ATOMIC_CAS_ADDR (curr_p, curr, next)) {
// raced; remove our mark and restart
if (m_edesc->using_mutex && LF_LIST_BF_IS_FLAG_SET (..., LF_LIST_BF_LOCK_ON_DELETE))
unlock_entry_mutex_force (entry_mutex);
ATOMIC_CAS_ADDR (next_p, address_type::set_adress_mark (next), next);
if (LF_LIST_BR_RESTARTED) return false;
break;
}
if (m_edesc->using_mutex) unlock_entry_mutex_force (entry_mutex);
promote_tran_force (tdes); // bump global id (retire epoch)
freelist_retire (tdes, curr); // append to retired ring
end_tran_force (tdes);
return true;
}
curr_p = &get_nextp_ref (curr);
curr = address_type::strip_address_mark (*curr_p);
}
}
}

The two CASes are exactly the Harris-Michael protocol:

  • Mark. next_p from next to mark(next). Failure means a sibling thread is also trying to delete the same node (or has already done step 1).
  • Unlink. curr_p from curr to next (unmarked). Failure means the predecessor pointer changed under us — another insert or delete moved past curr. Recovery: undo the mark and restart.

The retire is delayed until after the unlink CAS succeeds. This is what makes m_retire_tranid strictly greater than any concurrent reader’s snapshot — by the time we call promote_tran_force (which bumps the global id and writes it to our descriptor), the chain no longer contains curr, so any future reader cannot see it via the bucket. Existing readers holding curr directly still complete safely because their snapshot’s tranid <= our pre-promotion tranid < our_retire_tranid, and reclamation only fires when min_active > retire_tranid.

// hashmap<K,T>::find (abridged) — src/base/lockfree_hashmap.hpp:280
T *find (tran::index tran_index, Key &key)
{
while (restart) {
T *&list_head = get_bucket (key);
entry = NULL;
bflags = LF_LIST_BF_RETURN_ON_RESTART;
list_find (tran_index, list_head, key, &bflags, entry);
restart = (bflags & LF_LIST_BR_RESTARTED) != 0;
}
return entry;
}

find always sets LF_LIST_BF_RETURN_ON_RESTART so the chain walk returns to the bucket-array level on restart, recomputing the hash. This matters when the only inflight change to the chain is unrelated (a different key being inserted/deleted) — a clean restart from get_bucket re-derefs the bucket pointer and is robust to a clear having swapped the bucket array underneath.

hashmap::clear is the only non-lock-free public operation:

// hashmap<K,T>::clear (abridged) — src/base/lockfree_hashmap.hpp:378
void clear (tran::index tran_index)
{
tran::descriptor &tdes = get_tran_descriptor (tran_index);
std::unique_lock<std::mutex> ulock (m_backbuffer_mutex);
// STEP 1: atomically swap m_buckets <-> m_backbuffer
T **old_buckets = ATOMIC_TAS_ADDR (&m_buckets, m_backbuffer);
// STEP 2: reset what is now the new backbuffer (was the old m_buckets)
for (size_t i = 0; i < m_size; ++i) {
assert (m_backbuffer[i] == address_type::set_adress_mark (NULL));
m_buckets[i] = NULL;
}
m_backbuffer = old_buckets;
// STEP 3: retire every entry in old buckets, link by link
for (size_t i = 0; i < m_size; ++i) {
do {
curr = address_type::strip_address_mark (old_buckets[i]);
} while (!ATOMIC_CAS_ADDR (&old_buckets[i],
curr, address_type::set_adress_mark (NULL)));
while (curr != NULL) {
next_p = &get_nextp_ref (curr);
do {
next = address_type::strip_address_mark (*next_p);
} while (!ATOMIC_CAS_ADDR (next_p, next, address_type::set_adress_mark (next)));
if (m_edesc->using_mutex) {
lock_entry_mutex (*curr, mutex_p); // wait for any in-flight reader
unlock_entry_mutex_force (mutex_p);
}
freelist_retire (tdes, curr); // append to retired ring
curr = next;
}
}
}

Three observations:

  • The bucket-array swap is atomic. Concurrent finds may briefly observe m_buckets pointing at the old or the new array; the invariant is that whatever they point at, the chain walk succeeds (modulo restart-on-CAS-fail) because every chain link is still valid.
  • The back-buffer is pre-marked with set_adress_mark (NULL) so that any thread looking at the back-buffer’s bucket head sees a marked-NULL chain head and treats it as empty. This is the back-buffer mutex’s invariant — under it, the back-buffer is always all-marked-NULL.
  • Per-entry mutex acquisition during retirement is the wait point for in-flight readers. A reader who found the entry before clear started will hold the entry’s mutex; clear serializes against that mutex, ensuring the reader is done before we retire.

Cost: O(buckets × chain_length) atomic operations + N pthread mutex lock/unlock cycles + descriptor retire-ring growth. On a heavily-loaded table this is the worst single operation in the hashmap surface; consumers who clear (sessions on logout, lock manager on shutdown, XASL cache on plan-cache reset) accept it as a rare event.

The iterator (hashmap::iterator) walks bucket-by-bucket, chain-by-chain, with a transaction held open per bucket. It auto-releases the transaction at end-of-bucket and reacquires at the next. The pattern is deliberately conservative — long iterations would otherwise keep the descriptor’s m_tranid low and prevent reclamation across the whole server.

// hashmap<K,T>::iterator::iterate (abridged) — src/base/lockfree_hashmap.hpp:1343
T *iterate () {
do {
if (m_curr != NULL) {
if (m_hashmap->m_edesc->using_mutex)
m_hashmap->unlock_entry (*m_curr); // release per-entry mutex
m_curr = address_type::strip_address_mark (m_hashmap->get_nextp_ref (m_curr));
} else {
if (m_bucket_index != INVALID_INDEX) m_tdes->end_tran ();
m_tdes->start_tran ();
m_bucket_index++;
if (m_bucket_index >= m_hashmap->m_size) {
m_tdes->end_tran ();
return NULL; // end
}
m_curr = address_type::atomic_strip_address_mark (m_hashmap->m_buckets[m_bucket_index]);
}
if (m_curr != NULL && m_hashmap->m_edesc->using_mutex) {
m_hashmap->lock_entry (*m_curr);
if (address_type::is_address_marked (m_hashmap->get_nextp_ref (m_curr)))
continue; // deleted in flight, skip
}
} while (m_curr == NULL);
return m_curr;
}

Every public method is wrapped by ct_stat_type::autotimer that increments a counter and accumulates a duration timer into one of m_stat_find, m_stat_insert, m_stat_erase, m_stat_unlock, m_stat_clear, m_stat_iterates, m_stat_claim, m_stat_retire. The autotimer is short-circuited unless m_active_stats == true, which is toggled by activate_stats() / deactivate_stats() per-table. The default is off — production servers do not pay the stats cost unless an operator explicitly enables it.

The legacy lf_hash_* family does the same Harris-Michael walk in C, with LF_TRAN_ENTRY * instead of tran::descriptor and the global LF_TRAN_SYSTEMs instead of per-table tables. The key procedural calls:

  • lf_hash_init builds the bucket array and the back-buffer, registers entry_desc with the freelist.
  • lf_hash_find calls lf_list_find which is the legacy counterpart of hashmap::list_find.
  • lf_hash_insert calls lf_hash_insert_internallf_list_insert_internal.
  • lf_hash_delete calls lf_hash_delete_internallf_list_delete.
  • lf_hash_clear walks the back-buffer the same way the modern clear does.

The same LF_LIST_BF_* flags drive both. The same back-buffer layout (backbuffer[i] == ADDR_WITH_MARK (NULL)) is invariant.

The legacy stack additionally exposes:

  • lf_io_list_find, lf_io_list_find_or_insert — an insert-only singly-linked list with no deletes and no freelist. Used by code paths that need a thread-safe append- only set (currently a small number of internal call sites in the legacy stack; the modern path has no equivalent).
  • lf_stack_push, lf_stack_pop — Treiber stack. Documented as ABA-vulnerable; not currently used by any in-engine consumer (kept for compile-link compatibility with potential external callers).

The legacy stack predates C++17 in CUBRID. The modern stack was introduced as part of the C++17 modernization with two stated goals:

  1. Type safety. No more void* on the public surface; the lf_entry_descriptor byte-offsets remain but are confined to the implementation.
  2. Per-table reclamation. No more 11 global LF_TRAN_SYSTEM instances. Each lockfree::hashmap builds its own tran::table from a server-wide tran::system; reclamation pressure on the lock manager’s hash does not throttle reclamation on the session table.

The migration is opt-in via system parameter, not a flag day. The bridge exists so each consumer can flip independently without rewriting call sites. The migration completion bar is zero call sites left on lf_hash_table_cpp directly — every consumer hits the new path through the bridge, and the PRM_ID_ENABLE_NEW_LFHASH toggle becomes the single decision for the whole server.

As of 2026-05-07 the toggle’s default is unverified in this doc (see Open Question 1). Empirically the bridge dispatch table — the macro lockfree_hashmap_forward_func — is the hottest code in the hashmap, more than the chain walks themselves, because every public call goes through it.

ConsumerSourceEntry structGeneration toggle
Object-lock resource hashsrc/transaction/lock_manager.cLK_RESenable_new_lfhash
Object-lock entry hashsrc/transaction/lock_manager.cLK_ENTRYenable_new_lfhash
Session tablesrc/session/session.cSESSION_STATEenable_new_lfhash
XASL cachesrc/query/xasl_cache.cXASL_CACHE_ENTRYenable_new_lfhash
Filter-predicate cachesrc/query/filter_pred_cache.cFILTER_PRED_CACHE_ENTRYenable_new_lfhash
System catalogsrc/storage/system_catalog.cCATCLS_ENTRYenable_new_lfhash
Slotted-page savingsrc/storage/slotted_page.cSPAGE_SAVE_ENTRYenable_new_lfhash
HFID tablesrc/storage/heap_file.cHFID_TABLE_ENTRYenable_new_lfhash
Global unique statssrc/storage/btree.cLOG_LSA / *unique_stats*enable_new_lfhash
DWB slot tablesrc/storage/double_write_buffer.cppDWB_SLOT_HASH_ENTRYenable_new_lfhash
Sort-list (legacy only)src/query/sort.cSORT_LIST_ENTRY(legacy only)

Each consumer defines its lf_entry_descriptor once (typically as a static const at file scope), constructs its cubthread::lockfree_hashmap<K,T> member, and calls init() once at boot.

  • lockfree::hashmap<K,T> — class template, lockfree_hashmap.hpp:42. The .cpp (25 lines) is a stub.
  • hashmap::find:280
  • hashmap::find_or_insert:301
  • hashmap::insert:308
  • hashmap::insert_given:315
  • hashmap::erase:325
  • hashmap::erase_locked:345
  • hashmap::unlock:360
  • hashmap::clear:378
  • hashmap::freelist_claim:656 (and :663 for the tran::descriptor & overload)
  • hashmap::freelist_retire:579 (and :646 for the tran::descriptor & overload)
  • hashmap::list_find:813
  • hashmap::list_insert_internal:899
  • hashmap::list_delete:1111
  • hashmap::hash_insert_internal:1269
  • hashmap::hash_erase_internal:1305
  • hashmap::iterator::iterate:1343
  • LF_HASH_TABLE struct — lock_free.h:299
  • LF_ENTRY_DESCRIPTOR struct — lock_free.h:62
  • lf_hash_initlock_free.c (defined later in the file)
  • lf_hash_findlock_free.c
  • lf_hash_find_or_insertlock_free.c
  • lf_hash_insert / lf_hash_insert_givenlock_free.c
  • lf_hash_delete / lf_hash_delete_already_lockedlock_free.c
  • lf_hash_clearlock_free.c
  • lf_list_find / lf_list_insert_internal / lf_list_delete — internal chain walks (lock_free.c)
  • lf_io_list_find / lf_io_list_find_or_insert — insert-only list (lock_free.c:1077+)
  • lf_stack_push / lf_stack_pop — Treiber stack (lock_free.c:555+)
  • lf_freelist_init / lf_freelist_claim / lf_freelist_retire / lf_freelist_transportlock_free.c:676+
  • lf_hash_table_cpp<K,T> — C++ shim, lock_free.h:367+
  • cubthread::lockfree_hashmap<K,T>thread_lockfree_hash_map.hpp:35
  • init (toggles by parameter) — :129
  • init_as_old / init_as_new:145, :155
  • All forwarded methods (find, insert, erase, …) — :171-:294, expanding lockfree_hashmap_forward_func macro
  • cubthread::get_thread_entry_lftransysthread_lockfree_hash_map.cpp:27
SymbolFileLine
lockfree::hashmap<K,T> (class template)src/base/lockfree_hashmap.hpp42
hashmap::initsrc/base/lockfree_hashmap.hpp261
hashmap::findsrc/base/lockfree_hashmap.hpp280
hashmap::find_or_insertsrc/base/lockfree_hashmap.hpp301
hashmap::erasesrc/base/lockfree_hashmap.hpp325
hashmap::erase_lockedsrc/base/lockfree_hashmap.hpp345
hashmap::clearsrc/base/lockfree_hashmap.hpp378
hashmap::list_findsrc/base/lockfree_hashmap.hpp813
hashmap::list_insert_internalsrc/base/lockfree_hashmap.hpp899
hashmap::list_deletesrc/base/lockfree_hashmap.hpp1111
hashmap::hash_insert_internalsrc/base/lockfree_hashmap.hpp1269
hashmap::iterator::iteratesrc/base/lockfree_hashmap.hpp1343
hashmap::freelist_node_data::on_reclaimsrc/base/lockfree_hashmap.hpp92
hashmap::to_free_nodesrc/base/lockfree_hashmap.hpp552
lf_hash_table (struct)src/base/lock_free.h299
lf_entry_descriptor (struct)src/base/lock_free.h62
lf_hash_table_cpp<K,T> (template)src/base/lock_free.h367
lf_freelist_alloc_blocksrc/base/lock_free.c621
lf_freelist_claimsrc/base/lock_free.c760
lf_freelist_retiresrc/base/lock_free.c873
lf_freelist_transportsrc/base/lock_free.c934
lf_io_list_findsrc/base/lock_free.c1077
lf_io_list_find_or_insertsrc/base/lock_free.c1131
lf_stack_push / lf_stack_popsrc/base/lock_free.c554 / 584
cubthread::lockfree_hashmap<K,T>src/thread/thread_lockfree_hash_map.hpp35
cubthread::lockfree_hashmap::initsrc/thread/thread_lockfree_hash_map.hpp129
cubthread::lockfree_hashmap::init_as_newsrc/thread/thread_lockfree_hash_map.hpp155
cubthread::lockfree_hashmap::init_as_oldsrc/thread/thread_lockfree_hash_map.hpp145
cubthread::get_thread_entry_lftransyssrc/thread/thread_lockfree_hash_map.cpp27
  • The bridge dispatches at init() only — there is no per-call toggle. cubthread::lockfree_hashmap::init reads prm_get_bool_value (PRM_ID_ENABLE_NEW_LFHASH) exactly once and writes m_type = OLD or NEW. Subsequent calls dispatch on m_type. Verified at thread_lockfree_hash_map.hpp:129-141.
  • Both implementations consume the same lf_entry_descriptor type. The struct is defined once in lock_free.h:62, used by lf_hash_* and by lockfree::hashmap<K,T>::init (which takes a lf_entry_descriptor & argument). Verified by reading both signatures.
  • Modern hashmap stores entries via freelist_node_data wrapping. Each entry is freelist<freelist_node_data>::free_node, carrying a tran::reclaimable_node base, a back-pointer to the freelist owner, and the payload T m_entry. Verified at lockfree_hashmap.hpp:84-100. The header acknowledges the to_free_node offset hack as workaround pending refactor.
  • Per-entry mutex is optional, driven by lf_entry_descriptor::using_mutex. Verified at lockfree_hashmap.hpp:777 (assert (m_edesc->using_mutex) in lock_entry_mutex) and at :362-371 where unlock branches on m_edesc->using_mutex.
  • f_duplicate callback currently always restarts. list_insert_internal at lockfree_hashmap.hpp:962-976: when f_duplicate is non-NULL, it is called for its side effect (e.g., bumping a counter), then LF_LIST_BR_RESTARTED is set and the search restarts with a new key. The #if 1 block masks an alternative no-restart path that is currently dead; the comment explains this is intentional pending a wider review.
  • address_marker<T>::set_adress_mark is the typo spelling used throughout. lockfree_address_marker.hpp:49. Both call sites and the header agree on the typo.
  • The clear’s back-buffer is pre-marked with set_adress_mark (NULL). Verified at lockfree_hashmap.hpp:272-275 (init) and :393 (clear’s invariant assert).
  • Statistics default off; toggled per-table by activate_stats()/deactivate_stats(). Verified at lockfree_hashmap.hpp:632-642. m_active_stats defaults to false (initialized in the default-constructor at :219).
  1. Default value of enable_new_lfhash. Not verified in this doc. Investigation: read prm_def[] in system_parameter.c and check the release-notes git history for the toggle. If the default is false, the production reality is the legacy path; the modern path is staged but not running.
  2. Are any lf_hash_table_cpp<K,T> direct call sites left? The bridge’s existence assumes every consumer uses cubthread::lockfree_hashmap. If a consumer still instantiates lf_hash_table_cpp directly, that consumer is pinned to the legacy path. Investigation: grep lf_hash_table_cpp< and confirm every match is inside cubthread::lockfree_hashmap::m_old_hash.
  3. Why does the modern path keep of_local_next in the descriptor? The field is legacy-only; the modern hashmap stores entries inside free_node and uses tran::reclaimable_node::m_retired_next for the retired ring. The descriptor does not need of_local_next. The field is retained, presumably, so a single descriptor object can drive both paths during the migration; once the migration completes, of_local_next, of_del_tran_id, f_alloc, f_free, max_alloc_cnt are all dead and can be purged.
  4. f_duplicate semantics — is the restart-only behavior the final design? The #if 1 / #else / #endif block at lockfree_hashmap.hpp:972-1005 keeps two alternatives in the tree. If the no-restart path is intended for a future feature (incrementing a counter without reissuing the key), the dead branch should be either revived or removed.
  5. clear while iterators are open. The clear acquires m_backbuffer_mutex and atomic-swaps the bucket array; an open iterator holding m_curr from the old bucket array continues walking into retired memory until reclamation fires. Reclamation is bounded by the iterator’s m_tdes bracket, so the entries are still alive — but the m_curr pointer no longer reflects the live table. Investigation: are clear-vs-iterator races documented? If not, the iterator class needs a “table generation” counter check.

Beyond CUBRID — Comparative Designs & Research Frontiers

Section titled “Beyond CUBRID — Comparative Designs & Research Frontiers”
  • Split-ordered hashing (Shalev & Shavit 2003). A single recursive lock-free linked list with bucket pointers as shortcuts; supports lock-free resize. Adopting it would let the hashmap grow under contention rather than being fixed at init. Cost: more complex chain pointers and a different invariant on bucket-array semantics.
  • Folly’s ConcurrentHashMap. Hazard-pointer-based reclamation, simpler API, no per-entry mutex. A side-by-side micro-benchmark on the lock manager’s resource hash would tell us whether the per-entry mutex is a real cost or a real benefit.
  • Cuckoo hashing. Open-addressing alternative with O(1) worst-case lookup. Lock-free variants exist (Li et al. EuroSys 2014). Worth comparing on read-mostly workloads (XASL cache).
  • Eliminate the legacy path. The most concrete frontier is retiring lf_hash_* and the bridge. Steps: confirm enable_new_lfhash defaults to true on a recent release, burn-in for one release cycle, then delete lock_free.{h,c} and the m_old_hash field of the bridge.
  • Server-wide rather than per-consumer hash tables. Today every consumer instantiates its own cubthread::lockfree_hashmap. A future refactor could expose a server-wide registry indexed by consumer name, reducing boot-time setup boilerplate and centralizing tuning.
  • Replace f_hash callbacks with std::hash<K> traits. C++17 has the machinery; the only blocker is that lf_entry_descriptor is a C-shaped struct shared with legacy code. Once the legacy path is gone, the descriptor becomes a C++ class and f_hash becomes a template parameter.
  • Source files (in the CUBRID source tree at /data/hgryoo/references/cubrid/):
    • src/base/lock_free.h, src/base/lock_free.c (legacy procedural API + the lf_hash_table_cpp<K,T> template shim)
    • src/base/lockfree_hashmap.{hpp,cpp} (modern templated class)
    • src/base/lockfree_address_marker.hpp (pointer-mark helper)
    • src/base/lockfree_freelist.hpp (modern entry storage)
    • src/base/lockfree_transaction_descriptor.hpp (read / retire bracketing)
    • src/thread/thread_lockfree_hash_map.{hpp,cpp} (the bridge cubthread::lockfree_hashmap<K,T>)
  • Consumer source files (sampled):
    • src/transaction/lock_manager.c (object-lock resource & entry tables — primary canary for this analysis)
    • src/session/session.c
    • src/query/xasl_cache.c
    • src/query/filter_pred_cache.c
    • src/storage/system_catalog.c
    • src/storage/slotted_page.c
    • src/storage/double_write_buffer.cpp
  • Textbook chapters cited:
    • Herlihy & Shavit, The Art of Multiprocessor Programming, ch. 9 §“Lock-Free Linked Lists” (Harris-Michael), ch. 13 §“Concurrent Hashing” (open vs. closed addressing, striping vs. lock-free chains).
    • Williams, C++ Concurrency in Action, ch. 7 §“Designing lock-free concurrent data structures.”
  • Foundational papers:
    • Harris, T. L. “A Pragmatic Implementation of Non-Blocking Linked-Lists.” DISC 2001 — the mark-bit-on-next protocol used by both CUBRID generations.
    • Michael, M. M. “High Performance Dynamic Lock-Free Hash Tables and List-Based Sets.” SPAA 2002.
    • Shalev, O., Shavit, N. “Split-Ordered Lists: Lock-Free Extensible Hash Tables.” JACM 53(3), 2006.
    • Michael, M. M. “Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects.” IEEE TPDS 15(8), 2004 — the alternative reclamation model CUBRID does not use.
  • Sibling docs in this repo:
    • cubrid-lockfree-overview.md — umbrella, two-generation map.
    • cubrid-lockfree-transaction.md — the reclamation framework the modern hashmap rides on.
    • cubrid-lockfree-freelist.md — the entry storage layer.
    • cubrid-lockfree-bitmap.md — the index allocator beneath tran::system.
    • cubrid-lock-manager.md — the consumer that exercises every code path of this hashmap.
    • cubrid-thread-worker-pool.md — defines cubthread::entry::get_lf_tran_index() used by the bridge forwarders.