CUBRID Lock-Free Hash Map — Legacy C, Modern C++, the Bridge, and the Consumers
Contents:
- Theoretical Background
- Common DBMS Design
- CUBRID’s Approach
- Source Walkthrough
- Source verification (as of 2026-05-07)
- Beyond CUBRID — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A concurrent hash table is the most-used lock-free data structure in a DBMS. Lock managers, page-buffer maps, session caches, plan caches, statistics tables — every per-key resource that wants O(1) lookup under contention picks some form of concurrent hash. The textbook design space (Herlihy & Shavit, The Art of Multiprocessor Programming, ch. 13 §“Concurrent Hashing”, and Shalev & Shavit, JACM 53(3) 2006) has three axes:
- Open vs. closed addressing. Open addressing (linear / quadratic probing, Robin Hood) places entries directly in the bucket array — cache-friendly but resize-unfriendly. Closed addressing (chained hashing) places entries in per-bucket linked lists — easy to grow incrementally because chains absorb collisions.
- Static vs. resizable. A static table has a fixed bucket count. A resizable table grows when the load factor exceeds a threshold — but lock-free resize is hard, requiring either split-ordered bucket pointers (Shalev & Shavit) so that the chain itself does not need re-linking, or stop-the-world re-hash (which sacrifices the lock-free property during resize).
- What guards the chains? A coarse table mutex
(PostgreSQL’s
LockHashPartitionLock), per-bucket striping (Java’s pre-Java-8ConcurrentHashMap), or fully lock-free chains via the Harris–Michael protocol (Harris 2001, Michael 2002).
CUBRID picks closed addressing + static buckets + Harris–Michael
chains. The cost is no automatic resize — the bucket array is
fixed at init time. The benefit is the implementation can be
small and the chain operations are textbook.
Harris–Michael chain operations
Section titled “Harris–Michael chain operations”The protocol uses three invariants on each chain node:
- The
nextpointer’s low bit is reserved as a delete mark. - Insert is a single CAS: starting from the predecessor’s
next, CAS fromNULLto the new node. If the predecessor already has anext, the inserter walks past and tries again. Insertion of an already-present key is a no-op (or a “found” return for find-or-insert semantics). - Delete is a two-CAS protocol:
- Mark. CAS
victim->nextfromnext_ptrtomark(next_ptr)— sets the low bit. After this, the victim is logically deleted: any concurrent traverser sees the mark and treats the victim as gone, but the victim is still reachable frompredecessor->next. - Unlink. CAS
predecessor->nextfromvictimtovictim->next(the unmarked one) — physically removes the victim from the chain. If this CAS fails (someone else mutated the chain between mark and unlink), the unlinker either retries (rare) or restarts the search (cheaper).
- Mark. CAS
Two side conditions:
- Restart on staleness. Every chain operation may need to
restart from the bucket head if a CAS fails. This is the
LF_LIST_BR_RESTARTEDsignal in the legacy code and therestart_searchboolean in the modern code. - Reclamation. A logically-deleted but not-yet-physically-
unlinked node must remain dereferenceable until every reader
who could have reached it via
predecessor->nexthas finished. This is what the transactional reclamation framework (lockfree::tran::*) provides.
ABA hazard, recovery via reclamation
Section titled “ABA hazard, recovery via reclamation”The classical ABA scenario for chained hashing: a thread reads
predecessor->next == A, plans to CAS-mark A->next. A sibling
thread deletes A, then re-allocates a new A' at the same
address (e.g., via the freelist), and inserts A' somewhere — now
predecessor->next points to A' (or further along the chain),
not A. The original thread’s CAS-mark on the saved A pointer
either succeeds and silently corrupts A'’s next (use-after-
free + corruption) or fails because the address changed (which is
the goal but not the worst case).
The transactional reclamation framework prevents this by ensuring
that as long as the original thread holds an open descriptor on
the table, no node it could have observed is reclaimed — including
A. The CAS-mark either succeeds on the original A (which is
still in the chain, even if logically deleted) or fails because
the chain shape changed. There is no second-allocation aliasing.
Hash function and bucket distribution
Section titled “Hash function and bucket distribution”CUBRID delegates the hash function to the per-table descriptor’s
f_hash callback, which takes a key and a bucket count and
returns the bucket index. This is per-consumer:
- Lock manager uses an OID-shaped hash.
- Page buffer (legacy
LF_HASH_TABLE) uses a VPID hash:(pageid | ((unsigned int)volid) << 24) % htsize(lock_free.c:154). - Session table uses session-id modulo bucket count.
The bucket count itself is a one-shot decision per consumer at
boot, picked from system parameters. There is no growth, no shrink,
no rehash. The only resize-shaped operation is clear, which
swaps in an empty back-buffer atomically — see §“Clear and
back-buffer.”
Common DBMS Design
Section titled “Common DBMS Design”Lock managers everywhere use a chained hash
Section titled “Lock managers everywhere use a chained hash”PostgreSQL’s lock.c partitions a single hash table into
NUM_LOCK_PARTITIONS = 16 partitions, each guarded by an
LWLock. InnoDB’s lock_sys_t uses a hash array on
(space_id, page_no) pairs guarded by lock_sys->mutex (with
recent versions adding sharded mutexes). MySQL’s open-table cache
is a hash with per-table mutexes. The repeated pattern is
partition by hash, lock per partition.
CUBRID’s approach is structurally different — instead of a coarse
mutex per partition, every chain is fully lock-free, and the
optional using_mutex is per entry rather than per partition.
The trade-off:
- Pro. Find / insert / erase scale linearly with hash size up to chain-length; no partition-size threshold.
- Pro. Memory access pattern is simpler — a thread holds at most one entry’s mutex at a time.
- Con. The transactional reclamation overhead is non-zero on
every find (the
start_tran/end_tranbracket). - Con.
clearisO(buckets × chain_length)because every entry must be retired — the back-buffer trick spreads the cost but does not eliminate it.
Optional per-entry mutex
Section titled “Optional per-entry mutex”A subtle CUBRID-specific design: some entries carry their own
pthread_mutex_t (declared via lf_entry_descriptor::using_mutex = 1 and of_mutex field offset). When set, find and erase grab
the entry’s mutex after finding it but before operating on
the payload, allowing the caller to subsequently release the
mutex with a separate unlock call. This is the pattern used by
the lock manager’s resource entries — a thread that finds a lock
resource needs to hold the resource entry’s mutex while
mutating its waiter list.
When using_mutex = 0, the find/erase result is only valid
inside the transaction bracket — the caller must not let the
descriptor’s m_tranid go to INVALID_TRANID while it still
holds a payload pointer.
Find-or-insert speculation
Section titled “Find-or-insert speculation”Insert-with-find-fallback (the find_or_insert API) is the
common case — the caller wants the entry whether or not it
already exists. The implementation pattern: speculatively claim a
fresh entry from the freelist before searching, search the
chain, if the key is found discard the speculative entry,
otherwise CAS-link it into place.
CUBRID optimizes the discard with the descriptor::save_reclaimable
mechanism described in cubrid-lockfree-transaction.md — the
discarded entry is parked on the calling thread’s descriptor
rather than retired, and the next call site reuses it without a
freelist round-trip.
Theory ↔ CUBRID mapping
Section titled “Theory ↔ CUBRID mapping”| Concept | CUBRID name (legacy) | CUBRID name (modern) |
|---|---|---|
| Chained hash table | LF_HASH_TABLE | lockfree::hashmap<K,T> |
| Bucket array | LF_HASH_TABLE::buckets | hashmap::m_buckets |
| Bucket count | LF_HASH_TABLE::hash_size | hashmap::m_size |
| Per-entry layout descriptor | LF_ENTRY_DESCRIPTOR | (same struct, lf_entry_descriptor) |
| Hash function | edesc->f_hash (key, htsize) | (same) |
| Key compare | edesc->f_key_cmp (k1, k2) | (same) |
| Key copy | edesc->f_key_copy (src, dst) | (same) |
| Init / uninit | edesc->f_init, edesc->f_uninit | (same) |
| Duplicate-key callback | edesc->f_duplicate | (same) |
| Per-entry mutex flag | edesc->using_mutex | (same) |
| Field offset: next | edesc->of_next | (same) |
| Field offset: key | edesc->of_key | (same) |
| Field offset: mutex | edesc->of_mutex | (same) |
| Field offset: del_tran_id | edesc->of_del_tran_id | (legacy only — modern uses freelist node) |
| Field offset: local_next | edesc->of_local_next | (legacy only — modern uses freelist node) |
| Behaviour flags | LF_LIST_BF_* | (same constants reused) |
| Restart signal | LF_LIST_BR_RESTARTED | (same constant reused) |
| Find primitive | lf_hash_find | hashmap::find |
| Find-or-insert | lf_hash_find_or_insert | hashmap::find_or_insert |
| Insert | lf_hash_insert | hashmap::insert |
| Insert-given | lf_hash_insert_given | hashmap::insert_given |
| Delete | lf_hash_delete | hashmap::erase |
| Delete-locked | lf_hash_delete_already_locked | hashmap::erase_locked |
| Clear | lf_hash_clear | hashmap::clear |
| Iterator | LF_HASH_TABLE_ITERATOR | hashmap::iterator |
| Back-buffer for clear | LF_HASH_TABLE::backbuffer | hashmap::m_backbuffer |
| C++ shim | lf_hash_table_cpp<K,T> | (same — used in bridge) |
| Wrapper deciding generation | (n/a) | cubthread::lockfree_hashmap<K,T> |
CUBRID’s Approach
Section titled “CUBRID’s Approach”Two generations, one descriptor
Section titled “Two generations, one descriptor”The hashmap exists as two complete implementations bridged by a
single dispatch class. Both consume the same
lf_entry_descriptor so a per-consumer entry layout works
unmodified across the migration.
Legacy C — lf_hash_* over LF_HASH_TABLE
Section titled “Legacy C — lf_hash_* over LF_HASH_TABLE”// LF_HASH_TABLE — src/base/lock_free.h:299struct lf_hash_table { void **buckets; void **backbuffer; pthread_mutex_t backbuffer_mutex; unsigned int hash_size; LF_FREELIST *freelist; LF_ENTRY_DESCRIPTOR *entry_desc;};The procedural API: lf_hash_init, lf_hash_find,
lf_hash_find_or_insert, lf_hash_insert,
lf_hash_insert_given, lf_hash_delete,
lf_hash_delete_already_locked, lf_hash_clear. All take an
LF_TRAN_ENTRY * (the per-thread state in the legacy
reclamation system) and an LF_HASH_TABLE *.
The C++ shim lf_hash_table_cpp<K,T> is a typed wrapper in the
same header — it owns an lf_freelist and an lf_hash_table
inline and forwards every method to the C functions with a
reinterpret_cast<void**>(&t) for the entry pointer. This shim
is what cubthread::lockfree_hashmap holds for the OLD path.
Modern C++ — lockfree::hashmap<K,T>
Section titled “Modern C++ — lockfree::hashmap<K,T>”// lockfree::hashmap<Key, T> — src/base/lockfree_hashmap.hpp:42template <class Key, class T>class hashmap {public: void init (tran::system &transys, size_t hash_size, size_t freelist_block_size, size_t freelist_block_count, lf_entry_descriptor &edesc); void destroy ();
T *find (tran::index, Key &); bool find_or_insert (tran::index, Key &, T *&); bool insert (tran::index, Key &, T *&); bool insert_given (tran::index, Key &, T *&); bool erase (tran::index, Key &); bool erase_locked (tran::index, Key &, T *&);
void unlock (tran::index, T *&); void clear (tran::index); // NOT lock-free T *freelist_claim (tran::index); void freelist_retire (tran::index, T *&);
void start_tran (tran::index); void end_tran (tran::index);
// ...private: using address_type = address_marker<T>; struct freelist_node_data { T m_entry; lf_entry_descriptor *m_edesc; void on_reclaim (); }; using freelist_type = freelist<freelist_node_data>; using free_node_type = typename freelist_type::free_node;
freelist_type *m_freelist; T **m_buckets; size_t m_size; T **m_backbuffer; std::mutex m_backbuffer_mutex; lf_entry_descriptor *m_edesc; // statistics ...};Three structural differences from legacy:
- Templated on
K, T, novoid*casts in the public surface. - No global tran-system reference. The constructor takes a
tran::system &and builds its owntran::table(inside the freelist). - Entries live as
freelist<freelist_node_data>::free_node, not bareT. The freelist owns the node header (thetran::reclaimable_nodebase +m_ownerpointer) and the payloadT m_entry— see §“Entry layout”.
Bridge — cubthread::lockfree_hashmap<K,T>
Section titled “Bridge — cubthread::lockfree_hashmap<K,T>”// cubthread::lockfree_hashmap<Key,T> — src/thread/thread_lockfree_hash_map.hpp:35template <class Key, class T>class lockfree_hashmap {public: void init (lf_tran_system &transys, int entry_idx, int hash_size, int freelist_block_size, int freelist_block_count, lf_entry_descriptor &edesc); void init_as_old (lf_tran_system &, int hash_size, int freelist_block_count, int freelist_block_size, lf_entry_descriptor &, int entry_idx); void init_as_new (lockfree::tran::system &, size_t hash_size, size_t freelist_block_size, size_t freelist_block_count, lf_entry_descriptor &); // forwards: find, find_or_insert, insert, insert_given, erase, erase_locked, // unlock, clear, freelist_claim, freelist_retire, start_tran, end_tranprivate: enum type { OLD, NEW, UNKNOWN }; lf_hash_table_cpp<Key, T> m_old_hash; lockfree::hashmap<Key, T> m_new_hash; type m_type; int m_entry_idx;};The dispatch is a single boolean check at init():
void init (lf_tran_system &transys, int entry_idx, int hash_size, int freelist_block_size, int freelist_block_count, lf_entry_descriptor &edesc){ if (prm_get_bool_value (PRM_ID_ENABLE_NEW_LFHASH)) init_as_new (get_thread_entry_lftransys (), (size_t) hash_size, (size_t) freelist_block_size, (size_t) freelist_block_count, edesc); else init_as_old (transys, hash_size, freelist_block_count, freelist_block_size, edesc, entry_idx);}Every forwarded method picks the live implementation by reading
m_type and dispatching the matching call. The dispatch is cheap —
a single conditional branch — but it is on every call. There is
no inlined fast path; the branch sits inside a macro
(lockfree_hashmap_forward_func) that every public method
expands.
The shared lf_entry_descriptor
Section titled “The shared lf_entry_descriptor”Both paths consume the same descriptor:
// lf_entry_descriptor — src/base/lock_free.h:62struct lf_entry_descriptor { unsigned int of_local_next; // legacy only unsigned int of_next; unsigned int of_del_tran_id; // legacy only unsigned int of_key; unsigned int of_mutex; int using_mutex; int max_alloc_cnt; // legacy only
LF_ENTRY_ALLOC_FUNC f_alloc; // legacy only LF_ENTRY_FREE_FUNC f_free; // legacy only LF_ENTRY_INITIALIZE_FUNC f_init; LF_ENTRY_UNINITIALIZE_FUNC f_uninit; LF_ENTRY_KEY_COPY_FUNC f_key_copy; LF_ENTRY_KEY_COMPARE_FUNC f_key_cmp; LF_ENTRY_HASH_FUNC f_hash; LF_ENTRY_DUPLICATE_KEY_HANDLER f_duplicate;};Three categories of fields:
- Always used.
of_next,of_key,of_mutex,using_mutex,f_init,f_uninit,f_key_copy,f_key_cmp,f_hash,f_duplicate. - Legacy only.
of_local_next(used by the legacy retire list and the legacy stack/freelist),of_del_tran_id(used by the legacy retire-id stamping),max_alloc_cnt,f_alloc,f_free(legacy freelist allocates withf_alloc/f_freecallbacks; modern freelist usesnew/deleteon its ownfree_nodetype). - Modern only. None — the modern path simply ignores legacy-only fields. A descriptor populated for both paths silently uses the right subset.
This is the critical reason the two-generation co-existence works: a single in-tree descriptor table works on either side because the modern code overlooks the legacy fields and the legacy code overlooks the modern absences. Each consumer (lock manager, session table, etc.) defines its descriptor once and the bridge flips the implementation under it.
Entry layout — modern
Section titled “Entry layout — modern”Where the legacy table stores entries as raw T allocated by
f_alloc, the modern path wraps T inside a freelist node:
// lockfree::hashmap<K,T>::freelist_node_data — src/base/lockfree_hashmap.hpp:84struct freelist_node_data { T m_entry; lf_entry_descriptor *m_edesc; freelist_node_data () = default; ~freelist_node_data () = default; void on_reclaim () { if (m_edesc->f_uninit != NULL) (void) m_edesc->f_uninit (&m_entry); }};using freelist_type = freelist<freelist_node_data>;using free_node_type = typename freelist_type::free_node;The free_node template adds a tran::reclaimable_node base
plus a freelist *m_owner back-pointer plus the payload (see
cubrid-lockfree-freelist.md). The hashmap therefore stores
pointers to T m_entry in the buckets, and converts back to
the enclosing free_node when retiring:
// hashmap<K,T>::to_free_node — src/base/lockfree_hashmap.hpp:552free_node_type *to_free_node (T *p) { // not nice, but necessary until we fully refactor lockfree hashmap const std::ptrdiff_t off = free_node_offset_of_data (free_node_type ()); char *cp = (char *) p; cp -= off; return (free_node_type *) cp;}The free_node_offset_of_data constexpr helper computes the
offset of m_entry within the free_node so the back-cast is
correct regardless of compiler padding. The header comment
acknowledges this is a workaround pending a deeper refactor — the
“correct” design would have hashmap entries embed
reclaimable_node directly rather than via the freelist’s
generic free_node.
Field-offset access — both generations
Section titled “Field-offset access — both generations”Both paths mutate T through byte offsets, not member access.
The hashmap reads the next-pointer of an entry as:
// hashmap<K,T>::get_nextp_ref — src/base/lockfree_hashmap.hpp:530T *&get_nextp_ref (T *p) { return * (T **) get_ref (p, m_edesc->of_next);}
// hashmap<K,T>::get_keyp — src/base/lockfree_hashmap.hpp:516void *get_keyp (T *p) { return get_ptr (p, m_edesc->of_key);}This is the design that makes lf_entry_descriptor the single
shared type. Both implementations pun a T* into a char*,
add a byte offset, cast back to the field type, and access. The
template parameter T does not appear in the access path — it is
a documentation-only type bound. This is what lets a legacy
LK_RES (lock-resource) struct be stored unchanged when the
consumer migrates from lf_hash_* to lockfree::hashmap<OID, LK_RES>.
Component overview
Section titled “Component overview”flowchart TB
Caller["Consumer<br/>(lock_manager, session, ...)"]
Bridge["cubthread::lockfree_hashmap<K,T><br/>m_type ∈ {OLD, NEW, UNKNOWN}"]
Caller --> Bridge
subgraph Legacy["m_type = OLD"]
Cpp["lf_hash_table_cpp<K,T><br/>(template, in lock_free.h)"]
LFH["lf_hash_table"]
LFFL["lf_freelist"]
LFTS["LF_TRAN_SYSTEM (per-consumer global)"]
LFTE["LF_TRAN_ENTRY[]"]
Cpp --> LFH
Cpp --> LFFL
LFFL --> LFTS
LFTS --> LFTE
end
subgraph Modern["m_type = NEW"]
HM["lockfree::hashmap<K,T>"]
FL["lockfree::freelist<freelist_node_data>"]
TT["lockfree::tran::table"]
TS["lockfree::tran::system<br/>(server-wide)"]
HM --> FL
FL --> TT
TT --> TS
end
Bridge --> Cpp
Bridge --> HM
PRM["PRM_ID_ENABLE_NEW_LFHASH<br/>(read at init time)"]
PRM --> Bridge
Edesc["lf_entry_descriptor<br/>(shared by both)"]
Edesc --> Cpp
Edesc --> HM
List operations — modern, in detail
Section titled “List operations — modern, in detail”The hashmap defers to three private chain primitives:
list_find— read-only walk.list_insert_internal— find-or-insert with optional given- entry, optional duplicate callback, optional behavior flags.list_delete— Harris-Michael two-CAS unlink with optional pre-locked entry.
Each is wrapped in an outer loop (hash_insert_internal,
hash_erase_internal) that retries on the LF_LIST_BR_RESTARTED
flag. The behavior flags LF_LIST_BF_* and response flags
LF_LIST_BR_* are taken verbatim from the legacy header — the
modern path reuses these constants rather than reinventing.
Insert path
Section titled “Insert path”// hashmap<K,T>::list_insert_internal (abridged) — src/base/lockfree_hashmap.hpp:899bool list_insert_internal (tran::index tran_index, T *&list_head, Key &key, int *behavior_flags, T *&entry){ pthread_mutex_t *entry_mutex = NULL; T **curr_p; T *curr; tran::descriptor &tdes = get_tran_descriptor (tran_index);
while (restart_search) { start_tran_force (tdes); // bracket open curr_p = &list_head; curr = address_type::atomic_strip_address_mark (*curr_p);
while (curr_p != NULL) { if (curr != NULL) { if (m_edesc->f_key_cmp (&key, get_keyp (curr)) == 0) { // FOUND duplicate key if (m_edesc->using_mutex) { lock_entry_mutex (*curr, entry_mutex); end_tran_force (tdes); // mutex supersedes tran if (address_type::is_address_marked (get_nextp_ref (curr))) { // raced with delete — restart unlock_entry_mutex_force (entry_mutex); return false; // RESTARTED flag set } } if (m_edesc->f_duplicate != NULL) { // restart on duplicate (the only current behavior) return false; } if (LF_LIST_BF_IS_FLAG_SET (behavior_flags, LF_LIST_BF_INSERT_GIVEN)) freelist_retire (tdes, entry); // give-up the speculative entry if (!LF_LIST_BF_IS_FLAG_SET (behavior_flags, LF_LIST_BF_FIND_OR_INSERT)) { // straight insert: failure return false; } entry = curr; // hand back found entry return false; // not inserted, but found } // advance to next curr_p = &get_nextp_ref (curr); curr = address_type::strip_address_mark (*curr_p); } else { // END OF CHAIN — insert here if (entry == NULL) { entry = freelist_claim (tdes); // speculative claim if (m_edesc->f_key_copy (&key, get_keyp (entry)) != NO_ERROR) return false; } if (m_edesc->using_mutex) lock_entry_mutex (*entry, entry_mutex);
if (!ATOMIC_CAS_ADDR (curr_p, (T *) NULL, entry)) { // somebody linked first — save speculative for next round and restart if (m_edesc->using_mutex) unlock_entry_mutex_force (entry_mutex); if (!LF_LIST_BF_IS_FLAG_SET (behavior_flags, LF_LIST_BF_INSERT_GIVEN)) save_temporary (tdes, entry); LF_LIST_BR_SET_FLAG (behavior_flags, LF_LIST_BR_RESTARTED); end_tran_force (tdes); return false; } if (m_edesc->using_mutex) end_tran_force (tdes); // mutex supersedes tran return true; // INSERTED } } }}The insertion is a single CAS on the predecessor’s next-pointer. The dance around it is:
save_temporaryon race-loss. If the CAS-link fails because a sibling thread linked first, the speculatively-claimed entry is parked on the descriptor (tdes.save_reclaimable) so the next iteration of the outerwhilereuses it. This is the freelist round-trip avoidance.f_duplicatecallback path. If a per-consumer “what to do on duplicate key” callback is registered, the insert calls it (e.g., to bump a session-id counter and request a new key) and then setsRESTARTEDto retry from a different bucket.- Mutex supersedes tran. When the entry has a mutex, the insert grabs the mutex before CAS-linking and ends the transaction immediately after the link succeeds — the mutex is now sufficient protection for the payload, and ending the tran early frees the descriptor for other ops.
Delete path
Section titled “Delete path”// hashmap<K,T>::list_delete (abridged) — src/base/lockfree_hashmap.hpp:1111bool list_delete (tran::index tran_index, T *&list_head, Key &key, T *locked_entry, int *behavior_flags){ while (restart_search) { start_tran_force (tdes); curr_p = &list_head; curr = address_type::atomic_strip_address_mark (*curr_p);
while (curr != NULL) { if (m_edesc->f_key_cmp (&key, get_keyp (curr)) == 0) { if (locked_entry != NULL && locked_entry != curr) { // erase_locked semantics: the entry I held the lock on was // already retired; current entry is a *different* one with the // same key. Bail. return false; } next_p = &get_nextp_ref (curr); next = address_type::strip_address_mark (*next_p);
// STEP 1: mark next pointer if (!ATOMIC_CAS_ADDR (next_p, next, address_type::set_adress_mark (next))) { // raced — restart if (LF_LIST_BR_RESTARTED) return false; break; }
if (m_edesc->using_mutex) { if (LF_LIST_BF_IS_FLAG_SET (behavior_flags, LF_LIST_BF_LOCK_ON_DELETE)) lock_entry_mutex (*curr, entry_mutex); else entry_mutex = get_pthread_mutexp (curr); // already locked }
// STEP 2: unlink if (!ATOMIC_CAS_ADDR (curr_p, curr, next)) { // raced; remove our mark and restart if (m_edesc->using_mutex && LF_LIST_BF_IS_FLAG_SET (..., LF_LIST_BF_LOCK_ON_DELETE)) unlock_entry_mutex_force (entry_mutex); ATOMIC_CAS_ADDR (next_p, address_type::set_adress_mark (next), next); if (LF_LIST_BR_RESTARTED) return false; break; } if (m_edesc->using_mutex) unlock_entry_mutex_force (entry_mutex);
promote_tran_force (tdes); // bump global id (retire epoch) freelist_retire (tdes, curr); // append to retired ring end_tran_force (tdes); return true; } curr_p = &get_nextp_ref (curr); curr = address_type::strip_address_mark (*curr_p); } }}The two CASes are exactly the Harris-Michael protocol:
- Mark.
next_pfromnexttomark(next). Failure means a sibling thread is also trying to delete the same node (or has already done step 1). - Unlink.
curr_pfromcurrtonext(unmarked). Failure means the predecessor pointer changed under us — another insert or delete moved pastcurr. Recovery: undo the mark and restart.
The retire is delayed until after the unlink CAS succeeds.
This is what makes m_retire_tranid strictly greater than any
concurrent reader’s snapshot — by the time we call
promote_tran_force (which bumps the global id and writes it to
our descriptor), the chain no longer contains curr, so any
future reader cannot see it via the bucket. Existing readers
holding curr directly still complete safely because their
snapshot’s tranid <= our pre-promotion tranid < our_retire_tranid, and reclamation only fires when
min_active > retire_tranid.
Find path
Section titled “Find path”// hashmap<K,T>::find (abridged) — src/base/lockfree_hashmap.hpp:280T *find (tran::index tran_index, Key &key){ while (restart) { T *&list_head = get_bucket (key); entry = NULL; bflags = LF_LIST_BF_RETURN_ON_RESTART; list_find (tran_index, list_head, key, &bflags, entry); restart = (bflags & LF_LIST_BR_RESTARTED) != 0; } return entry;}find always sets LF_LIST_BF_RETURN_ON_RESTART so the chain
walk returns to the bucket-array level on restart, recomputing
the hash. This matters when the only inflight change to the
chain is unrelated (a different key being inserted/deleted) — a
clean restart from get_bucket re-derefs the bucket pointer
and is robust to a clear having swapped the bucket array
underneath.
Clear and back-buffer
Section titled “Clear and back-buffer”hashmap::clear is the only non-lock-free public operation:
// hashmap<K,T>::clear (abridged) — src/base/lockfree_hashmap.hpp:378void clear (tran::index tran_index){ tran::descriptor &tdes = get_tran_descriptor (tran_index); std::unique_lock<std::mutex> ulock (m_backbuffer_mutex);
// STEP 1: atomically swap m_buckets <-> m_backbuffer T **old_buckets = ATOMIC_TAS_ADDR (&m_buckets, m_backbuffer);
// STEP 2: reset what is now the new backbuffer (was the old m_buckets) for (size_t i = 0; i < m_size; ++i) { assert (m_backbuffer[i] == address_type::set_adress_mark (NULL)); m_buckets[i] = NULL; } m_backbuffer = old_buckets;
// STEP 3: retire every entry in old buckets, link by link for (size_t i = 0; i < m_size; ++i) { do { curr = address_type::strip_address_mark (old_buckets[i]); } while (!ATOMIC_CAS_ADDR (&old_buckets[i], curr, address_type::set_adress_mark (NULL))); while (curr != NULL) { next_p = &get_nextp_ref (curr); do { next = address_type::strip_address_mark (*next_p); } while (!ATOMIC_CAS_ADDR (next_p, next, address_type::set_adress_mark (next))); if (m_edesc->using_mutex) { lock_entry_mutex (*curr, mutex_p); // wait for any in-flight reader unlock_entry_mutex_force (mutex_p); } freelist_retire (tdes, curr); // append to retired ring curr = next; } }}Three observations:
- The bucket-array swap is atomic. Concurrent finds may briefly
observe
m_bucketspointing at the old or the new array; the invariant is that whatever they point at, the chain walk succeeds (modulo restart-on-CAS-fail) because every chain link is still valid. - The back-buffer is pre-marked with
set_adress_mark (NULL)so that any thread looking at the back-buffer’s bucket head sees a marked-NULL chain head and treats it as empty. This is the back-buffer mutex’s invariant — under it, the back-buffer is always all-marked-NULL. - Per-entry mutex acquisition during retirement is the wait point for in-flight readers. A reader who found the entry before clear started will hold the entry’s mutex; clear serializes against that mutex, ensuring the reader is done before we retire.
Cost: O(buckets × chain_length) atomic operations + N pthread mutex lock/unlock cycles + descriptor retire-ring growth. On a heavily-loaded table this is the worst single operation in the hashmap surface; consumers who clear (sessions on logout, lock manager on shutdown, XASL cache on plan-cache reset) accept it as a rare event.
Iterator
Section titled “Iterator”The iterator (hashmap::iterator) walks bucket-by-bucket,
chain-by-chain, with a transaction held open per bucket. It
auto-releases the transaction at end-of-bucket and reacquires at
the next. The pattern is deliberately conservative — long
iterations would otherwise keep the descriptor’s m_tranid low
and prevent reclamation across the whole server.
// hashmap<K,T>::iterator::iterate (abridged) — src/base/lockfree_hashmap.hpp:1343T *iterate () { do { if (m_curr != NULL) { if (m_hashmap->m_edesc->using_mutex) m_hashmap->unlock_entry (*m_curr); // release per-entry mutex m_curr = address_type::strip_address_mark (m_hashmap->get_nextp_ref (m_curr)); } else { if (m_bucket_index != INVALID_INDEX) m_tdes->end_tran (); m_tdes->start_tran (); m_bucket_index++; if (m_bucket_index >= m_hashmap->m_size) { m_tdes->end_tran (); return NULL; // end } m_curr = address_type::atomic_strip_address_mark (m_hashmap->m_buckets[m_bucket_index]); } if (m_curr != NULL && m_hashmap->m_edesc->using_mutex) { m_hashmap->lock_entry (*m_curr); if (address_type::is_address_marked (m_hashmap->get_nextp_ref (m_curr))) continue; // deleted in flight, skip } } while (m_curr == NULL); return m_curr;}Per-method statistics
Section titled “Per-method statistics”Every public method is wrapped by ct_stat_type::autotimer
that increments a counter and accumulates a duration timer
into one of m_stat_find, m_stat_insert, m_stat_erase,
m_stat_unlock, m_stat_clear, m_stat_iterates,
m_stat_claim, m_stat_retire. The autotimer is short-circuited
unless m_active_stats == true, which is toggled by
activate_stats() / deactivate_stats() per-table. The default
is off — production servers do not pay the stats cost
unless an operator explicitly enables it.
Legacy implementation, in brief
Section titled “Legacy implementation, in brief”The legacy lf_hash_* family does the same Harris-Michael walk
in C, with LF_TRAN_ENTRY * instead of tran::descriptor and
the global LF_TRAN_SYSTEMs instead of per-table tables. The
key procedural calls:
lf_hash_initbuilds the bucket array and the back-buffer, registersentry_descwith the freelist.lf_hash_findcallslf_list_findwhich is the legacy counterpart ofhashmap::list_find.lf_hash_insertcallslf_hash_insert_internal→lf_list_insert_internal.lf_hash_deletecallslf_hash_delete_internal→lf_list_delete.lf_hash_clearwalks the back-buffer the same way the moderncleardoes.
The same LF_LIST_BF_* flags drive both. The same back-buffer
layout (backbuffer[i] == ADDR_WITH_MARK (NULL)) is invariant.
The legacy stack additionally exposes:
lf_io_list_find,lf_io_list_find_or_insert— an insert-only singly-linked list with no deletes and no freelist. Used by code paths that need a thread-safe append- only set (currently a small number of internal call sites in the legacy stack; the modern path has no equivalent).lf_stack_push,lf_stack_pop— Treiber stack. Documented as ABA-vulnerable; not currently used by any in-engine consumer (kept for compile-link compatibility with potential external callers).
Migration narrative
Section titled “Migration narrative”The legacy stack predates C++17 in CUBRID. The modern stack was introduced as part of the C++17 modernization with two stated goals:
- Type safety. No more
void*on the public surface; thelf_entry_descriptorbyte-offsets remain but are confined to the implementation. - Per-table reclamation. No more 11 global
LF_TRAN_SYSTEMinstances. Eachlockfree::hashmapbuilds its owntran::tablefrom a server-widetran::system; reclamation pressure on the lock manager’s hash does not throttle reclamation on the session table.
The migration is opt-in via system parameter, not a flag
day. The bridge exists so each consumer can flip independently
without rewriting call sites. The migration completion bar is
zero call sites left on lf_hash_table_cpp directly — every
consumer hits the new path through the bridge, and the
PRM_ID_ENABLE_NEW_LFHASH toggle becomes the single decision
for the whole server.
As of 2026-05-07 the toggle’s default is unverified in this
doc (see Open Question 1). Empirically the bridge dispatch
table — the macro lockfree_hashmap_forward_func — is the
hottest code in the hashmap, more than the chain walks
themselves, because every public call goes through it.
Consumers
Section titled “Consumers”| Consumer | Source | Entry struct | Generation toggle |
|---|---|---|---|
| Object-lock resource hash | src/transaction/lock_manager.c | LK_RES | enable_new_lfhash |
| Object-lock entry hash | src/transaction/lock_manager.c | LK_ENTRY | enable_new_lfhash |
| Session table | src/session/session.c | SESSION_STATE | enable_new_lfhash |
| XASL cache | src/query/xasl_cache.c | XASL_CACHE_ENTRY | enable_new_lfhash |
| Filter-predicate cache | src/query/filter_pred_cache.c | FILTER_PRED_CACHE_ENTRY | enable_new_lfhash |
| System catalog | src/storage/system_catalog.c | CATCLS_ENTRY | enable_new_lfhash |
| Slotted-page saving | src/storage/slotted_page.c | SPAGE_SAVE_ENTRY | enable_new_lfhash |
| HFID table | src/storage/heap_file.c | HFID_TABLE_ENTRY | enable_new_lfhash |
| Global unique stats | src/storage/btree.c | LOG_LSA / *unique_stats* | enable_new_lfhash |
| DWB slot table | src/storage/double_write_buffer.cpp | DWB_SLOT_HASH_ENTRY | enable_new_lfhash |
| Sort-list (legacy only) | src/query/sort.c | SORT_LIST_ENTRY | (legacy only) |
Each consumer defines its lf_entry_descriptor once (typically
as a static const at file scope), constructs its
cubthread::lockfree_hashmap<K,T> member, and calls init()
once at boot.
Source Walkthrough
Section titled “Source Walkthrough”Modern stack — public surface
Section titled “Modern stack — public surface”lockfree::hashmap<K,T>— class template,lockfree_hashmap.hpp:42. The.cpp(25 lines) is a stub.hashmap::find—:280hashmap::find_or_insert—:301hashmap::insert—:308hashmap::insert_given—:315hashmap::erase—:325hashmap::erase_locked—:345hashmap::unlock—:360hashmap::clear—:378hashmap::freelist_claim—:656(and:663for thetran::descriptor &overload)hashmap::freelist_retire—:579(and:646for thetran::descriptor &overload)hashmap::list_find—:813hashmap::list_insert_internal—:899hashmap::list_delete—:1111hashmap::hash_insert_internal—:1269hashmap::hash_erase_internal—:1305hashmap::iterator::iterate—:1343
Legacy stack — public surface
Section titled “Legacy stack — public surface”LF_HASH_TABLEstruct —lock_free.h:299LF_ENTRY_DESCRIPTORstruct —lock_free.h:62lf_hash_init—lock_free.c(defined later in the file)lf_hash_find—lock_free.clf_hash_find_or_insert—lock_free.clf_hash_insert/lf_hash_insert_given—lock_free.clf_hash_delete/lf_hash_delete_already_locked—lock_free.clf_hash_clear—lock_free.clf_list_find/lf_list_insert_internal/lf_list_delete— internal chain walks (lock_free.c)lf_io_list_find/lf_io_list_find_or_insert— insert-only list (lock_free.c:1077+)lf_stack_push/lf_stack_pop— Treiber stack (lock_free.c:555+)lf_freelist_init/lf_freelist_claim/lf_freelist_retire/lf_freelist_transport—lock_free.c:676+lf_hash_table_cpp<K,T>— C++ shim,lock_free.h:367+
Bridge
Section titled “Bridge”cubthread::lockfree_hashmap<K,T>—thread_lockfree_hash_map.hpp:35init(toggles by parameter) —:129init_as_old/init_as_new—:145,:155- All forwarded methods (
find,insert,erase, …) —:171-:294, expandinglockfree_hashmap_forward_funcmacro cubthread::get_thread_entry_lftransys—thread_lockfree_hash_map.cpp:27
Position hints (as of 2026-05-07)
Section titled “Position hints (as of 2026-05-07)”| Symbol | File | Line |
|---|---|---|
lockfree::hashmap<K,T> (class template) | src/base/lockfree_hashmap.hpp | 42 |
hashmap::init | src/base/lockfree_hashmap.hpp | 261 |
hashmap::find | src/base/lockfree_hashmap.hpp | 280 |
hashmap::find_or_insert | src/base/lockfree_hashmap.hpp | 301 |
hashmap::erase | src/base/lockfree_hashmap.hpp | 325 |
hashmap::erase_locked | src/base/lockfree_hashmap.hpp | 345 |
hashmap::clear | src/base/lockfree_hashmap.hpp | 378 |
hashmap::list_find | src/base/lockfree_hashmap.hpp | 813 |
hashmap::list_insert_internal | src/base/lockfree_hashmap.hpp | 899 |
hashmap::list_delete | src/base/lockfree_hashmap.hpp | 1111 |
hashmap::hash_insert_internal | src/base/lockfree_hashmap.hpp | 1269 |
hashmap::iterator::iterate | src/base/lockfree_hashmap.hpp | 1343 |
hashmap::freelist_node_data::on_reclaim | src/base/lockfree_hashmap.hpp | 92 |
hashmap::to_free_node | src/base/lockfree_hashmap.hpp | 552 |
lf_hash_table (struct) | src/base/lock_free.h | 299 |
lf_entry_descriptor (struct) | src/base/lock_free.h | 62 |
lf_hash_table_cpp<K,T> (template) | src/base/lock_free.h | 367 |
lf_freelist_alloc_block | src/base/lock_free.c | 621 |
lf_freelist_claim | src/base/lock_free.c | 760 |
lf_freelist_retire | src/base/lock_free.c | 873 |
lf_freelist_transport | src/base/lock_free.c | 934 |
lf_io_list_find | src/base/lock_free.c | 1077 |
lf_io_list_find_or_insert | src/base/lock_free.c | 1131 |
lf_stack_push / lf_stack_pop | src/base/lock_free.c | 554 / 584 |
cubthread::lockfree_hashmap<K,T> | src/thread/thread_lockfree_hash_map.hpp | 35 |
cubthread::lockfree_hashmap::init | src/thread/thread_lockfree_hash_map.hpp | 129 |
cubthread::lockfree_hashmap::init_as_new | src/thread/thread_lockfree_hash_map.hpp | 155 |
cubthread::lockfree_hashmap::init_as_old | src/thread/thread_lockfree_hash_map.hpp | 145 |
cubthread::get_thread_entry_lftransys | src/thread/thread_lockfree_hash_map.cpp | 27 |
Source verification (as of 2026-05-07)
Section titled “Source verification (as of 2026-05-07)”Verified facts
Section titled “Verified facts”- The bridge dispatches at
init()only — there is no per-call toggle.cubthread::lockfree_hashmap::initreadsprm_get_bool_value (PRM_ID_ENABLE_NEW_LFHASH)exactly once and writesm_type = OLD or NEW. Subsequent calls dispatch onm_type. Verified atthread_lockfree_hash_map.hpp:129-141. - Both implementations consume the same
lf_entry_descriptortype. The struct is defined once inlock_free.h:62, used bylf_hash_*and bylockfree::hashmap<K,T>::init(which takes alf_entry_descriptor &argument). Verified by reading both signatures. - Modern hashmap stores entries via
freelist_node_datawrapping. Each entry isfreelist<freelist_node_data>::free_node, carrying atran::reclaimable_nodebase, a back-pointer to the freelist owner, and the payloadT m_entry. Verified atlockfree_hashmap.hpp:84-100. The header acknowledges theto_free_nodeoffset hack as workaround pending refactor. - Per-entry mutex is optional, driven by
lf_entry_descriptor::using_mutex. Verified atlockfree_hashmap.hpp:777(assert (m_edesc->using_mutex)inlock_entry_mutex) and at:362-371whereunlockbranches onm_edesc->using_mutex. f_duplicatecallback currently always restarts.list_insert_internalatlockfree_hashmap.hpp:962-976: whenf_duplicateis non-NULL, it is called for its side effect (e.g., bumping a counter), thenLF_LIST_BR_RESTARTEDis set and the search restarts with a new key. The#if 1block masks an alternative no-restart path that is currently dead; the comment explains this is intentional pending a wider review.address_marker<T>::set_adress_markis the typo spelling used throughout.lockfree_address_marker.hpp:49. Both call sites and the header agree on the typo.- The clear’s back-buffer is pre-marked with
set_adress_mark (NULL). Verified atlockfree_hashmap.hpp:272-275(init) and:393(clear’s invariant assert). - Statistics default off; toggled per-table by
activate_stats()/deactivate_stats(). Verified atlockfree_hashmap.hpp:632-642.m_active_statsdefaults tofalse(initialized in the default-constructor at:219).
Open questions
Section titled “Open questions”- Default value of
enable_new_lfhash. Not verified in this doc. Investigation: readprm_def[]insystem_parameter.cand check the release-notes git history for the toggle. If the default isfalse, the production reality is the legacy path; the modern path is staged but not running. - Are any
lf_hash_table_cpp<K,T>direct call sites left? The bridge’s existence assumes every consumer usescubthread::lockfree_hashmap. If a consumer still instantiateslf_hash_table_cppdirectly, that consumer is pinned to the legacy path. Investigation: greplf_hash_table_cpp<and confirm every match is insidecubthread::lockfree_hashmap::m_old_hash. - Why does the modern path keep
of_local_nextin the descriptor? The field is legacy-only; the modern hashmap stores entries insidefree_nodeand usestran::reclaimable_node::m_retired_nextfor the retired ring. The descriptor does not needof_local_next. The field is retained, presumably, so a single descriptor object can drive both paths during the migration; once the migration completes,of_local_next,of_del_tran_id,f_alloc,f_free,max_alloc_cntare all dead and can be purged. f_duplicatesemantics — is the restart-only behavior the final design? The#if 1 / #else / #endifblock atlockfree_hashmap.hpp:972-1005keeps two alternatives in the tree. If the no-restart path is intended for a future feature (incrementing a counter without reissuing the key), the dead branch should be either revived or removed.clearwhile iterators are open. The clear acquiresm_backbuffer_mutexand atomic-swaps the bucket array; an open iterator holdingm_currfrom the old bucket array continues walking into retired memory until reclamation fires. Reclamation is bounded by the iterator’sm_tdesbracket, so the entries are still alive — but them_currpointer no longer reflects the live table. Investigation: are clear-vs-iterator races documented? If not, the iterator class needs a “table generation” counter check.
Beyond CUBRID — Comparative Designs & Research Frontiers
Section titled “Beyond CUBRID — Comparative Designs & Research Frontiers”- Split-ordered hashing (Shalev & Shavit 2003). A single recursive lock-free linked list with bucket pointers as shortcuts; supports lock-free resize. Adopting it would let the hashmap grow under contention rather than being fixed at init. Cost: more complex chain pointers and a different invariant on bucket-array semantics.
- Folly’s
ConcurrentHashMap. Hazard-pointer-based reclamation, simpler API, no per-entry mutex. A side-by-side micro-benchmark on the lock manager’s resource hash would tell us whether the per-entry mutex is a real cost or a real benefit. - Cuckoo hashing. Open-addressing alternative with O(1) worst-case lookup. Lock-free variants exist (Li et al. EuroSys 2014). Worth comparing on read-mostly workloads (XASL cache).
- Eliminate the legacy path. The most concrete frontier is
retiring
lf_hash_*and the bridge. Steps: confirmenable_new_lfhashdefaults totrueon a recent release, burn-in for one release cycle, then deletelock_free.{h,c}and them_old_hashfield of the bridge. - Server-wide rather than per-consumer hash tables. Today
every consumer instantiates its own
cubthread::lockfree_hashmap. A future refactor could expose a server-wide registry indexed by consumer name, reducing boot-time setup boilerplate and centralizing tuning. - Replace
f_hashcallbacks withstd::hash<K>traits. C++17 has the machinery; the only blocker is thatlf_entry_descriptoris a C-shaped struct shared with legacy code. Once the legacy path is gone, the descriptor becomes a C++ class andf_hashbecomes a template parameter.
Sources
Section titled “Sources”- Source files (in the CUBRID source tree at
/data/hgryoo/references/cubrid/):src/base/lock_free.h,src/base/lock_free.c(legacy procedural API + thelf_hash_table_cpp<K,T>template shim)src/base/lockfree_hashmap.{hpp,cpp}(modern templated class)src/base/lockfree_address_marker.hpp(pointer-mark helper)src/base/lockfree_freelist.hpp(modern entry storage)src/base/lockfree_transaction_descriptor.hpp(read / retire bracketing)src/thread/thread_lockfree_hash_map.{hpp,cpp}(the bridgecubthread::lockfree_hashmap<K,T>)
- Consumer source files (sampled):
src/transaction/lock_manager.c(object-lock resource & entry tables — primary canary for this analysis)src/session/session.csrc/query/xasl_cache.csrc/query/filter_pred_cache.csrc/storage/system_catalog.csrc/storage/slotted_page.csrc/storage/double_write_buffer.cpp
- Textbook chapters cited:
- Herlihy & Shavit, The Art of Multiprocessor Programming, ch. 9 §“Lock-Free Linked Lists” (Harris-Michael), ch. 13 §“Concurrent Hashing” (open vs. closed addressing, striping vs. lock-free chains).
- Williams, C++ Concurrency in Action, ch. 7 §“Designing lock-free concurrent data structures.”
- Foundational papers:
- Harris, T. L. “A Pragmatic Implementation of Non-Blocking Linked-Lists.” DISC 2001 — the mark-bit-on-next protocol used by both CUBRID generations.
- Michael, M. M. “High Performance Dynamic Lock-Free Hash Tables and List-Based Sets.” SPAA 2002.
- Shalev, O., Shavit, N. “Split-Ordered Lists: Lock-Free Extensible Hash Tables.” JACM 53(3), 2006.
- Michael, M. M. “Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects.” IEEE TPDS 15(8), 2004 — the alternative reclamation model CUBRID does not use.
- Sibling docs in this repo:
cubrid-lockfree-overview.md— umbrella, two-generation map.cubrid-lockfree-transaction.md— the reclamation framework the modern hashmap rides on.cubrid-lockfree-freelist.md— the entry storage layer.cubrid-lockfree-bitmap.md— the index allocator beneathtran::system.cubrid-lock-manager.md— the consumer that exercises every code path of this hashmap.cubrid-thread-worker-pool.md— definescubthread::entry::get_lf_tran_index()used by the bridge forwarders.