CUBRID Lock-Free Hash Map — Legacy C, Modern C++, the Bridge, and the Consumers

Contents:

Theoretical Background
Common DBMS Design
CUBRID’s Approach
Source Walkthrough
Source verification (as of 2026-05-07)
Beyond CUBRID — Comparative Designs & Research Frontiers
Sources

Theoretical Background

A concurrent hash table is the most-used lock-free data structure in a DBMS. Lock managers, page-buffer maps, session caches, plan caches, statistics tables — every per-key resource that wants O(1) lookup under contention picks some form of concurrent hash. The textbook design space (Herlihy & Shavit, The Art of Multiprocessor Programming, ch. 13 §“Concurrent Hashing”, and Shalev & Shavit, JACM 53(3) 2006) has three axes:

Open vs. closed addressing. Open addressing (linear / quadratic probing, Robin Hood) places entries directly in the bucket array — cache-friendly but resize-unfriendly. Closed addressing (chained hashing) places entries in per-bucket linked lists — easy to grow incrementally because chains absorb collisions.
Static vs. resizable. A static table has a fixed bucket count. A resizable table grows when the load factor exceeds a threshold — but lock-free resize is hard, requiring either split-ordered bucket pointers (Shalev & Shavit) so that the chain itself does not need re-linking, or stop-the-world re-hash (which sacrifices the lock-free property during resize).
What guards the chains? A coarse table mutex (PostgreSQL’s LockHashPartitionLock), per-bucket striping (Java’s pre-Java-8 ConcurrentHashMap), or fully lock-free chains via the Harris–Michael protocol (Harris 2001, Michael 2002).

CUBRID picks closed addressing + static buckets + Harris–Michael chains. The cost is no automatic resize — the bucket array is fixed at init time. The benefit is the implementation can be small and the chain operations are textbook.

Harris–Michael chain operations

The protocol uses three invariants on each chain node:

The next pointer’s low bit is reserved as a delete mark.
Insert is a single CAS: starting from the predecessor’s next, CAS from NULL to the new node. If the predecessor already has a next, the inserter walks past and tries again. Insertion of an already-present key is a no-op (or a “found” return for find-or-insert semantics).
Delete is a two-CAS protocol:
- Mark. CAS victim->next from next_ptr to mark(next_ptr) — sets the low bit. After this, the victim is logically deleted: any concurrent traverser sees the mark and treats the victim as gone, but the victim is still reachable from predecessor->next.
- Unlink. CAS predecessor->next from victim to victim->next (the unmarked one) — physically removes the victim from the chain. If this CAS fails (someone else mutated the chain between mark and unlink), the unlinker either retries (rare) or restarts the search (cheaper).

Two side conditions:

Restart on staleness. Every chain operation may need to restart from the bucket head if a CAS fails. This is the LF_LIST_BR_RESTARTED signal in the legacy code and the restart_search boolean in the modern code.
Reclamation. A logically-deleted but not-yet-physically- unlinked node must remain dereferenceable until every reader who could have reached it via predecessor->next has finished. This is what the transactional reclamation framework (lockfree::tran::*) provides.

ABA hazard, recovery via reclamation

The classical ABA scenario for chained hashing: a thread reads predecessor->next == A, plans to CAS-mark A->next. A sibling thread deletes A, then re-allocates a new A' at the same address (e.g., via the freelist), and inserts A' somewhere — now predecessor->next points to A' (or further along the chain), not A. The original thread’s CAS-mark on the saved A pointer either succeeds and silently corrupts A'’s next (use-after- free + corruption) or fails because the address changed (which is the goal but not the worst case).

The transactional reclamation framework prevents this by ensuring that as long as the original thread holds an open descriptor on the table, no node it could have observed is reclaimed — including A. The CAS-mark either succeeds on the original A (which is still in the chain, even if logically deleted) or fails because the chain shape changed. There is no second-allocation aliasing.

Hash function and bucket distribution

CUBRID delegates the hash function to the per-table descriptor’s f_hash callback, which takes a key and a bucket count and returns the bucket index. This is per-consumer:

Lock manager uses an OID-shaped hash.
Page buffer (legacy LF_HASH_TABLE) uses a VPID hash: (pageid | ((unsigned int)volid) << 24) % htsize (lock_free.c:154).
Session table uses session-id modulo bucket count.

The bucket count itself is a one-shot decision per consumer at boot, picked from system parameters. There is no growth, no shrink, no rehash. The only resize-shaped operation is clear, which swaps in an empty back-buffer atomically — see §“Clear and back-buffer.”

Common DBMS Design

Lock managers everywhere use a chained hash

PostgreSQL’s lock.c partitions a single hash table into NUM_LOCK_PARTITIONS = 16 partitions, each guarded by an LWLock. InnoDB’s lock_sys_t uses a hash array on (space_id, page_no) pairs guarded by lock_sys->mutex (with recent versions adding sharded mutexes). MySQL’s open-table cache is a hash with per-table mutexes. The repeated pattern is partition by hash, lock per partition.

CUBRID’s approach is structurally different — instead of a coarse mutex per partition, every chain is fully lock-free, and the optional using_mutex is per entry rather than per partition. The trade-off:

Pro. Find / insert / erase scale linearly with hash size up to chain-length; no partition-size threshold.
Pro. Memory access pattern is simpler — a thread holds at most one entry’s mutex at a time.
Con. The transactional reclamation overhead is non-zero on every find (the start_tran/end_tran bracket).
Con. clear is O(buckets × chain_length) because every entry must be retired — the back-buffer trick spreads the cost but does not eliminate it.

Optional per-entry mutex

A subtle CUBRID-specific design: some entries carry their own pthread_mutex_t (declared via lf_entry_descriptor::using_mutex = 1 and of_mutex field offset). When set, find and erase grab the entry’s mutex after finding it but before operating on the payload, allowing the caller to subsequently release the mutex with a separate unlock call. This is the pattern used by the lock manager’s resource entries — a thread that finds a lock resource needs to hold the resource entry’s mutex while mutating its waiter list.

When using_mutex = 0, the find/erase result is only valid inside the transaction bracket — the caller must not let the descriptor’s m_tranid go to INVALID_TRANID while it still holds a payload pointer.

Find-or-insert speculation

Insert-with-find-fallback (the find_or_insert API) is the common case — the caller wants the entry whether or not it already exists. The implementation pattern: speculatively claim a fresh entry from the freelist before searching, search the chain, if the key is found discard the speculative entry, otherwise CAS-link it into place.

CUBRID optimizes the discard with the descriptor::save_reclaimable mechanism described in cubrid-lockfree-transaction.md — the discarded entry is parked on the calling thread’s descriptor rather than retired, and the next call site reuses it without a freelist round-trip.

Theory ↔ CUBRID mapping

Concept	CUBRID name (legacy)	CUBRID name (modern)
Chained hash table	`LF_HASH_TABLE`	`lockfree::hashmap<K,T>`
Bucket array	`LF_HASH_TABLE::buckets`	`hashmap::m_buckets`
Bucket count	`LF_HASH_TABLE::hash_size`	`hashmap::m_size`
Per-entry layout descriptor	`LF_ENTRY_DESCRIPTOR`	(same struct, `lf_entry_descriptor`)
Hash function	`edesc->f_hash (key, htsize)`	(same)
Key compare	`edesc->f_key_cmp (k1, k2)`	(same)
Key copy	`edesc->f_key_copy (src, dst)`	(same)
Init / uninit	`edesc->f_init`, `edesc->f_uninit`	(same)
Duplicate-key callback	`edesc->f_duplicate`	(same)
Per-entry mutex flag	`edesc->using_mutex`	(same)
Field offset: next	`edesc->of_next`	(same)
Field offset: key	`edesc->of_key`	(same)
Field offset: mutex	`edesc->of_mutex`	(same)
Field offset: del_tran_id	`edesc->of_del_tran_id`	(legacy only — modern uses freelist node)
Field offset: local_next	`edesc->of_local_next`	(legacy only — modern uses freelist node)
Behaviour flags	`LF_LIST_BF_*`	(same constants reused)
Restart signal	`LF_LIST_BR_RESTARTED`	(same constant reused)
Find primitive	`lf_hash_find`	`hashmap::find`
Find-or-insert	`lf_hash_find_or_insert`	`hashmap::find_or_insert`
Insert	`lf_hash_insert`	`hashmap::insert`
Insert-given	`lf_hash_insert_given`	`hashmap::insert_given`
Delete	`lf_hash_delete`	`hashmap::erase`
Delete-locked	`lf_hash_delete_already_locked`	`hashmap::erase_locked`
Clear	`lf_hash_clear`	`hashmap::clear`
Iterator	`LF_HASH_TABLE_ITERATOR`	`hashmap::iterator`
Back-buffer for clear	`LF_HASH_TABLE::backbuffer`	`hashmap::m_backbuffer`
C++ shim	`lf_hash_table_cpp<K,T>`	(same — used in bridge)
Wrapper deciding generation	(n/a)	`cubthread::lockfree_hashmap<K,T>`

CUBRID’s Approach

Two generations, one descriptor

The hashmap exists as two complete implementations bridged by a single dispatch class. Both consume the same lf_entry_descriptor so a per-consumer entry layout works unmodified across the migration.

Legacy C — `lf_hash_*` over `LF_HASH_TABLE`

// LF_HASH_TABLE — src/base/lock_free.h:299
struct lf_hash_table {
  void **buckets;
  void **backbuffer;
  pthread_mutex_t backbuffer_mutex;
  unsigned int hash_size;
  LF_FREELIST *freelist;
  LF_ENTRY_DESCRIPTOR *entry_desc;
};

The procedural API: lf_hash_init, lf_hash_find, lf_hash_find_or_insert, lf_hash_insert, lf_hash_insert_given, lf_hash_delete, lf_hash_delete_already_locked, lf_hash_clear. All take an LF_TRAN_ENTRY * (the per-thread state in the legacy reclamation system) and an LF_HASH_TABLE *.

The C++ shim lf_hash_table_cpp<K,T> is a typed wrapper in the same header — it owns an lf_freelist and an lf_hash_table inline and forwards every method to the C functions with a reinterpret_cast<void**>(&t) for the entry pointer. This shim is what cubthread::lockfree_hashmap holds for the OLD path.

Modern C++ — `lockfree::hashmap<K,T>`

// lockfree::hashmap<Key, T> — src/base/lockfree_hashmap.hpp:42
template <class Key, class T>
class hashmap {
public:
  void init (tran::system &transys, size_t hash_size,
             size_t freelist_block_size, size_t freelist_block_count,
             lf_entry_descriptor &edesc);
  void destroy ();

  T   *find             (tran::index, Key &);
  bool find_or_insert   (tran::index, Key &, T *&);
  bool insert           (tran::index, Key &, T *&);
  bool insert_given     (tran::index, Key &, T *&);
  bool erase            (tran::index, Key &);
  bool erase_locked     (tran::index, Key &, T *&);

  void unlock           (tran::index, T *&);
  void clear            (tran::index);            // NOT lock-free
  T   *freelist_claim   (tran::index);
  void freelist_retire  (tran::index, T *&);

  void start_tran (tran::index);
  void end_tran   (tran::index);

  // ...
private:
  using address_type = address_marker<T>;
  struct freelist_node_data { T m_entry; lf_entry_descriptor *m_edesc; void on_reclaim (); };
  using freelist_type   = freelist<freelist_node_data>;
  using free_node_type  = typename freelist_type::free_node;

  freelist_type *m_freelist;
  T            **m_buckets;
  size_t         m_size;
  T            **m_backbuffer;
  std::mutex     m_backbuffer_mutex;
  lf_entry_descriptor *m_edesc;
  // statistics ...
};

Three structural differences from legacy:

Templated on K, T, no void* casts in the public surface.
No global tran-system reference. The constructor takes a tran::system & and builds its own tran::table (inside the freelist).
Entries live as freelist<freelist_node_data>::free_node, not bare T. The freelist owns the node header (the tran::reclaimable_node base + m_owner pointer) and the payload T m_entry — see §“Entry layout”.

Bridge — `cubthread::lockfree_hashmap<K,T>`

// cubthread::lockfree_hashmap<Key,T> — src/thread/thread_lockfree_hash_map.hpp:35
template <class Key, class T>
class lockfree_hashmap {
public:
  void init (lf_tran_system &transys, int entry_idx, int hash_size,
             int freelist_block_size, int freelist_block_count,
             lf_entry_descriptor &edesc);
  void init_as_old (lf_tran_system &, int hash_size, int freelist_block_count,
                    int freelist_block_size, lf_entry_descriptor &, int entry_idx);
  void init_as_new (lockfree::tran::system &, size_t hash_size,
                    size_t freelist_block_size, size_t freelist_block_count,
                    lf_entry_descriptor &);
  // forwards: find, find_or_insert, insert, insert_given, erase, erase_locked,
  //           unlock, clear, freelist_claim, freelist_retire, start_tran, end_tran
private:
  enum type { OLD, NEW, UNKNOWN };
  lf_hash_table_cpp<Key, T> m_old_hash;
  lockfree::hashmap<Key, T> m_new_hash;
  type m_type;
  int  m_entry_idx;
};

The dispatch is a single boolean check at init():

void init (lf_tran_system &transys, int entry_idx, int hash_size,
           int freelist_block_size, int freelist_block_count,
           lf_entry_descriptor &edesc)
{
  if (prm_get_bool_value (PRM_ID_ENABLE_NEW_LFHASH))
    init_as_new (get_thread_entry_lftransys (), (size_t) hash_size,
                 (size_t) freelist_block_size, (size_t) freelist_block_count, edesc);
  else
    init_as_old (transys, hash_size, freelist_block_count,
                 freelist_block_size, edesc, entry_idx);
}

Every forwarded method picks the live implementation by reading m_type and dispatching the matching call. The dispatch is cheap — a single conditional branch — but it is on every call. There is no inlined fast path; the branch sits inside a macro (lockfree_hashmap_forward_func) that every public method expands.

The shared `lf_entry_descriptor`

Both paths consume the same descriptor:

// lf_entry_descriptor — src/base/lock_free.h:62
struct lf_entry_descriptor {
  unsigned int of_local_next;     // legacy only
  unsigned int of_next;
  unsigned int of_del_tran_id;    // legacy only
  unsigned int of_key;
  unsigned int of_mutex;
  int          using_mutex;
  int          max_alloc_cnt;     // legacy only

  LF_ENTRY_ALLOC_FUNC          f_alloc;     // legacy only
  LF_ENTRY_FREE_FUNC           f_free;      // legacy only
  LF_ENTRY_INITIALIZE_FUNC     f_init;
  LF_ENTRY_UNINITIALIZE_FUNC   f_uninit;
  LF_ENTRY_KEY_COPY_FUNC       f_key_copy;
  LF_ENTRY_KEY_COMPARE_FUNC    f_key_cmp;
  LF_ENTRY_HASH_FUNC           f_hash;
  LF_ENTRY_DUPLICATE_KEY_HANDLER f_duplicate;
};

Three categories of fields:

Always used. of_next, of_key, of_mutex, using_mutex, f_init, f_uninit, f_key_copy, f_key_cmp, f_hash, f_duplicate.
Legacy only. of_local_next (used by the legacy retire list and the legacy stack/freelist), of_del_tran_id (used by the legacy retire-id stamping), max_alloc_cnt, f_alloc, f_free (legacy freelist allocates with f_alloc/f_free callbacks; modern freelist uses new/delete on its own free_node type).
Modern only. None — the modern path simply ignores legacy-only fields. A descriptor populated for both paths silently uses the right subset.

This is the critical reason the two-generation co-existence works: a single in-tree descriptor table works on either side because the modern code overlooks the legacy fields and the legacy code overlooks the modern absences. Each consumer (lock manager, session table, etc.) defines its descriptor once and the bridge flips the implementation under it.

Entry layout — modern

Where the legacy table stores entries as raw T allocated by f_alloc, the modern path wraps T inside a freelist node:

// lockfree::hashmap<K,T>::freelist_node_data — src/base/lockfree_hashmap.hpp:84
struct freelist_node_data {
  T m_entry;
  lf_entry_descriptor *m_edesc;
  freelist_node_data () = default;
  ~freelist_node_data () = default;
  void on_reclaim () {
    if (m_edesc->f_uninit != NULL)
      (void) m_edesc->f_uninit (&m_entry);
  }
};
using freelist_type  = freelist<freelist_node_data>;
using free_node_type = typename freelist_type::free_node;

The free_node template adds a tran::reclaimable_node base plus a freelist *m_owner back-pointer plus the payload (see cubrid-lockfree-freelist.md). The hashmap therefore stores pointers to T m_entry in the buckets, and converts back to the enclosing free_node when retiring:

// hashmap<K,T>::to_free_node — src/base/lockfree_hashmap.hpp:552
free_node_type *to_free_node (T *p) {
  // not nice, but necessary until we fully refactor lockfree hashmap
  const std::ptrdiff_t off = free_node_offset_of_data (free_node_type ());
  char *cp = (char *) p;
  cp -= off;
  return (free_node_type *) cp;
}

The free_node_offset_of_data constexpr helper computes the offset of m_entry within the free_node so the back-cast is correct regardless of compiler padding. The header comment acknowledges this is a workaround pending a deeper refactor — the “correct” design would have hashmap entries embed reclaimable_node directly rather than via the freelist’s generic free_node.

Field-offset access — both generations

Both paths mutate T through byte offsets, not member access. The hashmap reads the next-pointer of an entry as:

// hashmap<K,T>::get_nextp_ref — src/base/lockfree_hashmap.hpp:530
T *&get_nextp_ref (T *p) {
  return * (T **) get_ref (p, m_edesc->of_next);
}

// hashmap<K,T>::get_keyp — src/base/lockfree_hashmap.hpp:516
void *get_keyp (T *p) {
  return get_ptr (p, m_edesc->of_key);
}

This is the design that makes lf_entry_descriptor the single shared type. Both implementations pun a T* into a char*, add a byte offset, cast back to the field type, and access. The template parameter T does not appear in the access path — it is a documentation-only type bound. This is what lets a legacy LK_RES (lock-resource) struct be stored unchanged when the consumer migrates from lf_hash_* to lockfree::hashmap<OID, LK_RES>.

Component overview

flowchart TB
  Caller["Consumer<br/>(lock_manager, session, ...)"]
  Bridge["cubthread::lockfree_hashmap&lt;K,T&gt;<br/>m_type ∈ {OLD, NEW, UNKNOWN}"]
  Caller --> Bridge

  subgraph Legacy["m_type = OLD"]
    Cpp["lf_hash_table_cpp&lt;K,T&gt;<br/>(template, in lock_free.h)"]
    LFH["lf_hash_table"]
    LFFL["lf_freelist"]
    LFTS["LF_TRAN_SYSTEM (per-consumer global)"]
    LFTE["LF_TRAN_ENTRY[]"]
    Cpp --> LFH
    Cpp --> LFFL
    LFFL --> LFTS
    LFTS --> LFTE
  end

  subgraph Modern["m_type = NEW"]
    HM["lockfree::hashmap&lt;K,T&gt;"]
    FL["lockfree::freelist&lt;freelist_node_data&gt;"]
    TT["lockfree::tran::table"]
    TS["lockfree::tran::system<br/>(server-wide)"]
    HM --> FL
    FL --> TT
    TT --> TS
  end

  Bridge --> Cpp
  Bridge --> HM

  PRM["PRM_ID_ENABLE_NEW_LFHASH<br/>(read at init time)"]
  PRM --> Bridge

  Edesc["lf_entry_descriptor<br/>(shared by both)"]
  Edesc --> Cpp
  Edesc --> HM

List operations — modern, in detail

The hashmap defers to three private chain primitives:

list_find — read-only walk.
list_insert_internal — find-or-insert with optional given- entry, optional duplicate callback, optional behavior flags.
list_delete — Harris-Michael two-CAS unlink with optional pre-locked entry.

Each is wrapped in an outer loop (hash_insert_internal, hash_erase_internal) that retries on the LF_LIST_BR_RESTARTED flag. The behavior flags LF_LIST_BF_* and response flags LF_LIST_BR_* are taken verbatim from the legacy header — the modern path reuses these constants rather than reinventing.

Insert path

// hashmap<K,T>::list_insert_internal (abridged) — src/base/lockfree_hashmap.hpp:899
bool list_insert_internal (tran::index tran_index, T *&list_head, Key &key,
                           int *behavior_flags, T *&entry)
{
  pthread_mutex_t *entry_mutex = NULL;
  T **curr_p; T *curr;
  tran::descriptor &tdes = get_tran_descriptor (tran_index);

  while (restart_search) {
    start_tran_force (tdes);                                // bracket open
    curr_p = &list_head;
    curr = address_type::atomic_strip_address_mark (*curr_p);

    while (curr_p != NULL) {
      if (curr != NULL) {
        if (m_edesc->f_key_cmp (&key, get_keyp (curr)) == 0) {
          // FOUND duplicate key
          if (m_edesc->using_mutex) {
            lock_entry_mutex (*curr, entry_mutex);
            end_tran_force (tdes);                          // mutex supersedes tran
            if (address_type::is_address_marked (get_nextp_ref (curr))) {
              // raced with delete — restart
              unlock_entry_mutex_force (entry_mutex);
              return false;     // RESTARTED flag set
            }
          }
          if (m_edesc->f_duplicate != NULL) {
            // restart on duplicate (the only current behavior)
            return false;
          }
          if (LF_LIST_BF_IS_FLAG_SET (behavior_flags, LF_LIST_BF_INSERT_GIVEN))
            freelist_retire (tdes, entry);                  // give-up the speculative entry
          if (!LF_LIST_BF_IS_FLAG_SET (behavior_flags, LF_LIST_BF_FIND_OR_INSERT)) {
            // straight insert: failure
            return false;
          }
          entry = curr;                                     // hand back found entry
          return false;                                      // not inserted, but found
        }
        // advance to next
        curr_p = &get_nextp_ref (curr);
        curr = address_type::strip_address_mark (*curr_p);
      } else {
        // END OF CHAIN — insert here
        if (entry == NULL) {
          entry = freelist_claim (tdes);                    // speculative claim
          if (m_edesc->f_key_copy (&key, get_keyp (entry)) != NO_ERROR)
            return false;
        }
        if (m_edesc->using_mutex) lock_entry_mutex (*entry, entry_mutex);

        if (!ATOMIC_CAS_ADDR (curr_p, (T *) NULL, entry)) {
          // somebody linked first — save speculative for next round and restart
          if (m_edesc->using_mutex) unlock_entry_mutex_force (entry_mutex);
          if (!LF_LIST_BF_IS_FLAG_SET (behavior_flags, LF_LIST_BF_INSERT_GIVEN))
            save_temporary (tdes, entry);
          LF_LIST_BR_SET_FLAG (behavior_flags, LF_LIST_BR_RESTARTED);
          end_tran_force (tdes);
          return false;
        }
        if (m_edesc->using_mutex) end_tran_force (tdes);    // mutex supersedes tran
        return true;                                        // INSERTED
      }
    }
  }
}

The insertion is a single CAS on the predecessor’s next-pointer. The dance around it is:

save_temporary on race-loss. If the CAS-link fails because a sibling thread linked first, the speculatively-claimed entry is parked on the descriptor (tdes.save_reclaimable) so the next iteration of the outer while reuses it. This is the freelist round-trip avoidance.
f_duplicate callback path. If a per-consumer “what to do on duplicate key” callback is registered, the insert calls it (e.g., to bump a session-id counter and request a new key) and then sets RESTARTED to retry from a different bucket.
Mutex supersedes tran. When the entry has a mutex, the insert grabs the mutex before CAS-linking and ends the transaction immediately after the link succeeds — the mutex is now sufficient protection for the payload, and ending the tran early frees the descriptor for other ops.

Delete path

// hashmap<K,T>::list_delete (abridged) — src/base/lockfree_hashmap.hpp:1111
bool list_delete (tran::index tran_index, T *&list_head, Key &key,
                  T *locked_entry, int *behavior_flags)
{
  while (restart_search) {
    start_tran_force (tdes);
    curr_p = &list_head;
    curr = address_type::atomic_strip_address_mark (*curr_p);

    while (curr != NULL) {
      if (m_edesc->f_key_cmp (&key, get_keyp (curr)) == 0) {
        if (locked_entry != NULL && locked_entry != curr) {
          // erase_locked semantics: the entry I held the lock on was
          // already retired; current entry is a *different* one with the
          // same key. Bail.
          return false;
        }
        next_p = &get_nextp_ref (curr);
        next   = address_type::strip_address_mark (*next_p);

        // STEP 1: mark next pointer
        if (!ATOMIC_CAS_ADDR (next_p, next, address_type::set_adress_mark (next))) {
          // raced — restart
          if (LF_LIST_BR_RESTARTED) return false;
          break;
        }

        if (m_edesc->using_mutex) {
          if (LF_LIST_BF_IS_FLAG_SET (behavior_flags, LF_LIST_BF_LOCK_ON_DELETE))
            lock_entry_mutex (*curr, entry_mutex);
          else
            entry_mutex = get_pthread_mutexp (curr);  // already locked
        }

        // STEP 2: unlink
        if (!ATOMIC_CAS_ADDR (curr_p, curr, next)) {
          // raced; remove our mark and restart
          if (m_edesc->using_mutex && LF_LIST_BF_IS_FLAG_SET (..., LF_LIST_BF_LOCK_ON_DELETE))
            unlock_entry_mutex_force (entry_mutex);
          ATOMIC_CAS_ADDR (next_p, address_type::set_adress_mark (next), next);
          if (LF_LIST_BR_RESTARTED) return false;
          break;
        }
        if (m_edesc->using_mutex) unlock_entry_mutex_force (entry_mutex);

        promote_tran_force (tdes);                  // bump global id (retire epoch)
        freelist_retire (tdes, curr);               // append to retired ring
        end_tran_force (tdes);
        return true;
      }
      curr_p = &get_nextp_ref (curr);
      curr   = address_type::strip_address_mark (*curr_p);
    }
  }
}

The two CASes are exactly the Harris-Michael protocol:

Mark. next_p from next to mark(next). Failure means a sibling thread is also trying to delete the same node (or has already done step 1).
Unlink. curr_p from curr to next (unmarked). Failure means the predecessor pointer changed under us — another insert or delete moved past curr. Recovery: undo the mark and restart.

The retire is delayed until after the unlink CAS succeeds. This is what makes m_retire_tranid strictly greater than any concurrent reader’s snapshot — by the time we call promote_tran_force (which bumps the global id and writes it to our descriptor), the chain no longer contains curr, so any future reader cannot see it via the bucket. Existing readers holding curr directly still complete safely because their snapshot’s tranid <= our pre-promotion tranid < our_retire_tranid, and reclamation only fires when min_active > retire_tranid.

Find path

// hashmap<K,T>::find (abridged) — src/base/lockfree_hashmap.hpp:280
T *find (tran::index tran_index, Key &key)
{
  while (restart) {
    T *&list_head = get_bucket (key);
    entry = NULL;
    bflags = LF_LIST_BF_RETURN_ON_RESTART;
    list_find (tran_index, list_head, key, &bflags, entry);
    restart = (bflags & LF_LIST_BR_RESTARTED) != 0;
  }
  return entry;
}

find always sets LF_LIST_BF_RETURN_ON_RESTART so the chain walk returns to the bucket-array level on restart, recomputing the hash. This matters when the only inflight change to the chain is unrelated (a different key being inserted/deleted) — a clean restart from get_bucket re-derefs the bucket pointer and is robust to a clear having swapped the bucket array underneath.

Clear and back-buffer

hashmap::clear is the only non-lock-free public operation:

// hashmap<K,T>::clear (abridged) — src/base/lockfree_hashmap.hpp:378
void clear (tran::index tran_index)
{
  tran::descriptor &tdes = get_tran_descriptor (tran_index);
  std::unique_lock<std::mutex> ulock (m_backbuffer_mutex);

  // STEP 1: atomically swap m_buckets <-> m_backbuffer
  T **old_buckets = ATOMIC_TAS_ADDR (&m_buckets, m_backbuffer);

  // STEP 2: reset what is now the new backbuffer (was the old m_buckets)
  for (size_t i = 0; i < m_size; ++i) {
    assert (m_backbuffer[i] == address_type::set_adress_mark (NULL));
    m_buckets[i] = NULL;
  }
  m_backbuffer = old_buckets;

  // STEP 3: retire every entry in old buckets, link by link
  for (size_t i = 0; i < m_size; ++i) {
    do {
      curr = address_type::strip_address_mark (old_buckets[i]);
    } while (!ATOMIC_CAS_ADDR (&old_buckets[i],
                               curr, address_type::set_adress_mark (NULL)));
    while (curr != NULL) {
      next_p = &get_nextp_ref (curr);
      do {
        next = address_type::strip_address_mark (*next_p);
      } while (!ATOMIC_CAS_ADDR (next_p, next, address_type::set_adress_mark (next)));
      if (m_edesc->using_mutex) {
        lock_entry_mutex (*curr, mutex_p);    // wait for any in-flight reader
        unlock_entry_mutex_force (mutex_p);
      }
      freelist_retire (tdes, curr);           // append to retired ring
      curr = next;
    }
  }
}

Three observations:

The bucket-array swap is atomic. Concurrent finds may briefly observe m_buckets pointing at the old or the new array; the invariant is that whatever they point at, the chain walk succeeds (modulo restart-on-CAS-fail) because every chain link is still valid.
The back-buffer is pre-marked with set_adress_mark (NULL) so that any thread looking at the back-buffer’s bucket head sees a marked-NULL chain head and treats it as empty. This is the back-buffer mutex’s invariant — under it, the back-buffer is always all-marked-NULL.
Per-entry mutex acquisition during retirement is the wait point for in-flight readers. A reader who found the entry before clear started will hold the entry’s mutex; clear serializes against that mutex, ensuring the reader is done before we retire.

Cost: O(buckets × chain_length) atomic operations + N pthread mutex lock/unlock cycles + descriptor retire-ring growth. On a heavily-loaded table this is the worst single operation in the hashmap surface; consumers who clear (sessions on logout, lock manager on shutdown, XASL cache on plan-cache reset) accept it as a rare event.

Iterator

The iterator (hashmap::iterator) walks bucket-by-bucket, chain-by-chain, with a transaction held open per bucket. It auto-releases the transaction at end-of-bucket and reacquires at the next. The pattern is deliberately conservative — long iterations would otherwise keep the descriptor’s m_tranid low and prevent reclamation across the whole server.

// hashmap<K,T>::iterator::iterate (abridged) — src/base/lockfree_hashmap.hpp:1343
T *iterate () {
  do {
    if (m_curr != NULL) {
      if (m_hashmap->m_edesc->using_mutex)
        m_hashmap->unlock_entry (*m_curr);     // release per-entry mutex
      m_curr = address_type::strip_address_mark (m_hashmap->get_nextp_ref (m_curr));
    } else {
      if (m_bucket_index != INVALID_INDEX) m_tdes->end_tran ();
      m_tdes->start_tran ();
      m_bucket_index++;
      if (m_bucket_index >= m_hashmap->m_size) {
        m_tdes->end_tran ();
        return NULL;                            // end
      }
      m_curr = address_type::atomic_strip_address_mark (m_hashmap->m_buckets[m_bucket_index]);
    }
    if (m_curr != NULL && m_hashmap->m_edesc->using_mutex) {
      m_hashmap->lock_entry (*m_curr);
      if (address_type::is_address_marked (m_hashmap->get_nextp_ref (m_curr)))
        continue;                               // deleted in flight, skip
    }
  } while (m_curr == NULL);
  return m_curr;
}

Per-method statistics

Every public method is wrapped by ct_stat_type::autotimer that increments a counter and accumulates a duration timer into one of m_stat_find, m_stat_insert, m_stat_erase, m_stat_unlock, m_stat_clear, m_stat_iterates, m_stat_claim, m_stat_retire. The autotimer is short-circuited unless m_active_stats == true, which is toggled by activate_stats() / deactivate_stats() per-table. The default is off — production servers do not pay the stats cost unless an operator explicitly enables it.

Legacy implementation, in brief

The legacy lf_hash_* family does the same Harris-Michael walk in C, with LF_TRAN_ENTRY * instead of tran::descriptor and the global LF_TRAN_SYSTEMs instead of per-table tables. The key procedural calls:

lf_hash_init builds the bucket array and the back-buffer, registers entry_desc with the freelist.
lf_hash_find calls lf_list_find which is the legacy counterpart of hashmap::list_find.
lf_hash_insert calls lf_hash_insert_internal → lf_list_insert_internal.
lf_hash_delete calls lf_hash_delete_internal → lf_list_delete.
lf_hash_clear walks the back-buffer the same way the modern clear does.

The same LF_LIST_BF_* flags drive both. The same back-buffer layout (backbuffer[i] == ADDR_WITH_MARK (NULL)) is invariant.

The legacy stack additionally exposes:

lf_io_list_find, lf_io_list_find_or_insert — an insert-only singly-linked list with no deletes and no freelist. Used by code paths that need a thread-safe append- only set (currently a small number of internal call sites in the legacy stack; the modern path has no equivalent).
lf_stack_push, lf_stack_pop — Treiber stack. Documented as ABA-vulnerable; not currently used by any in-engine consumer (kept for compile-link compatibility with potential external callers).

Migration narrative

The legacy stack predates C++17 in CUBRID. The modern stack was introduced as part of the C++17 modernization with two stated goals:

Type safety. No more void* on the public surface; the lf_entry_descriptor byte-offsets remain but are confined to the implementation.
Per-table reclamation. No more 11 global LF_TRAN_SYSTEM instances. Each lockfree::hashmap builds its own tran::table from a server-wide tran::system; reclamation pressure on the lock manager’s hash does not throttle reclamation on the session table.

The migration is opt-in via system parameter, not a flag day. The bridge exists so each consumer can flip independently without rewriting call sites. The migration completion bar is zero call sites left on lf_hash_table_cpp directly — every consumer hits the new path through the bridge, and the PRM_ID_ENABLE_NEW_LFHASH toggle becomes the single decision for the whole server.

As of 2026-05-07 the toggle’s default is unverified in this doc (see Open Question 1). Empirically the bridge dispatch table — the macro lockfree_hashmap_forward_func — is the hottest code in the hashmap, more than the chain walks themselves, because every public call goes through it.

Consumers

Consumer	Source	Entry struct	Generation toggle
Object-lock resource hash	`src/transaction/lock_manager.c`	`LK_RES`	`enable_new_lfhash`
Object-lock entry hash	`src/transaction/lock_manager.c`	`LK_ENTRY`	`enable_new_lfhash`
Session table	`src/session/session.c`	`SESSION_STATE`	`enable_new_lfhash`
XASL cache	`src/query/xasl_cache.c`	`XASL_CACHE_ENTRY`	`enable_new_lfhash`
Filter-predicate cache	`src/query/filter_pred_cache.c`	`FILTER_PRED_CACHE_ENTRY`	`enable_new_lfhash`
System catalog	`src/storage/system_catalog.c`	`CATCLS_ENTRY`	`enable_new_lfhash`
Slotted-page saving	`src/storage/slotted_page.c`	`SPAGE_SAVE_ENTRY`	`enable_new_lfhash`
HFID table	`src/storage/heap_file.c`	`HFID_TABLE_ENTRY`	`enable_new_lfhash`
Global unique stats	`src/storage/btree.c`	`LOG_LSA / unique_stats`	`enable_new_lfhash`
DWB slot table	`src/storage/double_write_buffer.cpp`	`DWB_SLOT_HASH_ENTRY`	`enable_new_lfhash`
Sort-list (legacy only)	`src/query/sort.c`	`SORT_LIST_ENTRY`	(legacy only)

Each consumer defines its lf_entry_descriptor once (typically as a static const at file scope), constructs its cubthread::lockfree_hashmap<K,T> member, and calls init() once at boot.

Source Walkthrough

Modern stack — public surface

lockfree::hashmap<K,T> — class template, lockfree_hashmap.hpp:42. The .cpp (25 lines) is a stub.
hashmap::find — :280
hashmap::find_or_insert — :301
hashmap::insert — :308
hashmap::insert_given — :315
hashmap::erase — :325
hashmap::erase_locked — :345
hashmap::unlock — :360
hashmap::clear — :378
hashmap::freelist_claim — :656 (and :663 for the tran::descriptor & overload)
hashmap::freelist_retire — :579 (and :646 for the tran::descriptor & overload)
hashmap::list_find — :813
hashmap::list_insert_internal — :899
hashmap::list_delete — :1111
hashmap::hash_insert_internal — :1269
hashmap::hash_erase_internal — :1305
hashmap::iterator::iterate — :1343

Legacy stack — public surface

LF_HASH_TABLE struct — lock_free.h:299
LF_ENTRY_DESCRIPTOR struct — lock_free.h:62
lf_hash_init — lock_free.c (defined later in the file)
lf_hash_find — lock_free.c
lf_hash_find_or_insert — lock_free.c
lf_hash_insert / lf_hash_insert_given — lock_free.c
lf_hash_delete / lf_hash_delete_already_locked — lock_free.c
lf_hash_clear — lock_free.c
lf_list_find / lf_list_insert_internal / lf_list_delete — internal chain walks (lock_free.c)
lf_io_list_find / lf_io_list_find_or_insert — insert-only list (lock_free.c:1077+)
lf_stack_push / lf_stack_pop — Treiber stack (lock_free.c:555+)
lf_freelist_init / lf_freelist_claim / lf_freelist_retire / lf_freelist_transport — lock_free.c:676+
lf_hash_table_cpp<K,T> — C++ shim, lock_free.h:367+

Bridge

cubthread::lockfree_hashmap<K,T> — thread_lockfree_hash_map.hpp:35
init (toggles by parameter) — :129
init_as_old / init_as_new — :145, :155
All forwarded methods (find, insert, erase, …) — :171-:294, expanding lockfree_hashmap_forward_func macro
cubthread::get_thread_entry_lftransys — thread_lockfree_hash_map.cpp:27

Position hints (as of 2026-05-07)

Symbol	File	Line
`lockfree::hashmap<K,T>` (class template)	`src/base/lockfree_hashmap.hpp`	42
`hashmap::init`	`src/base/lockfree_hashmap.hpp`	261
`hashmap::find`	`src/base/lockfree_hashmap.hpp`	280
`hashmap::find_or_insert`	`src/base/lockfree_hashmap.hpp`	301
`hashmap::erase`	`src/base/lockfree_hashmap.hpp`	325
`hashmap::erase_locked`	`src/base/lockfree_hashmap.hpp`	345
`hashmap::clear`	`src/base/lockfree_hashmap.hpp`	378
`hashmap::list_find`	`src/base/lockfree_hashmap.hpp`	813
`hashmap::list_insert_internal`	`src/base/lockfree_hashmap.hpp`	899
`hashmap::list_delete`	`src/base/lockfree_hashmap.hpp`	1111
`hashmap::hash_insert_internal`	`src/base/lockfree_hashmap.hpp`	1269
`hashmap::iterator::iterate`	`src/base/lockfree_hashmap.hpp`	1343
`hashmap::freelist_node_data::on_reclaim`	`src/base/lockfree_hashmap.hpp`	92
`hashmap::to_free_node`	`src/base/lockfree_hashmap.hpp`	552
`lf_hash_table` (struct)	`src/base/lock_free.h`	299
`lf_entry_descriptor` (struct)	`src/base/lock_free.h`	62
`lf_hash_table_cpp<K,T>` (template)	`src/base/lock_free.h`	367
`lf_freelist_alloc_block`	`src/base/lock_free.c`	621
`lf_freelist_claim`	`src/base/lock_free.c`	760
`lf_freelist_retire`	`src/base/lock_free.c`	873
`lf_freelist_transport`	`src/base/lock_free.c`	934
`lf_io_list_find`	`src/base/lock_free.c`	1077
`lf_io_list_find_or_insert`	`src/base/lock_free.c`	1131
`lf_stack_push` / `lf_stack_pop`	`src/base/lock_free.c`	554 / 584
`cubthread::lockfree_hashmap<K,T>`	`src/thread/thread_lockfree_hash_map.hpp`	35
`cubthread::lockfree_hashmap::init`	`src/thread/thread_lockfree_hash_map.hpp`	129
`cubthread::lockfree_hashmap::init_as_new`	`src/thread/thread_lockfree_hash_map.hpp`	155
`cubthread::lockfree_hashmap::init_as_old`	`src/thread/thread_lockfree_hash_map.hpp`	145
`cubthread::get_thread_entry_lftransys`	`src/thread/thread_lockfree_hash_map.cpp`	27

Source verification (as of 2026-05-07)

Verified facts

The bridge dispatches at init() only — there is no per-call toggle. cubthread::lockfree_hashmap::init reads prm_get_bool_value (PRM_ID_ENABLE_NEW_LFHASH) exactly once and writes m_type = OLD or NEW. Subsequent calls dispatch on m_type. Verified at thread_lockfree_hash_map.hpp:129-141.
Both implementations consume the same lf_entry_descriptor type. The struct is defined once in lock_free.h:62, used by lf_hash_* and by lockfree::hashmap<K,T>::init (which takes a lf_entry_descriptor & argument). Verified by reading both signatures.
Modern hashmap stores entries via freelist_node_data wrapping. Each entry is freelist<freelist_node_data>::free_node, carrying a tran::reclaimable_node base, a back-pointer to the freelist owner, and the payload T m_entry. Verified at lockfree_hashmap.hpp:84-100. The header acknowledges the to_free_node offset hack as workaround pending refactor.
Per-entry mutex is optional, driven by lf_entry_descriptor::using_mutex. Verified at lockfree_hashmap.hpp:777 (assert (m_edesc->using_mutex) in lock_entry_mutex) and at :362-371 where unlock branches on m_edesc->using_mutex.
f_duplicate callback currently always restarts. list_insert_internal at lockfree_hashmap.hpp:962-976: when f_duplicate is non-NULL, it is called for its side effect (e.g., bumping a counter), then LF_LIST_BR_RESTARTED is set and the search restarts with a new key. The #if 1 block masks an alternative no-restart path that is currently dead; the comment explains this is intentional pending a wider review.
address_marker<T>::set_adress_mark is the typo spelling used throughout. lockfree_address_marker.hpp:49. Both call sites and the header agree on the typo.
The clear’s back-buffer is pre-marked with set_adress_mark (NULL). Verified at lockfree_hashmap.hpp:272-275 (init) and :393 (clear’s invariant assert).
Statistics default off; toggled per-table by activate_stats()/deactivate_stats(). Verified at lockfree_hashmap.hpp:632-642. m_active_stats defaults to false (initialized in the default-constructor at :219).

Open questions

Default value of enable_new_lfhash. Not verified in this doc. Investigation: read prm_def[] in system_parameter.c and check the release-notes git history for the toggle. If the default is false, the production reality is the legacy path; the modern path is staged but not running.
Are any lf_hash_table_cpp<K,T> direct call sites left? The bridge’s existence assumes every consumer uses cubthread::lockfree_hashmap. If a consumer still instantiates lf_hash_table_cpp directly, that consumer is pinned to the legacy path. Investigation: grep lf_hash_table_cpp< and confirm every match is inside cubthread::lockfree_hashmap::m_old_hash.
Why does the modern path keep of_local_next in the descriptor? The field is legacy-only; the modern hashmap stores entries inside free_node and uses tran::reclaimable_node::m_retired_next for the retired ring. The descriptor does not need of_local_next. The field is retained, presumably, so a single descriptor object can drive both paths during the migration; once the migration completes, of_local_next, of_del_tran_id, f_alloc, f_free, max_alloc_cnt are all dead and can be purged.
f_duplicate semantics — is the restart-only behavior the final design? The #if 1 / #else / #endif block at lockfree_hashmap.hpp:972-1005 keeps two alternatives in the tree. If the no-restart path is intended for a future feature (incrementing a counter without reissuing the key), the dead branch should be either revived or removed.
clear while iterators are open. The clear acquires m_backbuffer_mutex and atomic-swaps the bucket array; an open iterator holding m_curr from the old bucket array continues walking into retired memory until reclamation fires. Reclamation is bounded by the iterator’s m_tdes bracket, so the entries are still alive — but the m_curr pointer no longer reflects the live table. Investigation: are clear-vs-iterator races documented? If not, the iterator class needs a “table generation” counter check.

Beyond CUBRID — Comparative Designs & Research Frontiers

Split-ordered hashing (Shalev & Shavit 2003). A single recursive lock-free linked list with bucket pointers as shortcuts; supports lock-free resize. Adopting it would let the hashmap grow under contention rather than being fixed at init. Cost: more complex chain pointers and a different invariant on bucket-array semantics.
Folly’s ConcurrentHashMap. Hazard-pointer-based reclamation, simpler API, no per-entry mutex. A side-by-side micro-benchmark on the lock manager’s resource hash would tell us whether the per-entry mutex is a real cost or a real benefit.
Cuckoo hashing. Open-addressing alternative with O(1) worst-case lookup. Lock-free variants exist (Li et al. EuroSys 2014). Worth comparing on read-mostly workloads (XASL cache).
Eliminate the legacy path. The most concrete frontier is retiring lf_hash_* and the bridge. Steps: confirm enable_new_lfhash defaults to true on a recent release, burn-in for one release cycle, then delete lock_free.{h,c} and the m_old_hash field of the bridge.
Server-wide rather than per-consumer hash tables. Today every consumer instantiates its own cubthread::lockfree_hashmap. A future refactor could expose a server-wide registry indexed by consumer name, reducing boot-time setup boilerplate and centralizing tuning.
Replace f_hash callbacks with std::hash<K> traits. C++17 has the machinery; the only blocker is that lf_entry_descriptor is a C-shaped struct shared with legacy code. Once the legacy path is gone, the descriptor becomes a C++ class and f_hash becomes a template parameter.

Sources

Source files (in the CUBRID source tree at /data/hgryoo/references/cubrid/):
- src/base/lock_free.h, src/base/lock_free.c (legacy procedural API + the lf_hash_table_cpp<K,T> template shim)
- src/base/lockfree_hashmap.{hpp,cpp} (modern templated class)
- src/base/lockfree_address_marker.hpp (pointer-mark helper)
- src/base/lockfree_freelist.hpp (modern entry storage)
- src/base/lockfree_transaction_descriptor.hpp (read / retire bracketing)
- src/thread/thread_lockfree_hash_map.{hpp,cpp} (the bridge cubthread::lockfree_hashmap<K,T>)
Consumer source files (sampled):
- src/transaction/lock_manager.c (object-lock resource & entry tables — primary canary for this analysis)
- src/session/session.c
- src/query/xasl_cache.c
- src/query/filter_pred_cache.c
- src/storage/system_catalog.c
- src/storage/slotted_page.c
- src/storage/double_write_buffer.cpp
Textbook chapters cited:
- Herlihy & Shavit, The Art of Multiprocessor Programming, ch. 9 §“Lock-Free Linked Lists” (Harris-Michael), ch. 13 §“Concurrent Hashing” (open vs. closed addressing, striping vs. lock-free chains).
- Williams, C++ Concurrency in Action, ch. 7 §“Designing lock-free concurrent data structures.”
Foundational papers:
- Harris, T. L. “A Pragmatic Implementation of Non-Blocking Linked-Lists.” DISC 2001 — the mark-bit-on-next protocol used by both CUBRID generations.
- Michael, M. M. “High Performance Dynamic Lock-Free Hash Tables and List-Based Sets.” SPAA 2002.
- Shalev, O., Shavit, N. “Split-Ordered Lists: Lock-Free Extensible Hash Tables.” JACM 53(3), 2006.
- Michael, M. M. “Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects.” IEEE TPDS 15(8), 2004 — the alternative reclamation model CUBRID does not use.
Sibling docs in this repo:
- cubrid-lockfree-overview.md — umbrella, two-generation map.
- cubrid-lockfree-transaction.md — the reclamation framework the modern hashmap rides on.
- cubrid-lockfree-freelist.md — the entry storage layer.
- cubrid-lockfree-bitmap.md — the index allocator beneath tran::system.
- cubrid-lock-manager.md — the consumer that exercises every code path of this hashmap.
- cubrid-thread-worker-pool.md — defines cubthread::entry::get_lf_tran_index() used by the bridge forwarders.