CUBRID Lock-Free Primitives — Overview, Two Generations, and Reclamation Spine
Contents:
- Theoretical Background
- Common DBMS Design
- CUBRID’s Approach
- Source Walkthrough
- Source verification (as of 2026-05-07)
- Beyond CUBRID — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A lock-free data structure is one in which at least one thread is guaranteed to make progress in any bounded number of steps. The definition is operational, not aesthetic: blocking a thread on a mutex is forbidden because a swapped-out lock-holder freezes everyone else; busy-spinning a CAS-loop on a contended word is allowed because the thread that just swapped the word forward did make progress, even if its competitors did not. The Art of Multiprocessor Programming (Herlihy & Shavit, ch. 3 §“Progress Conditions”) names the hierarchy: wait-free (every thread finishes in a bounded number of steps), lock-free (at least one thread finishes), and obstruction-free (any thread that runs in isolation finishes). Most production lock- free code is lock-free, not wait-free, because making it wait-free typically costs more in cache traffic than it saves in latency.
The hardware primitive every lock-free design composes from is some
form of compare-and-swap. On x86 it is CMPXCHG (and the 16-byte
variant CMPXCHG16B); on ARM it is the load-link / store-conditional
pair (LDREX/STREX, or LSE atomics on ARMv8.1+); in C++11 these
are abstracted as std::atomic<T>::compare_exchange_weak/strong. CAS
gives a single-word atomic test-and-update; everything else — Treiber
stacks, Michael–Scott queues, Harris linked lists, split-ordered
hashes, RCU readers — is built on top by encoding the structure’s
invariants into the value of the word being CAS’d.
Three problems separate lock-free in theory from lock-free in production code, and every CUBRID lock-free file in this analysis exists to solve some combination of them.
1. The ABA problem
Section titled “1. The ABA problem”If a CAS reads value A, then a sibling thread changes it to B and
back to A before the CAS retries, the CAS will succeed as if no
change happened — even though the underlying object the pointer-
shaped A referred to was destroyed and rebuilt in between. The
canonical reproduction is a Treiber stack: thread 1 reads top → A,
plans to CAS top from A to A->next; thread 2 pops A, pops
B, frees both, allocates a new A' at the same address, pushes
A' so top → A' with A'->next pointing somewhere unrelated;
thread 1 wakes, CAS succeeds because top is A' which compares
equal to the saved A, and top becomes a dangling pointer to
free memory. The classical references are Treiber’s IBM Almaden
report (1986), and Michael’s “Hazard Pointers” paper (Michael 2004,
IEEE TPDS).
2. Safe memory reclamation
Section titled “2. Safe memory reclamation”The ABA problem is a special case of a more general one: when can a
node retired from a lock-free structure be returned to the
allocator? Returning it too early gives ABA and use-after-free;
returning it too late is a memory leak. A real lock-free system
needs an epoch (some monotonic clock) and a way to determine
“no thread holds a reference to a retired node older than epoch
E.” There are four mainstream techniques:
| Technique | Mechanic | Cost on read path | Reclamation cost |
|---|---|---|---|
| Hazard pointers (Michael 2004) | Each thread publishes the few pointers it is currently dereferencing | O(1) atomic store per access | Reader scans every thread’s hazard slots before freeing |
| Epoch-Based Reclamation / EBR (Fraser 2004) | Each thread enters/exits an epoch on access; reclaimer waits until all threads cross the boundary | O(1) per critical section | Periodic global epoch advance |
| RCU (read-copy-update; McKenney 2001) | Readers are unblocked; writers wait for a “grace period” that all readers transit | Effectively zero | Grace-period detection cost |
| QSBR (quiescent-state-based) | Each thread reports a “quiescent state” periodically; no node is freed until every thread has reported one since retirement | Zero on hot path | One per-thread counter check at reclaim |
CUBRID picks an EBR-flavoured scheme — its transaction-based
reclamation described in §“CUBRID’s Approach” — and implements it
once, in the lockfree::tran namespace. Every other lock-free
structure (hashmap, freelist) plugs into it.
3. Memory ordering
Section titled “3. Memory ordering”std::atomic<T> operations take a memory-order argument
(memory_order_relaxed, acquire, release, acq_rel,
seq_cst). Picking the wrong one is a defect that may not surface
on x86 (which is close to sequentially consistent for word-sized
loads and stores) but reorders into bugs on weak-memory architectures
(ARM, POWER, RISC-V). The textbook account is C++ Concurrency in
Action (Williams, ch. 5). The two anchor patterns are:
- acquire-release synchronization. A
releasestore on wordWpairs with anacquireload on the sameW; the loader then observes every store sequenced-before the releaser’s release. This is how a lock-free queue’s consumer observes the producer’s payload write. - read-modify-write (RMW) atomics. A successful
compare_exchangeis both anacquireand arelease(when called withmemory_order_acq_relor stronger). It is the join point that publishes a writer’s prior stores and acquires the predecessor’s.
CUBRID’s lock-free code overwhelmingly uses default seq_cst ordering
on std::atomic<T>, which is conservative but correct on every
target architecture CUBRID supports. Tightening to acquire/release
is a known optimization opportunity rather than a current invariant.
4. Three structural patterns
Section titled “4. Three structural patterns”Every lock-free building block in CUBRID is one of three textbook shapes:
- Lock-free linked list (Harris 2001 / Michael 2002). Insert is
a single CAS that splices a new node in. Delete is a two-CAS
protocol: first logically delete the node by setting a mark
bit on its
nextpointer (so concurrent traversers know to skip it), then physically unlink it with a CAS on the predecessor’snext. Reads detect the mark and either help unlink or treat the node as gone. CUBRID’s hashmap chains are Harris-Michael lists; the mark bit is the low bit of thenextpointer, encoded bylockfree::address_marker. - Lock-free FIFO queue (Michael & Scott 1996). The two-cursor
variant: a producer cursor and a consumer cursor, each on its
own atomic. CUBRID’s
lockfree::circular_queueis a fixed- capacity bounded variant — closer in shape to LMAX Disruptor than to MS-queue — using cursor sequence numbers and per-slot block-flags. - Lock-free split-ordered hash (Shalev & Shavit 2003). The
hash table is a single recursive lock-free linked list with
bucket pointers acting as shortcuts; resize requires only
inserting new bucket pointers, not relinking the data. CUBRID
does not use split-ordered hashing — its hash table is a
fixed-bucket-count chained design where each bucket points at
a Harris-Michael list. Resize is not lock-free; the only
resize-shaped operation is
clear(), which atomic-swaps the bucket array with a back-buffer.
Three patterns name the design space the next two sections fill in.
Common DBMS Design
Section titled “Common DBMS Design”Every modern DBMS pays for three lock-free responsibilities: inter-thread queues (the producer-consumer rendezvous that hands work between layers), shared associative tables (page buffers, lock tables, session caches keyed by some identifier), and per-thread free lists (so a transaction-private fast path does not bounce on a global allocator). The shapes converge.
Inter-thread queue
Section titled “Inter-thread queue”The standard role: a daemon thread fills a queue, worker threads drain it (or vice versa). Variants:
- Bounded MPMC ring buffer. Power-of-two capacity, two cursor
atomics, sequence numbers per slot. Closest production exemplar:
LMAX Disruptor. Used in InnoDB’s redo log writer ring, in
PostgreSQL’s autovacuum worker queue, and in CUBRID’s
lockfree::circular_queue(used by vacuum, page-buffer victim hand-off, and CDC log-info forwarding). - Unbounded Michael–Scott queue. Two atomic pointers
headandtail, each node allocated on enqueue and freed on dequeue. Cleaner semantics but unbounded memory and mandatory per-op allocation. - Per-CPU SPMC / MPSC queues. One queue per CPU/thread group; steal across when local is empty (work-stealing). CUBRID does not use this pattern in the lock-free namespace.
Shared associative table
Section titled “Shared associative table”Every DBMS has a lock table (one entry per resource being
locked), a page buffer hash (one entry per resident page,
keyed by (volid, pageid)), a session cache (one entry per
client connection), and a statement cache (XASL, query plan).
Every one of these wants concurrent insert / find / erase under
heavy contention.
The shared engineering pattern:
- Bucket array of fixed size,
hash(key) % size → bucket. - Each bucket is the head of a lock-free linked list.
- List nodes carry the entry payload + a
nextpointer with the low bit reserved as a delete mark. - Insert = walk to end of list, CAS bucket-or-tail next from
NULLto new node. - Delete = CAS-mark the victim’s
next, then CAS-unlink it from the predecessor. - Reclamation is deferred to an epoch / hazard-pointer / RCU / transactional scheme.
CUBRID’s lockfree::hashmap is exactly this shape, with one
distinctive choice: per-entry pthread mutex, optional. When a
descriptor declares using_mutex = 1, the hashmap takes the entry
mutex during find/insert/delete and uses transactions only to
guard the bucket chain against concurrent retirements; payload
mutation is mutex-protected. When using_mutex = 0, the
transaction itself is the only protection — payload reads must
finish inside a transaction window.
Per-thread free list
Section titled “Per-thread free list”Allocating a fresh list node for every insert and freeing on every
delete is a malloc bottleneck even for a great allocator. The
DBMS-typical answer is a per-data-structure free list of typed
nodes — when an entry retires from the bucket chain, it does not
go to free(), it goes to a free list keyed by the structure that
owns it. Subsequent inserts pop from the free list rather than
calling new. This is the role of lockfree::freelist<T>.
Two design wrinkles every implementation hits:
- Contention on a single available list — the most-recent-pop cache line bounces between cores. Standard cure: a back-buffer block held aside, swapped wholesale into the available list when the available list drains, so the slow path of allocating a new block is hidden behind a one-CAS swap. CUBRID’s freelist does exactly this.
- ABA on the head pointer — pop reads
head, computeshead->next, CAS-swapsheadtohead->next; in between, a sibling pops the same head and a third party pushes it back. The savedhead->nextis now stale. CUBRID’s freelist documents this hazard explicitly inpop_from_available()(lockfree_freelist.hpp) and relies on the transaction system rather than tagged pointers to bound the window.
Transactional reclamation, generalized
Section titled “Transactional reclamation, generalized”The fourth shared piece — and the one CUBRID extracts most cleanly into its own class — is deferred reclamation tied to thread activity. Even if every consumer of a lock-free structure ships its own free list, somebody has to decide when a retired node can be physically reused. The shared template:
- A per-data-structure
tranidcounter; every retirement and every “I am about to read” event takes a snapshot. - A per-thread descriptor, one per data structure, holding the
thread’s last-fetched
tranidand a list of “retired by me, not yet reclaimable” nodes. - A periodic computation of
min(active_tranid)across all threads. - A node retired with
tranid = Xis reclaimable oncemin_active_tranid > X.
Compare this template to the Linux kernel’s RCU
(read-side critical section bracketing a per-CPU counter,
grace-period detection by a global counter chase) and it is the
same shape, restricted to a per-DS scope rather than a system-wide
one. CUBRID generalizes it as the lockfree::tran family and lets
each lock-free structure own its own tran::table.
Theory ↔ CUBRID mapping
Section titled “Theory ↔ CUBRID mapping”| Concept | CUBRID name (legacy) | CUBRID name (modern) |
|---|---|---|
| Per-DS reclamation epoch | LF_TRAN_SYSTEM::global_transaction_id | lockfree::tran::table::m_global_tranid |
| Per-thread reader bracket | lf_tran_start / lf_tran_end | descriptor::start_tran / end_tran |
| Per-thread retired list | LF_TRAN_ENTRY::retired_list | descriptor::m_retired_head/tail |
| Min-active-epoch scan | lf_tran_compute_minimum_transaction_id | table::compute_min_active_tranid |
| Refresh interval | LF_TRAN_SYSTEM::mati_refresh_interval = 100 | MATI_REFRESH_INTERVAL = 100 |
| Per-thread index allocator | LF_TRAN_SYSTEM::lf_bitmap (LF_BITMAP) | tran::system::m_tran_idx_map (bitmap) |
| Pointer-mark bit | ADDR_HAS_MARK / ADDR_WITH_MARK macros | lockfree::address_marker<T> template |
| Bounded MPMC queue | n/a (introduced as C++ first) | lockfree::circular_queue<T> |
| Free-list of typed nodes | LF_FREELIST | lockfree::freelist<T> |
| Hash table | LF_HASH_TABLE / lf_hash_table_cpp<K,T> | lockfree::hashmap<K,T> |
| Hash dispatch wrapper | (was integrated into lf_hash_*) | cubthread::lockfree_hashmap<K,T> |
| Bitmap allocator | LF_BITMAP | lockfree::bitmap (same class, both names) |
CUBRID’s Approach
Section titled “CUBRID’s Approach”Two generations, one engine, one toggle
Section titled “Two generations, one engine, one toggle”CUBRID’s lock-free namespace exists as two complete and co-existing implementations:
- Legacy C —
src/base/lock_free.{h,c}. The original family, shipped well before 2020. Procedural, pointer-table based, globally-instantiatedLF_TRAN_SYSTEMs (one per consumer:obj_lock_res_Ts,sessions_Ts,xcache_Ts, etc.). The data structures areLF_HASH_TABLE,LF_FREELIST, plus thelf_io_list_*insert-only list,lf_stack_*Treiber stack, and thelf_list_*Harris-Michael primitives. The C++ shimlf_hash_table_cpp<K,T>(in the same header) wraps the C procedures behind a typed class. - Modern C++ —
src/base/lockfree_*.{hpp,cpp}. Re-architected during the C++17 conversion. Templated, namespace-scoped (lockfree::*), reclamation broken out aslockfree::tran::*,address_marker<T>replacing the macro family,circular_queue<T>introduced as a new primitive,bitmapported with both legacy typedef (LF_BITMAP) and modern name. Each new hashmap owns its owntran::table; there are no globally-instantiated reclamation systems.
The two coexist because the migration is opt-in per call site.
The bridge lives in src/thread/thread_lockfree_hash_map.hpp
(cubthread::lockfree_hashmap<K,T>), which holds both an old
lf_hash_table_cpp<K,T> and a new lockfree::hashmap<K,T> and
dispatches on a m_type ∈ {OLD, NEW, UNKNOWN} field. The choice
is driven by the system parameter PRM_ID_ENABLE_NEW_LFHASH at
init() time:
// cubthread::lockfree_hashmap::init — src/thread/thread_lockfree_hash_map.hppvoid init (lf_tran_system &transys, int entry_idx, int hash_size, int freelist_block_size, int freelist_block_count, lf_entry_descriptor &edesc){ if (prm_get_bool_value (PRM_ID_ENABLE_NEW_LFHASH)) init_as_new (get_thread_entry_lftransys (), ...); else init_as_old (transys, hash_size, ...);}Operationally, a CUBRID server today still defaults to the legacy
path unless enable_new_lfhash is set. The new path is the
preferred forward direction; the legacy path remains for risk
mitigation and for call sites that have not yet migrated.
The design split is intentional: every consumer of
cubthread::lockfree_hashmap (lock manager’s resource and entry
tables, session table, XASL cache, FP cache, slotted-page saving,
catalog, global-unique-stats, hfid table, DWB slot table) gets the
toggle for free; every other consumer (vacuum’s circular queue,
page buffer’s victim queues, CDC’s log-info queue) is C++-only and
sits on the modern stack unconditionally.
The reclamation spine
Section titled “The reclamation spine”The modern stack is built around a single reclamation primitive that every lock-free data structure plugs into. Its three classes are:
// lockfree::tran::system — src/base/lockfree_transaction_system.hppclass system { size_t m_max_tran_per_table; // how many concurrent threads a table must accommodate bitmap m_tran_idx_map; // index allocator (each thread gets one) index assign_index (); void free_index (index idx);};
// lockfree::tran::table — src/base/lockfree_transaction_table.hppclass table { system &m_sys; descriptor *m_all; // one descriptor per assigned index std::atomic<id> m_global_tranid; std::atomic<id> m_min_active_tranid; static const id MATI_REFRESH_INTERVAL = 100; // ...};
// lockfree::tran::descriptor — src/base/lockfree_transaction_descriptor.hppclass descriptor { table *m_table; id m_tranid; // INVALID_TRANID when not in a tran reclaimable_node *m_retired_head, *m_retired_tail; void start_tran (); void start_tran_and_increment_id (); void end_tran (); void retire_node (reclaimable_node &); void reclaim_retired_list (); // ...};The protocol: a thread calls descriptor::start_tran() before
touching any node in the data structure, finishes its work, then
calls end_tran(). A retirement bumps the global counter and
appends the retired node to the calling thread’s per-descriptor
list. Periodically (every MATI_REFRESH_INTERVAL = 100
retirements) the table recomputes min(active_tranid) across
descriptors. Any retired node whose recorded tranid is strictly
less than the current min_active_tranid is reclaimable.
This mechanism is described exhaustively in
cubrid-lockfree-transaction.md. For the overview it is enough
to know that:
- Every lock-free data structure (
hashmap,freelist,circular_queuedoes not need it because it is bounded with no pointer-shaped retirement) owns its owntran::table— there is no global reclamation graph. - Every thread that may touch a lock-free structure is assigned a
tran::indexfrom the system’sbitmap, fixed for that thread’s lifetime. - The descriptor ring is dimensioned by the maximum number of
concurrent threads (
max_transaction_count), not by the rate of retirements; the retirement cost is O(1) on the descriptor’s list and O(N_threads) on the periodic min-scan.
Pointer marking — address_marker<T>
Section titled “Pointer marking — address_marker<T>”The Harris-Michael delete protocol needs a way to set a “this node
is logically deleted” flag without reserving extra space — the flag
must travel inside the pointer itself. CUBRID encodes the flag in
the low bit of the pointer (safe because T* is at minimum
2-byte aligned in the C++ ABI; in CUBRID’s case 8-byte aligned for
pointer-shaped lock-free nodes). The legacy macros
ADDR_HAS_MARK / ADDR_WITH_MARK / ADDR_STRIP_MARK in lock_free.c
do this with raw casts; the modern lockfree::address_marker<T>
template in lockfree_address_marker.hpp provides type-safe
set_adress_mark / strip_address_mark / atomic_strip_address_mark
static helpers and an instance form address_marker<T> that wraps
a std::atomic<T*>.
Both are byte-compatible — they encode the same bit in the same position — so a node moving between code paths during the migration sees a stable representation.
Per-thread index allocation — bitmap
Section titled “Per-thread index allocation — bitmap”Both generations need a way to assign a stable per-thread index
for descriptor / entry lookup. lockfree::bitmap is the
allocator: a chunked array of std::atomic<unsigned int> words,
each bit representing a slot. get_entry() rotates a start_idx
counter (round-robin in SERVER_MODE, last-allocated in client
mode) to find a chunk with an empty bit, then CAS-flips the bit.
free_entry() clears the bit. Two style modes (ONE_CHUNK for
“must always succeed” usage like the transaction system,
LIST_OF_CHUNKS with a usage threshold for usage-bounded
contexts) are encoded in chunking_style. Full details:
cubrid-lockfree-bitmap.md.
Hash map structure
Section titled “Hash map structure”Every modern CUBRID hash table built from lockfree::hashmap<K,T>
has six parts:
- A bucket array
T **m_buckets[m_size]. Each entry is the head of a Harris-Michael chain. - A back-buffer
T **m_backbuffer[m_size]of marked-NULL pointers, used by theclear()operation: the live array is atomic-swapped with the back-buffer, the old array is then walked and every entry retired through the freelist. - A per-table freelist of typed entries
(
freelist<freelist_node_data>wherefreelist_node_datacarries theT m_entryplus thelf_entry_descriptor *m_edescneeded foron_reclaim). - A per-table tran::table, owned by the freelist.
- An
lf_entry_descriptorgiving offsets and callbacks (f_hash,f_key_cmp,f_key_copy,f_init,f_uninit,f_duplicate,using_mutex, theof_*field offsets) — the sole shared type with the legacy code. - A per-method statistics ring (
atomic_counter_timer_stat) forfind/insert/erase/unlock/clear/iter/claim/retire, activated byactivate_stats().
The key-pointer interaction with type T is by byte offset,
not by member access. The descriptor declares of_next, of_key,
of_mutex, of_del_tran_id (legacy only), and the hashmap reads
T*-shaped fields by (char*)p + offset casting. This is what
makes lf_entry_descriptor the bridge: a single descriptor can
drive both the legacy C path and the modern C++ path because both
walk the same offsets.
Full details: cubrid-lockfree-hashmap.md.
Bounded MPMC queue — circular_queue<T>
Section titled “Bounded MPMC queue — circular_queue<T>”The third structural piece is a fixed-capacity multi-producer
multi-consumer queue. The implementation is a power-of-two slot
array with two cursor atomics (m_produce_cursor,
m_consume_cursor) and a per-slot block-flag word
(m_blocked_cursors[N]). Producers CAS-acquire a slot,
copy the payload in, then unblock the slot. Consumers detect that
the slot is unblocked, copy the payload out, then CAS-bump the
consume cursor. The implementation lives entirely in the header
(lockfree_circular_queue.hpp) — no .cpp file beyond a stub.
Full details: cubrid-lockfree-circular-queue.md.
The queue is unconditionally on the modern stack. It is used by:
- Page buffer (
src/storage/page_buffer.c) —waiter_threads_high_priority,waiter_threads_low_priorityfor fixers waiting on a victim,flushed_bcbsfor post-flush processing,private_lrus_with_victims,big_private_lrus_with_victims,shared_lrus_with_victimsfor victim hand-off across LRU partitions. - Vacuum (
src/query/vacuum.c) —vacuum_Block_data_bufferfor log-block dispatch from the log-flush daemon to vacuum workers,vacuum_Finished_job_queuefor completion-back notification. - CDC (
src/transaction/log_manager.c) —cdc_Gl.loginfo_queuefor CDC log-info forwarding.
Per-data-structure free list — freelist<T>
Section titled “Per-data-structure free list — freelist<T>”The fourth piece is a typed free list. lockfree::freelist<T>
takes a tran::system & and constructs its own tran::table. It
maintains a single available-list head and a one-block back-buffer
that swaps in lazily when the available list drains. Items are
typed freelist<T>::free_node, which derives from
tran::reclaimable_node so retirement plugs into the reclamation
spine directly. The contract on T is that it expose an
on_reclaim() method which the freelist invokes when the node
crosses from “retired” to “reclaimed” — this is where typed
cleanup runs (e.g., calling the lf_entry_descriptor::f_uninit
callback from freelist_node_data::on_reclaim).
Full details: cubrid-lockfree-freelist.md.
Map of consumers
Section titled “Map of consumers”A non-exhaustive list of in-engine consumers, by which generation they are pinned to:
| Consumer | File | Generation |
|---|---|---|
| Object-lock resource table | src/transaction/lock_manager.c | toggle (was obj_lock_res_Ts) |
| Object-lock entry table | src/transaction/lock_manager.c | toggle (was obj_lock_ent_Ts) |
| Session table | src/session/session.c | toggle (was sessions_Ts) |
| XASL cache | src/query/xasl_cache.c | toggle (was xcache_Ts) |
| Filter-predicate cache | src/query/filter_pred_cache.c | toggle (was fpcache_Ts) |
| Slotted-page saving | src/storage/slotted_page.c | toggle (was spage_saving_Ts) |
| System catalog | src/storage/system_catalog.c | toggle (was catalog_Ts) |
| Global unique stats | src/storage/btree.c (via log_manager) | toggle (was global_unique_stats_Ts) |
| HFID table | src/storage/heap_file.c | toggle (was hfid_table_Ts) |
| DWB slot table | src/storage/double_write_buffer.cpp | toggle (was dwb_slots_Ts) |
| Free-sort-list | (legacy only) | legacy (free_sort_list_Ts) |
| Vacuum block-data buffer | src/query/vacuum.c | new only (circular_queue) |
| Vacuum finished-job queue | src/query/vacuum.c | new only (circular_queue) |
| Page-buffer victim/waiter queues | src/storage/page_buffer.c | new only (circular_queue) |
| CDC log-info queue | src/transaction/log_manager.c | new only (circular_queue) |
Per-thread lf_tran_entry map | src/thread/thread_manager.cpp | both (legacy via lf_tran_request_entry, new via lockfree::tran::system) |
Component overview
Section titled “Component overview”flowchart LR
subgraph LegacyC["Legacy C (src/base/lock_free.{h,c})"]
LFTS["LF_TRAN_SYSTEM<br/>per consumer (10 globals)"]
LFFL["LF_FREELIST"]
LFHT["LF_HASH_TABLE"]
LFTS --> LFFL --> LFHT
LFLI["lf_list_*<br/>Harris-Michael"]
LFHT --> LFLI
LFCPP["lf_hash_table_cpp<K,T>"]
LFCPP --> LFHT
LFCPP --> LFFL
end
subgraph ModernCpp["Modern C++ (src/base/lockfree_*)"]
TS["lockfree::tran::system<br/>(holds bitmap)"]
TT["lockfree::tran::table"]
TD["lockfree::tran::descriptor[]"]
TS --> TT
TT --> TD
BM["lockfree::bitmap"]
AM["lockfree::address_marker<T>"]
FL["lockfree::freelist<T>"]
HM["lockfree::hashmap<K,T>"]
CQ["lockfree::circular_queue<T>"]
TS --> BM
FL --> TT
HM --> FL
HM --> AM
end
Bridge["cubthread::lockfree_hashmap<K,T><br/>m_type = OLD or NEW"]
Bridge --> LFCPP
Bridge --> HM
PRM["PRM_ID_ENABLE_NEW_LFHASH"]
PRM --> Bridge
Memory ordering posture
Section titled “Memory ordering posture”Every std::atomic access in the modern stack uses the default
memory_order_seq_cst unless the surrounding code uses an
explicit compare_exchange_weak/strong (which is acq_rel by
default on x86, seq_cst overall). The legacy stack uses CUBRID’s
ATOMIC_INC_*, ATOMIC_CAS_ADDR, ATOMIC_TAS_* macros (defined
in porting.h) which expand to GCC __atomic_* builtins with
__ATOMIC_SEQ_CST and an explicit MEMORY_BARRIER() macro
(__atomic_thread_fence (__ATOMIC_SEQ_CST)) on the lf_tran_*
hot paths. There are two consequences:
- Correctness on weak-memory architectures is conservative but correct. A POWER or ARM build will pay full sync overhead on every CAS, but no defect is masked.
- Tightening to
acquire/releaseis an open optimization. No measurement campaign has been run; the cost is obscured by the much heavier transaction-table min-scan on hot paths (lf_tran_compute_minimum_transaction_id).
Posture against ABA
Section titled “Posture against ABA”The transactional reclamation scheme does not solve ABA for
reads outside a transaction window. A thread that reads a pointer
without first calling descriptor::start_tran() may observe a
recycled node as if it were the original, with no detection. The
discipline therefore is:
- Every read on a hashmap chain must be inside a transaction.
hashmap::list_find,list_insert_internal,list_deleteall calltdes.start_tran()before any pointer dereference andtdes.end_tran()(or a mutex-acquire that supersedes it) before return. - The freelist’s
pop_from_available()is the documented exception. The function carries a comment explicitly naming the ABA window:The hazard is bounded by the back-buffer pattern (a retired node spends a long time in the back-buffer before circling back) but is not eliminated. This is a known weakness; the current mitigation is the back-buffer time, not a tagged-pointer redesign.// todo: this is a dangerous preemption point; if I am preempted// here, and thread 2 comes and does:// - second thread gets same rhead and successfully moves m_available_list to next// - third thread gets next and successfully moves m_available_list to next->next// - second thread retires rhead. m_available_list becomes rhead and its next// becomes next->next// - I wake up, compare exchange m_available_list successfully because it is// rhead again, but next will become the item third thread already claimed.
How readers should navigate the rest of these docs
Section titled “How readers should navigate the rest of these docs”| Want to know | Read |
|---|---|
| The reclamation protocol from scratch | cubrid-lockfree-transaction.md |
Why two hash tables exist and how they share lf_entry_descriptor | cubrid-lockfree-hashmap.md §“Two generations” |
| The Harris-Michael insert/delete walk | cubrid-lockfree-hashmap.md §“List operations” |
| The MPMC ring buffer protocol | cubrid-lockfree-circular-queue.md |
| The bitmap allocator used for transaction indexes | cubrid-lockfree-bitmap.md |
| The freelist back-buffer trick | cubrid-lockfree-freelist.md |
Source Walkthrough
Section titled “Source Walkthrough”Modern C++ stack
Section titled “Modern C++ stack”lockfree::tran::system(inlockfree_transaction_system.{hpp,cpp}) — index dispenser; ownsm_tran_idx_map(abitmap).lockfree::tran::table(inlockfree_transaction_table.{hpp,cpp}) — per-DS reclamation table; ownsm_global_tranidandm_min_active_tranidplus an array ofdescriptor. Callscompute_min_active_tranid()everyMATI_REFRESH_INTERVAL(=100) global-id increments.lockfree::tran::descriptor(inlockfree_transaction_descriptor.{hpp,cpp}) — per-thread per-DS state; one per(table, tran::index)pair.lockfree::tran::reclaimable_node(inlockfree_transaction_reclaimable.hpp) — the base class every retirable node derives from. Providesm_retired_next,m_retire_tranid, and a virtualreclaim()(defaultdelete this).lockfree::address_marker<T>(inlockfree_address_marker.hpp) — typed wrapper for the low-bit pointer-mark trick.lockfree::bitmap(inlockfree_bitmap.{hpp,cpp}) — chunked atomic bitfield allocator. Twochunking_stylemodes.lockfree::freelist<T>(inlockfree_freelist.hpp, header-only template) — typed free list with back-buffer; constructs its owntran::tablefrom atran::system &.lockfree::hashmap<K,T>(inlockfree_hashmap.hpp, header- only template;lockfree_hashmap.cppis a stub) — chained hashmap with Harris-Michael chains, optional per-entry mutex, back-buffer-drivenclear().lockfree::circular_queue<T>(inlockfree_circular_queue.hpp, header-only) — bounded MPMC ring buffer with cursor sequence numbers and per-slot block flags.
Legacy C stack
Section titled “Legacy C stack”LF_TRAN_SYSTEM,LF_TRAN_ENTRY—lock_free.{h,c}. 11 global systems instantiated vialf_initialize_transaction_systems()at boot.LF_FREELIST—lock_free.{h,c}. Holds a singleavailablestack; entries are linked throughof_local_next.LF_HASH_TABLE,lf_hash_*—lock_free.{h,c}. The legacy hashmap. Entry layout is described byLF_ENTRY_DESCRIPTOR.lf_list_find/lf_list_insert_internal/lf_list_delete— Harris-Michael chain walks called fromlf_hash_*.lf_io_list_find/lf_io_list_find_or_insert— insert-only list (no transaction system, no freelist).lf_stack_push/lf_stack_pop— Treiber stack. Not currently used by any in-engine consumer; kept for historical reasons.lf_hash_table_cpp<K,T>— C++ shim overlf_hash_*, gives the legacy stack a typed surface that matches the modern hashmap’s shape.
Bridge
Section titled “Bridge”cubthread::lockfree_hashmap<K,T>(insrc/thread/thread_lockfree_hash_map.{hpp,cpp}) — holds bothlf_hash_table_cpp<K,T>andlockfree::hashmap<K,T>, dispatches viam_type ∈ {OLD, NEW, UNKNOWN}decided atinit()time byprm_get_bool_value (PRM_ID_ENABLE_NEW_LFHASH).cubthread::get_thread_entry_lftransys()— accessor for the per-thread-managerlockfree::tran::systemshared by every new- style hashmap on a server.
Position hints (as of 2026-05-07)
Section titled “Position hints (as of 2026-05-07)”| Symbol | File | Line |
|---|---|---|
lockfree::tran::system::system | src/base/lockfree_transaction_system.cpp | 29 |
lockfree::tran::table::table | src/base/lockfree_transaction_table.cpp | 36 |
lockfree::tran::table::compute_min_active_tranid | src/base/lockfree_transaction_table.cpp | 90 |
lockfree::tran::table::get_new_global_tranid | src/base/lockfree_transaction_table.cpp | 73 |
lockfree::tran::descriptor::retire_node | src/base/lockfree_transaction_descriptor.cpp | 65 |
lockfree::tran::descriptor::reclaim_retired_list | src/base/lockfree_transaction_descriptor.cpp | 133 |
lockfree::address_marker<T> | src/base/lockfree_address_marker.hpp | 31 |
lockfree::bitmap::get_entry (lf_bitmap_get_entry) | src/base/lockfree_bitmap.cpp | 138 |
lockfree::freelist<T>::claim | src/base/lockfree_freelist.hpp | 291 |
lockfree::freelist<T>::pop_from_available | src/base/lockfree_freelist.hpp | 322 |
lockfree::hashmap<K,T>::find | src/base/lockfree_hashmap.hpp | 280 |
lockfree::hashmap<K,T>::list_insert_internal | src/base/lockfree_hashmap.hpp | 899 |
lockfree::hashmap<K,T>::list_delete | src/base/lockfree_hashmap.hpp | 1111 |
lockfree::hashmap<K,T>::clear | src/base/lockfree_hashmap.hpp | 378 |
lockfree::circular_queue<T>::produce | src/base/lockfree_circular_queue.hpp | 230 |
lockfree::circular_queue<T>::consume | src/base/lockfree_circular_queue.hpp | 318 |
lf_tran_start (legacy) | src/base/lock_free.c | 413 |
lf_tran_compute_minimum_transaction_id (legacy) | src/base/lock_free.c | 379 |
cubthread::lockfree_hashmap::init (bridge) | src/thread/thread_lockfree_hash_map.hpp | 129 |
Source verification (as of 2026-05-07)
Section titled “Source verification (as of 2026-05-07)”Verified facts
Section titled “Verified facts”- Two co-existing implementations live in the tree.
src/base/lock_free.{h,c}(legacy) and thesrc/base/lockfree_*.{hpp,cpp}family (modern) are both present and compiled. The bridge insrc/thread/thread_lockfree_hash_map.hppdispatches atinit()time viaprm_get_bool_value (PRM_ID_ENABLE_NEW_LFHASH)(verified by reading the source on 2026-05-07). - The minimum-active-tranid refresh interval is 100, hard-coded.
Both
LF_TRAN_SYSTEM::mati_refresh_interval(initialized to 100 inlf_tran_system_init) and the modernlockfree::tran::table::MATI_REFRESH_INTERVALare 100. Not a runtime parameter; tuning requires a source change. - The legacy stack instantiates 11 named global
LF_TRAN_SYSTEMs at boot.spage_saving_Ts,obj_lock_res_Ts,obj_lock_ent_Ts,catalog_Ts,sessions_Ts,free_sort_list_Ts,global_unique_stats_Ts,hfid_table_Ts,xcache_Ts,fpcache_Ts,dwb_slots_Ts. Each is one reclamation domain, populated inlf_initialize_transaction_systems. The modern stack has no globaltran::system— each consumer builds its own. - The address-mark bit is the low bit of the pointer.
Identical between legacy
ADDR_*macros and modernlockfree::address_marker<T>::MARK = 0x1. Verified by direct comparison. - The
pop_from_availableABA window is documented in source. See the multi-line comment infreelist<T>::pop_from_available(src/base/lockfree_freelist.hpp:336); the back-buffer time is the current mitigation, not a tagged pointer. - Statistics on the modern hashmap are opt-in.
m_active_statsdefaults tofalse;activate_stats()/deactivate_stats()toggle it. The legacy path has its own staticlf_inserts,lf_deletes, etc. counters inlock_free.c, unconditional but only incremented underUNITTEST_LF.
Open questions
Section titled “Open questions”- What fraction of in-engine consumers run under the modern
path on a default-configured production server? The default
for
enable_new_lfhashis unverified in this doc; checkprm_def[]insystem_parameter.cand the release-notes history. If the default isfalse, the legacy path is still the production reality and the modern path is more “loaded gun” than “active engine.” - Are any
seq_cstaccesses on the modern hot path measurably costly on ARM/POWER? No campaign run. Worth a perf-test onlockfree::hashmapfind/insert/erase under contention with a profile-guided tightening toacquire/releasewhere safe. - Why is
free_sort_list_Tslegacy-only? Migration not completed, intentionally? Or is this a candidate that simply nobody has touched yet? Check git history for the most recent commit onquery_executor.c’s use offree_sort_list_Ts. - The freelist’s
pop_from_availableABA window — has it ever been hit in production? The hazard is documented but not bounded; a stress test that intentionally retires-and-re- acquires the same head with deliberate preemption (e.g.,SIGSTOPon a worker) would tell us whether the back-buffer distance is empirically sufficient.
Beyond CUBRID — Comparative Designs & Research Frontiers
Section titled “Beyond CUBRID — Comparative Designs & Research Frontiers”- Hazard pointers vs. transactional reclamation. Michael’s
hazard-pointer design (Michael 2004, IEEE TPDS) reclaims
per-thread published pointers, not per-transaction; the cost
is a per-access
releasestore rather than a tran-bracket. A side-by-side cost comparison on CUBRID’s hashmap workload would tell us whether theMATI_REFRESH_INTERVAL = 100amortization is empirically cheaper than per-access HP publication. - Userspace RCU (liburcu, Mathieu Desnoyers). Lock-free
reads with bracketed
rcu_read_lock/unlock, and writers defer reclamation to a grace period. CUBRID’s transactional reclamation is structurally a per-DS userspace RCU; the question is whether usingliburcu(already a system library on most Linux distros) would simplify the engine’slockfree::tranfamily without losing per-DS scoping. - Folly’s
ConcurrentHashMap(Facebook open source). A production-grade C++ lock-free hash table with hazard-pointer reclamation and split-ordered semantics. Comparing it tolockfree::hashmapwould tell us whether theclear()-via- back-buffer trick we use is a real cost or a cheap luxury. - LMAX Disruptor (Thompson et al., 2011). The canonical
bounded MPMC ring; CUBRID’s
lockfree::circular_queueis the same shape minus the batched-claim and waiting-strategy generalizations. Adopting batched claims would let DWB and page-buffer victim queues drain in chunks rather than one CAS per slot. - MERGE: a single
lockfree::tran::systemserver-wide. Today every modern data structure builds its owntran::tablefrom a server-widetran::system(returned bycubthread::get_thread_entry_lftransys()). If a future refactor merged the legacy 11-system zoo into the same server-wide system, the legacy/modern distinction in consumer code would collapse and thecubthread::lockfree_hashmapbridge could retire. - Wait-free flat combining (Hendler et al., SPAA 2010). Replaces a CAS-loop hot spot with a per-thread publication record that a designated combiner thread sweeps. The technique pays off on rare-update / many-reader structures; the lock manager’s wait-for graph is a possible target.
Sources
Section titled “Sources”- Source files (in the CUBRID source tree at
/data/hgryoo/references/cubrid/):src/base/lock_free.h,src/base/lock_free.csrc/base/lockfree_transaction_def.hppsrc/base/lockfree_transaction_system.{hpp,cpp}src/base/lockfree_transaction_table.{hpp,cpp}src/base/lockfree_transaction_descriptor.{hpp,cpp}src/base/lockfree_transaction_reclaimable.hppsrc/base/lockfree_address_marker.hppsrc/base/lockfree_bitmap.{hpp,cpp}src/base/lockfree_freelist.hppsrc/base/lockfree_hashmap.{hpp,cpp}src/base/lockfree_circular_queue.hppsrc/thread/thread_lockfree_hash_map.{hpp,cpp}
- Textbook chapters cited:
- Herlihy & Shavit, The Art of Multiprocessor Programming, ch. 3 §“Progress Conditions”, ch. 9–10 (lock-free linked lists), ch. 13 (concurrent queues), ch. 13 §“Bounded Lock-Free Queues”.
- Williams, C++ Concurrency in Action, ch. 5 (memory model), ch. 7 (designing lock-free data structures).
- Foundational papers:
- Michael, M. M. “Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects.” IEEE TPDS 15(8), 2004.
- Harris, T. L. “A Pragmatic Implementation of Non-Blocking Linked-Lists.” DISC 2001.
- Michael, M. M., Scott, M. L. “Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms.” PODC 1996.
- Shalev, O., Shavit, N. “Split-Ordered Lists: Lock-Free Extensible Hash Tables.” JACM 53(3), 2006.
- Treiber, R. K. “Systems Programming: Coping with Parallelism.” IBM Almaden RJ 5118, 1986.
- Fraser, K. “Practical Lock-Freedom.” Univ. of Cambridge PhD thesis (epoch-based reclamation), 2004.
- Sibling docs in this repo:
cubrid-lockfree-transaction.md— the reclamation framework in depth.cubrid-lockfree-hashmap.md— both hash table generations, with the migration narrative.cubrid-lockfree-circular-queue.md— the bounded MPMC ring.cubrid-lockfree-bitmap.md— the chunked atomic bitmap allocator.cubrid-lockfree-freelist.md— the typed free list with back-buffer.
- Adjacent code-analysis docs:
cubrid-thread-worker-pool.md— names howcubthread::entry::tran_entries[]andget_lf_tran_index()plug into the lock-free framework.cubrid-mvcc.md,cubrid-lock-manager.md— primary consumers oflf_hash_*(legacy) /cubthread::lockfree_hashmap(bridged).