CUBRID Lock-Free Transactional Reclamation — System, Table, Descriptor, and Address Marker
Contents:
- Theoretical Background
- Common DBMS Design
- CUBRID’s Approach
- Source Walkthrough
- Source verification (as of 2026-05-07)
- Beyond CUBRID — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”The hardest problem in a lock-free data structure is not insert and
not delete — it is safe memory reclamation (SMR). Once a node has
been unlinked from the structure, when can the allocator have it back?
A naïve free() immediately after unlink hands a use-after-free to
any concurrent reader still dereferencing the unlinked node. Waiting
“long enough” before freeing is unbounded and a memory leak. The
correct primitive is some scheme that proves no reachable reader
holds the node anymore. This is the safe-memory-reclamation problem,
formalized first by Maged Michael (Michael 2004, IEEE TPDS) and
since elaborated by McKenney, Fraser, Brown, and others.
The four mainstream techniques fall on a cost / latency / generality plane:
| Technique | Reader cost | Reclamation latency | Per-DS scope? |
|---|---|---|---|
| Hazard pointers (HP) | O(1) atomic store per access | Bounded (worst case = K hazards × N threads scan) | Yes |
| Epoch-based reclamation (EBR / Fraser 2004) | O(1) — register epoch on entry | One epoch (~ tens of microseconds) | Yes |
| RCU (Read-Copy-Update; McKenney 2001+) | Effectively zero — rcu_read_lock is a barrier() | One grace period | System-wide (Linux) |
| Quiescent-state-based (QSBR) | Zero — readers report quiescence implicitly | Until every thread quiesces once | Yes |
EBR (and its userspace cousin liburcu) bracket every read-side
critical section with enter_epoch / exit_epoch; reclamation
walks every retired node and frees those whose retirement epoch is
strictly less than the minimum currently-entered epoch. The basic
invariant: if no thread has entered an epoch ≤ E since node N was
retired with epoch E, then N is unreachable from any reader and
reclaimable. The scheme rides on three properties:
- The epoch counter is monotonic.
- A thread’s “currently active” epoch can only be observed after the thread has paired its retirements with new reads, so a reader cannot have a stale snapshot whose epoch the writer’s retirement bumped past.
- The minimum-active-epoch scan can be done lazily — readers do not block on it.
Two real-world tradeoffs shape the design:
- Granularity. EBR is naturally per-data-structure; you do not want a vacuum thread’s hashmap epoch to gate a query plan cache’s epoch. The scope is the data structure, not the system.
- The min-active-epoch scan is O(N_threads). Doing it on every retirement is expensive; doing it never is unbounded latency. The standard compromise is every K retirements, where K is chosen so the scan cost amortizes against retirement rate. K is the refresh interval.
CUBRID’s transactional reclamation is exactly EBR with a database-
flavoured vocabulary. The “epoch” is a “transaction id,” the
“per-DS data structure” is a “transaction table,” the “per-thread
state” is a “transaction descriptor,” and the “retired node” is a
“reclaimable node.” The refresh interval is named
MATI_REFRESH_INTERVAL (Minimum Active Tran Id), set to 100.
The scheme does not solve ABA in general. It guarantees that a
node retired during transaction Tr is not reclaimed until every
descriptor either is idle or has a tranid > Tr. Inside that
window, a reader that started before the retirement still sees the
node at its original address, and any CAS by the reader on a
pointer to that node still operates on a live object. ABA at the
pointer-value level — where a node is retired, the bytes are
later reclaimed and reused for a different node, then a third
party puts the same address back into the structure — is prevented
for any thread holding a transaction descriptor open; readers
that touch lock-free structures without bracketing themselves with
start_tran / end_tran are outside the protection.
Common DBMS Design
Section titled “Common DBMS Design”Every DBMS that ships lock-free internal structures pays for some form of SMR. The implementations differ but the conceptual template is shared:
Per-DS reclamation table
Section titled “Per-DS reclamation table”A reclamation primitive is per data structure, not per process. The lock manager’s resource hash, the page buffer’s BCB hash, the session table — each owns its own table because mixing them gates one’s reclamation behind another’s reader latency.
PostgreSQL’s XactLockTable and LockHashPartitionLock use
heavyweight LWLocks; reclamation is implicit (no lock-free
structures). InnoDB’s lock_sys_t is mutex-protected. Redis uses
a single-threaded model and has no concurrent reclamation
problem. The DBMSes that do run lock-free internals (Citus,
Yugabyte, FoundationDB, CockroachDB) tend to use the language
runtime’s GC (Go, Java) and avoid the SMR problem at the cost of
per-allocation overhead.
CUBRID is the C++ outlier: lock-free internals + manual memory
management + custom EBR. The trade buys throughput at the cost of
the entire lockfree::tran::* machinery existing and being
correct.
Per-thread descriptor
Section titled “Per-thread descriptor”Every reader must publish “I am inside a critical section that uses node X.” The publication must be cheap — it is on every read hot path. The publication must be observable by retiring threads in bounded time. Real implementations use:
- A per-thread
local epochinteger, atomic-stored on entry, atomic-cleared on exit. Reclaimers loop and read every thread’s entry. EBR / liburcu / CUBRID’s modern stack. - Hazard-pointer arrays of K slots per thread. Each slot is the target pointer the thread is currently dereferencing. Readers store, then test for unlink, then dereference. CUBRID does not use this.
- Per-CPU sequence counters with periodic call-out to a reclaimer thread. CUBRID does not use this.
Min-scan amortization
Section titled “Min-scan amortization”Computing the global minimum across N threads is the only non-O(1) operation in the read/write path. Every implementation amortizes:
- liburcu computes
gp_seqlazily on call-back (call_rcu). - Java’s G1 collector tracks GC roots concurrently and sweeps in the background.
- CUBRID computes
min_active_tranideveryMATI_REFRESH_INTERVAL(= 100) global-id increments, in the same thread that callsget_new_global_tranid(). The cost is paid by the retiring thread, not by readers.
Retirement-list ownership
Section titled “Retirement-list ownership”Two designs:
- Per-data-structure global retired list. All retiring threads append to a shared list. Pros: simpler bookkeeping. Cons: contention on the shared list, plus the reclaimer must walk every entry on every scan.
- Per-thread retired list. Each thread carries its own
ordered list of “things I retired.” Reclaim walks only the
prefix whose
tranid < min_active. CUBRID picks this — everylockfree::tran::descriptorhas its ownm_retired_head,m_retired_tailring.
The per-thread list relies on retirement order being the same as
tranid order (which it is, because start_tran_and_increment_id
fetches a fresh global id under CAS), so the prefix-walk is a
strictly increasing scan that stops at the first node whose
tranid >= min_active.
Theory ↔ CUBRID mapping
Section titled “Theory ↔ CUBRID mapping”| Concept | CUBRID name |
|---|---|
| EBR epoch counter | table::m_global_tranid (monotonic atomic) |
| Per-thread active epoch | descriptor::m_tranid (set by start_tran) |
| Idle sentinel | INVALID_TRANID = numeric_limits<id>::max() |
| Per-thread retired list | descriptor::m_retired_head/tail (linked through m_retired_next) |
| Min-active-epoch | table::m_min_active_tranid (lazily computed atomic) |
| Refresh interval | MATI_REFRESH_INTERVAL = 100 |
| Per-thread state base class | tran::reclaimable_node |
| Custom destructor hook | reclaimable_node::reclaim() (virtual, default delete this) |
| Per-thread index | tran::index (typedef size_t); allocated by tran::system::assign_index |
CUBRID’s Approach
Section titled “CUBRID’s Approach”Three classes, one purpose
Section titled “Three classes, one purpose”The modern reclamation stack is exactly three classes plus one base class for retirable nodes:
// lockfree::tran::system — src/base/lockfree_transaction_system.hpp// index dispenser shared by every table in the system.class system {public: system () = delete; system (size_t max_tran_count); index assign_index (); // give a fresh per-thread index void free_index (index idx); // return one size_t get_max_transaction_count () const;private: size_t m_max_tran_per_table; bitmap m_tran_idx_map; // ONE_CHUNK style, full-usage threshold};
// lockfree::tran::table — src/base/lockfree_transaction_table.hpp// one per lock-free data structure; owns the global tranid and the descriptor array.class table {public: table (system &sys); descriptor &get_descriptor (const index &tran_index); void start_tran (const index &); void end_tran (const index &); id get_current_global_tranid () const; id get_new_global_tranid (); // atomic-bump and return id get_min_active_tranid () const;private: static const id MATI_REFRESH_INTERVAL = 100; void compute_min_active_tranid (); system &m_sys; descriptor *m_all; // size = sys.get_max_transaction_count() std::atomic<id> m_global_tranid; std::atomic<id> m_min_active_tranid;};
// lockfree::tran::descriptor — src/base/lockfree_transaction_descriptor.hpp// per-thread per-table state.class descriptor {public: void start_tran (); // snapshot current global id (no bump) void start_tran_and_increment_id (); // bump global, take the new id void end_tran (); void retire_node (reclaimable_node &); // append to retired ring void reclaim_retired_list (); // walk prefix whose tranid < min_active bool is_tran_started () const; // ...private: table *m_table; id m_tranid; // INVALID_TRANID when idle id m_last_reclaim_minid; reclaimable_node *m_retired_head; reclaimable_node *m_retired_tail; bool m_did_incr; reclaimable_node *m_saved_node; // local-stash for find-and-insert ping-pong size_t m_retire_count, m_reclaim_count;};The relationship is one-to-many one-way:
- One
tran::systemmay underpin manytran::tableinstances. On the CUBRID server there is exactly one server-widetran::system(returned bycubthread::get_thread_entry_lftransys()), parameterized by the maximum thread count. - Each
tran::tableowns itsdescriptor[]array. The array is sized at construction bym_sys.get_max_transaction_count()and every slot is initialized — this is not a sparse map; it is a pre-allocated dense array indexed bytran::index. - A
tran::indexis fixed for a thread’s lifetime — assigned at thread setup, valid across everytablein the samesystem.
The protocol — read
Section titled “The protocol — read”Reader-side critical section:
- Locate descriptor.
tran::descriptor &tdes = table.get_descriptor (my_index). - Bracket open.
tdes.start_tran()snapshots the current global id. The descriptor’sm_tranidbecomes the snapshot. - Touch lock-free structure. Read pointers, CAS values,
walk chains. Any pointer you read is guaranteed not to be
reclaimed before you call
end_tran()because every retiring thread that comes after you seesmin_active_tranid <= m_traniduntil youend_tran(). - Bracket close.
tdes.end_tran()resetsm_tranid = INVALID_TRANID.
Cost per access: one atomic load (the global id snapshot) plus
one atomic store (the descriptor write). Both are
seq-cst by default; the descriptor write is
unsynchronized with other descriptors — only the descriptor’s
owning thread writes to it. The retiring thread’s
compute_min_active_tranid() reads every descriptor’s
m_tranid; the read is done with a plain get_transaction_id()
load, which on x86 is sequentially consistent for the word-sized
read.
The choice to keep start_tran / end_tran purely on the calling
thread’s descriptor (no atomic on a shared cache line) is what
makes the read path cheap — the reader pays only for its own
cache line.
The protocol — retire
Section titled “The protocol — retire”Retire-side:
- Locate descriptor. Same as read.
- Bracket open with bump.
tdes.start_tran_and_increment_id()atomically incrementsm_table->m_global_tranidand assigns the new id tom_tranid. The bump is what makes the retiring epoch strictly greater than any concurrent reader’s snapshot. - Reclaim eligible nodes first.
reclaim_retired_list()walks the prefix ofm_retired_headring whosem_retire_tranid < min_active_tranidand calls each node’sreclaim()(virtual, defaultdelete this). This is amortization — every retire pays the cost of reclaiming the nodes that became eligible since last time. - Append the new retiree.
retire_node()sets the retiree’sm_retire_tranidtom_tranidand links it at the tail of the retired ring. - Bracket close.
end_tran().
The retiree’s m_retire_tranid is the current global id, not
the next one. Why is this safe? Because every reader who already
snapshotted a tranid <= my_tranid will keep m_tranid ≤
my_tranid open until they end their tran, so the
min_active_tranid cannot exceed my_tranid until every such
reader is done. Therefore min_active > my_tranid ⇔ no reader
holds a snapshot ≤ my_tranid ⇔ no reader can dereference the
retiree.
The min-active-tranid scan
Section titled “The min-active-tranid scan”Computed lazily on every 100th get_new_global_tranid():
// lockfree::tran::table::get_new_global_tranid — src/base/lockfree_transaction_table.cpp:73id table::get_new_global_tranid (){ id ret = ++m_global_tranid; if (ret % MATI_REFRESH_INTERVAL == 0) { compute_min_active_tranid (); } return ret;}
// lockfree::tran::table::compute_min_active_tranid — src/base/lockfree_transaction_table.cpp:90void table::compute_min_active_tranid (){ // note: all transactions are actually claimed from boot. this code is optimized // for this case. if we ever change how transactions are requested, this // must be updated too id minvalue = INVALID_TRANID; // nothing is bigger than INVALID_TRANID for (size_t it = 0; it < m_sys.get_max_transaction_count (); it++) { id tranid = m_all[it].get_transaction_id (); if (minvalue > tranid) minvalue = tranid; } m_min_active_tranid.store (minvalue);}Two observations:
- The scan walks every descriptor, not just the
currently-occupied ones. The header comment notes this is
because “all transactions are actually claimed from boot” —
thread indexes are assigned at startup and held for life. If
a future refactor introduces dynamic on-demand index
assignment, this scan must be rewritten to skip vacant slots
(otherwise
INVALID_TRANIDfrom a vacant slot — which isnumeric_limits<id>::max()— never participates in the minimum, but a vacant slot adds a useless cache miss). INVALID_TRANIDisnumeric_limits<id>::max(), on purpose. An idle descriptor stores the maximum and so cannot lower the minimum. The retire-side testtranid < min_activetreats an all-idle world as “every retiree is reclaimable” (becausemin_active == INVALID_TRANIDand every retiree’sm_retire_tranidis some real id<= m_global_tranid < INVALID_TRANID).
The retire ring on each descriptor
Section titled “The retire ring on each descriptor”Each descriptor maintains a singly-linked ordered ring of
retired nodes via m_retired_head / m_retired_tail, with each
node carrying m_retired_next and m_retire_tranid. Insertions
go to m_retired_tail; reclamation walks from m_retired_head
forward.
// lockfree::tran::descriptor::retire_node — src/base/lockfree_transaction_descriptor.cpp:65void descriptor::retire_node (reclaimable_node &node){ bool should_end = !is_tran_started (); start_tran_and_increment_id (); // bump global id
reclaim_retired_list (); // amortized cleanup
node.m_retire_tranid = m_tranid; node.m_retired_next = NULL; // add to tail to keep delete ids ordered if (m_retired_tail == NULL) { assert (m_retired_head == NULL); m_retired_head = m_retired_tail = &node; } else { m_retired_tail->m_retired_next = &node; m_retired_tail = &node; } ++m_retire_count;
if (should_end) end_tran (); // restore previous bracketing}The retire path does not lock. It is single-writer (the
descriptor’s owning thread is the only writer of its retired
ring) and the descriptor’s m_retired_* pointers are not racing
with anyone else. The min-scan reads each descriptor’s
m_tranid (a single word) but never the ring; the ring is
private to the owning thread.
The fast path of reclaim_retired_list() is a quick load of
m_min_active_tranid and an early-return when nothing has
changed:
// lockfree::tran::descriptor::reclaim_retired_list — src/base/lockfree_transaction_descriptor.cpp:133void descriptor::reclaim_retired_list (){ id min_tran_id = m_table->get_min_active_tranid (); if (min_tran_id <= m_last_reclaim_minid) { // nothing changed return; } while (m_retired_head != NULL && m_retired_head->m_retire_tranid < min_tran_id) reclaim_retired_head (); if (m_retired_head == NULL) m_retired_tail = NULL;
m_last_reclaim_minid = min_tran_id;}This early-return is what makes retirement amortized-O(1) — only
the call that crosses a MATI_REFRESH_INTERVAL boundary
pays the scan cost on its descriptor.
Reclaim is a virtual method on the node
Section titled “Reclaim is a virtual method on the node”The retired node owns its own destructor. The default is
delete this; the freelist overrides it to push the node back
onto the freelist’s available list:
// lockfree::tran::reclaimable_node — src/base/lockfree_transaction_reclaimable.hpp:46class reclaimable_node {public: reclaimable_node () : m_retired_next (NULL), m_retire_tranid (0) {} virtual ~reclaimable_node () = default; virtual void reclaim () { delete this; } // override to recycle instead of deleteprotected: reclaimable_node *m_retired_next;private: friend descriptor; id m_retire_tranid;};
// lockfree::freelist<T>::free_node::reclaim — src/base/lockfree_freelist.hpp:523void free_node::reclaim () final override{ m_t.on_reclaim (); // payload's on_reclaim hook m_retired_next = NULL; --m_owner->m_retired_count; ++m_owner->m_available_count; m_owner->push_to_list (*this, *this, m_owner->m_available_list);}The contract is: a node’s storage is owned by something
(the freelist, the heap, an embedded slot); reclaim() is the
hook that returns it. delete this is the default for code that
allocates with new and has no enclosing free list. Hashmap
entries go through the freelist’s overridden reclaim, so they
never call delete on a node — they recycle it.
The save / pull-saved-reclaimable mechanism
Section titled “The save / pull-saved-reclaimable mechanism”descriptor exposes one auxiliary primitive:
void save_reclaimable (reclaimable_node *&node);reclaimable_node *pull_saved_reclaimable ();Use case: a thread is doing a find_or_insert on the hashmap.
It speculatively claims a fresh node from the freelist
(freelist_claim), tries to CAS-link it into the chain, then
finds another thread already inserted the same key. The
speculatively-claimed node is now stale but must not be
reclaimed yet — the thread does not want to retire it
(retirement bumps the global id and incurs amortization cost),
nor does it want to push it back onto the freelist’s available
list (race with the next call site). It instead saves it on
its own descriptor:
// freelist_claim — src/base/lockfree_hashmap.hpp:663T *hashmap<Key, T>::freelist_claim (tran::descriptor &tdes){ T *claimed = NULL; free_node_type *fn = reinterpret_cast<free_node_type *> (tdes.pull_saved_reclaimable ()); // ... if fn != NULL, reuse the saved node before going to the freelist if (fn == NULL) { fn = m_freelist->claim (tdes); // initialize via f_init } // ...}The next time the same thread calls freelist_claim, the saved
node is pulled out and reused before the freelist’s CAS-pop. This
trick is one of the reasons find_or_insert is cheap on contended
keys — the speculative claim is not wasted.
Caveat: save_reclaimable asserts only one node may be saved at
a time per descriptor (assert (m_saved_node == NULL)). If a
thread saves a node and then forgets to pull it before the next
save_reclaimable, the second save asserts in debug and silently
overwrites in release. Hashmap call sites are written so the
saved-node lifetime cannot escape a single hash operation.
Per-thread index allocation — cycle of life
Section titled “Per-thread index allocation — cycle of life”A thread is born, calls system::assign_index() to get a unique
index, and uses it until it dies (or returns the index via
free_index). On the CUBRID server every thread gets its index
at thread-pool dispatch time and keeps it for the lifetime of the
worker; the index is stored in cubthread::entry::m_lf_tran_index
and exposed via entry::get_lf_tran_index().
// lockfree::tran::system::assign_index — src/base/lockfree_transaction_system.cpp:39index system::assign_index (){ int ret = m_tran_idx_map.get_entry (); if (ret < 0) { assert (false); return INVALID_INDEX; } return static_cast<index> (ret);}The bitmap is ONE_CHUNK style with FULL_USAGE_RATIO (1.0) —
“all slots usable” — so assign_index only fails when every slot
is taken, which under the boot-time-claim model means a
miscount of max_tran_count.
Lifecycle diagram
Section titled “Lifecycle diagram”sequenceDiagram participant Reader as Reader thread participant Retirer as Retirer thread participant Tdes_R as Reader's descriptor participant Tdes_W as Retirer's descriptor participant Tab as tran::table Note over Tab: m_global_tranid = G<br/>m_min_active_tranid = G_min Reader->>Tdes_R: start_tran() Tdes_R->>Tab: load m_global_tranid → G Tdes_R->>Tdes_R: m_tranid = G Reader->>Reader: dereference node N Retirer->>Tdes_W: retire_node(N) Tdes_W->>Tab: ++m_global_tranid → G+1 Tdes_W->>Tdes_W: m_tranid = G+1 Note over Tdes_W: every 100th increment:<br/>compute_min_active_tranid()<br/>scans all descriptors Tdes_W->>Tdes_W: N.m_retire_tranid = G+1 Tdes_W->>Tdes_W: append N to retired ring Tdes_W->>Tdes_W: m_retire_count++ Note over Reader: ... still using N ... Reader->>Tdes_R: end_tran() Tdes_R->>Tdes_R: m_tranid = INVALID_TRANID Note over Tab: next refresh:<br/>min_active climbs past G+1 Retirer->>Tdes_W: reclaim_retired_list() Tdes_W->>Tdes_W: walk head while m_retire_tranid < min_active Tdes_W->>Tdes_W: N.reclaim() (delete or recycle)
Address marker — the pointer-mark companion
Section titled “Address marker — the pointer-mark companion”The reclamation framework solves when to reclaim. The
Harris-Michael delete protocol needs a way to mark a node as
logically deleted but not yet physically unlinked without
allocating extra memory. The standard technique is to put the
mark in the low bit of the next pointer: real pointers are
≥ 2-byte aligned, so the low bit is otherwise always 0. CUBRID
encodes this with lockfree::address_marker<T>:
// lockfree::address_marker<T> — src/base/lockfree_address_marker.hpp:31template <class T>class address_marker {public: bool is_marked () const; T *get_address () const; // strip mark T *get_address_no_strip () const; static T *set_adress_mark (T *addr); static T *strip_address_mark (T *addr); static bool is_address_marked (T *addr); static T *atomic_strip_address_mark (T *addr);private: using convert_type = std::uint64_t; static const convert_type MARK = 0x1; // ... std::atomic<T *> m_addr;};The implementation is dataless: every static helper casts the
pointer to uint64_t, OR/AND-NOTs the mark bit, casts back. The
instance form address_marker<T> wraps a std::atomic<T*> so
callers can store-marked pointers without the cast. The hashmap
uses both: marked next-pointers in entries (via static helpers
set_adress_mark and strip_address_mark) and a marked back-
buffer head (via instance address_marker<T>).
Note the typo in set_adress_mark (single d in adress) — it
matches in every call site, so the typo is now load-bearing.
Co-existence with the legacy LF_TRAN_*
Section titled “Co-existence with the legacy LF_TRAN_*”The legacy stack (src/base/lock_free.{h,c}) implements the same
EBR shape with different names:
| Modern | Legacy |
|---|---|
tran::system::assign_index | lf_tran_request_entry |
tran::system::free_index | lf_tran_return_entry |
tran::table::m_global_tranid | LF_TRAN_SYSTEM::global_transaction_id |
tran::table::m_min_active_tranid | LF_TRAN_SYSTEM::min_active_transaction_id |
MATI_REFRESH_INTERVAL = 100 | LF_TRAN_SYSTEM::mati_refresh_interval = 100 |
descriptor::start_tran | lf_tran_start (entry, false) |
descriptor::start_tran_and_increment_id | lf_tran_start (entry, true) |
descriptor::end_tran | lf_tran_end (entry) |
descriptor::retire_node | (inlined into lf_freelist_retire) |
descriptor::m_retired_* | LF_TRAN_ENTRY::retired_list (single linked list) |
compute_min_active_tranid | lf_tran_compute_minimum_transaction_id |
reclaimable_node | (no base class; void* node walked via LF_ENTRY_DESCRIPTOR::of_local_next) |
Two structural differences:
- Legacy uses 11 globally-named
LF_TRAN_SYSTEMinstances (spage_saving_Ts,obj_lock_res_Ts, …,dwb_slots_Ts). Modern has no globals — everytran::tableis constructed from atran::system &reference. - Legacy has explicit
MEMORY_BARRIER()macros in thelf_tran_start_with_mb/lf_tran_end_with_mbhelpers; modern relies on the seq_cst defaults ofstd::atomic. The intent is the same — publish the descriptor write before any node dereference, observe the global id increment after — but the modern code expresses it through the type system rather than explicit fences.
Both schemes satisfy the same EBR invariant. Code that runs under
the bridge cubthread::lockfree_hashmap works whichever side is
active.
Source Walkthrough
Section titled “Source Walkthrough”lockfree_transaction_def.hpp
Section titled “lockfree_transaction_def.hpp”Just two type aliases:
namespace lockfree { namespace tran { using index = size_t; // per-thread descriptor index using id = std::uint64_t; // monotonic global tranid}}lockfree_transaction_system.{hpp,cpp}
Section titled “lockfree_transaction_system.{hpp,cpp}”The index dispenser. system::system(size_t) initializes an
internal bitmap of size max_tran_count in ONE_CHUNK
FULL_USAGE_RATIO mode; assign_index() calls
bitmap::get_entry() and free_index() calls
bitmap::free_entry(). There is no per-table state in the
system class itself — it is a thin shell over the bitmap.
lockfree_transaction_table.{hpp,cpp}
Section titled “lockfree_transaction_table.{hpp,cpp}”The reclamation table. table(system &) constructor allocates
new descriptor[m_sys.get_max_transaction_count()] and walks
each descriptor to set its back-pointer
(m_all[i].set_table(*this)). The destructor delete[] m_all
relies on every descriptor’s destructor having drained its
retired ring (descriptor destructor asserts
!is_tran_started() and walks the ring calling reclaim_retired_head).
get_new_global_tranid() is the hot path on the retire side:
++m_global_tranid then conditional compute_min_active_tranid()
on the modular boundary.
lockfree_transaction_descriptor.{hpp,cpp}
Section titled “lockfree_transaction_descriptor.{hpp,cpp}”The per-thread state. The methods are deliberately fine-grained:
start_tran()snapshots without bumping. Used by readers.start_tran_and_increment_id()bumps the global. Used by retirers and by code that needs a fresh epoch (e.g.,hashmap::list_deleteafter the unlink CAS, to ensure the retiree’s epoch is strictly greater than every concurrent reader’s snapshot).end_tran()resets toINVALID_TRANID.retire_node()bumps then appends.reclaim_retired_list()walks the eligible prefix.save_reclaimable/pull_saved_reclaimablefor the speculative-claim ping-pong.
lockfree_transaction_reclaimable.hpp
Section titled “lockfree_transaction_reclaimable.hpp”The base class for retirable nodes. The interesting field is
m_retired_next, which is protected and “may be repurposed by
derived classes.” The freelist’s free_node exploits this — when
a node is on the freelist’s available list, m_retired_next is
the freelist link; when it is on a descriptor’s retired ring,
m_retired_next is the retire-ring link. The two roles never
co-exist (a node is on exactly one list at a time) so the
overload is safe.
m_retire_tranid is private, only the friend descriptor
writes it.
lockfree_address_marker.hpp
Section titled “lockfree_address_marker.hpp”Templated, header-only. The static helpers are stateless casts;
the instance form holds a std::atomic<T*> so callers can do
am.is_marked() without re-loading. The
atomic_strip_address_mark(T*) static helper boxes a temporary
address_marker<T> and immediately strips, which is more
defensive than the inline-strip macros in the legacy code.
Position hints (as of 2026-05-07)
Section titled “Position hints (as of 2026-05-07)”| Symbol | File | Line |
|---|---|---|
tran::system::system | src/base/lockfree_transaction_system.cpp | 29 |
tran::system::assign_index | src/base/lockfree_transaction_system.cpp | 39 |
tran::system::free_index | src/base/lockfree_transaction_system.cpp | 51 |
tran::table::table | src/base/lockfree_transaction_table.cpp | 36 |
tran::table::get_descriptor | src/base/lockfree_transaction_table.cpp | 54 |
tran::table::get_new_global_tranid | src/base/lockfree_transaction_table.cpp | 73 |
tran::table::compute_min_active_tranid | src/base/lockfree_transaction_table.cpp | 90 |
tran::table::get_min_active_tranid | src/base/lockfree_transaction_table.cpp | 107 |
tran::descriptor::descriptor | src/base/lockfree_transaction_descriptor.cpp | 32 |
tran::descriptor::~descriptor | src/base/lockfree_transaction_descriptor.cpp | 45 |
tran::descriptor::retire_node | src/base/lockfree_transaction_descriptor.cpp | 65 |
tran::descriptor::start_tran | src/base/lockfree_transaction_descriptor.cpp | 94 |
tran::descriptor::start_tran_and_increment_id | src/base/lockfree_transaction_descriptor.cpp | 103 |
tran::descriptor::end_tran | src/base/lockfree_transaction_descriptor.cpp | 119 |
tran::descriptor::reclaim_retired_list | src/base/lockfree_transaction_descriptor.cpp | 133 |
tran::descriptor::reclaim_retired_head | src/base/lockfree_transaction_descriptor.cpp | 154 |
tran::descriptor::save_reclaimable | src/base/lockfree_transaction_descriptor.cpp | 170 |
tran::descriptor::pull_saved_reclaimable | src/base/lockfree_transaction_descriptor.cpp | 178 |
tran::reclaimable_node (class) | src/base/lockfree_transaction_reclaimable.hpp | 46 |
address_marker<T> (class) | src/base/lockfree_address_marker.hpp | 31 |
lf_tran_start (legacy) | src/base/lock_free.c | 413 |
lf_tran_end (legacy) | src/base/lock_free.c | 446 |
lf_tran_compute_minimum_transaction_id (legacy) | src/base/lock_free.c | 379 |
lf_tran_request_entry (legacy) | src/base/lock_free.c | 281 |
Source verification (as of 2026-05-07)
Section titled “Source verification (as of 2026-05-07)”Verified facts
Section titled “Verified facts”MATI_REFRESH_INTERVALis exactly 100, identical between legacy and modern. Legacy:LF_TRAN_SYSTEM::mati_refresh_interval = 100set inlf_tran_system_init(lock_free.c:236). Modern:lockfree::tran::table::MATI_REFRESH_INTERVAL = 100declared inlockfree_transaction_table.hpp:78. Not runtime-tunable.INVALID_TRANID == numeric_limits<id>::max()declared inlockfree_transaction_descriptor.hpp:49and used as the idle sentinel. Legacy usesLF_NULL_TRANSACTION_ID == ULONG_MAX(lock_free.h:119); both encode the same maximum-value pattern.- Min-scan walks every descriptor unconditionally.
compute_min_active_tranidinlockfree_transaction_table.cpp:90iterates0 .. m_sys. get_max_transaction_count()with no early termination. The header comment notes this is intentional given the boot-time-claim assumption. reclaim()is virtual, defaultdelete this. Declared inlockfree_transaction_reclaimable.hpp:57, the freelist’sfree_node::reclaimoverrides withfinalqualifier (lockfree_freelist.hpp:523).- The descriptor destructor walks the retired ring.
~descriptor()atlockfree_transaction_descriptor.cpp:45loopswhile (m_retired_head != NULL)and callsreclaim_retired_head(). If a thread dies mid-retirement (which should not happen in CUBRID’s lifetime model), the destructor still drains. - Saved reclaimable is a single slot.
descriptor::save_reclaimableatlockfree_transaction_descriptor.cpp:170assertsm_saved_node == NULL. Two saves without an intervening pull trigger the assert in debug builds. address_marker<T>::MARK == 0x1declared atlockfree_address_marker.hpp:36. Same bit-position as the legacyADDR_HAS_MARK / ADDR_WITH_MARKmacros atlock_free.c:72-74.
Open questions
Section titled “Open questions”- Should
compute_min_active_tranidskip vacant descriptor slots? The header comment acknowledges the all-claimed-from- boot assumption. If a future change introduces dynamic index assignment, the scan must change. Investigation path: search forassign_indexcall sites and confirm they all run during boot. - Is the typo
set_adress_mark(oned) intentional, or should it be cleaned up? Every call site uses the typo spelling, so renaming is a sweep across the codebase. Low value; flagged for future cleanup if a wider lockfree refactor happens. - What is the cost of the seq_cst default on weak-memory
targets? No measurement campaign exists. The
lf_tran_start_with_mbmacro in legacy code suggests the intent is full barrier semantics, but the modern code does not measure whetheracquire/releasewould suffice on the read path. - Are there any structures that retire nodes outside a
tran::tableand rely ondelete thisdirectly? The defaultreclaim()isdelete this. If a code path retires areclaimable_nodewhose storage is not heap-owned (e.g., embedded in an array slot), the default reclaim is a double-free. Investigation path: grep fortran::reclaimable_nodederived classes and verify each override. - Does the descriptor’s
m_did_incrflag still serve a purpose? The flag is used instart_tran_and_increment_id()to ensure only one bump per transaction even on multiple calls. With the freelist’sclaim()callingstart_tran()(no bump) and the hashmap’slist_deletecallingpromote_tran_force()(bump), the flag is the sentinel between “I am observing” and “I am retiring.” Worth a focused review under stress test.
Beyond CUBRID — Comparative Designs & Research Frontiers
Section titled “Beyond CUBRID — Comparative Designs & Research Frontiers”- Userspace RCU (liburcu) integration. liburcu is a
production-grade EBR with grace-period detection. Its API
(
rcu_read_lock,rcu_read_unlock,synchronize_rcu,call_rcu) maps almost directly onto CUBRID’sstart_tran/end_tran/retire_node. Adopting liburcu would let CUBRID drop several hundred lines oflockfree::trancode at the cost of a system dependency. - DEBRA / IBR (Brown, ICDCS 2015). A faster EBR variant that reduces the min-scan cost by partitioning descriptors. Worth comparing against CUBRID’s full-scan-every-100 cost on high-thread-count machines.
- NBR+ (Singh et al., PPoPP 2021). Neutralization-based reclamation, avoids per-access epoch publication entirely by using OS signals to detect quiescence. Aggressive but promising for read-heavy workloads (CUBRID’s hashmap finds vastly outnumber retires).
- Hyaline (Nikolaev & Ravindran, PPoPP 2020). Constant- time reclamation with hazard-eras. Tighter latency bound than EBR — every retiree is reclaimable in O(1) amortized — at the cost of per-access reference-counting machinery.
- Replacing the per-thread retire ring with a global shard-per-CPU ring. Currently each descriptor’s retired ring is private; if the per-thread index pool grows large the total retired-node memory is O(threads × pending-reclaims). A shard-per-CPU pool with thread-affinity dispatch would bound this regardless of thread count.
Sources
Section titled “Sources”- Source files (in the CUBRID source tree at
/data/hgryoo/references/cubrid/):src/base/lockfree_transaction_def.hppsrc/base/lockfree_transaction_system.{hpp,cpp}src/base/lockfree_transaction_table.{hpp,cpp}src/base/lockfree_transaction_descriptor.{hpp,cpp}src/base/lockfree_transaction_reclaimable.hppsrc/base/lockfree_address_marker.hppsrc/base/lock_free.h,src/base/lock_free.c(legacy counterparts)
- Textbook chapters cited:
- Herlihy & Shavit, The Art of Multiprocessor Programming, ch. 9 §“Lock-Free Linked Lists” (Harris-Michael delete with pointer-mark bit), ch. 10 §“Concurrent Hashing” (split-ordered chained tables), ch. 7 §“Spin Locks and Contention” (CAS hazards).
- Williams, C++ Concurrency in Action, ch. 5 §“Memory ordering for atomic operations”, ch. 7 §“Designing lock-free concurrent data structures.”
- Foundational papers:
- Michael, M. M. “Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects.” IEEE TPDS 15(8), 2004.
- Fraser, K. “Practical Lock-Freedom.” Univ. of Cambridge Tech Report UCAM-CL-TR-579 (2004) — origin of the epoch-based reclamation framing CUBRID’s scheme matches.
- McKenney, P. E., Slingwine, J. “Read-Copy Update: Using Execution History to Solve Concurrency Problems.” PDCS 1998.
- Brown, T. A. “Reclaiming Memory for Lock-Free Data Structures.” PODC 2015.
- Sibling docs in this repo:
cubrid-lockfree-overview.md— the umbrella; navigate here for the two-generation map.cubrid-lockfree-hashmap.md— the largest consumer of the reclamation framework.cubrid-lockfree-freelist.md— the second consumer.cubrid-lockfree-bitmap.md— used internally bytran::systemfor index allocation.