Skip to content

CUBRID Lock-Free Transactional Reclamation — System, Table, Descriptor, and Address Marker

Contents:

The hardest problem in a lock-free data structure is not insert and not delete — it is safe memory reclamation (SMR). Once a node has been unlinked from the structure, when can the allocator have it back? A naïve free() immediately after unlink hands a use-after-free to any concurrent reader still dereferencing the unlinked node. Waiting “long enough” before freeing is unbounded and a memory leak. The correct primitive is some scheme that proves no reachable reader holds the node anymore. This is the safe-memory-reclamation problem, formalized first by Maged Michael (Michael 2004, IEEE TPDS) and since elaborated by McKenney, Fraser, Brown, and others.

The four mainstream techniques fall on a cost / latency / generality plane:

TechniqueReader costReclamation latencyPer-DS scope?
Hazard pointers (HP)O(1) atomic store per accessBounded (worst case = K hazards × N threads scan)Yes
Epoch-based reclamation (EBR / Fraser 2004)O(1) — register epoch on entryOne epoch (~ tens of microseconds)Yes
RCU (Read-Copy-Update; McKenney 2001+)Effectively zero — rcu_read_lock is a barrier()One grace periodSystem-wide (Linux)
Quiescent-state-based (QSBR)Zero — readers report quiescence implicitlyUntil every thread quiesces onceYes

EBR (and its userspace cousin liburcu) bracket every read-side critical section with enter_epoch / exit_epoch; reclamation walks every retired node and frees those whose retirement epoch is strictly less than the minimum currently-entered epoch. The basic invariant: if no thread has entered an epoch ≤ E since node N was retired with epoch E, then N is unreachable from any reader and reclaimable. The scheme rides on three properties:

  1. The epoch counter is monotonic.
  2. A thread’s “currently active” epoch can only be observed after the thread has paired its retirements with new reads, so a reader cannot have a stale snapshot whose epoch the writer’s retirement bumped past.
  3. The minimum-active-epoch scan can be done lazily — readers do not block on it.

Two real-world tradeoffs shape the design:

  • Granularity. EBR is naturally per-data-structure; you do not want a vacuum thread’s hashmap epoch to gate a query plan cache’s epoch. The scope is the data structure, not the system.
  • The min-active-epoch scan is O(N_threads). Doing it on every retirement is expensive; doing it never is unbounded latency. The standard compromise is every K retirements, where K is chosen so the scan cost amortizes against retirement rate. K is the refresh interval.

CUBRID’s transactional reclamation is exactly EBR with a database- flavoured vocabulary. The “epoch” is a “transaction id,” the “per-DS data structure” is a “transaction table,” the “per-thread state” is a “transaction descriptor,” and the “retired node” is a “reclaimable node.” The refresh interval is named MATI_REFRESH_INTERVAL (Minimum Active Tran Id), set to 100.

The scheme does not solve ABA in general. It guarantees that a node retired during transaction Tr is not reclaimed until every descriptor either is idle or has a tranid > Tr. Inside that window, a reader that started before the retirement still sees the node at its original address, and any CAS by the reader on a pointer to that node still operates on a live object. ABA at the pointer-value level — where a node is retired, the bytes are later reclaimed and reused for a different node, then a third party puts the same address back into the structure — is prevented for any thread holding a transaction descriptor open; readers that touch lock-free structures without bracketing themselves with start_tran / end_tran are outside the protection.

Every DBMS that ships lock-free internal structures pays for some form of SMR. The implementations differ but the conceptual template is shared:

A reclamation primitive is per data structure, not per process. The lock manager’s resource hash, the page buffer’s BCB hash, the session table — each owns its own table because mixing them gates one’s reclamation behind another’s reader latency.

PostgreSQL’s XactLockTable and LockHashPartitionLock use heavyweight LWLocks; reclamation is implicit (no lock-free structures). InnoDB’s lock_sys_t is mutex-protected. Redis uses a single-threaded model and has no concurrent reclamation problem. The DBMSes that do run lock-free internals (Citus, Yugabyte, FoundationDB, CockroachDB) tend to use the language runtime’s GC (Go, Java) and avoid the SMR problem at the cost of per-allocation overhead.

CUBRID is the C++ outlier: lock-free internals + manual memory management + custom EBR. The trade buys throughput at the cost of the entire lockfree::tran::* machinery existing and being correct.

Every reader must publish “I am inside a critical section that uses node X.” The publication must be cheap — it is on every read hot path. The publication must be observable by retiring threads in bounded time. Real implementations use:

  • A per-thread local epoch integer, atomic-stored on entry, atomic-cleared on exit. Reclaimers loop and read every thread’s entry. EBR / liburcu / CUBRID’s modern stack.
  • Hazard-pointer arrays of K slots per thread. Each slot is the target pointer the thread is currently dereferencing. Readers store, then test for unlink, then dereference. CUBRID does not use this.
  • Per-CPU sequence counters with periodic call-out to a reclaimer thread. CUBRID does not use this.

Computing the global minimum across N threads is the only non-O(1) operation in the read/write path. Every implementation amortizes:

  • liburcu computes gp_seq lazily on call-back (call_rcu).
  • Java’s G1 collector tracks GC roots concurrently and sweeps in the background.
  • CUBRID computes min_active_tranid every MATI_REFRESH_INTERVAL (= 100) global-id increments, in the same thread that calls get_new_global_tranid(). The cost is paid by the retiring thread, not by readers.

Two designs:

  • Per-data-structure global retired list. All retiring threads append to a shared list. Pros: simpler bookkeeping. Cons: contention on the shared list, plus the reclaimer must walk every entry on every scan.
  • Per-thread retired list. Each thread carries its own ordered list of “things I retired.” Reclaim walks only the prefix whose tranid < min_active. CUBRID picks this — every lockfree::tran::descriptor has its own m_retired_head, m_retired_tail ring.

The per-thread list relies on retirement order being the same as tranid order (which it is, because start_tran_and_increment_id fetches a fresh global id under CAS), so the prefix-walk is a strictly increasing scan that stops at the first node whose tranid >= min_active.

ConceptCUBRID name
EBR epoch countertable::m_global_tranid (monotonic atomic)
Per-thread active epochdescriptor::m_tranid (set by start_tran)
Idle sentinelINVALID_TRANID = numeric_limits<id>::max()
Per-thread retired listdescriptor::m_retired_head/tail (linked through m_retired_next)
Min-active-epochtable::m_min_active_tranid (lazily computed atomic)
Refresh intervalMATI_REFRESH_INTERVAL = 100
Per-thread state base classtran::reclaimable_node
Custom destructor hookreclaimable_node::reclaim() (virtual, default delete this)
Per-thread indextran::index (typedef size_t); allocated by tran::system::assign_index

The modern reclamation stack is exactly three classes plus one base class for retirable nodes:

// lockfree::tran::system — src/base/lockfree_transaction_system.hpp
// index dispenser shared by every table in the system.
class system {
public:
system () = delete;
system (size_t max_tran_count);
index assign_index (); // give a fresh per-thread index
void free_index (index idx); // return one
size_t get_max_transaction_count () const;
private:
size_t m_max_tran_per_table;
bitmap m_tran_idx_map; // ONE_CHUNK style, full-usage threshold
};
// lockfree::tran::table — src/base/lockfree_transaction_table.hpp
// one per lock-free data structure; owns the global tranid and the descriptor array.
class table {
public:
table (system &sys);
descriptor &get_descriptor (const index &tran_index);
void start_tran (const index &);
void end_tran (const index &);
id get_current_global_tranid () const;
id get_new_global_tranid (); // atomic-bump and return
id get_min_active_tranid () const;
private:
static const id MATI_REFRESH_INTERVAL = 100;
void compute_min_active_tranid ();
system &m_sys;
descriptor *m_all; // size = sys.get_max_transaction_count()
std::atomic<id> m_global_tranid;
std::atomic<id> m_min_active_tranid;
};
// lockfree::tran::descriptor — src/base/lockfree_transaction_descriptor.hpp
// per-thread per-table state.
class descriptor {
public:
void start_tran (); // snapshot current global id (no bump)
void start_tran_and_increment_id (); // bump global, take the new id
void end_tran ();
void retire_node (reclaimable_node &); // append to retired ring
void reclaim_retired_list (); // walk prefix whose tranid < min_active
bool is_tran_started () const;
// ...
private:
table *m_table;
id m_tranid; // INVALID_TRANID when idle
id m_last_reclaim_minid;
reclaimable_node *m_retired_head;
reclaimable_node *m_retired_tail;
bool m_did_incr;
reclaimable_node *m_saved_node; // local-stash for find-and-insert ping-pong
size_t m_retire_count, m_reclaim_count;
};

The relationship is one-to-many one-way:

  • One tran::system may underpin many tran::table instances. On the CUBRID server there is exactly one server-wide tran::system (returned by cubthread::get_thread_entry_lftransys()), parameterized by the maximum thread count.
  • Each tran::table owns its descriptor[] array. The array is sized at construction by m_sys.get_max_transaction_count() and every slot is initialized — this is not a sparse map; it is a pre-allocated dense array indexed by tran::index.
  • A tran::index is fixed for a thread’s lifetime — assigned at thread setup, valid across every table in the same system.

Reader-side critical section:

  1. Locate descriptor. tran::descriptor &tdes = table.get_descriptor (my_index).
  2. Bracket open. tdes.start_tran() snapshots the current global id. The descriptor’s m_tranid becomes the snapshot.
  3. Touch lock-free structure. Read pointers, CAS values, walk chains. Any pointer you read is guaranteed not to be reclaimed before you call end_tran() because every retiring thread that comes after you sees min_active_tranid <= m_tranid until you end_tran().
  4. Bracket close. tdes.end_tran() resets m_tranid = INVALID_TRANID.

Cost per access: one atomic load (the global id snapshot) plus one atomic store (the descriptor write). Both are seq-cst by default; the descriptor write is unsynchronized with other descriptors — only the descriptor’s owning thread writes to it. The retiring thread’s compute_min_active_tranid() reads every descriptor’s m_tranid; the read is done with a plain get_transaction_id() load, which on x86 is sequentially consistent for the word-sized read.

The choice to keep start_tran / end_tran purely on the calling thread’s descriptor (no atomic on a shared cache line) is what makes the read path cheap — the reader pays only for its own cache line.

Retire-side:

  1. Locate descriptor. Same as read.
  2. Bracket open with bump. tdes.start_tran_and_increment_id() atomically increments m_table->m_global_tranid and assigns the new id to m_tranid. The bump is what makes the retiring epoch strictly greater than any concurrent reader’s snapshot.
  3. Reclaim eligible nodes first. reclaim_retired_list() walks the prefix of m_retired_head ring whose m_retire_tranid < min_active_tranid and calls each node’s reclaim() (virtual, default delete this). This is amortization — every retire pays the cost of reclaiming the nodes that became eligible since last time.
  4. Append the new retiree. retire_node() sets the retiree’s m_retire_tranid to m_tranid and links it at the tail of the retired ring.
  5. Bracket close. end_tran().

The retiree’s m_retire_tranid is the current global id, not the next one. Why is this safe? Because every reader who already snapshotted a tranid <= my_tranid will keep m_tranidmy_tranid open until they end their tran, so the min_active_tranid cannot exceed my_tranid until every such reader is done. Therefore min_active > my_tranid ⇔ no reader holds a snapshot ≤ my_tranid ⇔ no reader can dereference the retiree.

Computed lazily on every 100th get_new_global_tranid():

// lockfree::tran::table::get_new_global_tranid — src/base/lockfree_transaction_table.cpp:73
id table::get_new_global_tranid ()
{
id ret = ++m_global_tranid;
if (ret % MATI_REFRESH_INTERVAL == 0)
{
compute_min_active_tranid ();
}
return ret;
}
// lockfree::tran::table::compute_min_active_tranid — src/base/lockfree_transaction_table.cpp:90
void table::compute_min_active_tranid ()
{
// note: all transactions are actually claimed from boot. this code is optimized
// for this case. if we ever change how transactions are requested, this
// must be updated too
id minvalue = INVALID_TRANID; // nothing is bigger than INVALID_TRANID
for (size_t it = 0; it < m_sys.get_max_transaction_count (); it++)
{
id tranid = m_all[it].get_transaction_id ();
if (minvalue > tranid)
minvalue = tranid;
}
m_min_active_tranid.store (minvalue);
}

Two observations:

  • The scan walks every descriptor, not just the currently-occupied ones. The header comment notes this is because “all transactions are actually claimed from boot” — thread indexes are assigned at startup and held for life. If a future refactor introduces dynamic on-demand index assignment, this scan must be rewritten to skip vacant slots (otherwise INVALID_TRANID from a vacant slot — which is numeric_limits<id>::max() — never participates in the minimum, but a vacant slot adds a useless cache miss).
  • INVALID_TRANID is numeric_limits<id>::max(), on purpose. An idle descriptor stores the maximum and so cannot lower the minimum. The retire-side test tranid < min_active treats an all-idle world as “every retiree is reclaimable” (because min_active == INVALID_TRANID and every retiree’s m_retire_tranid is some real id <= m_global_tranid < INVALID_TRANID).

Each descriptor maintains a singly-linked ordered ring of retired nodes via m_retired_head / m_retired_tail, with each node carrying m_retired_next and m_retire_tranid. Insertions go to m_retired_tail; reclamation walks from m_retired_head forward.

// lockfree::tran::descriptor::retire_node — src/base/lockfree_transaction_descriptor.cpp:65
void descriptor::retire_node (reclaimable_node &node)
{
bool should_end = !is_tran_started ();
start_tran_and_increment_id (); // bump global id
reclaim_retired_list (); // amortized cleanup
node.m_retire_tranid = m_tranid;
node.m_retired_next = NULL;
// add to tail to keep delete ids ordered
if (m_retired_tail == NULL) {
assert (m_retired_head == NULL);
m_retired_head = m_retired_tail = &node;
} else {
m_retired_tail->m_retired_next = &node;
m_retired_tail = &node;
}
++m_retire_count;
if (should_end)
end_tran (); // restore previous bracketing
}

The retire path does not lock. It is single-writer (the descriptor’s owning thread is the only writer of its retired ring) and the descriptor’s m_retired_* pointers are not racing with anyone else. The min-scan reads each descriptor’s m_tranid (a single word) but never the ring; the ring is private to the owning thread.

The fast path of reclaim_retired_list() is a quick load of m_min_active_tranid and an early-return when nothing has changed:

// lockfree::tran::descriptor::reclaim_retired_list — src/base/lockfree_transaction_descriptor.cpp:133
void descriptor::reclaim_retired_list ()
{
id min_tran_id = m_table->get_min_active_tranid ();
if (min_tran_id <= m_last_reclaim_minid) {
// nothing changed
return;
}
while (m_retired_head != NULL && m_retired_head->m_retire_tranid < min_tran_id)
reclaim_retired_head ();
if (m_retired_head == NULL)
m_retired_tail = NULL;
m_last_reclaim_minid = min_tran_id;
}

This early-return is what makes retirement amortized-O(1) — only the call that crosses a MATI_REFRESH_INTERVAL boundary pays the scan cost on its descriptor.

The retired node owns its own destructor. The default is delete this; the freelist overrides it to push the node back onto the freelist’s available list:

// lockfree::tran::reclaimable_node — src/base/lockfree_transaction_reclaimable.hpp:46
class reclaimable_node {
public:
reclaimable_node () : m_retired_next (NULL), m_retire_tranid (0) {}
virtual ~reclaimable_node () = default;
virtual void reclaim () { delete this; } // override to recycle instead of delete
protected:
reclaimable_node *m_retired_next;
private:
friend descriptor;
id m_retire_tranid;
};
// lockfree::freelist<T>::free_node::reclaim — src/base/lockfree_freelist.hpp:523
void free_node::reclaim () final override
{
m_t.on_reclaim (); // payload's on_reclaim hook
m_retired_next = NULL;
--m_owner->m_retired_count;
++m_owner->m_available_count;
m_owner->push_to_list (*this, *this, m_owner->m_available_list);
}

The contract is: a node’s storage is owned by something (the freelist, the heap, an embedded slot); reclaim() is the hook that returns it. delete this is the default for code that allocates with new and has no enclosing free list. Hashmap entries go through the freelist’s overridden reclaim, so they never call delete on a node — they recycle it.

The save / pull-saved-reclaimable mechanism

Section titled “The save / pull-saved-reclaimable mechanism”

descriptor exposes one auxiliary primitive:

void save_reclaimable (reclaimable_node *&node);
reclaimable_node *pull_saved_reclaimable ();

Use case: a thread is doing a find_or_insert on the hashmap. It speculatively claims a fresh node from the freelist (freelist_claim), tries to CAS-link it into the chain, then finds another thread already inserted the same key. The speculatively-claimed node is now stale but must not be reclaimed yet — the thread does not want to retire it (retirement bumps the global id and incurs amortization cost), nor does it want to push it back onto the freelist’s available list (race with the next call site). It instead saves it on its own descriptor:

// freelist_claim — src/base/lockfree_hashmap.hpp:663
T *
hashmap<Key, T>::freelist_claim (tran::descriptor &tdes)
{
T *claimed = NULL;
free_node_type *fn =
reinterpret_cast<free_node_type *> (tdes.pull_saved_reclaimable ());
// ... if fn != NULL, reuse the saved node before going to the freelist
if (fn == NULL) {
fn = m_freelist->claim (tdes);
// initialize via f_init
}
// ...
}

The next time the same thread calls freelist_claim, the saved node is pulled out and reused before the freelist’s CAS-pop. This trick is one of the reasons find_or_insert is cheap on contended keys — the speculative claim is not wasted.

Caveat: save_reclaimable asserts only one node may be saved at a time per descriptor (assert (m_saved_node == NULL)). If a thread saves a node and then forgets to pull it before the next save_reclaimable, the second save asserts in debug and silently overwrites in release. Hashmap call sites are written so the saved-node lifetime cannot escape a single hash operation.

Per-thread index allocation — cycle of life

Section titled “Per-thread index allocation — cycle of life”

A thread is born, calls system::assign_index() to get a unique index, and uses it until it dies (or returns the index via free_index). On the CUBRID server every thread gets its index at thread-pool dispatch time and keeps it for the lifetime of the worker; the index is stored in cubthread::entry::m_lf_tran_index and exposed via entry::get_lf_tran_index().

// lockfree::tran::system::assign_index — src/base/lockfree_transaction_system.cpp:39
index system::assign_index ()
{
int ret = m_tran_idx_map.get_entry ();
if (ret < 0) { assert (false); return INVALID_INDEX; }
return static_cast<index> (ret);
}

The bitmap is ONE_CHUNK style with FULL_USAGE_RATIO (1.0) — “all slots usable” — so assign_index only fails when every slot is taken, which under the boot-time-claim model means a miscount of max_tran_count.

sequenceDiagram
  participant Reader as Reader thread
  participant Retirer as Retirer thread
  participant Tdes_R as Reader's descriptor
  participant Tdes_W as Retirer's descriptor
  participant Tab as tran::table

  Note over Tab: m_global_tranid = G<br/>m_min_active_tranid = G_min
  Reader->>Tdes_R: start_tran()
  Tdes_R->>Tab: load m_global_tranid → G
  Tdes_R->>Tdes_R: m_tranid = G
  Reader->>Reader: dereference node N
  Retirer->>Tdes_W: retire_node(N)
  Tdes_W->>Tab: ++m_global_tranid → G+1
  Tdes_W->>Tdes_W: m_tranid = G+1
  Note over Tdes_W: every 100th increment:<br/>compute_min_active_tranid()<br/>scans all descriptors
  Tdes_W->>Tdes_W: N.m_retire_tranid = G+1
  Tdes_W->>Tdes_W: append N to retired ring
  Tdes_W->>Tdes_W: m_retire_count++
  Note over Reader: ... still using N ...
  Reader->>Tdes_R: end_tran()
  Tdes_R->>Tdes_R: m_tranid = INVALID_TRANID
  Note over Tab: next refresh:<br/>min_active climbs past G+1
  Retirer->>Tdes_W: reclaim_retired_list()
  Tdes_W->>Tdes_W: walk head while m_retire_tranid &lt; min_active
  Tdes_W->>Tdes_W: N.reclaim() (delete or recycle)

Address marker — the pointer-mark companion

Section titled “Address marker — the pointer-mark companion”

The reclamation framework solves when to reclaim. The Harris-Michael delete protocol needs a way to mark a node as logically deleted but not yet physically unlinked without allocating extra memory. The standard technique is to put the mark in the low bit of the next pointer: real pointers are ≥ 2-byte aligned, so the low bit is otherwise always 0. CUBRID encodes this with lockfree::address_marker<T>:

// lockfree::address_marker<T> — src/base/lockfree_address_marker.hpp:31
template <class T>
class address_marker {
public:
bool is_marked () const;
T *get_address () const; // strip mark
T *get_address_no_strip () const;
static T *set_adress_mark (T *addr);
static T *strip_address_mark (T *addr);
static bool is_address_marked (T *addr);
static T *atomic_strip_address_mark (T *addr);
private:
using convert_type = std::uint64_t;
static const convert_type MARK = 0x1;
// ...
std::atomic<T *> m_addr;
};

The implementation is dataless: every static helper casts the pointer to uint64_t, OR/AND-NOTs the mark bit, casts back. The instance form address_marker<T> wraps a std::atomic<T*> so callers can store-marked pointers without the cast. The hashmap uses both: marked next-pointers in entries (via static helpers set_adress_mark and strip_address_mark) and a marked back- buffer head (via instance address_marker<T>).

Note the typo in set_adress_mark (single d in adress) — it matches in every call site, so the typo is now load-bearing.

The legacy stack (src/base/lock_free.{h,c}) implements the same EBR shape with different names:

ModernLegacy
tran::system::assign_indexlf_tran_request_entry
tran::system::free_indexlf_tran_return_entry
tran::table::m_global_tranidLF_TRAN_SYSTEM::global_transaction_id
tran::table::m_min_active_tranidLF_TRAN_SYSTEM::min_active_transaction_id
MATI_REFRESH_INTERVAL = 100LF_TRAN_SYSTEM::mati_refresh_interval = 100
descriptor::start_tranlf_tran_start (entry, false)
descriptor::start_tran_and_increment_idlf_tran_start (entry, true)
descriptor::end_tranlf_tran_end (entry)
descriptor::retire_node(inlined into lf_freelist_retire)
descriptor::m_retired_*LF_TRAN_ENTRY::retired_list (single linked list)
compute_min_active_tranidlf_tran_compute_minimum_transaction_id
reclaimable_node(no base class; void* node walked via LF_ENTRY_DESCRIPTOR::of_local_next)

Two structural differences:

  1. Legacy uses 11 globally-named LF_TRAN_SYSTEM instances (spage_saving_Ts, obj_lock_res_Ts, …, dwb_slots_Ts). Modern has no globals — every tran::table is constructed from a tran::system & reference.
  2. Legacy has explicit MEMORY_BARRIER() macros in the lf_tran_start_with_mb / lf_tran_end_with_mb helpers; modern relies on the seq_cst defaults of std::atomic. The intent is the same — publish the descriptor write before any node dereference, observe the global id increment after — but the modern code expresses it through the type system rather than explicit fences.

Both schemes satisfy the same EBR invariant. Code that runs under the bridge cubthread::lockfree_hashmap works whichever side is active.

Just two type aliases:

namespace lockfree { namespace tran {
using index = size_t; // per-thread descriptor index
using id = std::uint64_t; // monotonic global tranid
}}

The index dispenser. system::system(size_t) initializes an internal bitmap of size max_tran_count in ONE_CHUNK FULL_USAGE_RATIO mode; assign_index() calls bitmap::get_entry() and free_index() calls bitmap::free_entry(). There is no per-table state in the system class itself — it is a thin shell over the bitmap.

The reclamation table. table(system &) constructor allocates new descriptor[m_sys.get_max_transaction_count()] and walks each descriptor to set its back-pointer (m_all[i].set_table(*this)). The destructor delete[] m_all relies on every descriptor’s destructor having drained its retired ring (descriptor destructor asserts !is_tran_started() and walks the ring calling reclaim_retired_head).

get_new_global_tranid() is the hot path on the retire side: ++m_global_tranid then conditional compute_min_active_tranid() on the modular boundary.

The per-thread state. The methods are deliberately fine-grained:

  • start_tran() snapshots without bumping. Used by readers.
  • start_tran_and_increment_id() bumps the global. Used by retirers and by code that needs a fresh epoch (e.g., hashmap::list_delete after the unlink CAS, to ensure the retiree’s epoch is strictly greater than every concurrent reader’s snapshot).
  • end_tran() resets to INVALID_TRANID.
  • retire_node() bumps then appends.
  • reclaim_retired_list() walks the eligible prefix.
  • save_reclaimable / pull_saved_reclaimable for the speculative-claim ping-pong.

The base class for retirable nodes. The interesting field is m_retired_next, which is protected and “may be repurposed by derived classes.” The freelist’s free_node exploits this — when a node is on the freelist’s available list, m_retired_next is the freelist link; when it is on a descriptor’s retired ring, m_retired_next is the retire-ring link. The two roles never co-exist (a node is on exactly one list at a time) so the overload is safe.

m_retire_tranid is private, only the friend descriptor writes it.

Templated, header-only. The static helpers are stateless casts; the instance form holds a std::atomic<T*> so callers can do am.is_marked() without re-loading. The atomic_strip_address_mark(T*) static helper boxes a temporary address_marker<T> and immediately strips, which is more defensive than the inline-strip macros in the legacy code.

SymbolFileLine
tran::system::systemsrc/base/lockfree_transaction_system.cpp29
tran::system::assign_indexsrc/base/lockfree_transaction_system.cpp39
tran::system::free_indexsrc/base/lockfree_transaction_system.cpp51
tran::table::tablesrc/base/lockfree_transaction_table.cpp36
tran::table::get_descriptorsrc/base/lockfree_transaction_table.cpp54
tran::table::get_new_global_tranidsrc/base/lockfree_transaction_table.cpp73
tran::table::compute_min_active_tranidsrc/base/lockfree_transaction_table.cpp90
tran::table::get_min_active_tranidsrc/base/lockfree_transaction_table.cpp107
tran::descriptor::descriptorsrc/base/lockfree_transaction_descriptor.cpp32
tran::descriptor::~descriptorsrc/base/lockfree_transaction_descriptor.cpp45
tran::descriptor::retire_nodesrc/base/lockfree_transaction_descriptor.cpp65
tran::descriptor::start_transrc/base/lockfree_transaction_descriptor.cpp94
tran::descriptor::start_tran_and_increment_idsrc/base/lockfree_transaction_descriptor.cpp103
tran::descriptor::end_transrc/base/lockfree_transaction_descriptor.cpp119
tran::descriptor::reclaim_retired_listsrc/base/lockfree_transaction_descriptor.cpp133
tran::descriptor::reclaim_retired_headsrc/base/lockfree_transaction_descriptor.cpp154
tran::descriptor::save_reclaimablesrc/base/lockfree_transaction_descriptor.cpp170
tran::descriptor::pull_saved_reclaimablesrc/base/lockfree_transaction_descriptor.cpp178
tran::reclaimable_node (class)src/base/lockfree_transaction_reclaimable.hpp46
address_marker<T> (class)src/base/lockfree_address_marker.hpp31
lf_tran_start (legacy)src/base/lock_free.c413
lf_tran_end (legacy)src/base/lock_free.c446
lf_tran_compute_minimum_transaction_id (legacy)src/base/lock_free.c379
lf_tran_request_entry (legacy)src/base/lock_free.c281
  • MATI_REFRESH_INTERVAL is exactly 100, identical between legacy and modern. Legacy: LF_TRAN_SYSTEM::mati_refresh_interval = 100 set in lf_tran_system_init (lock_free.c:236). Modern: lockfree::tran::table::MATI_REFRESH_INTERVAL = 100 declared in lockfree_transaction_table.hpp:78. Not runtime-tunable.
  • INVALID_TRANID == numeric_limits<id>::max() declared in lockfree_transaction_descriptor.hpp:49 and used as the idle sentinel. Legacy uses LF_NULL_TRANSACTION_ID == ULONG_MAX (lock_free.h:119); both encode the same maximum-value pattern.
  • Min-scan walks every descriptor unconditionally. compute_min_active_tranid in lockfree_transaction_table.cpp:90 iterates 0 .. m_sys. get_max_transaction_count() with no early termination. The header comment notes this is intentional given the boot-time-claim assumption.
  • reclaim() is virtual, default delete this. Declared in lockfree_transaction_reclaimable.hpp:57, the freelist’s free_node::reclaim overrides with final qualifier (lockfree_freelist.hpp:523).
  • The descriptor destructor walks the retired ring. ~descriptor() at lockfree_transaction_descriptor.cpp:45 loops while (m_retired_head != NULL) and calls reclaim_retired_head(). If a thread dies mid-retirement (which should not happen in CUBRID’s lifetime model), the destructor still drains.
  • Saved reclaimable is a single slot. descriptor::save_reclaimable at lockfree_transaction_descriptor.cpp:170 asserts m_saved_node == NULL. Two saves without an intervening pull trigger the assert in debug builds.
  • address_marker<T>::MARK == 0x1 declared at lockfree_address_marker.hpp:36. Same bit-position as the legacy ADDR_HAS_MARK / ADDR_WITH_MARK macros at lock_free.c:72-74.
  1. Should compute_min_active_tranid skip vacant descriptor slots? The header comment acknowledges the all-claimed-from- boot assumption. If a future change introduces dynamic index assignment, the scan must change. Investigation path: search for assign_index call sites and confirm they all run during boot.
  2. Is the typo set_adress_mark (one d) intentional, or should it be cleaned up? Every call site uses the typo spelling, so renaming is a sweep across the codebase. Low value; flagged for future cleanup if a wider lockfree refactor happens.
  3. What is the cost of the seq_cst default on weak-memory targets? No measurement campaign exists. The lf_tran_start_with_mb macro in legacy code suggests the intent is full barrier semantics, but the modern code does not measure whether acquire/release would suffice on the read path.
  4. Are there any structures that retire nodes outside a tran::table and rely on delete this directly? The default reclaim() is delete this. If a code path retires a reclaimable_node whose storage is not heap-owned (e.g., embedded in an array slot), the default reclaim is a double-free. Investigation path: grep for tran::reclaimable_node derived classes and verify each override.
  5. Does the descriptor’s m_did_incr flag still serve a purpose? The flag is used in start_tran_and_increment_id() to ensure only one bump per transaction even on multiple calls. With the freelist’s claim() calling start_tran() (no bump) and the hashmap’s list_delete calling promote_tran_force() (bump), the flag is the sentinel between “I am observing” and “I am retiring.” Worth a focused review under stress test.

Beyond CUBRID — Comparative Designs & Research Frontiers

Section titled “Beyond CUBRID — Comparative Designs & Research Frontiers”
  • Userspace RCU (liburcu) integration. liburcu is a production-grade EBR with grace-period detection. Its API (rcu_read_lock, rcu_read_unlock, synchronize_rcu, call_rcu) maps almost directly onto CUBRID’s start_tran/end_tran/retire_node. Adopting liburcu would let CUBRID drop several hundred lines of lockfree::tran code at the cost of a system dependency.
  • DEBRA / IBR (Brown, ICDCS 2015). A faster EBR variant that reduces the min-scan cost by partitioning descriptors. Worth comparing against CUBRID’s full-scan-every-100 cost on high-thread-count machines.
  • NBR+ (Singh et al., PPoPP 2021). Neutralization-based reclamation, avoids per-access epoch publication entirely by using OS signals to detect quiescence. Aggressive but promising for read-heavy workloads (CUBRID’s hashmap finds vastly outnumber retires).
  • Hyaline (Nikolaev & Ravindran, PPoPP 2020). Constant- time reclamation with hazard-eras. Tighter latency bound than EBR — every retiree is reclaimable in O(1) amortized — at the cost of per-access reference-counting machinery.
  • Replacing the per-thread retire ring with a global shard-per-CPU ring. Currently each descriptor’s retired ring is private; if the per-thread index pool grows large the total retired-node memory is O(threads × pending-reclaims). A shard-per-CPU pool with thread-affinity dispatch would bound this regardless of thread count.
  • Source files (in the CUBRID source tree at /data/hgryoo/references/cubrid/):
    • src/base/lockfree_transaction_def.hpp
    • src/base/lockfree_transaction_system.{hpp,cpp}
    • src/base/lockfree_transaction_table.{hpp,cpp}
    • src/base/lockfree_transaction_descriptor.{hpp,cpp}
    • src/base/lockfree_transaction_reclaimable.hpp
    • src/base/lockfree_address_marker.hpp
    • src/base/lock_free.h, src/base/lock_free.c (legacy counterparts)
  • Textbook chapters cited:
    • Herlihy & Shavit, The Art of Multiprocessor Programming, ch. 9 §“Lock-Free Linked Lists” (Harris-Michael delete with pointer-mark bit), ch. 10 §“Concurrent Hashing” (split-ordered chained tables), ch. 7 §“Spin Locks and Contention” (CAS hazards).
    • Williams, C++ Concurrency in Action, ch. 5 §“Memory ordering for atomic operations”, ch. 7 §“Designing lock-free concurrent data structures.”
  • Foundational papers:
    • Michael, M. M. “Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects.” IEEE TPDS 15(8), 2004.
    • Fraser, K. “Practical Lock-Freedom.” Univ. of Cambridge Tech Report UCAM-CL-TR-579 (2004) — origin of the epoch-based reclamation framing CUBRID’s scheme matches.
    • McKenney, P. E., Slingwine, J. “Read-Copy Update: Using Execution History to Solve Concurrency Problems.” PDCS 1998.
    • Brown, T. A. “Reclaiming Memory for Lock-Free Data Structures.” PODC 2015.
  • Sibling docs in this repo:
    • cubrid-lockfree-overview.md — the umbrella; navigate here for the two-generation map.
    • cubrid-lockfree-hashmap.md — the largest consumer of the reclamation framework.
    • cubrid-lockfree-freelist.md — the second consumer.
    • cubrid-lockfree-bitmap.md — used internally by tran::system for index allocation.