Skip to content

CUBRID Lock-Free Freelist — Typed Node Pool with Back-Buffer Block Allocator

Contents:

A freelist is the simplest possible memory pool: a singly linked list of free objects. Allocate = pop the head; release = push to the head. The two-CAS protocol from Treiber’s stack applies directly. The textbook coverage is in Herlihy & Shavit, The Art of Multiprocessor Programming, ch. 11 §“Concurrent Stacks and Elimination” — the same shape, applied to typed reusable objects rather than abstract elements.

The interesting design tensions are:

  1. What goes on the head? A single global head bounces between cores on every alloc/release. The standard mitigations are elimination (pair an alloc with a release without touching the head) and partitioning (per-CPU freelists with cross-shard stealing).
  2. When does the freelist grow? A fixed-size pool is simplest but exhausts under load. A growable pool needs a block allocator and a strategy for when to call it. Block- on-demand at exhaustion stalls the calling thread; pre- allocate-once at init wastes memory.
  3. How is ABA prevented? The Treiber stack’s classic ABA problem — pop reads head, computes head->next, CAS-bumps head to head->next — is hit head-on by every freelist that recycles addresses. The mitigations are tagged pointers (extra bits in the head word), hazard pointers (per-thread published references), or transactional reclamation (defer reuse until no reader could be holding the popped node).

CUBRID’s lockfree::freelist<T> picks: a single available list head, a one-block back-buffer for lazy growth, and the tran::table / tran::descriptor framework for ABA-bounded reuse. The back-buffer is the design’s distinctive feature.

A naive claim() that finds the available list empty must either fail, spin, or allocate a new block. Allocation is expensive (a malloc + N typed-init calls). Concurrent allocators also race — three threads simultaneously discovering an empty list each call alloc_block, getting three blocks where one would suffice.

The back-buffer pattern: keep one already-built block held aside. When the available list drains, atomic-swap the back-buffer into the available list and asynchronously build a new back-buffer. The race is now bounded — at most one thread gets to swap the back-buffer; the others see the swap completed and try the available list again, which by then has the just-swapped block.

CUBRID’s freelist runs the back-buffer pre-allocation eagerly: on every successful swap, the freelist immediately allocates the next back-buffer. So at steady state the freelist holds two blocks of memory: one in m_available_list, one in m_backbuffer_*.

on_reclaim — typed cleanup before recycle

Section titled “on_reclaim — typed cleanup before recycle”

A freelist that recycles bare bytes (just void* from malloc/free) imposes no contract on the allocator. A typed freelist must let the type clean itself between uses — close file handles, release sub-allocations, scrub state. The shape the freelist enforces:

  • The payload type T must declare an on_reclaim() method (compile-time check via SFINAE on the template instantiation).
  • The freelist’s reclaim() calls m_t.on_reclaim() before pushing the node back to the available list.

The lockfree::hashmap’s freelist_node_data is the canonical example: its on_reclaim calls the per-table lf_entry_descriptor::f_uninit callback, so a hashmap entry’s custom destructor (releasing a sub-mutex, freeing a payload buffer, etc.) runs at every recycle.

Every DBMS uses pooled allocators for its lock-free entries — malloc on every insert and free on every delete is too slow. The shapes:

  • PostgreSQL uses MemoryContext arenas keyed by lifetime (top-level / per-transaction / per-tuple). Not lock-free; the context is per-backend.
  • InnoDB uses mem_heap_t arenas + a private freelist for hash entries (hash_table_t::heap). Mutex-protected.
  • MySQL has slab allocators for fixed-size objects.
  • CUBRID uses one freelist per lock-free hashmap (and per any other structure that wants pooled typed nodes), riding on tran::table reclamation.

The CUBRID approach is per-data-structure, not per-thread or per-transaction. The trade: each hashmap has its own pool with its own waste rate, but the pools never bleed cross-structure (retiring a hashmap clears only its own freelist).

The pattern of “allocate K objects at once, push as a block” is standard. The variant of “keep one block on the side” is less universal. Java’s ConcurrentLinkedDeque does not; Folly’s SmallVector does (capacity grows by doubling, holding the new buffer aside until the old is drained). CUBRID’s freelist is on the back-buffer side because the typical workload is bursty retirement (vacuum sweep, hashmap clear) followed by bursty claim (replay, repopulate) — the back-buffer absorbs the asymmetry.

ConceptCUBRID name
Typed node poollockfree::freelist<T>
Pool nodefreelist<T>::free_node (derives from tran::reclaimable_node)
Available stack headm_available_list (std::atomic<free_node *>)
Back-buffer head/tailm_backbuffer_head, m_backbuffer_tail (std::atomic<free_node *>)
Block size (init param)m_block_size (≥ 2 enforced)
On-reclaim hookT::on_reclaim () (caller-provided)
Per-pool reclamationm_trantable (each freelist owns its own tran::table)
Allocate from availablepop_from_available ()
Allocate via swapswap_backbuffer ()
Forced allocationforce_alloc_block () (bypass back-buffer)
Statistics: alloc countm_alloc_count
Statistics: available countm_available_count
Statistics: back-buffer countm_bb_count
Statistics: forced allocationsm_forced_alloc_count
Statistics: retired countm_retired_count

Two classes — freelist<T> and freelist<T>::free_node

Section titled “Two classes — freelist<T> and freelist<T>::free_node”
// lockfree::freelist<T> — src/base/lockfree_freelist.hpp:42
template <class T>
class freelist {
public:
class free_node;
freelist () = delete;
freelist (tran::system &transys, size_t block_size, size_t initial_block_count = 1);
~freelist ();
free_node *claim (tran::descriptor &);
free_node *claim (tran::index);
void retire (tran::descriptor &, free_node &);
void retire (tran::index, free_node &);
// statistics:
size_t get_alloc_count () const;
size_t get_available_count () const;
size_t get_backbuffer_count () const;
size_t get_forced_allocation_count () const;
size_t get_retired_count () const;
size_t get_claimed_count () const; // = alloc - (available + backbuffer + retired)
tran::system &get_transaction_system ();
tran::table &get_transaction_table ();
private:
tran::system &m_transys;
tran::table *m_trantable; // owned: deleted in destructor
size_t m_block_size;
std::atomic<free_node *> m_available_list;
std::atomic<free_node *> m_backbuffer_head;
std::atomic<free_node *> m_backbuffer_tail;
std::atomic<size_t> m_available_count;
std::atomic<size_t> m_alloc_count;
std::atomic<size_t> m_bb_count;
std::atomic<size_t> m_forced_alloc_count;
std::atomic<size_t> m_retired_count;
void swap_backbuffer ();
void alloc_backbuffer ();
void force_alloc_block ();
void alloc_list (free_node *&head, free_node *&tail);
void dealloc_list (free_node *head);
free_node *pop_from_available ();
void push_to_list (free_node &head, free_node &tail, std::atomic<free_node *> &dest);
void clear_free_nodes ();
void final_sanity_checks () const;
void check_my_pointer (free_node *node);
};
// lockfree::freelist<T>::free_node — src/base/lockfree_freelist.hpp:104
template <class T>
class freelist<T>::free_node : public tran::reclaimable_node {
public:
free_node ();
~free_node () = default;
T &get_data ();
void reclaim () final override; // recycles instead of delete
private:
friend freelist;
void set_owner (freelist &m_freelist);
void set_freelist_next (free_node *next);
void reset_freelist_next ();
free_node *get_freelist_next ();
freelist *m_owner;
T m_t; // payload
};

The free_node derives from tran::reclaimable_node, so a retired node lives on the descriptor’s retired ring through the inherited m_retired_next field. When reclamation eventually fires, the inherited reclaim() virtual is the recycle path that pushes back to m_available_list.

The clever overload of m_retired_next:

  • While the node is on the descriptor’s retired ring, m_retired_next is the retire-ring link.
  • While the node is on the freelist’s available list, m_retired_next is the freelist link.
  • A node is on exactly one of those at a time, so the field is reused.

free_node::set_freelist_next and get_freelist_next are private accessors that present m_retired_next as a freelist link with the correct return type (free_node *, not the base reclaimable_node *).

// freelist<T>::freelist — src/base/lockfree_freelist.hpp:138
freelist (tran::system &transys, size_t block_size,
size_t initial_block_count = 1)
: m_transys (transys),
m_trantable (new tran::table (transys)), // OWN tran table
m_block_size (block_size),
/* counters initialized to 0 */
{
if (initial_block_count <= 1) {
m_block_size /= 2;
initial_block_count = 2;
}
alloc_backbuffer ();
for (size_t i = 0; i < initial_block_count; i++)
swap_backbuffer (); // pre-fill available list
}

Three things happen at init:

  1. The freelist creates its own tran::table from the shared tran::system. Every freelist is a distinct reclamation domain — retiring nodes in freelist A does not bump the global tranid in freelist B.
  2. The “block size at least 2 with initial blocks at least 2” rule. If the caller asks for initial_block_count = 1, the constructor halves the block size and doubles the block count. This guarantees there is always at least one block in the back-buffer and one block in the available list at steady state.
  3. The init eagerly fills the available list to initial_block_count blocks. The back-buffer is one extra block beyond what was requested. After init, the freelist contains initial_block_count + 1 total blocks of memory, initial_block_count of which are immediately claimable.
// freelist<T>::claim (descriptor flavor) — src/base/lockfree_freelist.hpp:291
free_node *claim (tran::descriptor &tdes)
{
tdes.start_tran (); // bracket
tdes.reclaim_retired_list (); // amortized cleanup
free_node *node;
size_t count = 0;
for (node = pop_from_available (); node == NULL && count < 100;
node = pop_from_available (), ++count) {
// back-buffer allocator was preempted for a long time — force the swap
swap_backbuffer ();
}
while (node == NULL) {
// back-buffer still failed; force-allocate directly into available
force_alloc_block ();
node = pop_from_available ();
}
m_available_count--;
check_my_pointer (node);
return node;
}

The escalation ladder:

  1. Fast path. Pop from m_available_list. Most calls succeed here.
  2. Empty available list, pop returns NULL. Up to 100 retry cycles, each calling swap_backbuffer() to move the back-buffer’s block into the available list. If the back-buffer is also empty (because the back-buffer allocator was preempted between swap and rebuild), swap_backbuffer is a no-op and the next iteration tries pop again.
  3. After 100 empty-available + empty-back-buffer cycles, force-allocate. force_alloc_block does alloc_list + push_to_list (m_available_list) directly — bypasses the back-buffer protocol entirely. This is the “back-buffer allocator deeply preempted” escape hatch.

The tdes.start_tran() at the top is the EBR bracket — every read of m_available_list, every CAS-pop, happens inside the descriptor’s transaction window. The bracket is not ended inside claim; the contract is transaction will remain started (the header comment notes this) — the caller is responsible for ending the transaction (typically by inserting the claimed node into a structure and ending the tran inside the insert).

pop_from_available — the documented ABA window

Section titled “pop_from_available — the documented ABA window”
// freelist<T>::pop_from_available — src/base/lockfree_freelist.hpp:322
free_node *pop_from_available ()
{
free_node *rhead = NULL, *rhead_copy = NULL, *next;
do {
rhead = m_available_list;
if (rhead == NULL) return NULL;
next = rhead->get_freelist_next ();
rhead_copy = rhead;
// todo: this is a dangerous preemption point; if I am preempted here, and
// thread 2 comes and does:
// - second thread gets same rhead and successfully moves m_available_list
// to next
// - third thread gets next and successfully moves m_available_list to
// next->next
// - second thread retires rhead. m_available_list becomes rhead and its
// next becomes next->next
// - I wake up, compare exchange m_available_list successfully because it
// is rhead again, but next will become the item third thread already
// claimed.
} while (!m_available_list.compare_exchange_weak (rhead_copy, next));
rhead->reset_freelist_next ();
return rhead;
}

The comment names the classical ABA scenario verbatim. The CAS on m_available_list succeeds when the head is rhead again — but if rhead was popped, retired, and re-pushed in between, our saved next pointer is stale and we corrupt the list.

The mitigation is time, not algorithm. A retired node must travel:

  1. Through the descriptor’s retired ring,
  2. Wait for min_active_tranid > retire_tranid (which means no reader’s transaction window predates the retirement),
  3. Get reclaimed via reclaim(),
  4. Be pushed back to m_available_list.

The back-buffer pattern lengthens this travel: most retirees go to the descriptor’s retired ring → reclaim() → freelist’s available, but if there is a back-buffer block in flight, the back-buffer’s block waits until swap_backbuffer moves it into available. The total time is empirically long enough that the ABA window in pop_from_available is not hit in practice, but it is not eliminated by algorithm.

The todo flag in the comment is the design’s open issue. A tagged-pointer redesign (lower 16 bits as a generation tag in the head pointer) would close the window at the cost of constraining free_node * alignment. CUBRID has not adopted this; the back-buffer time is the production mitigation.

swap_backbuffer — moving the spare block in

Section titled “swap_backbuffer — moving the spare block in”
// freelist<T>::swap_backbuffer — src/base/lockfree_freelist.hpp:169
void swap_backbuffer ()
{
free_node *bb_head = m_backbuffer_head;
if (bb_head == NULL) return; // somebody already swapped
free_node *bb_head_copy = bb_head;
if (!m_backbuffer_head.compare_exchange_strong (bb_head_copy, NULL))
return; // somebody else is swapping
free_node *bb_tail = m_backbuffer_tail.exchange (NULL);
assert (bb_tail != NULL);
m_available_count += m_bb_count.exchange (0);
push_to_list (*bb_head, *bb_tail, m_available_list);
alloc_backbuffer (); // start building next back-buffer
}

The protocol:

  1. Claim the back-buffer. CAS m_backbuffer_head from the current pointer to NULL. Failure means someone else is already swapping; the loser returns immediately.
  2. Detach the tail. Atomic exchange m_backbuffer_tail to NULL. The pre-CAS head and the exchanged tail bracket the block to splice.
  3. Splice into available. push_to_list is the standard CAS-loop push — set tail->next = available_head, CAS available_head from available_head to block_head.
  4. Start building the next back-buffer. alloc_backbuffer new-s a fresh block and CAS-installs it in the now-empty m_backbuffer_head/m_backbuffer_tail.

The alloc_backbuffer is what hides the malloc/init cost from the next claimer: by the time swap_backbuffer returns, the hot path (pop_from_available) has already been refilled, and the next swap’s destination block is being built in parallel.

// freelist<T>::force_alloc_block — src/base/lockfree_freelist.hpp:213
void force_alloc_block ()
{
free_node *new_head = NULL, *new_tail = NULL;
alloc_list (new_head, new_tail);
m_available_count += m_block_size;
++m_forced_alloc_count;
push_to_list (*new_head, *new_tail, m_available_list);
}

Bypasses the back-buffer entirely; allocates and pushes straight to available. Triggered by claim’s 100-iteration escape from the back-buffer-empty loop. Tracked by m_forced_alloc_count so monitoring can flag pools that hit this path.

retire — the public retirement primitive

Section titled “retire — the public retirement primitive”
// freelist<T>::retire — src/base/lockfree_freelist.hpp:358
void retire (tran::descriptor &tdes, free_node &node)
{
assert (node.get_freelist_next () == NULL);
++m_retired_count;
check_my_pointer (&node);
tdes.retire_node (node); // hand to descriptor's retired ring
}

The freelist defers to tdes.retire_node(), which:

  • Bumps the global tranid (via start_tran_and_increment_id).
  • Amortized-reclaims the descriptor’s existing retired prefix.
  • Stamps the node’s m_retire_tranid and links it to the descriptor’s retired ring.

The freelist’s m_retired_count is a global counter (atomic inc on retire, atomic dec on reclaim) — useful for observability but not for correctness.

// freelist<T>::free_node::reclaim — src/base/lockfree_freelist.hpp:523
void free_node::reclaim () final override
{
m_t.on_reclaim (); // payload's cleanup hook
m_retired_next = NULL; // detach from retire ring
--m_owner->m_retired_count;
++m_owner->m_available_count;
m_owner->push_to_list (*this, *this, m_owner->m_available_list);
}

Three steps:

  1. Payload cleanup. m_t.on_reclaim(). For the hashmap’s freelist_node_data, this calls the per-table lf_entry_descriptor::f_uninit. For other consumers it is whatever destructor-like routine the type defines.
  2. Counters. Decrement retired, increment available.
  3. Push. Standard CAS-loop push to m_available_list. The node is now reclaimable through pop_from_available.

The final override qualifier is deliberate — no further override is allowed. This is the only place a node transitions from “retired” to “claimable.”

clear_free_nodes — destructor cleanup, not thread-safe

Section titled “clear_free_nodes — destructor cleanup, not thread-safe”
// freelist<T>::clear_free_nodes — src/base/lockfree_freelist.hpp:267
void clear_free_nodes () // not thread safe!
{
final_sanity_checks ();
dealloc_list (m_backbuffer_head.load ());
m_backbuffer_head = NULL;
m_backbuffer_tail = NULL;
dealloc_list (m_available_list.load ());
m_available_list = NULL;
m_available_count = m_bb_count = m_alloc_count = 0;
}

Used only by the destructor; assumes no other thread is touching the freelist. The header comment makes this explicit. Reaches into both lists and deletes each free_node directly, then zeros the counters. The final_sanity_checks (only enabled in debug) verifies the available + back-buffer counts equal m_alloc_count and walks both lists to confirm.

The freelist tracks m_alloc_count (total ever allocated), m_available_count (currently claimable), m_bb_count (currently in back-buffer), and m_retired_count (currently waiting for reclamation). The claimed count — currently held by a caller — is derived:

size_t get_claimed_count () const {
size_t alloc = m_alloc_count;
size_t unused = m_available_count + m_bb_count + m_retired_count;
return (alloc > unused) ? (alloc - unused) : 0;
}

The unused aggregate counters are loaded with seq_cst; the math is not transactional, so a fast burst of retires and claims can produce a transient negative (clamped to 0). For monitoring, the value is good enough.

flowchart LR
  Caller["Caller (e.g., hashmap insert)"]
  Caller --> Claim["claim(tdes)"]
  Claim --> Pop["pop_from_available()"]
  Pop -- empty --> Swap["swap_backbuffer()"]
  Swap --> BB["m_backbuffer_head/tail"]
  Swap -.alloc next.-> AllocBB["alloc_backbuffer()"]
  Swap --> Avail["m_available_list"]
  Pop --> Avail
  Swap -- still empty --> Force["force_alloc_block()"]
  Force --> Avail

  Caller --> Retire["retire(tdes, node)"]
  Retire --> Tdes["tran::descriptor"]
  Tdes --> RetRing["m_retired_head/tail ring"]
  RetRing -.tranid bump + min-scan.-> Reclaim["free_node::reclaim()"]
  Reclaim --> OnReclaim["T::on_reclaim()"]
  Reclaim --> Avail

  Avail --> Pop
CounterMeaningUpdated when
m_alloc_countTotal free_nodes ever createdalloc_list (block alloc)
m_available_countCurrently in m_available_listpop_from_available (-1), reclaim (+1), swap (+block_size)
m_bb_countCurrently in back-bufferalloc_backbuffer (+block_size), swap (zeroes via exchange)
m_forced_alloc_countHow many times the back-buffer escape hatch firedforce_alloc_block
m_retired_countCurrently in some descriptor’s retired ringretire (+1), reclaim (-1)

A healthy freelist has small m_forced_alloc_count (ideally 0 under steady state), m_retired_count bounded by reclamation latency, and m_available_count + m_bb_count close to one or two block sizes.

The legacy lf_freelist (lock_free.h:231+) does the same job with a different shape:

struct lf_freelist {
void *available; // single available stack head (Treiber)
int block_size;
int alloc_cnt;
int available_cnt;
int retired_cnt;
LF_ENTRY_DESCRIPTOR *entry_desc; // takes f_alloc/f_free callbacks
LF_TRAN_SYSTEM *tran_system; // shared global, not owned
};

Differences from the modern design:

  • No back-buffer. The legacy freelist allocates blocks on-demand inside lf_freelist_claim (when lf_stack_pop (&freelist->available, edesc) returns NULL) via lf_freelist_alloc_block. The header comment acknowledges multiple threads racing to alloc a block at once, calling it “acceptable given low enough block_size” (lock_free.c:840-842).
  • f_alloc/f_free callbacks instead of new/delete. Each block is built by calling the descriptor’s f_alloc per-entry; reclamation calls f_free. Modern freelist owns its own typed allocations.
  • max_alloc_cnt cap. The legacy freelist tracks alloc_cnt and refuses to grow past entry_desc->max_alloc_cnt (default 2,147,483,647). When the cap is hit and a retire happens, the legacy lf_freelist_transport f_frees the retiree instead of recycling. The modern freelist has no cap; once allocated, blocks live for the freelist’s lifetime.
  • Shared LF_TRAN_SYSTEM. A single legacy LF_TRAN_SYSTEM underlies many freelists in the same domain. Modern freelist owns its tran::table; reclamation pressure is per-pool.
  • lf_freelist_transport is the legacy reclamation walker. Walks the per-thread retired list, partitioning into “free this” (caps exceeded) and “transport to available” (still under cap). Modern code calls descriptor::reclaim_retired_list and the per-node reclaim() virtual.

The migration path: when a hashmap consumer flips from the legacy to the modern hashmap (via enable_new_lfhash), it trades the global LF_TRAN_SYSTEM + lf_freelist for a per-table tran::table + lockfree::freelist. The behavior is observably the same (entries claimed, retired, recycled); the bookkeeping is simpler.

Every atomic in the freelist uses seq_cst defaults. The hot operations are:

  • m_available_list.compare_exchange_weak (pop and push) — seq_cst.
  • m_backbuffer_head.compare_exchange_strong (claim back-buffer for swap) — seq_cst.
  • m_backbuffer_tail.exchange (detach tail during swap) — seq_cst.
  • Counter ++ / -- / += — seq_cst RMW.

On x86 the seq_cst overhead is small (a couple of additional fences); on ARM it is non-trivial. No measurement campaign exists for tightening to acquire/release.

  • lockfree::freelist<T> (class template) — lockfree_freelist.hpp:42
  • freelist::freelist (tran::system &, size_t, size_t):138
  • freelist::~freelist:259
  • freelist::claim (tran::descriptor &):291
  • freelist::claim (tran::index):284
  • freelist::retire (tran::descriptor &, free_node &):358
  • freelist::retire (tran::index, free_node &):351
  • freelist::get_alloc_count / get_available_count / ...:382+
  • freelist::get_transaction_system / get_transaction_table:433 / :440
  • freelist<T>::free_node:104
  • free_node::get_data:534
  • free_node::reclaim:523 (the final override)
  • freelist::swap_backbuffer:169
  • freelist::alloc_backbuffer:194
  • freelist::force_alloc_block:213
  • freelist::alloc_list:227
  • freelist::dealloc_list:247
  • freelist::pop_from_available:322
  • freelist::push_to_list:368
  • freelist::clear_free_nodes:267
  • freelist::final_sanity_checks:447
  • freelist::check_my_pointer:476
SymbolFileLine
lockfree::freelist<T> (class template)src/base/lockfree_freelist.hpp42
freelist<T>::free_node (class)src/base/lockfree_freelist.hpp104
freelist::freelist (constructor)src/base/lockfree_freelist.hpp138
freelist::swap_backbuffersrc/base/lockfree_freelist.hpp169
freelist::alloc_backbuffersrc/base/lockfree_freelist.hpp194
freelist::force_alloc_blocksrc/base/lockfree_freelist.hpp213
freelist::alloc_listsrc/base/lockfree_freelist.hpp227
freelist::~freelistsrc/base/lockfree_freelist.hpp259
freelist::clear_free_nodessrc/base/lockfree_freelist.hpp267
freelist::claim (descriptor)src/base/lockfree_freelist.hpp291
freelist::pop_from_availablesrc/base/lockfree_freelist.hpp322
freelist::retire (descriptor)src/base/lockfree_freelist.hpp358
freelist::push_to_listsrc/base/lockfree_freelist.hpp368
freelist::get_claimed_countsrc/base/lockfree_freelist.hpp418
free_node::reclaimsrc/base/lockfree_freelist.hpp523
lf_freelist_alloc_block (legacy)src/base/lock_free.c621
lf_freelist_claim (legacy)src/base/lock_free.c760
lf_freelist_retire (legacy)src/base/lock_free.c873
lf_freelist_transport (legacy)src/base/lock_free.c934
  • The freelist owns its tran::table. Constructor calls m_trantable = new tran::table (transys) at lockfree_freelist.hpp:141; destructor delete m_trantable at :261.
  • Block size is at least 2. Constructor enforces assert (block_size > 1) at :152 and the initial_block_count <= 1 branch at :154-158 halves block size and doubles count to ensure ≥ 2 blocks at init.
  • pop_from_available’s ABA window is documented in the source. Multi-line comment at :336-342 describing the exact race scenario. No tagged-pointer mitigation in the current implementation.
  • force_alloc_block is the escape hatch after 100 back-buffer-empty cycles. claim at :298 loops up to 100 swap attempts before falling through to the force_alloc_block while-loop at :306.
  • free_node::reclaim is final override. Declared at lockfree_freelist.hpp:112. No further overrides allowed.
  • Statistics are seq_cst atomics. All counters (m_alloc_count, m_available_count, m_bb_count, m_forced_alloc_count, m_retired_count) declared as std::atomic<size_t> at :82-86.
  • claim does not end the transaction. The header comment at :53 says “transaction will remain started!” Caller is responsible for ending it.
  • The free node’s m_retired_next is reused as the freelist link. set_freelist_next / get_freelist_next at :502-518 cast through the inherited tran::reclaimable_node::m_retired_next.
  1. Has the pop_from_available ABA window ever been hit in production? The comment is documented; no test evidence either way. A targeted stress test (deliberate SIGSTOP/SIGCONT on a worker thread between read of next and CAS) would tell us empirically. If it has not hit in N production-machine-years, the back-buffer time is the de-facto mitigation; if it has, the tagged-pointer redesign moves up the priority list.
  2. Why the magic number 100 in the swap-loop limit? The count < 100 cap before falling through to force_alloc_block is unmotivated by the source. Perhaps tuned empirically; perhaps placeholder. Worth a focused review under stress.
  3. Cap-on-allocation parity with legacy. Legacy lf_freelist honours entry_desc->max_alloc_cnt. Modern freelist<T> has no cap; once a block is allocated, it lives for the freelist’s lifetime. For long-running servers with bursty load, this is a memory-watermark concern. Investigation: does any consumer rely on the legacy cap to bound memory? If yes, the modern freelist needs a cap-and-shrink path.
  4. alloc_backbuffer’s use of compare_exchange_strong on a NULL expectation. At :204, the function does m_backbuffer_tail.compare_exchange_strong (dummy_null, new_bb_tail) — succeeds only if the tail is currently NULL (which is true after swap_backbuffer exchanged it to NULL). On the first init call the tail is also NULL. But what if a second alloc_backbuffer raced (e.g., from two near-simultaneous swaps)? The CAS would fail, and the second alloc_backbuffer would push_to_list to the already-installed back-buffer head. Behavior under that race is plausible-but-not-obviously-correct; worth a focused review.
  5. final_sanity_checks invariant under load. The debug-only check at :447-472 asserts m_available_count + m_bb_count == m_alloc_count. This ignores m_retired_count — i.e., it assumes no retires are in flight at destruction time. The destructor runs after every consumer has stopped using the freelist (via clear_free_nodes), so this is a reasonable precondition; but if a future change calls final_sanity_checks mid-life, the assertion is wrong.

Beyond CUBRID — Comparative Designs & Research Frontiers

Section titled “Beyond CUBRID — Comparative Designs & Research Frontiers”
  • Tagged pointers to close the pop_from_available ABA window. Stuff a 16-bit generation counter into the upper bits of free_node * (or use std::atomic<__int128>). Every push bumps the tag; every pop checks tag equality. Eliminates the documented ABA hazard at the cost of pointer-arithmetic discipline.
  • Hazard-pointer-based freelist (Michael 2004). Per-thread hazard pointer slot for the popped node; reclaim only when no thread has the address as a hazard. Cheaper read-side publication than EBR for hot workloads.
  • Per-CPU sharded freelist with stealing. Replace the single m_available_list with one per CPU (sched_getcpu()). Allocate from local; on local-empty, steal from a neighbor. Removes the cache-line bouncing on the head pointer at the cost of more memory and more complex sanity-check accounting.
  • Linked block lists rather than per-node lists. The freelist’s available list could be a list of blocks of nodes rather than individual nodes. Each block holds K nodes; pop gives one node from the head block (move to next when empty). Cuts CAS frequency by K. Requires the block to be the unit of cache-line concern, not the node.
  • Cap-and-shrink. Track high-water mark of in-use count; when the freelist is overprovisioned (in-use much smaller than allocated), shrink during a clear or a maintenance daemon tick. Would close the legacy parity gap (open question 3).
  • Source files (in the CUBRID source tree at /data/hgryoo/references/cubrid/):
    • src/base/lockfree_freelist.hpp (header-only template, the entire implementation)
    • src/base/lockfree_transaction_descriptor.hpp (the descriptor freelist::claim/retire rides on)
    • src/base/lockfree_transaction_table.hpp (the table freelist constructs internally)
    • src/base/lockfree_transaction_reclaimable.hpp (the base class free_node derives from)
    • src/base/lockfree_hashmap.hpp (primary consumer, via freelist<freelist_node_data>)
    • src/base/lock_free.h, src/base/lock_free.c (legacy lf_freelist counterpart)
  • Textbook chapters cited:
    • Herlihy & Shavit, The Art of Multiprocessor Programming, ch. 11 §“Concurrent Stacks and Elimination” — the Treiber-stack pattern at the core of pop_from_available and push_to_list.
  • Foundational papers:
    • Treiber, R. K. “Systems Programming: Coping with Parallelism.” IBM Almaden RJ 5118, 1986 — the classical Treiber stack and its ABA exposure.
    • Michael, M. M. “Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects.” IEEE TPDS 15(8), 2004 — the alternative reclamation model the freelist does not use.
  • Sibling docs in this repo:
    • cubrid-lockfree-overview.md — umbrella, two-generation map.
    • cubrid-lockfree-transaction.md — the reclamation framework the freelist plugs into.
    • cubrid-lockfree-hashmap.md — primary consumer; explains freelist_node_data::on_reclaim.