CUBRID Lock-Free Freelist — Typed Node Pool with Back-Buffer Block Allocator
Contents:
- Theoretical Background
- Common DBMS Design
- CUBRID’s Approach
- Source Walkthrough
- Source verification (as of 2026-05-07)
- Beyond CUBRID — Comparative Designs & Research Frontiers
- Sources
Theoretical Background
Section titled “Theoretical Background”A freelist is the simplest possible memory pool: a singly linked list of free objects. Allocate = pop the head; release = push to the head. The two-CAS protocol from Treiber’s stack applies directly. The textbook coverage is in Herlihy & Shavit, The Art of Multiprocessor Programming, ch. 11 §“Concurrent Stacks and Elimination” — the same shape, applied to typed reusable objects rather than abstract elements.
The interesting design tensions are:
- What goes on the head? A single global head bounces between cores on every alloc/release. The standard mitigations are elimination (pair an alloc with a release without touching the head) and partitioning (per-CPU freelists with cross-shard stealing).
- When does the freelist grow? A fixed-size pool is simplest but exhausts under load. A growable pool needs a block allocator and a strategy for when to call it. Block- on-demand at exhaustion stalls the calling thread; pre- allocate-once at init wastes memory.
- How is ABA prevented? The Treiber stack’s classic ABA
problem — pop reads
head, computeshead->next, CAS-bumpsheadtohead->next— is hit head-on by every freelist that recycles addresses. The mitigations are tagged pointers (extra bits in the head word), hazard pointers (per-thread published references), or transactional reclamation (defer reuse until no reader could be holding the popped node).
CUBRID’s lockfree::freelist<T> picks: a single available list
head, a one-block back-buffer for lazy growth, and the
tran::table / tran::descriptor framework for ABA-bounded
reuse. The back-buffer is the design’s distinctive feature.
Back-buffer as growth amortization
Section titled “Back-buffer as growth amortization”A naive claim() that finds the available list empty must
either fail, spin, or allocate a new block. Allocation is
expensive (a malloc + N typed-init calls). Concurrent allocators
also race — three threads simultaneously discovering an empty
list each call alloc_block, getting three blocks where one
would suffice.
The back-buffer pattern: keep one already-built block held aside. When the available list drains, atomic-swap the back-buffer into the available list and asynchronously build a new back-buffer. The race is now bounded — at most one thread gets to swap the back-buffer; the others see the swap completed and try the available list again, which by then has the just-swapped block.
CUBRID’s freelist runs the back-buffer pre-allocation eagerly:
on every successful swap, the freelist immediately allocates
the next back-buffer. So at steady state the freelist holds
two blocks of memory: one in m_available_list, one in
m_backbuffer_*.
on_reclaim — typed cleanup before recycle
Section titled “on_reclaim — typed cleanup before recycle”A freelist that recycles bare bytes (just void* from
malloc/free) imposes no contract on the allocator. A typed
freelist must let the type clean itself between uses — close
file handles, release sub-allocations, scrub state. The shape
the freelist enforces:
- The payload type
Tmust declare anon_reclaim()method (compile-time check via SFINAE on the template instantiation). - The freelist’s
reclaim()callsm_t.on_reclaim()before pushing the node back to the available list.
The lockfree::hashmap’s freelist_node_data is the canonical
example: its on_reclaim calls the per-table
lf_entry_descriptor::f_uninit callback, so a hashmap entry’s
custom destructor (releasing a sub-mutex, freeing a payload
buffer, etc.) runs at every recycle.
Common DBMS Design
Section titled “Common DBMS Design”Per-data-structure free lists
Section titled “Per-data-structure free lists”Every DBMS uses pooled allocators for its lock-free entries — malloc on every insert and free on every delete is too slow. The shapes:
- PostgreSQL uses
MemoryContextarenas keyed by lifetime (top-level / per-transaction / per-tuple). Not lock-free; the context is per-backend. - InnoDB uses
mem_heap_tarenas + a private freelist for hash entries (hash_table_t::heap). Mutex-protected. - MySQL has slab allocators for fixed-size objects.
- CUBRID uses one freelist per lock-free hashmap (and per
any other structure that wants pooled typed nodes), riding on
tran::tablereclamation.
The CUBRID approach is per-data-structure, not per-thread or per-transaction. The trade: each hashmap has its own pool with its own waste rate, but the pools never bleed cross-structure (retiring a hashmap clears only its own freelist).
Block allocator + back-buffer
Section titled “Block allocator + back-buffer”The pattern of “allocate K objects at once, push as a block” is
standard. The variant of “keep one block on the side” is less
universal. Java’s ConcurrentLinkedDeque does not; Folly’s
SmallVector does (capacity grows by doubling, holding the
new buffer aside until the old is drained). CUBRID’s freelist
is on the back-buffer side because the typical workload is
bursty retirement (vacuum sweep, hashmap clear) followed by
bursty claim (replay, repopulate) — the back-buffer absorbs
the asymmetry.
Theory ↔ CUBRID mapping
Section titled “Theory ↔ CUBRID mapping”| Concept | CUBRID name |
|---|---|
| Typed node pool | lockfree::freelist<T> |
| Pool node | freelist<T>::free_node (derives from tran::reclaimable_node) |
| Available stack head | m_available_list (std::atomic<free_node *>) |
| Back-buffer head/tail | m_backbuffer_head, m_backbuffer_tail (std::atomic<free_node *>) |
| Block size (init param) | m_block_size (≥ 2 enforced) |
| On-reclaim hook | T::on_reclaim () (caller-provided) |
| Per-pool reclamation | m_trantable (each freelist owns its own tran::table) |
| Allocate from available | pop_from_available () |
| Allocate via swap | swap_backbuffer () |
| Forced allocation | force_alloc_block () (bypass back-buffer) |
| Statistics: alloc count | m_alloc_count |
| Statistics: available count | m_available_count |
| Statistics: back-buffer count | m_bb_count |
| Statistics: forced allocations | m_forced_alloc_count |
| Statistics: retired count | m_retired_count |
CUBRID’s Approach
Section titled “CUBRID’s Approach”Two classes — freelist<T> and freelist<T>::free_node
Section titled “Two classes — freelist<T> and freelist<T>::free_node”// lockfree::freelist<T> — src/base/lockfree_freelist.hpp:42template <class T>class freelist {public: class free_node;
freelist () = delete; freelist (tran::system &transys, size_t block_size, size_t initial_block_count = 1); ~freelist ();
free_node *claim (tran::descriptor &); free_node *claim (tran::index);
void retire (tran::descriptor &, free_node &); void retire (tran::index, free_node &);
// statistics: size_t get_alloc_count () const; size_t get_available_count () const; size_t get_backbuffer_count () const; size_t get_forced_allocation_count () const; size_t get_retired_count () const; size_t get_claimed_count () const; // = alloc - (available + backbuffer + retired)
tran::system &get_transaction_system (); tran::table &get_transaction_table ();
private: tran::system &m_transys; tran::table *m_trantable; // owned: deleted in destructor size_t m_block_size; std::atomic<free_node *> m_available_list; std::atomic<free_node *> m_backbuffer_head; std::atomic<free_node *> m_backbuffer_tail; std::atomic<size_t> m_available_count; std::atomic<size_t> m_alloc_count; std::atomic<size_t> m_bb_count; std::atomic<size_t> m_forced_alloc_count; std::atomic<size_t> m_retired_count;
void swap_backbuffer (); void alloc_backbuffer (); void force_alloc_block (); void alloc_list (free_node *&head, free_node *&tail); void dealloc_list (free_node *head); free_node *pop_from_available (); void push_to_list (free_node &head, free_node &tail, std::atomic<free_node *> &dest); void clear_free_nodes (); void final_sanity_checks () const; void check_my_pointer (free_node *node);};
// lockfree::freelist<T>::free_node — src/base/lockfree_freelist.hpp:104template <class T>class freelist<T>::free_node : public tran::reclaimable_node {public: free_node (); ~free_node () = default; T &get_data (); void reclaim () final override; // recycles instead of deleteprivate: friend freelist; void set_owner (freelist &m_freelist); void set_freelist_next (free_node *next); void reset_freelist_next (); free_node *get_freelist_next (); freelist *m_owner; T m_t; // payload};The free_node derives from tran::reclaimable_node, so a
retired node lives on the descriptor’s retired ring through the
inherited m_retired_next field. When reclamation eventually
fires, the inherited reclaim() virtual is the recycle path
that pushes back to m_available_list.
The clever overload of m_retired_next:
- While the node is on the descriptor’s retired ring,
m_retired_nextis the retire-ring link. - While the node is on the freelist’s available list,
m_retired_nextis the freelist link. - A node is on exactly one of those at a time, so the field is reused.
free_node::set_freelist_next and get_freelist_next are
private accessors that present m_retired_next as a freelist
link with the correct return type (free_node *, not the base
reclaimable_node *).
Construction — eager pre-allocation
Section titled “Construction — eager pre-allocation”// freelist<T>::freelist — src/base/lockfree_freelist.hpp:138freelist (tran::system &transys, size_t block_size, size_t initial_block_count = 1) : m_transys (transys), m_trantable (new tran::table (transys)), // OWN tran table m_block_size (block_size), /* counters initialized to 0 */{ if (initial_block_count <= 1) { m_block_size /= 2; initial_block_count = 2; }
alloc_backbuffer (); for (size_t i = 0; i < initial_block_count; i++) swap_backbuffer (); // pre-fill available list}Three things happen at init:
- The freelist creates its own
tran::tablefrom the sharedtran::system. Every freelist is a distinct reclamation domain — retiring nodes in freelist A does not bump the global tranid in freelist B. - The “block size at least 2 with initial blocks at least 2”
rule. If the caller asks for
initial_block_count = 1, the constructor halves the block size and doubles the block count. This guarantees there is always at least one block in the back-buffer and one block in the available list at steady state. - The init eagerly fills the available list to
initial_block_countblocks. The back-buffer is one extra block beyond what was requested. After init, the freelist containsinitial_block_count + 1total blocks of memory,initial_block_countof which are immediately claimable.
claim — the hot allocate path
Section titled “claim — the hot allocate path”// freelist<T>::claim (descriptor flavor) — src/base/lockfree_freelist.hpp:291free_node *claim (tran::descriptor &tdes){ tdes.start_tran (); // bracket tdes.reclaim_retired_list (); // amortized cleanup
free_node *node; size_t count = 0; for (node = pop_from_available (); node == NULL && count < 100; node = pop_from_available (), ++count) { // back-buffer allocator was preempted for a long time — force the swap swap_backbuffer (); } while (node == NULL) { // back-buffer still failed; force-allocate directly into available force_alloc_block (); node = pop_from_available (); }
m_available_count--; check_my_pointer (node); return node;}The escalation ladder:
- Fast path. Pop from
m_available_list. Most calls succeed here. - Empty available list, pop returns NULL. Up to 100 retry
cycles, each calling
swap_backbuffer()to move the back-buffer’s block into the available list. If the back-buffer is also empty (because the back-buffer allocator was preempted between swap and rebuild),swap_backbufferis a no-op and the next iteration tries pop again. - After 100 empty-available + empty-back-buffer cycles,
force-allocate.
force_alloc_blockdoesalloc_list+push_to_list (m_available_list)directly — bypasses the back-buffer protocol entirely. This is the “back-buffer allocator deeply preempted” escape hatch.
The tdes.start_tran() at the top is the EBR bracket — every
read of m_available_list, every CAS-pop, happens inside the
descriptor’s transaction window. The bracket is not ended
inside claim; the contract is transaction will remain
started (the header comment notes this) — the caller is
responsible for ending the transaction (typically by inserting
the claimed node into a structure and ending the tran inside
the insert).
pop_from_available — the documented ABA window
Section titled “pop_from_available — the documented ABA window”// freelist<T>::pop_from_available — src/base/lockfree_freelist.hpp:322free_node *pop_from_available (){ free_node *rhead = NULL, *rhead_copy = NULL, *next; do { rhead = m_available_list; if (rhead == NULL) return NULL; next = rhead->get_freelist_next (); rhead_copy = rhead; // todo: this is a dangerous preemption point; if I am preempted here, and // thread 2 comes and does: // - second thread gets same rhead and successfully moves m_available_list // to next // - third thread gets next and successfully moves m_available_list to // next->next // - second thread retires rhead. m_available_list becomes rhead and its // next becomes next->next // - I wake up, compare exchange m_available_list successfully because it // is rhead again, but next will become the item third thread already // claimed. } while (!m_available_list.compare_exchange_weak (rhead_copy, next)); rhead->reset_freelist_next (); return rhead;}The comment names the classical ABA scenario verbatim. The CAS
on m_available_list succeeds when the head is rhead again
— but if rhead was popped, retired, and re-pushed in
between, our saved next pointer is stale and we corrupt the
list.
The mitigation is time, not algorithm. A retired node must travel:
- Through the descriptor’s retired ring,
- Wait for
min_active_tranid > retire_tranid(which means no reader’s transaction window predates the retirement), - Get reclaimed via
reclaim(), - Be pushed back to
m_available_list.
The back-buffer pattern lengthens this travel: most retirees go
to the descriptor’s retired ring → reclaim() → freelist’s
available, but if there is a back-buffer block in flight, the
back-buffer’s block waits until swap_backbuffer moves it into
available. The total time is empirically long enough that the
ABA window in pop_from_available is not hit in practice, but
it is not eliminated by algorithm.
The todo flag in the comment is the design’s open issue. A
tagged-pointer redesign (lower 16 bits as a generation tag in
the head pointer) would close the window at the cost of
constraining free_node * alignment. CUBRID has not adopted
this; the back-buffer time is the production mitigation.
swap_backbuffer — moving the spare block in
Section titled “swap_backbuffer — moving the spare block in”// freelist<T>::swap_backbuffer — src/base/lockfree_freelist.hpp:169void swap_backbuffer (){ free_node *bb_head = m_backbuffer_head; if (bb_head == NULL) return; // somebody already swapped
free_node *bb_head_copy = bb_head; if (!m_backbuffer_head.compare_exchange_strong (bb_head_copy, NULL)) return; // somebody else is swapping
free_node *bb_tail = m_backbuffer_tail.exchange (NULL); assert (bb_tail != NULL);
m_available_count += m_bb_count.exchange (0); push_to_list (*bb_head, *bb_tail, m_available_list);
alloc_backbuffer (); // start building next back-buffer}The protocol:
- Claim the back-buffer. CAS
m_backbuffer_headfrom the current pointer to NULL. Failure means someone else is already swapping; the loser returns immediately. - Detach the tail. Atomic exchange
m_backbuffer_tailto NULL. The pre-CAS head and the exchanged tail bracket the block to splice. - Splice into available.
push_to_listis the standard CAS-loop push — settail->next = available_head, CASavailable_headfromavailable_headtoblock_head. - Start building the next back-buffer.
alloc_backbuffernew-s a fresh block and CAS-installs it in the now-emptym_backbuffer_head/m_backbuffer_tail.
The alloc_backbuffer is what hides the malloc/init cost from
the next claimer: by the time swap_backbuffer returns, the
hot path (pop_from_available) has already been refilled, and
the next swap’s destination block is being built in
parallel.
force_alloc_block — the escape hatch
Section titled “force_alloc_block — the escape hatch”// freelist<T>::force_alloc_block — src/base/lockfree_freelist.hpp:213void force_alloc_block (){ free_node *new_head = NULL, *new_tail = NULL; alloc_list (new_head, new_tail); m_available_count += m_block_size; ++m_forced_alloc_count; push_to_list (*new_head, *new_tail, m_available_list);}Bypasses the back-buffer entirely; allocates and pushes
straight to available. Triggered by claim’s 100-iteration
escape from the back-buffer-empty loop. Tracked by
m_forced_alloc_count so monitoring can flag pools that hit
this path.
retire — the public retirement primitive
Section titled “retire — the public retirement primitive”// freelist<T>::retire — src/base/lockfree_freelist.hpp:358void retire (tran::descriptor &tdes, free_node &node){ assert (node.get_freelist_next () == NULL); ++m_retired_count; check_my_pointer (&node); tdes.retire_node (node); // hand to descriptor's retired ring}The freelist defers to tdes.retire_node(), which:
- Bumps the global tranid (via
start_tran_and_increment_id). - Amortized-reclaims the descriptor’s existing retired prefix.
- Stamps the node’s
m_retire_tranidand links it to the descriptor’s retired ring.
The freelist’s m_retired_count is a global counter (atomic
inc on retire, atomic dec on reclaim) — useful for
observability but not for correctness.
free_node::reclaim — the recycle hook
Section titled “free_node::reclaim — the recycle hook”// freelist<T>::free_node::reclaim — src/base/lockfree_freelist.hpp:523void free_node::reclaim () final override{ m_t.on_reclaim (); // payload's cleanup hook
m_retired_next = NULL; // detach from retire ring --m_owner->m_retired_count; ++m_owner->m_available_count; m_owner->push_to_list (*this, *this, m_owner->m_available_list);}Three steps:
- Payload cleanup.
m_t.on_reclaim(). For the hashmap’sfreelist_node_data, this calls the per-tablelf_entry_descriptor::f_uninit. For other consumers it is whatever destructor-like routine the type defines. - Counters. Decrement retired, increment available.
- Push. Standard CAS-loop push to
m_available_list. The node is now reclaimable throughpop_from_available.
The final override qualifier is deliberate — no further
override is allowed. This is the only place a node transitions
from “retired” to “claimable.”
clear_free_nodes — destructor cleanup, not thread-safe
Section titled “clear_free_nodes — destructor cleanup, not thread-safe”// freelist<T>::clear_free_nodes — src/base/lockfree_freelist.hpp:267void clear_free_nodes () // not thread safe!{ final_sanity_checks (); dealloc_list (m_backbuffer_head.load ()); m_backbuffer_head = NULL; m_backbuffer_tail = NULL; dealloc_list (m_available_list.load ()); m_available_list = NULL; m_available_count = m_bb_count = m_alloc_count = 0;}Used only by the destructor; assumes no other thread is touching
the freelist. The header comment makes this explicit. Reaches
into both lists and deletes each free_node directly, then
zeros the counters. The final_sanity_checks (only enabled in
debug) verifies the available + back-buffer counts equal
m_alloc_count and walks both lists to confirm.
get_claimed_count — derived statistic
Section titled “get_claimed_count — derived statistic”The freelist tracks m_alloc_count (total ever allocated),
m_available_count (currently claimable),
m_bb_count (currently in back-buffer), and
m_retired_count (currently waiting for reclamation). The
claimed count — currently held by a caller — is derived:
size_t get_claimed_count () const { size_t alloc = m_alloc_count; size_t unused = m_available_count + m_bb_count + m_retired_count; return (alloc > unused) ? (alloc - unused) : 0;}The unused aggregate counters are loaded with seq_cst; the
math is not transactional, so a fast burst of retires and
claims can produce a transient negative (clamped to 0). For
monitoring, the value is good enough.
Component overview
Section titled “Component overview”flowchart LR Caller["Caller (e.g., hashmap insert)"] Caller --> Claim["claim(tdes)"] Claim --> Pop["pop_from_available()"] Pop -- empty --> Swap["swap_backbuffer()"] Swap --> BB["m_backbuffer_head/tail"] Swap -.alloc next.-> AllocBB["alloc_backbuffer()"] Swap --> Avail["m_available_list"] Pop --> Avail Swap -- still empty --> Force["force_alloc_block()"] Force --> Avail Caller --> Retire["retire(tdes, node)"] Retire --> Tdes["tran::descriptor"] Tdes --> RetRing["m_retired_head/tail ring"] RetRing -.tranid bump + min-scan.-> Reclaim["free_node::reclaim()"] Reclaim --> OnReclaim["T::on_reclaim()"] Reclaim --> Avail Avail --> Pop
Statistics summary
Section titled “Statistics summary”| Counter | Meaning | Updated when |
|---|---|---|
m_alloc_count | Total free_nodes ever created | alloc_list (block alloc) |
m_available_count | Currently in m_available_list | pop_from_available (-1), reclaim (+1), swap (+block_size) |
m_bb_count | Currently in back-buffer | alloc_backbuffer (+block_size), swap (zeroes via exchange) |
m_forced_alloc_count | How many times the back-buffer escape hatch fired | force_alloc_block |
m_retired_count | Currently in some descriptor’s retired ring | retire (+1), reclaim (-1) |
A healthy freelist has small m_forced_alloc_count (ideally 0
under steady state), m_retired_count bounded by reclamation
latency, and m_available_count + m_bb_count close to one or
two block sizes.
Comparison with legacy LF_FREELIST
Section titled “Comparison with legacy LF_FREELIST”The legacy lf_freelist (lock_free.h:231+) does the same
job with a different shape:
struct lf_freelist { void *available; // single available stack head (Treiber) int block_size; int alloc_cnt; int available_cnt; int retired_cnt; LF_ENTRY_DESCRIPTOR *entry_desc; // takes f_alloc/f_free callbacks LF_TRAN_SYSTEM *tran_system; // shared global, not owned};Differences from the modern design:
- No back-buffer. The legacy freelist allocates blocks
on-demand inside
lf_freelist_claim(whenlf_stack_pop (&freelist->available, edesc)returns NULL) vialf_freelist_alloc_block. The header comment acknowledges multiple threads racing to alloc a block at once, calling it “acceptable given low enough block_size” (lock_free.c:840-842). f_alloc/f_freecallbacks instead ofnew/delete. Each block is built by calling the descriptor’sf_allocper-entry; reclamation callsf_free. Modern freelist owns its own typed allocations.max_alloc_cntcap. The legacy freelist tracksalloc_cntand refuses to grow pastentry_desc->max_alloc_cnt(default2,147,483,647). When the cap is hit and a retire happens, the legacylf_freelist_transportf_frees the retiree instead of recycling. The modern freelist has no cap; once allocated, blocks live for the freelist’s lifetime.- Shared
LF_TRAN_SYSTEM. A single legacyLF_TRAN_SYSTEMunderlies many freelists in the same domain. Modern freelist owns itstran::table; reclamation pressure is per-pool. lf_freelist_transportis the legacy reclamation walker. Walks the per-thread retired list, partitioning into “free this” (caps exceeded) and “transport to available” (still under cap). Modern code callsdescriptor::reclaim_retired_listand the per-nodereclaim()virtual.
The migration path: when a hashmap consumer flips from the
legacy to the modern hashmap (via enable_new_lfhash), it
trades the global LF_TRAN_SYSTEM + lf_freelist for a
per-table tran::table + lockfree::freelist. The behavior
is observably the same (entries claimed, retired, recycled);
the bookkeeping is simpler.
Memory ordering
Section titled “Memory ordering”Every atomic in the freelist uses seq_cst defaults. The
hot operations are:
m_available_list.compare_exchange_weak(pop and push) — seq_cst.m_backbuffer_head.compare_exchange_strong(claim back-buffer for swap) — seq_cst.m_backbuffer_tail.exchange(detach tail during swap) — seq_cst.- Counter
++/--/+=— seq_cst RMW.
On x86 the seq_cst overhead is small (a couple of additional
fences); on ARM it is non-trivial. No measurement campaign
exists for tightening to acquire/release.
Source Walkthrough
Section titled “Source Walkthrough”Public surface
Section titled “Public surface”lockfree::freelist<T>(class template) —lockfree_freelist.hpp:42freelist::freelist (tran::system &, size_t, size_t)—:138freelist::~freelist—:259freelist::claim (tran::descriptor &)—:291freelist::claim (tran::index)—:284freelist::retire (tran::descriptor &, free_node &)—:358freelist::retire (tran::index, free_node &)—:351freelist::get_alloc_count / get_available_count / ...—:382+freelist::get_transaction_system / get_transaction_table—:433/:440freelist<T>::free_node—:104free_node::get_data—:534free_node::reclaim—:523(thefinal override)
Private internals
Section titled “Private internals”freelist::swap_backbuffer—:169freelist::alloc_backbuffer—:194freelist::force_alloc_block—:213freelist::alloc_list—:227freelist::dealloc_list—:247freelist::pop_from_available—:322freelist::push_to_list—:368freelist::clear_free_nodes—:267freelist::final_sanity_checks—:447freelist::check_my_pointer—:476
Position hints (as of 2026-05-07)
Section titled “Position hints (as of 2026-05-07)”| Symbol | File | Line |
|---|---|---|
lockfree::freelist<T> (class template) | src/base/lockfree_freelist.hpp | 42 |
freelist<T>::free_node (class) | src/base/lockfree_freelist.hpp | 104 |
freelist::freelist (constructor) | src/base/lockfree_freelist.hpp | 138 |
freelist::swap_backbuffer | src/base/lockfree_freelist.hpp | 169 |
freelist::alloc_backbuffer | src/base/lockfree_freelist.hpp | 194 |
freelist::force_alloc_block | src/base/lockfree_freelist.hpp | 213 |
freelist::alloc_list | src/base/lockfree_freelist.hpp | 227 |
freelist::~freelist | src/base/lockfree_freelist.hpp | 259 |
freelist::clear_free_nodes | src/base/lockfree_freelist.hpp | 267 |
freelist::claim (descriptor) | src/base/lockfree_freelist.hpp | 291 |
freelist::pop_from_available | src/base/lockfree_freelist.hpp | 322 |
freelist::retire (descriptor) | src/base/lockfree_freelist.hpp | 358 |
freelist::push_to_list | src/base/lockfree_freelist.hpp | 368 |
freelist::get_claimed_count | src/base/lockfree_freelist.hpp | 418 |
free_node::reclaim | src/base/lockfree_freelist.hpp | 523 |
lf_freelist_alloc_block (legacy) | src/base/lock_free.c | 621 |
lf_freelist_claim (legacy) | src/base/lock_free.c | 760 |
lf_freelist_retire (legacy) | src/base/lock_free.c | 873 |
lf_freelist_transport (legacy) | src/base/lock_free.c | 934 |
Source verification (as of 2026-05-07)
Section titled “Source verification (as of 2026-05-07)”Verified facts
Section titled “Verified facts”- The freelist owns its
tran::table. Constructor callsm_trantable = new tran::table (transys)atlockfree_freelist.hpp:141; destructordelete m_trantableat:261. - Block size is at least 2. Constructor enforces
assert (block_size > 1)at:152and theinitial_block_count <= 1branch at:154-158halves block size and doubles count to ensure ≥ 2 blocks at init. pop_from_available’s ABA window is documented in the source. Multi-line comment at:336-342describing the exact race scenario. No tagged-pointer mitigation in the current implementation.force_alloc_blockis the escape hatch after 100 back-buffer-empty cycles.claimat:298loops up to 100 swap attempts before falling through to theforce_alloc_blockwhile-loop at:306.free_node::reclaimisfinal override. Declared atlockfree_freelist.hpp:112. No further overrides allowed.- Statistics are seq_cst atomics. All counters
(
m_alloc_count,m_available_count,m_bb_count,m_forced_alloc_count,m_retired_count) declared asstd::atomic<size_t>at:82-86. claimdoes not end the transaction. The header comment at:53says “transaction will remain started!” Caller is responsible for ending it.- The free node’s
m_retired_nextis reused as the freelist link.set_freelist_next/get_freelist_nextat:502-518cast through the inheritedtran::reclaimable_node::m_retired_next.
Open questions
Section titled “Open questions”- Has the
pop_from_availableABA window ever been hit in production? The comment is documented; no test evidence either way. A targeted stress test (deliberateSIGSTOP/SIGCONTon a worker thread between read ofnextand CAS) would tell us empirically. If it has not hit in N production-machine-years, the back-buffer time is the de-facto mitigation; if it has, the tagged-pointer redesign moves up the priority list. - Why the magic number 100 in the swap-loop limit? The
count < 100cap before falling through toforce_alloc_blockis unmotivated by the source. Perhaps tuned empirically; perhaps placeholder. Worth a focused review under stress. - Cap-on-allocation parity with legacy. Legacy
lf_freelisthonoursentry_desc->max_alloc_cnt. Modernfreelist<T>has no cap; once a block is allocated, it lives for the freelist’s lifetime. For long-running servers with bursty load, this is a memory-watermark concern. Investigation: does any consumer rely on the legacy cap to bound memory? If yes, the modern freelist needs a cap-and-shrink path. alloc_backbuffer’s use ofcompare_exchange_strongon a NULL expectation. At:204, the function doesm_backbuffer_tail.compare_exchange_strong (dummy_null, new_bb_tail)— succeeds only if the tail is currently NULL (which is true afterswap_backbufferexchanged it to NULL). On the first init call the tail is also NULL. But what if a secondalloc_backbufferraced (e.g., from two near-simultaneous swaps)? The CAS would fail, and the secondalloc_backbufferwouldpush_to_listto the already-installed back-buffer head. Behavior under that race is plausible-but-not-obviously-correct; worth a focused review.final_sanity_checksinvariant under load. The debug-only check at:447-472assertsm_available_count + m_bb_count == m_alloc_count. This ignoresm_retired_count— i.e., it assumes no retires are in flight at destruction time. The destructor runs after every consumer has stopped using the freelist (viaclear_free_nodes), so this is a reasonable precondition; but if a future change callsfinal_sanity_checksmid-life, the assertion is wrong.
Beyond CUBRID — Comparative Designs & Research Frontiers
Section titled “Beyond CUBRID — Comparative Designs & Research Frontiers”- Tagged pointers to close the
pop_from_availableABA window. Stuff a 16-bit generation counter into the upper bits offree_node *(or usestd::atomic<__int128>). Every push bumps the tag; every pop checks tag equality. Eliminates the documented ABA hazard at the cost of pointer-arithmetic discipline. - Hazard-pointer-based freelist (Michael 2004). Per-thread hazard pointer slot for the popped node; reclaim only when no thread has the address as a hazard. Cheaper read-side publication than EBR for hot workloads.
- Per-CPU sharded freelist with stealing. Replace the
single
m_available_listwith one per CPU (sched_getcpu()). Allocate from local; on local-empty, steal from a neighbor. Removes the cache-line bouncing on the head pointer at the cost of more memory and more complex sanity-check accounting. - Linked block lists rather than per-node lists. The freelist’s available list could be a list of blocks of nodes rather than individual nodes. Each block holds K nodes; pop gives one node from the head block (move to next when empty). Cuts CAS frequency by K. Requires the block to be the unit of cache-line concern, not the node.
- Cap-and-shrink. Track high-water mark of in-use count;
when the freelist is overprovisioned (in-use much smaller
than allocated), shrink during a
clearor a maintenance daemon tick. Would close the legacy parity gap (open question 3).
Sources
Section titled “Sources”- Source files (in the CUBRID source tree at
/data/hgryoo/references/cubrid/):src/base/lockfree_freelist.hpp(header-only template, the entire implementation)src/base/lockfree_transaction_descriptor.hpp(the descriptorfreelist::claim/retirerides on)src/base/lockfree_transaction_table.hpp(the tablefreelistconstructs internally)src/base/lockfree_transaction_reclaimable.hpp(the base classfree_nodederives from)src/base/lockfree_hashmap.hpp(primary consumer, viafreelist<freelist_node_data>)src/base/lock_free.h,src/base/lock_free.c(legacylf_freelistcounterpart)
- Textbook chapters cited:
- Herlihy & Shavit, The Art of Multiprocessor Programming,
ch. 11 §“Concurrent Stacks and Elimination” — the
Treiber-stack pattern at the core of
pop_from_availableandpush_to_list.
- Herlihy & Shavit, The Art of Multiprocessor Programming,
ch. 11 §“Concurrent Stacks and Elimination” — the
Treiber-stack pattern at the core of
- Foundational papers:
- Treiber, R. K. “Systems Programming: Coping with Parallelism.” IBM Almaden RJ 5118, 1986 — the classical Treiber stack and its ABA exposure.
- Michael, M. M. “Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects.” IEEE TPDS 15(8), 2004 — the alternative reclamation model the freelist does not use.
- Sibling docs in this repo:
cubrid-lockfree-overview.md— umbrella, two-generation map.cubrid-lockfree-transaction.md— the reclamation framework the freelist plugs into.cubrid-lockfree-hashmap.md— primary consumer; explainsfreelist_node_data::on_reclaim.