Skip to content

CUBRID Lock-Free Circular Queue — Bounded MPMC Ring with Per-Slot Block Flags

Contents:

A bounded multi-producer multi-consumer (MPMC) queue is the canonical inter-thread hand-off primitive in a high-throughput system. Producers push payloads from one set of threads; consumers pop them from another. The throughput goal is to amortize the per-item cost down to a small number of atomic operations per element. The textbook reference is Herlihy & Shavit, The Art of Multiprocessor Programming, ch. 13 §“Concurrent Queues.” Three implementation families recur:

  1. Linked-list MPMC (Michael & Scott 1996). Two atomic pointers (head, tail); each enqueue allocates a node and CAS-links it; each dequeue CAS-advances head. Unbounded; one allocation per item. Production exemplar: Java’s ConcurrentLinkedQueue.
  2. Bounded ring buffer (Lamport 1983 SPSC; many MPMC variants). Power-of-two slot array + two cursor atomics + per-slot synchronization. No per-item allocation. Production exemplar: LMAX Disruptor (Thompson et al. 2011).
  3. Bounded fetch-and-add ring (Yang & Mellor-Crummey 2016). Producers and consumers each fetch-and-add their cursor; the slot’s “sequence number” tells whether the slot is ready for produce or consume. The cleanest production design today (Vyukov, Adamson). Used by Boost.Lockfree’s spsc_queue / queue and many high-performance C++ libraries.

CUBRID’s lockfree::circular_queue<T> is closest to family (2) but with a CAS-driven cursor (not fetch-and-add) and a per-slot block flag instead of a sequence number. The choice matters:

  • CAS-cursor lets a producer detect a full queue cheaply (compare consume cursor + capacity to produce cursor) but exposes a producer-vs-producer race for the same slot.
  • Block flag resolves the race: a producer that lost the cursor CAS still incremented the cursor, but only the winner of the per-slot block-flag CAS gets to write into that slot. The block flag also doubles as a “produce complete” signal for consumers.

The design has a known weakness, called out in the source itself: a producer preempted between cursor-bump and block- release stalls every consumer at that slot until it wakes. This is not strictly lock-free under the textbook progress condition (a producer can be blocked while consumers are forced to wait). The source acknowledges this and notes a separation-of-concerns redesign as future work.

Two uint64_t atomic cursors:

  • m_produce_cursor — strictly monotone; reads-and-bumps via compare_exchange_strong.
  • m_consume_cursor — strictly monotone; reads-and-bumps via compare_exchange_strong.

The capacity is rounded up to the next power of two; the slot index for a cursor is cursor & (capacity - 1).

The size invariant: produce_cursor - consume_cursor is the current item count. The queue is full when consume_cursor + capacity <= produce_cursor and empty when produce_cursor <= consume_cursor.

A parallel m_blocked_cursors[capacity] array of atomic uint64_t. Each slot’s flag word encodes:

  • The expected cursor value in the low bits — the cursor value the slot is ready to accept next (initially the slot’s index, then bumped by +capacity after each produce).
  • A BLOCK_FLAG in the highest bit (0x8000_0000_0000_0000). When set, “produce in progress; consumer must not read this slot.”

The combination encodes three states implicitly:

Slot stateflag word
Ready for produce at cursor cc (no BLOCK_FLAG)
Produce in progress at cursor cc | BLOCK_FLAG
Produced; ready for consumec + capacity (after unblock)

The consumer reads the block flag, sees that the BLOCK_FLAG is clear, then reads the slot. The slot generation is encoded by the cursor value modulo capacity, so a producer who tries to re-claim the same slot index for the next generation must bump the flag word to a value that no longer matches the previous expected value.

Every DBMS that runs background work threads ships some MPMC queue. The shapes converge.

Vacuum dispatch (PostgreSQL, MySQL InnoDB purge, CUBRID)

Section titled “Vacuum dispatch (PostgreSQL, MySQL InnoDB purge, CUBRID)”

The pattern: log-flush daemon publishes “log block N is now durable” entries; vacuum workers pop and process them. The queue’s role is to decouple the latency of log-flush (synchronous) from the latency of vacuum (eventually consistent).

  • PostgreSQL uses no MPMC queue; autovacuum-launcher scans pg_class periodically and dispatches via shared- memory state. The decoupling is by polling, not by queue.
  • InnoDB purge uses a single dispatch thread popping from a ring buffer (trx_purge_t::view); multiple worker threads pick from a partitioned set.
  • CUBRID vacuum uses two lockfree::circular_queue<T> instances: vacuum_Block_data_buffer for “log block ready for vacuum” and vacuum_Finished_job_queue for “block N finished, can advance vacuum’s safe LSN.”

When a thread fixes a page that is not buffered, it must evict a victim. The decision can be done inline (the fixer walks the LRU and evicts) or pipelined (a daemon scans the LRU, publishes ready-to-evict victims, fixers pop).

  • MySQL InnoDB does inline eviction with mutex-protected buf_LRU lists.
  • PostgreSQL does inline eviction with the BufFreelist spinlock.
  • CUBRID uses the pipelined model with multiple lockfree::circular_queue<int> queues (private_lrus_with_victims, big_private_lrus_with_victims, shared_lrus_with_victims) — page-buffer aoutility daemons publish victim LRU partition indexes; fixer threads pop and use them.

CUBRID’s CDC subsystem forwards log-info entries from the log- manager thread to a CDC publisher thread via cdc_Gl.loginfo_queue (a lockfree::circular_queue <CDC_LOGINFO_ENTRY *>). The shape matches the producer- consumer streaming pattern that PostgreSQL implements via SLRU shared memory and that MySQL implements via Log_event allocation under the binlog mutex.

ConceptCUBRID name
Bounded MPMC queuelockfree::circular_queue<T>
Power-of-two capacitym_capacity = next_pow2 (size)
Slot index maskm_index_mask = m_capacity - 1
Producer cursorm_produce_cursor (std::atomic<uint64_t>)
Consumer cursorm_consume_cursor (std::atomic<uint64_t>)
Per-slot block flagm_blocked_cursors[i] (std::atomic<uint64_t>)
Block bitBLOCK_FLAG = 1ULL << 63
Empty testproduce_cursor <= consume_cursor
Full testconsume_cursor + capacity <= produce_cursor
Slot index from cursorcursor & m_index_mask
Produce-in-progress slot signalblock flag set on m_blocked_cursors[i]
Generation bump on unblockflag word ← cursor + capacity
Force-produce waitstd::this_thread::yield()
Optional execution-history tracelocal_history, local_event (under DEBUG_LFCQ)
// lockfree::circular_queue<T> — src/base/lockfree_circular_queue.hpp:46
template <class T>
class circular_queue {
public:
using cursor_type = std::uint64_t;
using atomic_cursor_type = std::atomic<cursor_type>;
using data_type = T;
using atomic_flag_type = atomic_cursor_type;
circular_queue (std::size_t size);
~circular_queue ();
bool is_empty () const; // cheap snapshot — may be stale
bool is_full () const;
bool is_half_full ();
std::size_t size ();
bool consume (T &element); // non-blocking; false on empty
bool produce (const T &element); // non-blocking; false on full
void force_produce (const T &element); // yield-loop until success
std::uint64_t get_consumer_cursor ();
private:
static const cursor_type BLOCK_FLAG; // (cursor_type) 1 << 63
data_type *m_data; // m_capacity slots
atomic_flag_type *m_blocked_cursors; // m_capacity flag words
atomic_cursor_type m_produce_cursor; // monotone
atomic_cursor_type m_consume_cursor; // monotone
std::size_t m_capacity; // power of two
std::size_t m_index_mask; // m_capacity - 1
};

The constructor rounds the requested size up to the next power of two:

circular_queue<T>::circular_queue (std::size_t size)
: m_produce_cursor (0), m_consume_cursor (0)
{
m_capacity = next_pow2 (size);
m_index_mask = m_capacity - 1;
assert ((m_capacity & m_index_mask) == 0);
m_data = new data_type[m_capacity];
init_blocked_cursors ();
}
void init_blocked_cursors () {
m_blocked_cursors = new atomic_flag_type[m_capacity];
for (cursor_type c = 0; c < m_capacity; c++)
m_blocked_cursors[c] = c; // initial expected value = slot index
}

The initial flag word for slot i is i — meaning “ready for produce at cursor i, no block flag.” After the first produce- unblock cycle, the flag becomes i + capacity; after the next, i + 2*capacity; and so on. This is how the slot tracks its generation without needing a separate generation counter.

// circular_queue<T>::produce — src/base/lockfree_circular_queue.hpp:230
bool produce (const T &element)
{
while (true) {
pc = load_cursor (m_produce_cursor);
cc = load_cursor (m_consume_cursor);
if (test_full_cursors (pc, cc))
return false; // QUEUE FULL
did_block = block (pc); // try to claim slot pc
// bump the produce cursor regardless of block success
test_and_increment_cursor (m_produce_cursor, pc);
if (did_block) {
store_data (pc, element);
unblock (pc); // mark slot ready-to-consume
return true;
}
// somebody else won the slot; loop
}
}

The four sub-operations:

  1. Fullness check. produce_cursor + 1 > consume_cursor + capacityconsume_cursor + capacity <= produce_cursor. Cheap; uses snapshot loads. False positives on transient races are tolerated — the next loop iteration re-tests.
  2. Block-flag CAS.
    bool block (cursor_type cursor) {
    cursor_type block_val = flag<cursor_type> (cursor).set (BLOCK_FLAG).get_flags ();
    return m_blocked_cursors[cursor & m_index_mask]
    .compare_exchange_strong (cursor, block_val);
    }
    The expected value is cursor (no block flag, generation matches); the new value is cursor | BLOCK_FLAG. The CAS succeeds only if (a) the slot’s generation is exactly cursor, and (b) no other producer has set the block bit.
  3. Cursor bump. compare_exchange_strong from pc to pc + 1. The bump is best-effort: it succeeds for exactly one of the racing producers. Whichever producer wins the bump-CAS pushes the cursor forward; the others ignore the result — their did_block decided whether they own the slot, and the cursor will catch up when their loop iteration re-loads it.
  4. Store + unblock.
    void unblock (cursor_type cursor) {
    atomic_flag_type &ref = m_blocked_cursors[cursor & m_index_mask];
    flag<cursor_type> v = ref.load ();
    assert (v.is_set (BLOCK_FLAG));
    cursor_type next_gen = v.clear (BLOCK_FLAG).get_flags () + m_capacity;
    ref.store (next_gen);
    }
    The unblock writes cursor + capacity (the next generation’s expected value) into the flag word. This is what tells a consumer that the slot is full, and tells the next-generation producer at the same slot index what to expect.
// circular_queue<T>::consume — src/base/lockfree_circular_queue.hpp:318
bool consume (T &element)
{
while (true) {
cc = load_cursor (m_consume_cursor);
pc = load_cursor (m_produce_cursor);
if (test_empty_cursors (pc, cc))
return false; // QUEUE EMPTY
if (is_blocked (cc))
return false; // first item not yet produced
element = load_data (cc); // tentative copy
if (test_and_increment_cursor (m_consume_cursor, cc))
break; // CONSUME COMPLETE
// raced — another consumer took the slot; loop
}
return true;
}

Three sub-operations:

  1. Empty + block check. Empty if pc <= cc. Even if non-empty, if the slot at cc is still blocked (producer hasn’t unblocked yet), treat the queue as empty for this call — return false so the caller can try later. This is the “preempted producer” mitigation: a consumer does not wait for a stalled producer; it bails and lets the caller re-poll.
  2. Tentative copy. Read the slot’s payload into the caller’s output argument. If the cursor-CAS fails (someone else consumed first), the caller’s element may have been partly overwritten — the caller is documented not to trust element on false return.
  3. Cursor CAS. Strong CAS from cc to cc + 1. The thread that wins is the consumer of this slot.

The producer-and-consumer protocol is therefore asymmetric: producers fight for slots via the block flag, consumers fight for cursor values via the cursor CAS.

stateDiagram-v2
    [*] --> Initial : alloc, slot i flag = i
    Initial --> ProduceInProgress : block flag CAS i -> i\nset BLOCK_FLAG
    ProduceInProgress --> Produced : unblock\nflag = i + capacity
    Produced --> Consumed : consume cursor CAS\nadvance to i+1
    Consumed --> Initial : next generation\nslot reused at cursor i+capacity
    note right of ProduceInProgress
      Consumer sees BLOCK_FLAG\nand returns "empty"
    end note
    note right of Consumed
      Slot becomes target for cursor i+capacity
    end note
void force_produce (const T &element) {
while (!produce (element))
std::this_thread::yield ();
}

The simplest possible blocking-producer wrapper. No exponential backoff, no condition variable — just yield and retry. This is acceptable because the queues that use force_produce are designed to drain rapidly (vacuum’s finished-job queue) or are acceptable to drop-on-full elsewhere.

Documented limitation — preempted producer

Section titled “Documented limitation — preempted producer”

The produce source comment names the worst-case scenario:

// NOTE: I have an implementation issue I don't know how to fix
// without a significant overhead. When the queue is very stressed
// (many threads produce/consume very often), the system may preempt
// a thread while it has a slot blocked and may keep it preempted
// for a very long time (long enough to produce and consume an entire
// generation). I'd like to any blocking of any kind.
//
// One possible direction is to separate data storage (and slot index)
// from cursor. When one wants to produce an element, it first reserves
// a slot in storage, adds his data, and then saves slot index/cursor in
// what is now m_blocked_cursors using CAS operation. if this succeeds,
// produced data is immediately available for consumption. any
// preemption would not block the queue (just delay when the produce
// is happening).

The comment is the design’s open issue. The current production behavior is: a preempted producer holding a slot’s BLOCK_FLAG blocks only that slot’s consumer, not the entire queue — the producer cursor advances, other producers fill subsequent slots, other consumers consume them. The single stalled slot’s consumer hits the “first item is blocked, return empty” branch and the consumer side moves on (or retries later). The queue-blocking case happens only if every producer in flight gets preempted simultaneously, which is rare.

The public predicates do not synchronize the two cursors — they snapshot each independently and apply the test:

bool is_empty () const { return test_empty_cursors (m_produce_cursor, m_consume_cursor); }
bool is_full () const { return test_full_cursors (m_produce_cursor, m_consume_cursor); }
std::size_t size () {
cursor_type cc = load_cursor (m_consume_cursor);
cursor_type pc = load_cursor (m_produce_cursor);
if (pc <= cc) return 0;
return pc - cc;
}

Callers are expected to use these for early-out hints, not for hard correctness. The header comments make this explicit. A produce call on a queue that just is_full() == false-d may still return false if a sibling producer beat it; the caller’s loop must handle that.

Optional execution-history trace — DEBUG_LFCQ

Section titled “Optional execution-history trace — DEBUG_LFCQ”

The header carries a parallel produce_debug / consume_debug / force_produce_debug family conditionally compiled under DEBUG_LFCQ. Each instruments the protocol with a thread-local local_history ring buffer, recording every cursor load, block CAS attempt, store, and unblock as a local_event. The intent is post-mortem debugging of stuck or lost-item scenarios on stress benches; in production builds the trace is dead code.

OwnerType parameterFieldFile
Vacuum block-data buffervacuum_data_entryvacuum_Block_data_buffersrc/query/vacuum.c:467
Vacuum finished-job queueVACUUM_LOG_BLOCKIDvacuum_Finished_job_queuesrc/query/vacuum.c:475
CDC log-info queueCDC_LOGINFO_ENTRY *cdc_Gl.loginfo_queuesrc/transaction/log_manager.c:14534
Page-buffer high-priority waiter queueTHREAD_ENTRY *pgbuf_Pool.waiter_threads_high_prioritysrc/storage/page_buffer.c:741
Page-buffer low-priority waiter queueTHREAD_ENTRY *pgbuf_Pool.waiter_threads_low_prioritysrc/storage/page_buffer.c:742
Page-buffer post-flush queuePGBUF_BCB *pgbuf_Pool.flushed_bcbssrc/storage/page_buffer.c:813
Page-buffer private LRU victim hintint (LRU index)pgbuf_Pool.private_lrus_with_victimssrc/storage/page_buffer.c:815
Page-buffer big-private LRU victim hintintpgbuf_Pool.big_private_lrus_with_victimssrc/storage/page_buffer.c:816
Page-buffer shared LRU victim hintintpgbuf_Pool.shared_lrus_with_victimssrc/storage/page_buffer.c:817

Most queues sit on force_produce for their producer side (blocking accept) and consume for their consumer side (non- blocking poll, retry from the daemon’s looper).

The queue is not on the reclamation spine. It does not allocate per-item dynamic memory (the slots are pre-allocated in m_data); it does not retire pointer-shaped nodes; it does not have an ABA hazard on the cursor (the cursor is a 64-bit counter, never wraps in any realistic uptime). The tran::system / tran::table / tran::descriptor machinery is unnecessary here.

The trade-off: the queue’s payload T must be copyable by value. In practice CUBRID’s queues store either trivially copyable types (int, THREAD_ENTRY *, PGBUF_BCB *, VACUUM_LOG_BLOCKID) or small POD structs. Larger payloads should be heap-allocated and the pointer enqueued — the CDC_LOGINFO_ENTRY * queue is the example.

  • lockfree::circular_queue<T> — class template, lockfree_circular_queue.hpp:46
  • circular_queue::circular_queue (size):208
  • circular_queue::~circular_queue:222
  • circular_queue::is_empty / is_full / is_half_full / size:373 / :381 / :387 / :394
  • circular_queue::consume:318
  • circular_queue::produce:230
  • circular_queue::force_produce:301
  • circular_queue::get_consumer_cursor:311
  • next_pow2:409
  • test_empty_cursors / test_full_cursors:419 / :425
  • load_cursor:432
  • test_and_increment_cursor:439 (the cursor-bump CAS; comments note “weak vs strong made no perf difference”)
  • get_cursor_index:447
  • store_data / load_data:454 / :461
  • is_blocked / block / unblock:467 / :476 / :485
  • init_blocked_cursors:498
  • produce_debug:511
  • force_produce_debug:563
  • consume_debug:572
  • local_history / local_event / local_action enum — :118+
SymbolFileLine
lockfree::circular_queue<T> (class template)src/base/lockfree_circular_queue.hpp46
BLOCK_FLAG ((cursor_type) 1 << 63)src/base/lockfree_circular_queue.hpp80 / 203
circular_queue::circular_queuesrc/base/lockfree_circular_queue.hpp208
circular_queue::producesrc/base/lockfree_circular_queue.hpp230
circular_queue::force_producesrc/base/lockfree_circular_queue.hpp301
circular_queue::consumesrc/base/lockfree_circular_queue.hpp318
circular_queue::is_emptysrc/base/lockfree_circular_queue.hpp373
circular_queue::is_fullsrc/base/lockfree_circular_queue.hpp381
circular_queue::sizesrc/base/lockfree_circular_queue.hpp394
circular_queue::next_pow2src/base/lockfree_circular_queue.hpp409
circular_queue::blocksrc/base/lockfree_circular_queue.hpp476
circular_queue::unblocksrc/base/lockfree_circular_queue.hpp485
circular_queue::init_blocked_cursorssrc/base/lockfree_circular_queue.hpp498
circular_queue::test_and_increment_cursorsrc/base/lockfree_circular_queue.hpp439
vacuum_Block_data_buffersrc/query/vacuum.c467
vacuum_Finished_job_queuesrc/query/vacuum.c475
cdc_Gl.loginfo_queue fieldsrc/transaction/log_impl.h877
pgbuf_Pool.waiter_threads_high_prioritysrc/storage/page_buffer.c741
pgbuf_Pool.flushed_bcbssrc/storage/page_buffer.c813
  • Capacity is rounded up to the next power of two. next_pow2 at lockfree_circular_queue.hpp:409; the constructor asserts (m_capacity & m_index_mask) == 0.
  • BLOCK_FLAG is the high bit of uint64_t. Defined as ((cursor_type) 1) << ((sizeof (cursor_type) * CHAR_BIT) - 1) at :203. On a 64-bit cursor, that’s 0x8000_0000_0000_0000.
  • The cursor-bump CAS is compare_exchange_strong. The comment at :443 notes “I tested, no performance difference from using one or the other,” indicating the choice is defensive rather than essential.
  • Per-slot block flags are initialized to the slot index. init_blocked_cursors at :498-507. The first generation produces against flag = i, sets BLOCK_FLAG, then unblocks with flag = i + capacity.
  • consume does not wait on a blocked slot. When is_blocked (cc) is true, consume returns false immediately at :349. This is the documented “first item not yet produced” early-out.
  • force_produce is a yield-loop, no backoff. :301-307.
  • The queue does not use the tran::* reclamation framework. No tran::reclaimable_node, no descriptor bracketing, no retire ring. Slots are pre-allocated in m_data.
  • 9 active in-engine consumers. Listed in the §“Consumers” table; verified via grep -rn 'lockfree::circular_queue'.
  1. Has the preempted-producer scenario been seen in production? The source comment names it as a known weakness. Investigation: search the issue tracker for stuck-vacuum or stuck-page-buffer reports under stress; if none, the back-of-queue reuse pattern empirically dominates the worst case.
  2. Should force_produce add backoff? std::this_thread::yield on a tight full queue is surprisingly cheap on modern Linux (sched_yield is a no-op if no other thread is runnable on the CPU) but exponentially-backed-off would still be cheaper under sustained pressure. No measurement campaign exists.
  3. Why is the consumer’s tentative-copy pattern preferred over a swap-then-CAS? The current consumer copies the payload before the cursor CAS; if the CAS fails, the tentative copy is wasted. A swap-pattern (CAS first, copy second) would require sentinels to mark “consumed but slot still occupied,” which has its own complexity. The current design’s documented contract — caller must not trust element on false return — is a deliberate simplification.
  4. DEBUG_LFCQ activation pattern. The trace is per-thread (local_history) and the comment says the caller should keep its history “in thread context to be inspected on demand.” Has anyone exercised the trace recently? The dual produce/consume code paths are a maintenance hazard if the production paths drift from the debug paths without a sync.
  5. 64-bit cursor wrap. At 1 billion enqueues per second (well beyond any current load), a uint64_t cursor wraps in ~580 years. Practically infinite. The cursor-arithmetic tests (pc <= cc, cc + capacity <= pc) would misbehave at wrap. No mitigation; correctly considered out-of-scope.

Beyond CUBRID — Comparative Designs & Research Frontiers

Section titled “Beyond CUBRID — Comparative Designs & Research Frontiers”
  • Vyukov MPMC queue. The cleanest production MPMC ring today: per-slot sequence number (no separate block flag), fetch-and-add cursors, no producer-stall hazard. A direct comparison on CUBRID’s vacuum and page-buffer paths would tell us whether the conversion buys throughput in proportion to the migration cost.
  • LMAX Disruptor. Adds batched claim (a producer can claim N consecutive slots in one CAS) and waiting strategy policies (BlockingWait, BusySpin, Yielding, LiteBlocking). CUBRID’s force_produce is hard-coded to yielding; making it pluggable would let DWB use blocking and page-buffer victim hand-off use busy-spin.
  • Folly’s ProducerConsumerQueue (SPSC, lock-free) and MPMCQueue. Production-grade with extensive measurement history. CUBRID’s queue is in spirit similar; the API differences are minor (tryEnqueue, tryDequeue with && perfect-forwarded payloads).
  • Eliminate the producer-stall hazard. The source comment’s proposed redesign — separate slot reservation from cursor publication — is exactly Vyukov’s algorithm. Adopting it would let the queue be wait-free on the producer side.
  • std::atomic<T> with explicit memory ordering. Today every cursor and flag access is seq_cst. The producer’s store-payload-then-unblock is naturally an acquire/release pair; tightening might reduce barrier overhead on weak- memory architectures.
  • Source files (in the CUBRID source tree at /data/hgryoo/references/cubrid/):
    • src/base/lockfree_circular_queue.hpp (the entire implementation; no .cpp companion)
    • src/base/base_flag.hpp (the flag<T> helper used to encode the block bit)
  • Consumer source files:
    • src/query/vacuum.c
    • src/transaction/log_manager.c, src/transaction/log_impl.h
    • src/storage/page_buffer.c
  • Textbook chapters cited:
    • Herlihy & Shavit, The Art of Multiprocessor Programming, ch. 13 §“Concurrent Queues” — bounded vs. unbounded, MS queue, fetch-and-add ring.
  • Foundational papers:
  • Sibling docs in this repo:
    • cubrid-lockfree-overview.md — umbrella.
    • cubrid-vacuum.md (if curated) — primary consumer for vacuum_Block_data_buffer and vacuum_Finished_job_queue.
    • cubrid-page-buffer-manager.md — consumer for the waiter and victim queues.