CUBRID Thread and Worker Pool — Workers, Daemons, Lock-Free Primitives, and Critical Sections
Contents:
- Theoretical Background
- Common DBMS Design
- CUBRID’s Approach
- Source Walkthrough
- Cross-check Notes
- Open Questions
- Sources
Theoretical Background
Section titled “Theoretical Background”Every database engine is, at its lowest layer, a thread system: a way of turning a stream of incoming requests and a set of recurring background duties into actual operating-system threads, and a way of giving each of those threads enough state to do its job without paying for that state on every call. The CUBRID server is no different. Before the lock manager, before MVCC, before recovery, there is a thread pool, a per-thread context, a set of long-running background daemons, and a small set of synchronization primitives those higher layers reuse.
Two textbooks frame the design space.
Database Internals (Petrov, Ch. 14 “Concurrency Control” and the storage-layer chapters) names the basic split: the engine has worker threads, which exist to run user-driven units of work (connections, queries, scans, page reads), and daemon threads, which exist to do recurring engine-driven work (flushing dirty pages, trimming WAL, detecting deadlocks, vacuuming dead versions). The lifetimes are different — a worker is born to do one task and is recycled to do another, while a daemon is born once and runs until shutdown. The synchronisation requirements are also different — a worker is usually woken by a request arriving on a queue, while a daemon is usually woken by a clock or a wakeup hint from another thread.
The Art of Multiprocessor Programming (Herlihy & Shavit) names the
algorithmic problem the thread system has to solve as soon as the
worker count grows past a handful: lock-free data structures for
the queues, free-lists, and hash tables those workers and daemons
share. A naive design with a single mutex around the task queue or
the lock table serializes every thread on the critical line of code
covered by that mutex, and the engine stops scaling. The book gives
the canonical patterns — the Treiber stack (CAS-based push/pop), the
Michael–Scott queue (lock-free FIFO), and the Maged Michael / Fraser
Robust Hash (FRH, also called split-ordered list hashmap, the
basis for Java’s ConcurrentHashMap after JSR-166). Every modern
engine that pretends to scale picks some subset of these and uses
them everywhere a thread might wait.
These two framings give the four implementation choices that shape every real engine and frame the rest of this document:
-
Thread-per-connection or thread-pool? The classical approach (PostgreSQL, the original C MySQL) is one OS process or thread per connection, no pool — the OS scheduler is the work queue. The pooled approach (MySQL’s thread-pool plugin, Oracle’s MTS, SQL Server’s SOS scheduler, CUBRID) decouples connections from threads: a connection arrives at a queue, and a worker picks it up. The trade-off is overhead vs isolation: pools amortise context creation and bound concurrency, but require their own scheduler and their own per-thread context to fake what thread-per-connection got from the OS for free.
-
One queue or many? A single global task queue is simple and correct but contended. The standard refinement is partitioned queues: split the workers into groups (call them cores, partitions, shards), each with its own task queue, and pick a queue per task by hash, round-robin, or NUMA proximity. The cost is uneven load (one shard may starve while another piles up); the gain is the lock on each shard’s queue is rarely contended. CUBRID picks partitioned queues with a round-robin dispatch.
-
Pool the threads or pool the entries? When a worker finishes its task, two things can be retired: the OS thread itself, and the per-thread context (the bag of state the engine code reads and writes via
thread_p). Some engines pool both (always-alive workers, never destroy contexts). Some pool only the contexts and let the OS thread come and go on demand. CUBRID does the latter by default — the context (thecubthread::entry) is dispatched from a fixed-size pool, and the OS thread is created when a task is assigned to a worker and exits after an idle timeout when no new task arrives. A tunable (pool_threads, plus a perf-test override) keeps threads alive forever when needed. -
What synchronization primitive guards the slow paths? Every thread system has at least three layers of sync primitives. At the bottom, a
pthread_mutex_tprotects single critical lines (a queue head, a free list). One layer up, a reader/writer primitive lets many readers in while excluding writers (the classic textbook RW-lock). At the top, lock-free structures replace locks entirely on the hot path. CUBRID has all three:pthread_mutex_teverywhere, a customSYNC_CRITICAL_SECTIONRW primitive (the csect) for slow heavyweight gates like the transaction table, andlockfree::hashmappluslockfree::circular_queue(in thebasemodule) for the hot data structures.
After these four choices are named, every CUBRID-specific structure in this document either implements one of them or makes the chosen implementation faster.
Common DBMS Design
Section titled “Common DBMS Design”Every DBMS that supports concurrent connections ships some form of thread system, and the shapes converge on a small handful of patterns. CUBRID’s specific choices in the next section are best read as one set of dials within this shared design space.
Thread-per-connection (PostgreSQL)
Section titled “Thread-per-connection (PostgreSQL)”PostgreSQL’s core uses a process-per-connection model, not a thread
pool. The postmaster forks a backend per connection; the backend
holds a MyProc and a PGPROC shared-memory slot until the
connection closes. There are background processes (autovacuum
launcher + workers, wal-writer, bgwriter, checkpointer, archiver,
replication walsender/walreceiver, logical replication apply
workers) but they are full processes, registered and started by the
postmaster, not threads of a pool. The result is excellent
isolation (a corrupted backend cannot corrupt others) at a cost in
per-connection memory and IPC overhead. Connection pooling is
delegated to external middleware (PgBouncer, pgpool).
Pooled threads (MySQL, MariaDB, Oracle MTS, SQL Server)
Section titled “Pooled threads (MySQL, MariaDB, Oracle MTS, SQL Server)”Commercial and high-fanout DBMSes pool threads for the obvious reason that the OS cannot economically dispatch tens of thousands of connections each as a thread.
- MySQL Enterprise / MariaDB thread pool assigns connections to thread groups, each with its own queue and its own waiter and listener threads. The number of groups is roughly the CPU count; each group runs one or two active threads.
- Oracle Multi-Threaded Server (MTS / shared-server architecture) multiplexes user processes onto a smaller pool of server processes via dispatchers. Dedicated-server mode bypasses the pool.
- SQL Server SOS (SQL OS) is its own user-space scheduler with workers, schedulers (one per logical CPU), tasks, and a task queue. It ports cooperatively-scheduled fibers on top of OS threads.
CUBRID sits squarely in this camp: a CUBRID server is a single multi-threaded process, with a thread-entry pool, a partitioned worker pool, and a daemon set.
Daemon set (everyone)
Section titled “Daemon set (everyone)”Independent of the worker model, every engine ships a small set of named long-running background threads. The names converge:
- Page-flush / bgwriter / lazy writer — picks dirty pages and writes them so the buffer pool always has clean victims.
- Log-flush / wal-writer — fsyncs WAL up to the latest committed LSN to enforce the WAL rule.
- Checkpoint — periodically flushes the buffer pool and writes a checkpoint log record.
- Archive / log-mover — ships old WAL out of the active range.
- Vacuum / purge / clean-up — reclaims dead MVCC versions (PostgreSQL autovacuum, InnoDB purge, CUBRID vacuum).
- Deadlock detector — periodically walks the wait-for graph.
- Stats collector — periodically samples activity counters.
Every CUBRID daemon listed in §“Pool consumers” matches one of these.
Custom RW-lock and tracker
Section titled “Custom RW-lock and tracker”When a single shared resource is read constantly and written
rarely (the transaction table, the schema cache, the disk
allocation map), a pthread_rwlock_t is the textbook answer. In
practice, every serious engine ships its own RW primitive that
adds: writer queue (so writers don’t starve under reader load),
promotion (a reader upgrading to a writer without releasing first),
demotion (a writer downgrading to reader), per-section identity
(so deadlock-cycle detection can name the section), and a
tracker that, in debug builds, records which sections each
thread holds and complains on inconsistent re-entry. This is
exactly CUBRID’s SYNC_CRITICAL_SECTION plus its
critical_section_tracker.
Lock-free shared structures
Section titled “Lock-free shared structures”The two consistently lock-free structures in modern engines are:
- A lock-free hash map for the lock table, the page table, the
schema cache, the lock-free allocator buckets. The standard
algorithm is Maged Michael’s split-ordered hashmap (Java
ConcurrentHashMapheritage) — CUBRID’slockfree::hashmap. - A lock-free circular buffer / MPSC queue for handoff
between workers and a single consumer (a flusher, a vacuumer).
CUBRID’s
lockfree::circular_queue(insrc/base/) is used inside the page buffer for direct victim handoff.
CUBRID exposes the lock-free hash map through a thread-aware
wrapper (cubthread::lockfree_hashmap) that ties hash-map
transactions to per-thread descriptors carried inside
cubthread::entry.
CUBRID’s Approach
Section titled “CUBRID’s Approach”CUBRID compiles three different binaries from the same src/thread/
sources via preprocessor guards: SERVER_MODE (the cub_server
process), SA_MODE (standalone, where client and server are linked
into the same binary), and CS_MODE (client-side library). The
thread machinery is fully active only in SERVER_MODE. In
SA_MODE, cubthread::manager is constructed but worker pools and
daemons are not — instead, push_task executes the task synchronously
on the calling thread. This is the first detail worth carrying into
the rest of the document: every worker-pool path you see is also a
synchronous path in standalone, by skipping the pool dispatch.
// manager::push_task — thread/thread_manager.cppvoidmanager::push_task (worker_pool *worker_pool_arg, entry_task *exec_p){ if (worker_pool_arg == NULL) { // execute on this thread exec_p->execute (get_entry ()); exec_p->retire (); } else {#if defined (SERVER_MODE) check_not_single_thread (); worker_pool_arg->execute (exec_p);#else // ... fall back to immediate execute ...#endif }}The rest of this section walks the architecture top-down: per-thread
context first (cubthread::entry), then the manager that owns the
context pool, then the worker pool template parameterised by it, then
the daemon class that uses the same context pool but a different
loop pattern, then the lock-free hashmap that those threads call
into, and finally the critical-section primitive layered on top of
pthread_mutex_t for slow shared resources.
cubthread::entry — the per-thread context
Section titled “cubthread::entry — the per-thread context”cubthread::entry is the single most important struct in the
server. Every engine routine that needs to know “who is calling me
and what state are they in” takes a THREAD_ENTRY *thread_p
(typedef alias for cubthread::entry *) as the first parameter.
The entry carries everything that, in a thread-per-connection
engine, would be implicit in the OS thread’s TLS:
- Identity —
index(slot in the manager’s array, 1-based),type(worker, daemon, vacuum master, recovery, …),m_id(std::thread::id),client_id(which session is this thread serving),tran_index(which transaction descriptor — TDES — is bound),private_lru_index(which page-buffer private LRU list this thread owns when private quotas are enabled). - Status & wait reason — an FSM with
TS_DEAD,TS_FREE,TS_RUN,TS_WAIT,TS_CHECK, plus aresume_statusenumerating the reason a TS_WAIT was entered (THREAD_LOCK_SUSPENDED,THREAD_PGBUF_SUSPENDED,THREAD_LOGWR_SUSPENDED, …). The reason is the protocol with whoever is going to wake the thread up: when the lock manager grants a lock, it passesTHREAD_LOCK_RESUMEDtothread_wakeupso the waiter can verify it was woken for the right reason and not by a spurious signal. - Synchronization — a per-entry
pthread_mutex_tth_entry_lockplus apthread_cond_twakeup_cond. Every blocking primitive in the server (latch, lock, csect, log writer) ultimately callsthread_suspend_wakeup_and_unlock_entrywhich doespthread_cond_waiton this pair. There is no futex-style fast path; every wait goes through the entry’s condition variable. - Allocator caches —
private_heap_id, an HL_HEAPID into the thread-private heap;log_zip_undo/log_zip_redo/log_data_ptr, the per-thread compression buffers reused across log appends;tran_entries[THREAD_TS_*], an array of lock-free transaction descriptors, one per shared lock-free table (sessions, catalog, lock-res, lock-ent, xcache, fpcache, hfid table, dwb-slots, …) used to coordinate with the lock-free hashmap’s epoch / hazard-pointer scheme. - Trackers —
m_alloc_tracker,m_pgbuf_tracker,m_csect_tracker. Active only in debug builds withENABLE_TRACKERS = !NDEBUG && SERVER_MODE. They record every malloc, every page fix, every csect enter so the engine can, on transaction end, assert that the thread released everything it grabbed. - Per-subsystem state —
vacuum_worker(a pointer to vacuum worker info if this entry is bound to a vacuum thread),xasl_unpack_info_ptr(the unpacked XASL for the currently executing query),lockwait/lockwait_msecs/lockwait_state(the “I am waiting on object X for at most N ms” tuple read by the deadlock detector),event_stats(slow query timing), and a thicket of flags —interrupted,shutdown(atomic),check_interrupt,no_logging,is_cdc_daemon,trigger_involved.
// cubthread::entry — thread/thread_entry.hppclass entry{ public: enum class status { TS_DEAD, TS_FREE, TS_RUN, TS_WAIT, TS_CHECK };
int index; // thread entry index thread_type type; // worker, daemon, vacuum, recovery, … int client_id; // client whose request this thread runs int tran_index; // bound transaction descriptor int private_lru_index; pthread_mutex_t th_entry_lock; pthread_cond_t wakeup_cond; status m_status; thread_resume_suspend_status resume_status;
HL_HEAPID private_heap_id; // private allocator cache css_conn_entry *conn_entry; // network connection xasl_unpack_info *xasl_unpack_info_ptr; vacuum_worker *vacuum_worker; lf_tran_entry *tran_entries[THREAD_TS_COUNT]; // lock-free txn descriptors pgbuf_holder_anchor *m_holder_anchor; // page-fix list head
std::atomic_bool shutdown; bool interrupted; bool check_interrupt; bool wait_for_latch_promote; // ... condensed ... private: cuberr::context m_error; cubbase::alloc_tracker &m_alloc_tracker; cubbase::pgbuf_tracker &m_pgbuf_tracker; cubsync::critical_section_tracker &m_csect_tracker; log_system_tdes *m_systdes; lockfree::tran::index m_lf_tran_index;};The “private members will gradually replace public” comment in the
header is genuine: the project is mid-refactor away from a C-era
THREAD_ENTRY struct of public fields, and the public fields above
are intentional liabilities. New code that needs to read or write
entry state should prefer accessor methods (get_error_context,
get_lf_tran_index, get_pgbuf_tracker, …) over direct field
access.
The constructor allocates one private heap, two pthread mutexes, one condition variable, and the three trackers — these are the costs the manager wants to amortise across many tasks, which is the whole reason entries are pooled.
Suspend/wake protocol
Section titled “Suspend/wake protocol”Every blocking primitive in the server reduces to two C-style functions on the entry’s mutex+condvar pair. Reading them once makes the rest of the engine readable:
// thread_suspend_wakeup_and_unlock_entry — thread/thread_entry.cppvoidthread_suspend_wakeup_and_unlock_entry (cubthread::entry *thread_p, thread_resume_suspend_status suspended_reason){ // entry's th_entry_lock must be held by caller thread_p->m_status = cubthread::entry::status::TS_WAIT; thread_p->resume_status = suspended_reason; // ... slow-query timing prelude ...
pthread_cond_wait (&thread_p->wakeup_cond, &thread_p->th_entry_lock);
thread_p->m_status = old_status; pthread_mutex_unlock (&thread_p->th_entry_lock);}// thread_wakeup_internal — thread/thread_entry.cppstatic voidthread_wakeup_internal (cubthread::entry *thread_p, thread_resume_suspend_status resume_reason, bool had_mutex){ if (!had_mutex) thread_lock_entry (thread_p); pthread_cond_signal (&thread_p->wakeup_cond); thread_p->resume_status = resume_reason; if (!had_mutex) thread_unlock_entry (thread_p);}The contract: a thread that wants to sleep on its own entry first
locks th_entry_lock, sets resume_status to a suspend-reason, and
calls cond_wait. A waker — typically holding another mutex (the
lock-table mutex, the page-buffer mutex) — calls thread_wakeup
with the matching resume-reason; the woken thread compares
resume_status against the suspend-reason it expected and either
proceeds or treats the wake as spurious. The _already_had_mutex
variant skips the lock since the caller still holds it (typical when
both the suspend and the wake are coordinated under a higher-level
mutex).
This is also how the timeout variants work. thread_suspend_with_other_mutex
takes an external mutex and lets the caller wait on the entry’s
condvar but under that external mutex — the lock manager uses this so
“hold the lock-table mutex, suspend on my entry, release lock-table
mutex” is atomic across the suspend.
cubthread::manager — orchestrator
Section titled “cubthread::manager — orchestrator”The manager is a singleton (Manager static, accessed via
cubthread::get_manager ()) that owns the entry pool, the worker
pool list, the daemon list, and the lock-free transaction system.
It is allocated by cubthread::initialize, set up in
initialize_thread_entries, and torn down by cubthread::finalize.
// cubthread::manager — thread/thread_manager.hppclass manager{ // entries entry *m_all_entries; entry_dispatcher *m_entry_dispatcher; // resource_shared_pool<entry> std::size_t m_max_threads; std::size_t m_available_entries_count;
// pools and daemons std::vector<worker_pool *> m_worker_pools; std::vector<daemon *> m_daemons; std::vector<daemon *> m_daemons_without_entries;
entry_manager m_entry_manager; daemon_entry_manager m_daemon_entry_manager;
// lock-free transaction system lockfree::tran::system *m_lf_tran_sys;
// ... template <typename Res, typename ... CtArgs> Res *create_worker_pool (std::size_t pool_size, std::size_t core_count, CtArgs &&... args); daemon *create_daemon (const looper &, entry_task *, const char *name, entry_manager * = NULL); daemon *create_daemon_without_entry (const looper &, task_without_context *, const char *); entry *claim_entry (void); void retire_entry (entry &);};The size of the entry pool is not a knob — it is computed from
the count of registered consumers using a count_registry global:
// manager::set_max_thread_count_from_config — thread/thread_manager.cppvoidmanager::set_max_thread_count_from_config (void){ m_max_threads = cubbase::count_registry<connection>::total () + cubbase::count_registry<worker_pool>::total () + cubbase::count_registry<daemon>::total () + 1 /* PAD */;}count_registry<T>::total() aggregates registrations made by the
REGISTER_CONNECTION, REGISTER_WORKERPOOL, and REGISTER_DAEMON
macros that each subsystem invokes at file scope. The macros do not
register the quantity of threads; they register the count of
named callers. So adding a new daemon or pool requires a
REGISTER_* macro near the file’s top so the manager can size the
entry pool, otherwise create_worker_pool will return NULL when
“reserve N entries” exceeds m_available_entries_count.
// macros — thread/thread_manager.hpp#define REGISTER_CONNECTION(name, getter) static cubthread::manager::connection_registry_t _gl_reg_conn_##name (#name, getter)#define REGISTER_WORKERPOOL(name, getter) static cubthread::manager::workerpool_registry_t _gl_reg_wp_##name (#name, getter)#define REGISTER_DAEMON(name) static cubthread::manager::daemon_registry_t _gl_reg_daemon_##name (#name, 1)Once entries exist, they are dispatched to OS threads on demand.
claim_entry pops an entry from resource_shared_pool and stores
it in a thread-local pointer tl_Entry_p; retire_entry puts it
back. Every server thread starts by calling claim_entry
(typically inside a worker pool’s init_run or a daemon’s
loop_with_context), gets back its entry &, and then runs.
// manager::claim_entry / retire_entry — thread/thread_manager.cppentry *manager::claim_entry (void){ tl_Entry_p = m_entry_dispatcher->claim (); return tl_Entry_p;}
voidmanager::retire_entry (entry &entry_p){ assert (tl_Entry_p == &entry_p); tl_Entry_p = NULL; m_entry_dispatcher->retire (entry_p);}The entry_manager (and its derivatives daemon_entry_manager,
system_worker_entry_manager) is the policy object the worker pool
calls into on each create_context / retire_context / recycle_context / stop_execution. It is the customisation hook by which subsystems
attach their per-thread state to a generic worker. For example,
vacuum’s vacuum_Worker_entry_manager overrides on_create to
attach a vacuum_worker * to the entry; pl/sp’s
m_monitor_helper_daemon uses a custom entry manager to wire up the
JNI bridge’s class loaders.
cubthread::worker_pool — partitioned pool with per-core queues
Section titled “cubthread::worker_pool — partitioned pool with per-core queues”The worker pool is the engine’s only way of running an
arbitrary-but-bounded number of entry_task instances in parallel
without spawning a thread per task. It is a template
worker_pool_impl<bool Stats> so that pools that do not need
statistics (typical) and pools that do (the global per-transaction
worker pool) compile to different code with no runtime overhead.
The base class worker_pool is the runtime polymorphic interface
the manager and push_task see.
Pool → Core → Worker → Task
Section titled “Pool → Core → Worker → Task”The pool is partitioned into cores (typically 1 for low-volume pools, several for high-volume pools), and each core owns a fixed slice of the pool’s workers and one task queue.
worker_pool / | \ core core core (m_cores) / | | \ worker worker worker (core::m_workers) (core::m_available_workers) (core::m_task_queue)execute_on_core picks a core by hash (round-robin by default,
custom by argument) and calls core_impl::execute_task. The hot
path is short: take m_workers_mutex, pop an idle worker if any,
and assign the task to that worker. Otherwise enqueue.
// core_impl::execute_task — thread/thread_worker_pool_impl.hppvoidworker_pool_impl<Stats>::core_impl::execute_task (task_type *task_p, bool is_temp){ if (!m_parent_pool->is_running ()) { task_p->retire (); return; }
wrapped_task task_ref (task_p); std::unique_lock<std::mutex> ulock (m_workers_mutex);
if (!m_available_workers.empty ()) { worker_impl *refp = static_cast<worker_impl *> (m_available_workers.back ()); m_available_workers.pop_back (); ulock.unlock (); refp->assign_task (std::move (task_ref)); // wake or spawn } else if (is_temp) { ulock.unlock (); execute_task_as_temp (std::move (task_ref)); // spawn an ad-hoc temp worker } else { m_task_queue.push (std::move (task_ref)); // enqueue }}A worker, after finishing its current task, calls
get_task_or_become_available — a single critical section that
either pops a queued task or registers the worker as available:
// core_impl::get_task_or_become_available — thread/thread_worker_pool_impl.hppstd::optional<wrapped_task>worker_pool_impl<Stats>::core_impl::get_task_or_become_available (worker &w){ std::unique_lock<std::mutex> ulock (m_workers_mutex); if (!m_task_queue.empty ()) { wrapped_task qt = std::move (m_task_queue.front ()); m_task_queue.pop (); return std::optional<wrapped_task> (std::in_place, std::move (qt)); } m_available_workers.push_back (&w); return std::nullopt;}So a worker is exactly in one of two states: holding a task and
running it, or in m_available_workers waiting to be assigned one.
The mutex guards both lists and serializes the assign vs available
race. There is no separate idle list; m_available_workers is the
idle list.
Worker thread lifecycle
Section titled “Worker thread lifecycle”A worker is a persistent C++ object inside the pool; what comes and
goes is the OS thread. assign_task either notifies the existing
thread (if the worker has one — m_has_thread) or spawns one fresh:
// worker_impl::assign_task — thread/thread_worker_pool_impl.hppvoidworker_pool_impl<Stats>::core_impl::worker_impl::assign_task (wrapped_task &&task_ref){ std::unique_lock<std::mutex> ulock (m_task_mutex); m_wrapped_task.emplace (std::move (task_ref));
if (m_is_temp) { m_has_thread = true; start_thread (); return; } if (m_has_thread) { ulock.unlock (); m_task_cv.notify_one (); } else { m_has_thread = true; ulock.unlock (); start_thread (); // std::thread (&worker_impl::run, this).detach () }}The detached thread runs worker_impl::run, which is:
// worker_impl::run — thread/thread_worker_pool_impl.hppvoidworker_pool_impl<Stats>::core_impl::worker_impl::run (void){ os::resources::cpu::clearaffinity (); pthread_setname_np (pthread_self (), m_parent_core->get_parent_pool ()->get_name ().c_str ()); init_run (); // claim entry (context)
if (m_is_temp) { execute_current_task (); finish_run (); return; }
// loop: execute current task, then try to grab another if (!m_wrapped_task.has_value ()) { if (get_new_task ()) { /* got one */ } } if (m_wrapped_task.has_value ()) { do { execute_current_task (); } while (get_new_task ()); }}init_run calls m_parent_core->get_entry_manager().create_context()
which under the hood claims an entry from the manager’s dispatcher and
runs the subsystem-specific on_create hook. finish_run is the
mirror — retire_context releases the entry back. Between them, the
thread loops on tasks; each task gets the same entry * as context,
so a long-running pool effectively keeps its claimed entries pinned.
get_new_task is the part of the worker that decides whether to
keep the OS thread alive:
// worker_impl::get_new_task — thread/thread_worker_pool_impl.hppboolworker_pool_impl<Stats>::core_impl::worker_impl::get_new_task (void){ std::unique_lock<std::mutex> ulock (m_task_mutex, std::defer_lock);
if (!m_stop) { // 1. queued task? pop it. auto qt = static_cast<core_impl *> (m_parent_core)->get_task_or_become_available (*this); if (qt.has_value ()) { m_wrapped_task.emplace (std::move (*qt)); return true; }
// 2. no queued task; we just got added to m_available_workers. // wait on m_task_cv until either a task arrives or idle_timeout fires. ulock.lock (); if (!m_wrapped_task.has_value () && !m_stop) { condvar_wait (m_task_cv, ulock, m_parent_core->get_parent_pool ()->get_idle_timeout (), [this] () -> bool { return m_wrapped_task.has_value () || m_stop; }); } } // ... if (!m_wrapped_task.has_value ()) { // timed out; thread will exit. next assign_task spawns fresh. m_has_thread = false; finish_run (); // retire context now return false; } // got a task, keep going return true;}The idle timeout default is 5 seconds (thread_create_worker_pool’s
default in the manager), or infinity when pool_threads=true is
passed (or perf-test mode forces it via
wp_set_force_thread_always_alive ()). When a worker’s thread times
out, the worker stays in the pool but with m_has_thread = false;
the next assign_task starts a new OS thread. This is the
“low-load: retire threads, high-load: keep them alive” knob.
Task wrapper, statistics
Section titled “Task wrapper, statistics”wrapped_task<Stats> wraps the task pointer with optional timing.
On Stats=true it records cubperf::time_point at construction and
the worker tracks per-stage timing (start_thread, create_context,
execute_task, retire_task, found_in_queue, wakeup_with_task,
recycle_context, retire_context). On Stats=false the wrapper is
essentially a task* and if constexpr (Stats) discards the
counters at compile time.
Task-cap (rate limiting)
Section titled “Task-cap (rate limiting)”Some pools must not be flooded — if the page-flush daemon enqueues
faster than workers can flush, the queue grows without bound and
backpressure breaks. cubthread::worker_pool_task_capper
(thread_worker_pool_taskcap.hpp) wraps a pool with a fixed token
budget. try_task returns false if budget exhausted; push_task
blocks on a condition variable until end_task (called from the
capped_task::execute epilogue) signals a free slot. This is how
recovery’s parallel redo and the page-flush dispatcher avoid
unbounded buildup.
Pool consumers
Section titled “Pool consumers”Eight pools live in the codebase, found by grepping
thread_create_worker_pool and thread_create_stats_worker_pool:
| Pool name | Subsystem | Created in |
|---|---|---|
transaction | Connection / query workers (the user-facing pool) | src/connection/server_support.c |
vacuum | Vacuum workers (per-block log replay) | src/query/vacuum.c |
parallel-query | Intra-query parallel scan/sort | src/query/parallel/px_worker_manager_global.cpp |
loaddb | Bulk loader workers | src/loaddb/load_worker_manager.cpp |
online-index | Online index builder | src/storage/btree_load.c |
backup-read | Parallel-read backup | src/transaction/log_page_buffer.c |
recovery-redo | Parallel WAL redo | src/transaction/log_recovery_redo_parallel.cpp |
| (per-method) | Method/SP temp workers | implicit via execute_on_core (..., is_temp=true) |
The transaction pool is the one that runs SELECT and DML — when
“who actually executes my SQL” is asked, the answer is “a worker
thread of the transaction pool”.
flowchart LR
subgraph WP["worker_pool"]
direction TB
Q1["core[0].queue"]
Q2["core[1].queue"]
Q3["core[2].queue"]
A1["available[0]"]
A2["available[1]"]
A3["available[2]"]
end
Push["push_task"] -->|round-robin| RR{"get_next_core"}
RR -->|hash%N| Q1 & Q2 & Q3
Q1 --> Wk1["worker_impl"]
Q2 --> Wk2["worker_impl"]
Q3 --> Wk3["worker_impl"]
Wk1 --> Ex1["execute_current_task"]
Wk2 --> Ex2["execute_current_task"]
Wk3 --> Ex3["execute_current_task"]
Ex1 -->|next loop| Get1["get_task_or_become_available"]
Get1 -->|queued? yes| Wk1
Get1 -->|queued? no| A1
A1 -.idle_timeout.-> Exit1["thread exit, m_has_thread=false"]
cubthread::daemon — single-thread looper
Section titled “cubthread::daemon — single-thread looper”A daemon is one OS thread, one task, one looper. daemon does not
share a pool — each daemon has its own thread, its own waiter, its
own context. The thread is started in the daemon constructor via
std::thread (daemon::loop_with_context, this, ...) and joined in
stop_execution (called from ~daemon).
// daemon::loop_with_context — thread/thread_daemon.cppvoiddaemon::loop_with_context (daemon *daemon_arg, entry_manager *entry_manager_arg, entry_task *exec_arg, const char *name){ pthread_setname_np (pthread_self (), name[0] ? name : "unnamed-daemon");
entry &context = entry_manager_arg->create_context (); // claim entry, run on_daemon_create daemon_arg->register_stat_start ();
while (!daemon_arg->m_looper.is_stopped ()) { exec_arg->execute (context); // do the periodic work daemon_arg->register_stat_execute ();
daemon_arg->pause (); // sleep per looper policy daemon_arg->register_stat_pause (); }
entry_manager_arg->stop_execution (context); entry_manager_arg->retire_context (context); exec_arg->retire ();}The looper controls how the daemon waits between iterations
(see next subsection). The waiter is the per-daemon condvar/mutex.
wakeup calls m_waiter.wakeup () so that an external event can
shorten the next wait — the page-flush daemon is woken as soon as
the dirty page count crosses a threshold; the deadlock detector is
woken when a thread has been on a lock wait queue for longer than
the configured threshold.
Daemon FSM
Section titled “Daemon FSM”stateDiagram-v2
[*] --> CreateContext
CreateContext --> ExecuteTask: entry_manager.create_context()
ExecuteTask --> Pause: exec->execute(ctx)
Pause --> CheckStopped: looper.put_to_sleep(waiter)
CheckStopped --> ExecuteTask: !looper.is_stopped()
CheckStopped --> RetireContext: looper.is_stopped()
RetireContext --> [*]: entry_manager.retire_context(ctx); exec->retire()
note right of Pause
Wait pattern decided by looper:
- INF_WAITS: cond_wait until wakeup
- FIXED_WAITS: cond_timedwait for fixed period
- INCREASING_WAITS: increasing periods, reset on wakeup
- CUSTOM_WAITS: user-provided period_function
end note
Daemon consumers
Section titled “Daemon consumers”Found by grepping create_daemon:
| Daemon name | Subsystem | Looper policy | Created in |
|---|---|---|---|
log-checkpoint | Log manager | Fixed period (checkpoint_interval) | log_manager.c |
log-rm-archive | Log manager | Fixed period | log_manager.c |
log-clock | Log manager | Fixed (1s) | log_manager.c |
ha-delay-check | HA replication | Fixed period | log_manager.c |
log-flush | Log manager | INF (woken by commits) | log_manager.c |
cdc-loginfo-producer | CDC | INF | log_manager.c |
deadlock-detect | Lock manager | Fixed (deadlock_detection_interval) | lock_manager.c |
dwb-flush-block | Double-write buffer | INF (woken on full block) | double_write_buffer.cpp |
dwb-file-sync | Double-write buffer | INF | double_write_buffer.cpp |
pgbuf-maintain | Page buffer | Fixed | page_buffer.c |
pgbuf-page-flush | Page buffer | INF (woken on dirty threshold) | page_buffer.c |
pgbuf-page-post-flush | Page buffer | INF | page_buffer.c |
pgbuf-flush-control | Page buffer | Fixed | page_buffer.c |
session-control | Session | Fixed | session.c |
vacuum-master | Vacuum | Fixed (master scans WAL) | vacuum.c |
pl-monitor | PL/Java server | Fixed | pl_sr.cpp |
That’s the complete daemon set. Cross-reference with
cubrid-vacuum.md (vacuum master + workers),
cubrid-recovery-manager.md (recovery-redo pool, no daemon),
cubrid-page-buffer-manager.md (four pgbuf daemons + dwb daemons).
graph TD
subgraph "Worker pools (consumers)"
TXN["transaction (server_support)"]
VWP["vacuum (vacuum.c)"]
PXQ["parallel-query (parallel/)"]
LDDB["loaddb"]
OIB["online-index (btree_load)"]
BR["backup-read"]
RR["recovery-redo"]
end
subgraph "Daemons (consumers)"
LF["log-flush"]
LCK["log-checkpoint"]
LRA["log-rm-archive"]
DD["deadlock-detect"]
PGM["pgbuf-maintain"]
PGF["pgbuf-page-flush"]
PGPF["pgbuf-page-post-flush"]
PGFC["pgbuf-flush-control"]
DWB1["dwb-flush-block"]
DWB2["dwb-file-sync"]
SC["session-control"]
VM["vacuum-master"]
PLM["pl-monitor"]
HAD["ha-delay-check"]
CDC["cdc-loginfo-producer"]
LCK1["log-clock"]
end
TM["cubthread::manager"]
TM ---|tracks| TXN
TM ---|tracks| VWP
TM ---|tracks| PXQ
TM ---|tracks| LDDB
TM ---|tracks| OIB
TM ---|tracks| BR
TM ---|tracks| RR
TM ---|tracks| LF
TM ---|tracks| LCK
TM ---|tracks| DD
TM ---|tracks| PGF
TM ---|tracks| DWB1
TM ---|tracks| VM
cubthread::looper — wait pattern
Section titled “cubthread::looper — wait pattern”The looper is small but central — it is the policy by which a daemon’s “do work, then sleep” loop spaces out iterations. Four patterns:
// looper::wait_type — thread/thread_looper.hppenum wait_type{ INF_WAITS, // sleep indefinitely until wakeup FIXED_WAITS, // sleep a fixed duration, repeat INCREASING_WAITS, // sleep increasing durations on timeout, reset on wakeup CUSTOM_WAITS, // arbitrary period_function};put_to_sleep is the one method daemons call:
// looper::put_to_sleep — thread/thread_looper.cppvoidlooper::put_to_sleep (waiter &waiter_arg){ if (is_stopped ()) return;
bool is_timed_wait = true; delta_time period = delta_time (0); m_setup_period (is_timed_wait, period); // policy bound at construction
if (is_timed_wait) { // subtract task execution time from the desired period so the daemon // ticks at a consistent wall-clock rate, not "execute + period" delta_time wait_time = delta_time (0); delta_time exec = std::chrono::system_clock::now () - m_start_execution_time; if (period > exec) wait_time = period - exec; m_was_woken_up = waiter_arg.wait_for (wait_time); } else { waiter_arg.wait_inf (); m_was_woken_up = true; } m_start_execution_time = std::chrono::system_clock::now ();}The “subtract execution time” detail matters: a daemon configured with a 1-second period that takes 800ms to do its job will sleep 200ms, not 1000ms, so wall-clock periodicity is preserved.
INCREASING_WAITS is the back-off policy — used for daemons whose
work runs quickly when there’s nothing to do but should not poll
expensively. The vacuum master uses this: scan WAL, find no work,
sleep increasing intervals (e.g. 10ms → 100ms → 1s); on the first
external wakeup hint, reset the index back to 0.
cubthread::lockfree_hashmap — shared hash map facade
Section titled “cubthread::lockfree_hashmap — shared hash map facade”CUBRID has two lock-free hash map implementations, kept side by
side: an older lf_hash_table_cpp and a newer
lockfree::hashmap (the split-ordered one). The thread layer
provides a single template wrapper that picks one based on
PRM_ID_ENABLE_NEW_LFHASH:
// cubthread::lockfree_hashmap — thread/thread_lockfree_hash_map.hpptemplate <class Key, class T>class lockfree_hashmap{ enum type { OLD, NEW, UNKNOWN }; lf_hash_table_cpp<Key, T> m_old_hash; lockfree::hashmap<Key, T> m_new_hash; type m_type; int m_entry_idx; // index into entry::tran_entries[] // ...};
#define lockfree_hashmap_forward_func(f_, tp_, ...) \ is_old_type () \ ? m_old_hash.f_ (get_tran_entry (tp_), __VA_ARGS__) \ : m_new_hash.f_ ((tp_)->get_lf_tran_index (), __VA_ARGS__)Every operation forwards through entry::tran_entries[m_entry_idx]
(old hash) or entry::get_lf_tran_index() (new hash) — the
thread’s per-table descriptor used by the hash-map’s epoch / hazard
scheme. The set of m_entry_idx slots is fixed in the
THREAD_TS_* enum — every consumer of the lock-free hashmap must
pre-allocate one in cubthread::entry. Today’s slots:
THREAD_TS_SPAGE_SAVING, THREAD_TS_OBJ_LOCK_RES,
THREAD_TS_OBJ_LOCK_ENT, THREAD_TS_CATALOG, THREAD_TS_SESSIONS,
THREAD_TS_FREE_SORT_LIST, THREAD_TS_GLOBAL_UNIQUE_STATS,
THREAD_TS_HFID_TABLE, THREAD_TS_XCACHE, THREAD_TS_FPCACHE,
THREAD_TS_DWB_SLOTS. To add a new lock-free hashmap, you add a
slot to the enum, request the entry slot in
entry::request_lock_free_transactions, and create the hashmap.
There is no extension point that avoids touching cubthread::entry.
SYNC_CRITICAL_SECTION (csect) — heavyweight RW gate
Section titled “SYNC_CRITICAL_SECTION (csect) — heavyweight RW gate”The csect is the engine’s coarsest synchronization primitive, reserved for shared resources where (a) reads dominate writes by orders of magnitude and (b) the resource is touched on critical control paths where a wait must be observable.
// SYNC_CRITICAL_SECTION — thread/critical_section.htypedef struct sync_critical_section{ const char *name; int cs_index; // identity, into csect_Names[] pthread_mutex_t lock; // monitor lock int rwlock; // >0 = # readers, <0 = writer, 0 = none unsigned int waiting_readers; unsigned int waiting_writers; pthread_cond_t readers_ok; // wakeup readers THREAD_ENTRY *waiting_writers_queue; // FIFO of waiting writers THREAD_ENTRY *waiting_promoters_queue; // FIFO of demoted-then-promoting thread_id_t owner; // current writer int tran_index; // for diag/dump SYNC_STATS *stats;} SYNC_CRITICAL_SECTION;The csect index space is fixed and small (CSECT_LAST ≈ 18 in the
current source):
// csect/CSECT_* — thread/critical_section.henum{ CSECT_WFG = 0, // wait-for-graph CSECT_LOG, // log manager CSECT_LOCATOR_SR_CLASSNAME_TABLE, CSECT_QPROC_QUERY_TABLE, CSECT_QPROC_LIST_CACHE, CSECT_DISK_CHECK, CSECT_CNV_FMT_LEXER, CSECT_HEAP_CHNGUESS, CSECT_TRAN_TABLE, CSECT_CT_OID_TABLE, CSECT_HA_SERVER_STATE, CSECT_COMPACTDB_ONE_INSTANCE, CSECT_ACL, CSECT_PARTITION_CACHE, CSECT_EVENT_LOG_FILE, CSECT_TRACE_LOG_FILE, CSECT_LOG_ARCHIVE, CSECT_ACCESS_STATUS, CSECT_LAST};Each csect has a name, a pthread_mutex_t guarding the rwlock state,
a single pthread_cond_t readers_ok for waking readers, and two
intrusive FIFOs (waiting_writers_queue, waiting_promoters_queue)
of THREAD_ENTRY *. The interesting feature is promotion and
demotion: a thread that entered as reader can call
csect_promote to upgrade in place to writer (without releasing
first); a writer can csect_demote to step down to reader. This is
what the lock manager and the schema manager need when they
“observed something during a read scan that requires a write”.
Each csect also has a stats pointer — total enters, total waits,
total re-enters, total elapsed wait, max elapsed wait — visible via
csect_dump_statistics and the csect_start_scan SHOW handler.
Critical-section tracker
Section titled “Critical-section tracker”In debug builds (ENABLE_TRACKERS = !NDEBUG && SERVER_MODE), each
thread carries a cubsync::critical_section_tracker that records
how many times the thread has entered each csect, whether it is
currently the writer or a (possibly demoted) reader, and a
hard-coded interdependency rule:
// critical_section_tracker::check_csect_interdependencies — thread/critical_section_tracker.cppvoidcritical_section_tracker::check_csect_interdependencies (int cs_index){ if (cs_index == CSECT_LOCATOR_SR_CLASSNAME_TABLE) { cstrack_assert (m_cstrack_array[CSECT_CT_OID_TABLE].m_enter_count == 0); }}The intent: never acquire LOCATOR_SR_CLASSNAME_TABLE while
already holding CT_OID_TABLE. This is hard-coded ordering — the
project knows from past deadlock investigations that one specific
two-section cycle is forbidden, and the tracker enforces it. New
csects can be added without touching the tracker; new ordering
rules must be added by hand here.
The tracker also enforces re-enter counts: max 8 re-enters per
section (MAX_REENTERS), reader re-enter only when the thread is
already the writer or in a demoted state, writer re-enter only
when the thread is already the writer. Anything else trips
cstrack_assert (false). On task end, entry::end_resource_tracks
checks that all csects are released, the alloc tracker is balanced,
and the page-buffer fix list is empty — three independent
“release-everything-you-grabbed” assertions.
How a SQL request becomes a worker task
Section titled “How a SQL request becomes a worker task”Putting the pieces together: a CSQL request arrives at the broker,
which forwards it to cub_server over the connection protocol.
The server-side connection-state machine in
src/connection/server_support.c calls thread_create_stats_worker_pool
with name "transaction" at startup; for each incoming request, it
constructs an entry_task and calls thread_get_manager()->push_task.
The pool dispatches to a core by round-robin, the core hands the
task to an available worker (or queues), the worker’s OS thread
(spawned on demand) calls worker_impl::run, which calls
init_run (claim entry, on_create), then execute_current_task
(which calls task::execute (entry &) — the actual SQL path),
then recycle_context, then get_new_task (next request or
become_available). The thread runs until idle_timeout exhausts
and then exits; the next request spawns it fresh.
sequenceDiagram
participant CAS as broker (CAS)
participant SS as server_support<br/>(connection thread)
participant WP as transaction worker_pool
participant W as worker_impl
participant T as entry_task<br/>(query executor)
CAS->>SS: request bytes
SS->>WP: push_task (entry_task)
WP->>WP: round-robin core
WP->>W: assign_task (notify or spawn)
W->>W: init_run -> claim_entry, on_create
W->>T: execute (entry &)
T-->>W: returned
W->>W: retire_current_task; recycle_context
W->>WP: get_task_or_become_available
alt queued task waiting
WP-->>W: return queued task
else no task
WP-->>W: register as available, wait on m_task_cv
end
A vacuum block job, a parallel-query exchange, a recovery-redo
chunk follow the same shape — they differ only in which pool’s
push_task is called and which entry_task subclass is pushed.
That uniformity is the whole point of the worker-pool template.
Source Walkthrough
Section titled “Source Walkthrough”Symbols grouped by subsystem.
cubthread::entry (per-thread context)
Section titled “cubthread::entry (per-thread context)”cubthread::entry— the class.cubthread::entry::status— TS_DEAD/FREE/RUN/WAIT/CHECK enum.thread_resume_suspend_status— the protocol enum (THREAD_LOCK_SUSPENDED, THREAD_PGBUF_SUSPENDED, …).cubthread::entry::request_lock_free_transactions— wirestran_entries[THREAD_TS_*]to the lock-free transaction systems.cubthread::entry::register_id/get_id/unregister_id— store / read / clear the OS thread id.cubthread::entry::lock/unlock— pthread_mutex aroundth_entry_lock.cubthread::entry::end_resource_tracks/push_resource_tracks/pop_resource_tracks— debug-only alloc/pgbuf/csect tracker hooks.cubthread::entry::claim_system_worker/retire_system_worker— for daemons that need a system TDES.cubthread::entry::assign_lf_tran_index/pull_lf_tran_index/get_lf_tran_index— new-style lock-free hashmap descriptor.THREAD_TS_*enum values — slot identities for per-table lock-free transaction descriptors.thread_suspend_wakeup_and_unlock_entry,thread_suspend_timeout_wakeup_and_unlock_entry,thread_wakeup,thread_wakeup_already_had_mutex,thread_check_suspend_reason_and_wakeup,thread_suspend_with_other_mutex— the C-callable suspend/wake API.thread_type_to_string,thread_status_to_string,thread_resume_status_to_string— pretty-printers.
cubthread::manager (orchestrator)
Section titled “cubthread::manager (orchestrator)”cubthread::manager— the singleton class.cubthread::initialize/cubthread::finalize— lifecycle.cubthread::initialize_thread_entries— entry pool sizing and lock-free transaction system bring-up.cubthread::get_manager,get_max_thread_count,get_entry,is_single_thread,check_not_single_thread— global accessors.cubthread::manager::alloc_entries— allocateentry[m_max_threads]and the dispatcher.cubthread::manager::init_entries— per-entry init, optional lock-free wiring.cubthread::manager::init_lockfree_system— constructlockfree::tran::system.cubthread::manager::create_worker_pool<Res>— template (see hpp).cubthread::manager::destroy_worker_pool— pool-stop + entry release.cubthread::manager::push_task/cubthread::manager::push_task_on_core— dispatch.cubthread::manager::create_daemon/cubthread::manager::create_daemon_without_entry/cubthread::manager::destroy_daemon/cubthread::manager::destroy_daemon_without_entry.cubthread::manager::set_max_thread_count_from_config— readscount_registry<connection>::total()+count_registry<worker_pool>::total()+count_registry<daemon>::total()+ 1.cubthread::manager::claim_entry/retire_entry— thread-local entry plumbing throughtl_Entry_p.cubthread::manager::find_by_tid— entry lookup.cubthread::manager::map_entries— iteration helper used by SHOW handlers.REGISTER_CONNECTION/REGISTER_WORKERPOOL/REGISTER_DAEMONmacros (thecount_registryAPI).cubthread::is_logging_configured— runtime gate for_er_log_debug.
cubthread::worker_pool (worker pool template)
Section titled “cubthread::worker_pool (worker pool template)”cubthread::worker_pool— abstract base.cubthread::worker_pool::core— abstract base for partitions.cubthread::worker_pool::core::worker— abstract base for individual workers.cubthread::worker_pool_impl<bool Stats>— the template impl;worker_pool_type(no stats) andstats_worker_pool_type(stats) are the two used aliases.cubthread::worker_pool_impl::initialize,cubthread::worker_pool_impl::execute,cubthread::worker_pool_impl::execute_on_core,cubthread::worker_pool_impl::warmup,cubthread::worker_pool_impl::stop_execution,cubthread::worker_pool_impl::is_running,cubthread::worker_pool_impl::get_worker_count,cubthread::worker_pool_impl::get_core_count— worker-pool API.cubthread::worker_pool_impl::map_running_contexts/map_cores— debug iteration over running threads / cores.cubthread::worker_pool_impl::core_impl— concrete partition withm_workers,m_available_workers,m_task_queue,m_temp_workers.cubthread::worker_pool_impl::core_impl::execute_task— assign-or-enqueue.cubthread::worker_pool_impl::core_impl::get_task_or_become_available— worker reentry.cubthread::worker_pool_impl::core_impl::execute_task_as_temp— ad-hoc method/SP worker spawned outside the pool.cubthread::worker_pool_impl::core_impl::worker_impl— concrete worker; ownsm_context_p,m_wrapped_task,m_task_cv,m_task_mutex,m_stop,m_has_thread,m_is_temp.cubthread::worker_pool_impl::core_impl::worker_impl::run— thread main.cubthread::worker_pool_impl::core_impl::worker_impl::init_run/finish_run— context create / retire.cubthread::worker_pool_impl::core_impl::worker_impl::assign_task/start_thread/get_new_task/execute_current_task/retire_current_task/stop_execution— worker control flow.cubthread::worker_pool_impl::wrapped_task— task pointer + optional timing.cubthread::worker_pool_impl::stats— per-worker counter definitions (start_thread, create_context, execute_task, retire_task, found_in_queue, wakeup_with_task, recycle_context, retire_context).cubthread::worker_pool_task_capper— bounded-queue wrapper.cubthread::system_core_count,cubthread::wp_handle_system_error,cubthread::wp_set_force_thread_always_alive,cubthread::wp_is_thread_always_alive_forced— utilities.
cubthread::daemon and cubthread::looper
Section titled “cubthread::daemon and cubthread::looper”cubthread::daemon— single-thread looper class.cubthread::daemon::loop_with_context,cubthread::daemon::loop_without_context— thread main.cubthread::daemon::wakeup,cubthread::daemon::stop_execution,cubthread::daemon::pause,cubthread::daemon::was_woken_up,cubthread::daemon::reset_looper,cubthread::daemon::is_running— control.cubthread::daemon::get_stats,cubthread::daemon::get_stats_value_count,cubthread::daemon::get_stat_name— stats facade.cubthread::looper— wait pattern.cubthread::looper::wait_type— INF / FIXED / INCREASING / CUSTOM.cubthread::looper::put_to_sleep— the called method.cubthread::looper::reset— used by INCREASING_WAITS on wakeup.cubthread::looper::stop,cubthread::looper::is_stopped— shutdown gate.cubthread::looper::setup_fixed_waits,cubthread::looper::setup_infinite_wait,cubthread::looper::setup_increasing_waits— internal policy function bound at construction.cubthread::waiter— the per-daemon mutex+condvar used by looper.cubthread::condvar_wait— overload that handleswait_duration<D>(infiniteflag).
cubthread::task and entry-task adapters
Section titled “cubthread::task and entry-task adapters”cubthread::task<Context>— abstract base for any task withexecute(Context &)+retire().cubthread::callable_task<Context>—std::function-backed task.cubthread::task<void>/task_without_context/callable_task<void>— context-less specialisation.cubthread::entry_task=task<entry>— the standard adapter.cubthread::entry_callable_task=callable_task<entry>.cubthread::entry_manager—create_context/retire_context/recycle_context/stop_execution.cubthread::daemon_entry_manager— daemon-specific specialisation withon_daemon_create/on_daemon_retire.cubthread::system_worker_entry_manager— pre-baked manager that setsentry::typeand a system-tdes context.
Lock-free hash map
Section titled “Lock-free hash map”cubthread::lockfree_hashmap<Key, T>— facade over old/new impls.cubthread::lockfree_hashmap::iterator.cubthread::lockfree_hashmap::init,init_as_old,init_as_new,destroy.cubthread::lockfree_hashmap::find,find_or_insert,insert,insert_given,erase,erase_locked,unlock,clear,freelist_claim,freelist_retire,start_tran,end_tran.cubthread::lockfree_hashmap::is_old_type,cubthread::lockfree_hashmap::get_tran_entry— internal forwarding glue. Macroslockfree_hashmap_forward_func/lockfree_hashmap_forward_func_noargroute to old or new.cubthread::get_thread_entry_lftransys— accessor for the new-style lock-free transaction system.THREAD_TS_*slots incubthread::entry::tran_entries— per-table descriptor identities (inthread_entry.hpp).
Critical sections
Section titled “Critical sections”SYNC_CRITICAL_SECTION(struct) — RW gate state.SYNC_RWLOCK,SYNC_RMUTEX(struct) — sibling primitives.SYNC_STATS,SYNC_PRIMITIVE_TYPE,SYNC_TYPE_CSECT/RWLOCK/RMUTEX/MUTEX— common stat plumbing.csect_initialize_static_critical_sections,csect_finalize_static_critical_sections— bring-up.csect_initialize_critical_section,csect_finalize_critical_section— single-csect lifecycle.csect_enter,csect_enter_as_reader,csect_enter_critical_section,csect_enter_critical_section_as_reader,csect_demote,csect_promote,csect_exit,csect_exit_critical_section— primary API.csect_check_own— assertion helper.csect_dump_statistics,csect_start_scan,csect_name_at— diagnostics.rwlock_initialize,rwlock_finalize,rwlock_read_lock,rwlock_read_unlock,rwlock_write_lock,rwlock_write_unlock— non-reentrant RW lock variant.rmutex_initialize,rmutex_finalize,rmutex_lock,rmutex_unlock— recursive mutex variant.sync_initialize_sync_stats,sync_finalize_sync_stats,sync_dump_statistics— common stats.cubsync::critical_section_tracker— per-thread debug tracker:start,stop,clear_all,on_enter_as_reader,on_enter_as_writer,on_promote,on_demote,on_exit,check_csect_interdependencies— the tracker API used in debug builds.
Position hints (as of this revision)
Section titled “Position hints (as of this revision)”| Symbol | File | Line |
|---|---|---|
cubthread::manager (class) | src/thread/thread_manager.hpp | 111 |
cubthread::manager::push_task | src/thread/thread_manager.cpp | 157 |
cubthread::manager::create_daemon | src/thread/thread_manager.cpp | 126 |
cubthread::manager::create_worker_pool (template) | src/thread/thread_manager.hpp | 367 |
cubthread::manager::set_max_thread_count_from_config | src/thread/thread_manager.cpp | 266 |
cubthread::manager::claim_entry / retire_entry | src/thread/thread_manager.cpp | 234 / 242 |
cubthread::initialize | src/thread/thread_manager.cpp | 315 |
cubthread::initialize_thread_entries | src/thread/thread_manager.cpp | 378 |
REGISTER_CONNECTION/WORKERPOOL/DAEMON macros | src/thread/thread_manager.hpp | 496–498 |
cubthread::entry (class) | src/thread/thread_entry.hpp | 195 |
cubthread::entry::status enum | src/thread/thread_entry.hpp | 202 |
THREAD_TS_* enum | src/thread/thread_entry.hpp | 81 |
thread_resume_suspend_status enum | src/thread/thread_entry.hpp | 139 |
thread_type enum | src/thread/thread_entry.hpp | 124 |
cubthread::entry ctor | src/thread/thread_entry.cpp | 78 |
cubthread::entry::request_lock_free_transactions | src/thread/thread_entry.cpp | 220 |
thread_suspend_wakeup_and_unlock_entry | src/thread/thread_entry.cpp | 497 |
thread_wakeup_internal | src/thread/thread_entry.cpp | 600 |
thread_suspend_with_other_mutex | src/thread/thread_entry.cpp | 688 |
cubthread::worker_pool (base) | src/thread/thread_worker_pool.hpp | 54 |
cubthread::worker_pool::core | src/thread/thread_worker_pool.hpp | 123 |
cubthread::worker_pool::core::worker | src/thread/thread_worker_pool.hpp | 178 |
cubthread::worker_pool_impl | src/thread/thread_worker_pool_impl.hpp | 105 |
worker_pool_impl::core_impl | src/thread/thread_worker_pool_impl.hpp | 263 |
core_impl::worker_impl | src/thread/thread_worker_pool_impl.hpp | 339 |
core_impl::execute_task | src/thread/thread_worker_pool_impl.hpp | 896 |
core_impl::get_task_or_become_available | src/thread/thread_worker_pool_impl.hpp | 1012 |
core_impl::execute_task_as_temp | src/thread/thread_worker_pool_impl.hpp | 1194 |
worker_impl::run | src/thread/thread_worker_pool_impl.hpp | 1387 |
worker_impl::init_run / finish_run | src/thread/thread_worker_pool_impl.hpp | 1430 / 1457 |
worker_impl::assign_task (with task) | src/thread/thread_worker_pool_impl.hpp | 1267 |
worker_impl::get_new_task | src/thread/thread_worker_pool_impl.hpp | 1521 |
worker_impl::stop_execution | src/thread/thread_worker_pool_impl.hpp | 1326 |
worker_pool_impl::stop_execution | src/thread/thread_worker_pool_impl.hpp | 594 |
worker_pool_task_capper | src/thread/thread_worker_pool_taskcap.hpp | 30 |
cubthread::daemon | src/thread/thread_daemon.hpp | 87 |
daemon::loop_with_context | src/thread/thread_daemon.cpp | 209 |
daemon::loop_without_context | src/thread/thread_daemon.cpp | 248 |
daemon::stop_execution | src/thread/thread_daemon.cpp | 90 |
cubthread::looper | src/thread/thread_looper.hpp | 81 |
looper::wait_type enum | src/thread/thread_looper.hpp | 132 |
looper::put_to_sleep | src/thread/thread_looper.cpp | 119 |
looper::setup_increasing_waits | src/thread/thread_looper.cpp | 208 |
cubthread::lockfree_hashmap | src/thread/thread_lockfree_hash_map.hpp | 35 |
lockfree_hashmap_forward_func macro | src/thread/thread_lockfree_hash_map.hpp | 162 |
SYNC_CRITICAL_SECTION struct | src/thread/critical_section.h | 110 |
CSECT_* enum | src/thread/critical_section.h | 57 |
csect_Names[] | src/thread/critical_section.c | 76 |
csect_initialize_static_critical_sections | src/thread/critical_section.c | 243 |
csect_enter | src/thread/critical_section.c | 674 |
csect_enter_critical_section | src/thread/critical_section.c | 474 |
csect_enter_as_reader | src/thread/critical_section.c | 891 |
cubsync::critical_section_tracker | src/thread/critical_section_tracker.hpp | 32 |
critical_section_tracker::on_enter_as_reader | src/thread/critical_section_tracker.cpp | 60 |
critical_section_tracker::check_csect_interdependencies | src/thread/critical_section_tracker.cpp | 188 |
Pool/daemon consumer call sites
Section titled “Pool/daemon consumer call sites”| Consumer | File | Line |
|---|---|---|
transaction worker pool | src/connection/server_support.c | 581 |
vacuum worker pool | src/query/vacuum.c | 1342 |
vacuum-master daemon | src/query/vacuum.c | 1352 |
parallel-query worker pool | src/query/parallel/px_worker_manager_global.cpp | 69 |
loaddb worker pool | src/loaddb/load_worker_manager.cpp | 124 |
online-index worker pool | src/storage/btree_load.c | 5315 |
backup-read worker pool | src/transaction/log_page_buffer.c | 7546 |
recovery-redo worker pool | src/transaction/log_recovery_redo_parallel.cpp | 662 |
log-checkpoint daemon | src/transaction/log_manager.c | 10415 |
log-rm-archive daemon | src/transaction/log_manager.c | 10440 |
log-clock daemon | src/transaction/log_manager.c | 10457 |
ha-delay-check daemon | src/transaction/log_manager.c | 10482 |
log-flush daemon | src/transaction/log_manager.c | 10500 |
cdc-loginfo-producer daemon | src/transaction/log_manager.c | 14046 |
deadlock-detect daemon | src/transaction/lock_manager.c | 5820 |
dwb-flush-block daemon | src/storage/double_write_buffer.cpp | 4078 |
dwb-file-sync daemon | src/storage/double_write_buffer.cpp | 4092 |
pgbuf-maintain daemon | src/storage/page_buffer.c | 16538 |
pgbuf-page-flush daemon | src/storage/page_buffer.c | 16556 |
pgbuf-page-post-flush daemon | src/storage/page_buffer.c | 16580 |
pgbuf-flush-control daemon | src/storage/page_buffer.c | 16604 |
session-control daemon | src/session/session.c | 588 |
pl-monitor daemon | src/sp/pl_sr.cpp | 266 |
Cross-check Notes
Section titled “Cross-check Notes”This document is the foundation for several others; it should be read in concert with them.
cubrid-vacuum.md(master + workers). Vacuum’s master is one of the daemons listed above; vacuum’s workers are one of the worker pools. The master is acubthread::daemonwith a fixed looper that, on each tick, scans the WAL forward and dispatches per-block jobs by callingthread_get_manager()->push_taskon thevacuumpool. Every vacuum worker thread carries avacuum_worker *on itscubthread::entry(thevacuum_Worker_entry_manageroverrideson_createto attach it). Look there for the per-block job logic; everything in this document is the substrate.cubrid-recovery-manager.md(parallel redo workers). Recovery’s parallel redo uses therecovery-redoworker pool registered inlog_recovery_redo_parallel.cpp. The worker pool is constructed inside the recovery flow rather than at server startup, and is destroyed at the end of recovery — this is the one short-lived pool the manager hosts. Workers there pin aworker_pool_task_capperfor backpressure on the redo job stream.cubrid-page-buffer-manager.md(flush daemon, post-flush daemon, flush-control, maintain). Four daemons live in the page-buffer module, all created inpgbuf_initializewith fixed-period or infinite loopers. Their wakeup hints come from the page-buffer hot path: when the dirty count crossespgbuf_flush_threshold, the buffer manager callspgbuf_Page_flush_daemon->wakeup(). Look there for the flush-control feedback math; everything else (the daemon mechanism) is in this document.cubrid-lock-manager.md(deadlock detector). Thedeadlock-detectdaemon is insrc/transaction/lock_manager.cwith a fixed looper period ofprm_get_integer_value (PRM_ID_DEADLOCK_DETECTION_INTERVAL_SECS). Its task scans the wait-for graph maintained in the lock table, which is itself acubthread::lockfree_hashmapoverLK_RES/LK_ENTslots (THREAD_TS_OBJ_LOCK_RES/THREAD_TS_OBJ_LOCK_ENT). The csect involved isCSECT_WFG.cubrid-log-manager.md(log flush daemon, archive removal, CDC). Five daemons live in the log module:log-flush,log-checkpoint,log-rm-archive,log-clock,ha-delay-check, pluscdc-loginfo-producer. All are registered together withREGISTER_DAEMON(log_*)macros.cubrid-2pc.mdandcubrid-ha-replication.md. These cross to daemons that live outsidesrc/thread/proper (the master process’s heartbeat threads insrc/executables/master_heartbeat.cuse a different pattern — seecubrid-heartbeat.mdfor that). This document is strictly about the in-server thread system.
A few drift points worth flagging:
- The
entryclass header explicitly says it is mid-refactor away from public fields. The TODO insrc/thread/thread_entry.hpp(“make member variable private, remove content that does not belong here, migrate here thread entry related functionality from thread.c/h”) will not be finished in one sweep — expect new code to mix accessor calls and direct field access for some time. - The lock-free hashmap façade has two live implementations
(
OLD,NEW) selected byPRM_ID_ENABLE_NEW_LFHASH. The intent is clearly to retire the oldlf_hash_table_cppeventually; until then, thelockfree_hashmap_forward_funcmacro hides which is in use behind every call. New code that deviates from the macro will see the OLD-only path. - The csect set is fixed-size (
CSECT_LAST≈ 18). New csects must be appended to the enum and tocsect_Names[], and a newcstrack_entryslot fits automatically into the per-thread tracker. The interdependency rule incheck_csect_interdependenciesis hard-coded againstCSECT_LOCATOR_SR_CLASSNAME_TABLE↔CSECT_CT_OID_TABLE; any new partial ordering needs an entry there. count_registry-driven sizing means a daemon or pool added without itsREGISTER_*macro at file scope silently fails atcreate_*time (returnsNULLbecause the entry pool is too small). The error is recoverable but easy to miss; the registration is the contract.
Open Questions
Section titled “Open Questions”- Why is the round-robin counter incremented modulo
m_max_workersrather thanm_cores.size()? The comment inget_round_robin_core_hashsays “preserve assignments proportional to core size” but the modulus is the worker count not the core count. The arithmetic works out for evenly-sized cores; for unevenly-sized cores (worker_count not divisible by core_count), the dispatch is biased toward the first (remainder) cores. - What sets
pool_threads=truein production? The flag bypasses the idle-timeout/exit path entirely. It is wired intoloaddbandbackup-readexplicitly; thetransactionpool uses the default. The perf-test override (wp_set_force_thread_always_alive) is enabled byPRM_ID_PERF_TEST_MODE. There may be additional implicit enablers I have not traced. - Can a worker context survive across very different tasks?
recycle_contextis called between tasks, so the entry’s trackers and TDES are reset, butprivate_heap_idpersists. If a task allocates into the private heap and only frees on retire, a long-lived recycled entry could accumulate fragmentation. In practice the trackers assert this in debug builds; release-mode behaviour is silent. - Does the csect tracker have any production impact?
ENABLE_TRACKERSis!NDEBUG && SERVER_MODE. The release path is statically dead. The interdependency check is debug-only. This means the only production protection againstCSECT_LOCATOR_SR_CLASSNAME_TABLE↔CSECT_CT_OID_TABLEinversion is the codebase’s coding discipline, not a runtime assertion. - The intermediate
m_temp_workerslist andregister_free_temp_listare designed for method/SP execution that needs a thread but should not consume a pool slot. Is there a hard cap on how many temp workers a single pool can spawn under load? On flooding, this could spawn unbounded OS threads. The taskcap wrapper helps but only for queued tasks, not foris_temp=truepaths. - The “subtract execution time from period” detail in
looper::put_to_sleepassumesm_start_execution_timeis monotonic. It usesstd::chrono::system_clock, which on Linux is wall-clock and can step backwards on NTP adjustment. A negativeperiod - execution_timebecomesdelta_time(0), so the daemon does not sleep — fine. But a forward jump may make the daemon skip a tick. This is probably not material at second granularity but worth noting.
Sources
Section titled “Sources”src/thread/thread_manager.cpp,src/thread/thread_manager.hpp— the manager singleton and its allocation, registration, and lifecycle.src/thread/thread_entry.cpp,src/thread/thread_entry.hpp—cubthread::entryand the C-callable suspend/wake API.src/thread/thread_worker_pool.hpp— the abstract base classes (worker_pool,core,worker).src/thread/thread_worker_pool_impl.hpp,src/thread/thread_worker_pool_impl.cpp— the templatedworker_pool_impl<bool Stats>and its core/worker implementations.src/thread/thread_worker_pool_taskcap.hpp,src/thread/thread_worker_pool_taskcap.cpp—worker_pool_task_capper, the bounded-queue wrapper.src/thread/thread_daemon.hpp,src/thread/thread_daemon.cpp— single-thread looper.src/thread/thread_looper.hpp,src/thread/thread_looper.cpp— wait policy.src/thread/thread_waiter.hpp,src/thread/thread_waiter.cpp—cubthread::waiter(the per-daemon mutex+condvar).src/thread/thread_task.hpp,src/thread/thread_entry_task.hpp,src/thread/thread_entry_task.cpp—task<Context>,entry_task,entry_managerand friends.src/thread/thread_lockfree_hash_map.hpp,src/thread/thread_lockfree_hash_map.cpp—cubthread::lockfree_hashmapfacade.src/thread/critical_section.h,src/thread/critical_section.c—SYNC_CRITICAL_SECTIONand the csect API.src/thread/critical_section_tracker.hpp,src/thread/critical_section_tracker.cpp— per-thread debug tracker.- Cross-references found via
grep -nE 'create_worker_pool|thread_create_worker_pool|thread_create_stats_worker_pool|create_daemon':src/connection/server_support.c(transaction pool)src/query/vacuum.c(vacuum pool + master daemon)src/query/parallel/px_worker_manager_global.cpp(parallel-query pool)src/loaddb/load_worker_manager.cpp(loaddb pool)src/storage/btree_load.c(online-index pool)src/transaction/log_page_buffer.c(backup-read pool)src/transaction/log_recovery_redo_parallel.cpp(recovery-redo pool)src/transaction/log_manager.c(six log/HA/CDC daemons)src/transaction/lock_manager.c(deadlock-detect daemon)src/storage/double_write_buffer.cpp(two DWB daemons)src/storage/page_buffer.c(four pgbuf daemons)src/session/session.c(session-control daemon)src/sp/pl_sr.cpp(pl-monitor daemon)
- Theory: Petrov, Database Internals, Ch. 14 “Concurrency
Control”; Herlihy & Shavit, The Art of Multiprocessor
Programming (Treiber stack, Michael–Scott queue,
split-ordered hashmap); Maged M. Michael, High Performance
Dynamic Lock-Free Hash Tables and List-Based Sets (SPAA 2002);
the Java
ConcurrentHashMaplineage for the new lock-free hashmap shape.