CUBRID Thread Manager NG — Connection/Worker Pool Redesign for High-Concurrency (CBRD-26177)
Contents:
- Theoretical Background
- Common DBMS Design
- Motivation (CBRD-26152 + CBRD-26177)
- CUBRID’s Approach
- Source Walkthrough
- Cross-check Notes
- Open Questions
- Sources
This document is the next-generation counterpart to
cubrid-thread-worker-pool.md. The sibling doc covers the legacy
baseline — one polling thread per accepted connection plus a
max_clients-sized cubthread::worker_pool, with all dispatch through
a per-pool mutex. The redesign tracked here, delivered under EPIC
CBRD-26177 for the guava release, replaces the front half with a
small bounded set of epoll-driven connection workers, adds a
coordinator that balances connections across them and dynamically
scales their count, bounds per-tick I/O via send/recv budgets, and
rotates context allocation through per-worker freelists so the hot
path no longer touches new/delete. The task-worker pool below is
retained but resized via two new tunables (task_group,
task_worker) to avoid contention at high concurrency.
Theoretical Background
Section titled “Theoretical Background”The connection front-end of a database server has to multiplex many TCP sockets onto a finite number of CPU cores. Three architectures have dominated the literature, each with a distinct mapping between sockets, threads, and event-loop iterations.
Thread-per-connection (one-thread-per-client). Each accepted
socket gets a dedicated kernel thread that calls read()/write()
directly. The model is simple — no event-loop bookkeeping, no
demultiplexing — and is the design Stevens describes in UNIX Network
Programming, Vol. 1 (3rd ed., §16.5 “TCP Concurrent Server,
One Child per Client”) as the canonical Unix server. It scales until
the kernel’s thread-switch overhead dominates: at C10K and beyond,
the working set of stacks blows the L1/L2 caches, the scheduler’s
runqueue grows linearly with idle threads, and any shared mutex
between threads serializes the entire server. Database Internals
(Petrov, 2019, §5.3 “Concurrent Execution”) summarises the lesson:
“If you want to scale to tens of thousands of concurrent connections,
having one thread per connection becomes impractical.”
Reactor (event-driven). A small fixed pool of event-loop
threads each blocks on a multiplexer (select/poll/epoll/kqueue)
and dispatches ready sockets synchronously. The reference work is
Pai, Druschel, and Zwaenepoel, “Flash: An Efficient and Portable
Web Server” (USENIX 1999) which demonstrated that a single
asymmetric event loop using non-blocking I/O could match or beat
threaded servers at an order-of-magnitude lower memory cost. The
crucial mechanical refinement is edge-triggered epoll: with
EPOLLET, the kernel reports a readiness transition exactly once,
and the user-space loop is responsible for draining the socket until
EAGAIN. Edge-triggering eliminates wake-up storms but forces the
loop to bound how much it drains per fd — otherwise a single fat
connection can starve the others. This is the head-of-line blocking
problem inside an event-loop worker.
Proactor (asynchronous I/O). The kernel signals
completion, not readiness — Windows IOCP, Linux io_uring,
POSIX AIO. Conceptually superior for write-heavy workloads but
operationally heavier and not yet the default for database
front-ends. CUBRID’s redesign deliberately chose reactor + edge
trigger; proactor is out of scope.
Admission control via budgets. Welsh, Culler, and Brewer’s
SEDA (“SEDA: An Architecture for Well-Conditioned, Scalable
Internet Services,” SOSP 2001) framed the front-end as a sequence of
stages connected by bounded queues, with each stage applying its
own admission policy. The empirical observation is that latency
under saturation degrades far less when each stage caps the work it
will absorb in a single tick. CUBRID’s recv_budget_per_connection
and send_budget_per_connection (CBRD-26392) are the SEDA admission
gate applied to a single epoll tick: a fat reader that would happily
drain a megabyte must instead yield after 16 KB, register itself in
an “exhausted” list, and let the worker round-robin back to it on
the next iteration.
Pool sizing — Little’s law. Given an arrival rate λ
(requests/sec) and an average per-request service time S (sec), the
average number of in-flight requests is L = λ · S. A pool with
fewer than L workers will queue indefinitely; a pool with
significantly more workers wastes CPU on context switching and
blocks on internal critical sections. Database Internals (§5.3)
notes that real systems usually pick a small multiple of physical
cores and tune empirically, because S varies with the workload.
CBRD-26424 (score-based assignment) and CBRD-26636 (Worker count
sweep) implement exactly this empirical loop: measure throughput at
several task_worker sizes, pick the local maximum.
Atomic-free monitoring. Naïve performance counters use
std::atomic<uint64_t>::fetch_add per event. Under high load the
cache-line of the counter pings between cores; at hundreds of
thousands of events per second per worker the contention itself
becomes the bottleneck the counter was meant to measure. The
established workaround is thread-local accumulation with lazy
aggregation: each worker increments a private counter and the
monitor reader sums them. CBRD-26191 demonstrates the gain on YCSB
(workload-a: 58 K → 60 K ops; workload-b: 70 K → 73 K ops) by
removing only the atomic instructions on the hot path. Connection
worker statistics in this redesign follow the same rule —
statistics::metrics<> is a plain uint64_t[] per worker, summed
by the coordinator on a 1-second timer.
Common DBMS Design
Section titled “Common DBMS Design”The shared design space for connection front-ends has narrowed since the C10K era. Almost every modern engine sits at one of four points on the threads × event loop matrix.
PostgreSQL — process per connection. postmaster forks a
postgres backend process per accepted connection. The model gives
strong isolation (a crashing backend can be restarted without
killing peers) at the cost of high per-connection memory (≥10 MB).
The PostgreSQL community has consistently rejected proposals to
replace the model in core; instead, the project recommends external
poolers such as PgBouncer for high-concurrency workloads. There is
no equivalent of CUBRID’s “one CPU-pinned event loop per N
connections” inside PostgreSQL itself.
MySQL — thread-per-connection by default; thread pool plugin
optional. The default Connection_handler_manager runs
one-thread-per-connection, giving each TCP session a dedicated
pthread. The Enterprise Thread Pool plugin replaces this with a
fixed number of thread groups (typically equal to core count) plus
a small admission queue per group. The plugin exists exactly because
the unbounded thread-per-connection model collapses past a few
hundred concurrent sessions on the same workloads CUBRID measured in
CBRD-26152. CUBRID’s redesign moves into this same architectural
neighbourhood — bounded connection workers, group-based task
dispatch, admission via budgets — without making it a plugin.
Oracle — dedicated server vs. shared server (DRCP). The default mode is dedicated-server (process per session). Shared-server mode multiplexes many sessions onto a small pool of server processes via a dispatcher that owns the listening socket and passes requests through queues. Database Resident Connection Pooling (DRCP) generalises this so multiple application servers share the same backend pool. CUBRID’s coordinator has the same arbitration role as the Oracle dispatcher, but with finer per-worker statistics and an auto-scaling rule.
SQL Server — SOSScheduler (cooperative). SQL Server’s SOS scheduler runs a fixed number of worker threads (≈ logical core count) and switches them cooperatively at well-defined yield points inside the engine. Connections are attached to schedulers rather than owning a thread of their own. The CUBRID redesign is closer to this model than to PostgreSQL’s: connection workers are CPU-pinned, fixed in count within a min/max range, and process many sessions per loop iteration.
Where legacy CUBRID sat. Before CBRD-26177 the server ran a
polling thread per connection (each css_master_thread-spawned
session looped on its own socket) plus a cubthread::worker_pool of
size max_clients — see cubrid-thread-worker-pool.md for the
detailed walkthrough. With max_clients set to 2000 the engine
genuinely held ≥4000 threads at full saturation. Each polling thread
contended for the worker-pool’s per-core mutex on every job
dispatch; CBRD-26152 measured the result on YCSB-a as monotonically
decreasing throughput as concurrency rose, with CPU spending the
extra cycles in mutex idle rather than user code.
Where the redesign sits. With CBRD-26177 the front becomes a
small set (min_connection_worker … max_connection_worker,
defaults 4 … cores/2) of epoll-driven cubconn::connection::worker
threads each pinned to a core; the back stays a cubthread::worker_pool
sized by task_group × task_worker (renamed from
thread_core_count × the old worker count). A single
cubconn::connection::coordinator thread, also pinned, brokers
new-client placement, rebalancing, and auto-scaling. The hot path
(connection worker → task push → task worker pop) no longer takes a
shared mutex except briefly for css_conn_entry::cmutex /
rmutex, both of which are per-connection.
Motivation (CBRD-26152 + CBRD-26177)
Section titled “Motivation (CBRD-26152 + CBRD-26177)”CBRD-26152 — “[Survey] 동시성 증가에 따른 CPU idle 증가 원인 조사” (“Survey of why CPU idle rises when concurrency increases”) — is the empirical study that motivated the redesign. Yechan Hong ran YCSB workload-b (read 95%, update 5%) with the client/CAS cap at 2000 and swept thread counts from 200 to 1000. The unexpected finding was quoted directly in the ticket:
“스레드의 개수가 200개에서 1000개로 증가하였지만, 오히려 iowait가 아닌 CPU idle이 증가하고 있다.” (As the thread count increased from 200 to 1000, CPU idle — not iowait — increased.)
If the bottleneck were disk, more threads would have shown up as iowait. CPU idle rising under load instead pointed at internal synchronization: threads arriving at the worker-pool dispatch mutex faster than the holder could release it, then the kernel parking them, leaving cores genuinely idle.
CBRD-26177 names two structural causes:
“각 connection 스레드들이 모두 따로 polling하고 cub_server는 이론 상 max_clients × 2 이상의 thread를 가지게 되므로 자원 및 관리 관점에서 비효율적이다.” (Each connection thread polls independently, and
cub_servertheoretically holds at leastmax_clients × 2threads, which is inefficient from both a resource and management perspective.)“동시성이 점차 높아질수록 각각이 core의 mutex를 잡고 job을 할당 받으려고 하므로 이 contention은 CPU가 idle에 있게 하는 주요 병목 지점이 된다.” (As concurrency rises, each thread contends for a core’s mutex to be assigned a job; this contention is the main bottleneck that keeps the CPU idle.)
The resulting goals were:
- Replace per-connection polling with a small bounded set of
epoll-driven connection workers — eliminate excessive
poll()calls (Acceptance Criterion 1 of CBRD-26177). - Make throughput monotonic in concurrency — additional clients should not degrade the rate (Acceptance Criterion 2).
- Add admission-style backpressure inside each worker (CBRD-26392) so a single fat connection cannot starve its peers.
- Add load-aware placement and dynamic resizing (CBRD-26406, CBRD-26407, CBRD-26424) so the engine self-tunes between idle and saturated regimes.
- Strip atomics off the monitoring hot path (CBRD-26191).
CBRD-26177 also issued a hard directive that shaped every subsequent ticket and shapes this document:
“connection worker는 매우 동시성이 높은 hot-path이므로 perfmon 계열의 모니터링 코드를 추가해서는 안된다. 심각한 성능 저하를 일으킬 수 있다.” (The connection worker is a very high-concurrency hot path, so perfmon-class monitoring code must not be added. It can cause serious performance degradation.)
This is the single most important constraint to keep in mind when
reading the source: anything that smells like a global atomic
counter or a perfmon_inc_stat() call on the worker tick is a
regression.
CUBRID’s Approach
Section titled “CUBRID’s Approach”The redesign is best understood as three figures, mirroring the diagram pages of the EPIC: the AS-IS baseline, the TO-BE state after CBRD-26212/26255, and the post-CBRD-26407 state after the coordinator is added.
Architecture diagrams
Section titled “Architecture diagrams”AS-IS (legacy). Each accepted client got a dedicated polling
thread. Each polling thread, on every iteration, would push a
task into the shared cubthread::worker_pool of size
max_clients. The push acquired a per-core mutex; with hundreds
of polling threads the mutex was contended on every dispatch.
flowchart LR
subgraph "Front (legacy) — N == active clients"
p1["polling thread 1<br/>poll(fd1)"]
p2["polling thread 2<br/>poll(fd2)"]
pN["polling thread N<br/>poll(fdN)"]
end
subgraph "Back (legacy) — task workers (size = max_clients)"
direction TB
M["per-core mutex<br/>(shared dispatch)"]
W1["worker 1"]
W2["worker 2"]
WK["worker K"]
end
p1 --> M
p2 --> M
pN --> M
M --> W1
M --> W2
M --> WK
TO-BE (CBRD-26212 + CBRD-26255). A small bounded set of
connection_worker threads each runs an epoll_wait loop with
edge-triggered I/O over many client sockets. Each connection
worker is CPU-pinned. When a complete request arrives, the
connection worker calls css_push_server_task into the back-end
task pool. The number of connection workers is controlled by
min_connection_worker/max_connection_worker; the task pool is
sized by task_group × task_worker.
flowchart LR
subgraph "Front (TO-BE) — bounded epoll workers"
cw1["connection_worker 0<br/>epoll_wait()"]
cw2["connection_worker 1<br/>epoll_wait()"]
cwM["connection_worker M-1<br/>epoll_wait()"]
end
subgraph "Back — task workers (task_group × task_worker)"
direction TB
G0["group 0<br/>workers"]
G1["group 1<br/>workers"]
GG["group g-1<br/>workers"]
end
client1 -.fd.- cw1
client2 -.fd.- cw1
client3 -.fd.- cw2
clientK -.fd.- cwM
cw1 -- "css_push_server_task(idx)" --> G0
cw2 --> G1
cwM --> GG
Post-CBRD-26407 (coordinator + freelist). A single
coordinator thread, pinned to core 0, owns placement
(new-client → worker), rebalancing (move existing connections
between workers when load skews), and auto-scaling
(hibernate/awaken workers within min..max). Workers send
statistics to it on a slow timer; the coordinator broadcasts
control messages back. Inside each worker, contexts are claimed
from a per-pool freelist instead of new/delete-allocated each
time.
flowchart LR
C["coordinator<br/>(pinned, core 0)"]
subgraph "Connection workers (current = 4..max)"
cw0["worker 0"]
cw1["worker 1"]
cwN["worker N-1"]
end
FL["pool::freelist<br/>(context cache)"]
TP["task worker pool<br/>(task_group × task_worker)"]
CTRL["controller socket<br/>(/tmp/cub_server_∗_coordinator.sock)"]
C -- "NEW_CLIENT / HANDOFF / HIBERNATE / AWAKEN" --> cw0
C -- "..." --> cw1
C -- "..." --> cwN
cw0 -- "STATISTICS / RETURN_TO_POOL / HANDOFF_REPLY" --> C
cw0 -- "claim_context / retire_context" --> FL
cw1 --> FL
cwN --> FL
cw0 -- "css_push_server_task" --> TP
cw1 --> TP
cwN --> TP
CTRL -.SHOW_STATS / SCALE_UP / SCALE_DOWN / CLIENT_MOVE.- C
Connection worker (CBRD-26212)
Section titled “Connection worker (CBRD-26212)”The connection worker is implemented as cubconn::connection::worker
in connection_worker.{cpp,hpp}. It owns:
- a Linux epoll instance (
cubsocket::epoll m_events); - two file descriptors registered into that epoll: an
eventfd(m_eventfd) for inter-thread wakeups and atimerfd(m_timerfd) for periodic work (hibernation check, statistics push, HA close-all); - two per-worker message queues (
IMMEDIATE,LAZY) implemented withtbb::concurrent_queue<message>and an atomic size counter; - the live set of
context *it owns (m_context), and a deferred removal queue (m_removed_context); - two budget knobs (
m_recv_budget,m_send_budget) and an exhausted-context map (m_exhausted); - an atomic-free
statistics::metrics<statistics::worker> m_statsfor self-reporting to the coordinator.
The constructor wires the epoll, registers the eventfd/timerfd, installs three timer handlers, and spawns the worker thread:
// worker::worker — src/connection/connection_worker.cppm_recv_budget = static_cast<size_t> (prm_get_integer_value (PRM_ID_CSS_RECV_BUDGET_PER_CONNECTION));m_send_budget = static_cast<size_t> (prm_get_integer_value (PRM_ID_CSS_SEND_BUDGET_PER_CONNECTION));m_exhausted.reserve (128);
m_eventfd = eventfd (0, EFD_NONBLOCK | EFD_CLOEXEC);m_timerfd = timerfd_create (CLOCK_MONOTONIC, TFD_NONBLOCK | TFD_CLOEXEC);// ... eventfd_register both into m_events ...
eventfd_addtimer (timer_type::HIBERNATE, timer_latency::MEDIUM_LATENCY, &worker::hibernate_check);eventfd_addtimer (timer_type::STATISTICS, timer_latency::MEDIUM_LATENCY, &worker::statistics_metrics_to_coordinator);eventfd_addtimer (timer_type::HA, timer_latency::HIGH_LATENCY, &worker::ha_close_all_connections);
m_thread = std::thread (&worker::attach, this);worker::attach is the thread entry point; it calls
initialize → run → finalize. initialize pins the thread to its
assigned core via os::resources::cpu::setaffinity (m_core),
claims a cubthread::entry, and sets the thread name to
"connections" (a name that, as we shall see, leaks into the
task pool in CBRD-26617).
The main loop is the textbook reactor:
// worker::run — src/connection/connection_worker.cppwhile (!m_stop) { nfds = m_events.wait (events.data (), events.size (), m_exhausted.empty () ? TIMEOUT_INFINITE : TIMEOUT_NOWAIT); // ... for (i = 0; i < nfds; i++) { ctx = reinterpret_cast<context *> (events[i].data.ptr); if ((events[i].events & (EPOLLHUP | EPOLLRDHUP | EPOLLERR)) && ...) { this->handle_hangup_or_error (ctx, events[i].events & EPOLLERR); continue; } if (events[i].events & EPOLLIN) { if (ctx->m_conn->fd == m_eventfd) { eventfds[0] = true; continue; } if (ctx->m_conn->fd == m_timerfd) { eventfds[1] = true; continue; } status = this->handle_reception (ctx, false); // ... } if (events[i].events & EPOLLOUT) status = this->handle_transmission (ctx, false); }
if (m_exhausted.size () > 0) handle_exhausted (); if (eventfds[0] || eventfds[1]) eventfd_handler (eventfds); }Note the timeout switch: when there are exhausted contexts to
re-drive (see Send/recv budgets below) the loop polls with
TIMEOUT_NOWAIT so it can immediately revisit them, otherwise it
blocks indefinitely on epoll_wait. The eventfd is the single
inter-thread doorbell — any outside producer (the coordinator,
another connection worker handing off, a task worker returning a
buffer) writes 1 into m_eventfd and the worker drains its
in-process queue once the loop wakes.
The connection::context (connection_context.hpp) is the
per-client object the worker owns. It contains the
css_conn_entry *m_conn, a worker index, a unique 64-bit id, a
receive state machine (HEADER → DATA → ERROR), the receiver and
transmitter, and an inline statistics::metrics<statistics::context>.
A complete request (header + optional data) is parsed inside
worker::handle_reception → handle_packet → handle_header_packet
or handle_data_packet, and the task push into the back-end pool
happens at handle_command_header_packet (when the request has no
following data) or handle_data_packet (after the data arrives):
// worker::push_task_into_worker_pool — src/connection/connection_worker.cppvoid worker::push_task_into_worker_pool (context *ctx){ /* push new task into worker pool */ css_push_server_task (*ctx->m_conn);}That single call is the entire interface between the new front
and the legacy back. css_push_server_task (in
server_support.c) wraps the connection in a css_server_task
and routes it to the cubthread worker pool with
push_task_on_core (..., conn_ref.idx, conn_ref.in_method) — the
core hash being the connection index, exactly as in the legacy
design, so a long-running session keeps affinity for the same
back-end core.
Connection lifecycle (close path) is driven by
worker::handle_connection_close. It serialises against
ctx->m_conn->cmutex, drains any in-flight task workers via
net_server_active_workers, retries (re-enqueues a
SHUTDOWN_CLIENT on the LAZY queue) if back-end workers are
still active, and on success removes the fd from epoll, marks the
context m_removed = true, and pushes it into
m_removed_context. The actual context return to the pool is
deferred to purge_stale_contexts, which sends a single
RETURN_TO_POOL message to the coordinator with the batched
list — so the freelist is touched once per loop tick, not once
per closed connection.
Connection pool (CBRD-26255)
Section titled “Connection pool (CBRD-26255)”The pool (cubconn::connection::pool in
connection_pool.{cpp,hpp}) is the owner of workers, coordinator,
and context freelist. It exists for the lifetime of the server
and is held by cub_server as a single instance.
The freelist itself is a singly-linked stack of pool::freelist
nodes, each of which embeds the actual context as its first
member so that reinterpret_cast<freelist *> (ctx) recovers the
node. The trick replaces the legacy “new context per connection”
allocation pattern:
// pool::freelist — src/connection/connection_pool.hppstruct freelist{ /* THIS MUST BE THE FIRST */ context m_context; freelist *m_next;
freelist (std::size_t capacity) : m_context (capacity), m_next (nullptr) {} ~freelist () = default;};// pool::claim_context / retire_context — src/connection/connection_pool.cppcontext *pool::claim_context (){ freelist *head; assert (m_mutex_holder == std::this_thread::get_id ());
head = m_freelist.m_head; if (head) { m_freelist.m_head = m_freelist.m_head->m_next; } else { head = new freelist (32 * 1024); } m_freelist.m_claim++;
return &head->m_context;}
void pool::retire_context (context *ctx){ freelist *head; // ... head = reinterpret_cast<freelist *> (ctx); head->m_context.reset (); if (m_freelist.m_claim > m_freelist.m_max) delete head; /* over-cap: actually free */ else { head->m_next = m_freelist.m_head; m_freelist.m_head = head; } m_freelist.m_claim--;}The freelist is only manipulated by code holding pool::m_mutex.
The coordinator’s handle_message_queue_new_client (which calls
claim_context) and handle_message_queue_return_to_pool (which
calls retire_context) both run on the coordinator thread, and
the coordinator holds the pool lock for its entire lifetime
(see coordinator::initialize → m_parent->lock_resource ()).
This is the design choice that makes context allocation
single-threaded without ever needing per-context atomics.
pool::initialize is wired via pool::initialize_topology, which
maps the requested max_connection_workers onto an actual NUMA
core layout via os::resources::cpu::effective () and may
additionally serialise NIC RX/TX IRQ to those cores via
os::resources::net::map_nic_to_index (cores). CBRD-26255 also
provides this NIC-pinning, which is the source of the warning
log messages discussed in the ticket comments
(warning: NIC channel configuration failed) — they are
non-fatal, surfacing only when the binary lacks CAP_NET_ADMIN
or runs in a virtualised environment.
The shutdown sequence uses a thread_watcher (a bare condvar
plus int active) to count down workers as they exit, and
pool::finalize_workers waits up to css_get_shutdown_timeout()
for m_watcher->active == 1 (only the coordinator left), then
pool::finalize_coordinator waits for active == 0. Failure
to reach those states triggers _exit(0) after a 10 s
try-lock loop in try_to_lock_resource — a deliberate hard
exit because the alternative is to wait forever for a thread
holding state nothing else can clean up.
Send/recv budgets (CBRD-26392)
Section titled “Send/recv budgets (CBRD-26392)”The budget mechanism is the single most subtle part of the
design. Without it, edge-triggered epoll plus a draining reader
would let a single client with backlog monopolise its worker:
once EPOLLIN fires, the reader is contractually obliged to
drain until EAGAIN; if the peer keeps writing, that drain
loop never returns. CBRD-26392 caps the drain per epoll tick.
Quoting the ticket directly:
“하나의 connection worker는 여러 connection들을 관리한다. 이때 하나의 긴 송수신을 수행하게 되면 다른 송수신들이 계속 blocked되며 response가 지연되게 된다. 이때 한 번에 송수신할 수 있는 양을 제한하여 전체 지연을 안정화한다.” (One connection worker manages many connections. If a single long send or receive runs, the other I/Os remain blocked and their response is delayed. Bound the amount that can be sent or received at once to stabilise the overall latency.)
Defaults: 16 KB receive, 32 KB send (see system_parameter.c).
Both can be set as low as 0 (no limit) or as high as 1 GB.
The implementation lives partly in receiver::drain /
transmitter::fill (their second argument is a size_t limit = 0
budget) and partly in
worker::handle_reception / worker::handle_transmission /
worker::handle_exhausted_add_context /
worker::handle_exhausted (connection_worker.cpp).
// worker::handle_reception — src/connection/connection_worker.cppio_status = ctx->m_recv.m_receiver.drain (ctx->m_conn->fd, m_recv_budget);if (io_status == result::PeerReset || io_status == result::Error) { /* close */ }
assert (io_status == result::Pending || io_status == result::BudgetExhausted);
if (!in_exhausted && io_status == result::BudgetExhausted) { handle_exhausted_add_context (ctx, EPOLLIN); }// worker::handle_transmission — src/connection/connection_worker.cppstatus = ctx->m_send.m_transmitter.fill (ctx->m_conn->fd, m_send_budget);// ...else if (!in_exhausted && status == result::BudgetExhausted) { handle_exhausted_add_context (ctx, EPOLLOUT); }When a context exhausts its budget, it lands in
m_exhausted keyed by context id. The main loop notices the
non-empty exhausted map and switches epoll_wait to
TIMEOUT_NOWAIT, then re-drives those contexts via
handle_exhausted after serving the current epoll batch. The
prepared flag in exhausted_context is the deferral guard:
the first time a context is added it is marked !prepared and
skipped; only on the second visit does the worker re-drain it.
This ensures every other ready fd in the current epoll batch gets
serviced before the budget-exceeded context is revisited.
The flow control finite-state machine for one fd:
stateDiagram-v2 [*] --> Idle Idle --> Reading : EPOLLIN \n handle_reception Reading --> Idle : drain Pending \n EAGAIN Reading --> Exhausted : drain BudgetExhausted \n add to m_exhausted, EPOLLIN Exhausted --> Reading : revisit on next loop \n prepared flag Idle --> Writing : EPOLLOUT \n handle_transmission Writing --> Idle : fill Ok Writing --> Exhausted : fill BudgetExhausted \n add to m_exhausted, EPOLLOUT Reading --> Closing : ClosedConnection or PeerReset Writing --> Closing : ClosedConnection or PeerReset Closing --> [*] : handle_connection_close
Note that result::BudgetExhausted is a distinct enum value from
result::Pending — the difference being that Pending means “the
kernel has no more bytes for me right now” (back-off naturally
until next epoll edge) while BudgetExhausted means “I have more
bytes available but I’m yielding voluntarily” (must come back
this loop or the next).
Auto scaling (CBRD-26406)
Section titled “Auto scaling (CBRD-26406)”CBRD-26406 wires the mechanism for connection-rebalancing and
worker-count scaling; the policy lives in CBRD-26424
(score-based selection, below). The mechanism is simple in
shape: workers report statistics on a 1-second timer, the
coordinator’s 5-second REBALANCING timer compares per-worker
scores and asks the heaviest worker to hand off one of its
connections to the lightest, the coordinator’s 60-second
SCALING timer drives the auto-scaling state machine.
The scaling_status enum has only two states:
STABLE— current count is “good enough”, no measurement in progress.TRIAL— sweep throughcountcandidate sizes recording their throughput score, then pick the best.
At each SCALING tick:
// coordinator::statistics_scaling — src/connection/coordinator.cppif (m_scaling_statistics.status == scaling_status::STABLE) { this->scale_trial (); return true; }
assert (m_scaling_statistics.status == scaling_status::TRIAL);
bytes_inout = 0;for (i = 0; i < m_max_worker; i++) { bytes_inout += m_statistics[i].m_sum.get (statistics::context::BYTES_IN_TOTAL); bytes_inout += m_statistics[i].m_sum.get (statistics::context::BYTES_OUT_TOTAL); }m_scaling_statistics.history.push_back ({ m_current_worker, VAL_TO_SCORE (50, 1000, bytes_inout) + m_task_statistics.completed.first * 2 });m_scaling_statistics.count--;
if (m_scaling_statistics.count == 0) { selected = this->scale_selection (); /* pick max-score scale */ if (selected < m_current_worker) this->scale_down (); else if (selected > m_current_worker) this->scale_up (); /* else stable */ }else { if (m_scaling_statistics.direction == scaling_direction::DOWN) this->scale_down (); else this->scale_up (); }scale_trial clears the history, alternates the trial direction
relative to the previous one (so consecutive trials don’t drift
uni-directionally), and sets count to the
auto_scaling_window_size parameter — the hyper-parameter that
trades trial length for sensitivity. The default of 4 means each
trial collects 4 samples (one per SCALING tick = 60 s) before
deciding.
Sliding-window mechanism:
sequenceDiagram
participant T as SCALING timer (60s)
participant C as coordinator
participant H as history (window_size = 4)
Note over C: status = STABLE
T->>C: tick
C->>C: scale_trial()
Note over C: direction = DOWN (or UP)<br/>count = 4<br/>status = TRIAL
loop count = 4
T->>C: tick
C->>H: push_back({ current_worker, score })
C->>C: scale_down() or scale_up()
end
T->>C: tick
C->>H: push_back({ current_worker, score })
C->>C: selected = scale_selection()
alt selected != current
C->>C: scale_down() or scale_up() to reach selected
end
Note over C: status = STABLE again
scale_selection picks any sample within 95% of the maximum
score, then chooses uniformly among them — a small Boltzmann-style
randomisation to avoid getting stuck at a flat local maximum
(see CBRD-26424 commentary on the dual local maxima observed in
small-machine measurements).
scale_up flips the next-in-line hibernating worker out of
HIBERNATING by sending an AWAKEN lazy message to it and
incrementing m_current_worker. scale_down does the reverse
in two phases: scale_down itself migrates every connection of
the draining worker via transfer_connection and parks the
coordinator status as DRAINING; scale_down_finish is the
actual hibernation, called from
handle_message_queue_statistics only once the draining worker
reports an empty context list. This two-phase shutdown is
necessary because worker shutdown is asynchronous and the
coordinator must not allow a worker to be re-targeted by
statistics_find_score_extremes while it is still serving
connections.
Coordinator + context freelist (CBRD-26407)
Section titled “Coordinator + context freelist (CBRD-26407)”The coordinator (cubconn::connection::coordinator in
coordinator.{cpp,hpp}) is structurally the same shape as a
worker — pinned thread, epoll instance, eventfd + timerfd,
single-producer-single-consumer (TBB) queue — but it owns three
distinct timers and an external Unix-domain control socket.
// coordinator::coordinator — src/connection/coordinator.cppm_controller.open ("/tmp/cub_server_" + std::to_string (getpid ()) + "_coordinator.sock", SOCK_NONBLOCK | SOCK_CLOEXEC);m_ctrlfd = m_controller.get_fd ();m_eventfd = eventfd (0, EFD_NONBLOCK | EFD_CLOEXEC);m_timerfd = timerfd_create (CLOCK_MONOTONIC, TFD_NONBLOCK | TFD_CLOEXEC);
eventfd_register (m_eventfd);eventfd_register (m_timerfd);eventfd_register (m_ctrlfd);
eventfd_addtimer (timer_type::STATISTICS, timer_latency::LOW_LATENCY, &coordinator::statistics_update);eventfd_addtimer (timer_type::REBALANCING, timer_latency::MEDIUM_LATENCY, &coordinator::statistics_rebalancing);eventfd_addtimer (timer_type::SCALING, timer_latency::HIGH_LATENCY, &coordinator::statistics_scaling);The three timer latencies are 1 second / 5 seconds / 60 seconds
respectively (see the timer_latency enum in coordinator.hpp).
The control socket exposes administrative commands:
SHOW_STATS— print per-worker EWMA throughput and queue depth (statistics_print) to stdout.SCALE_UP/SCALE_DOWN— force one step of the auto-scaling state machine.CLIENT_MOVE— manually transfer one connection by id from workerfromto workerto.
This is an out-of-band debugging interface; nothing in the data
path uses it. Sending a control_recv struct via
SOCK_DGRAM/SOCK_NONBLOCK triggers a reply with a single
OK/NOK byte. The directive from CBRD-26177 (“no perfmon on
the hot path”) means there is no SHOW server-side equivalent
through the standard server channel — the controller is
intentionally a side door, not a performance counter.
coordinator::handle_message_queue_new_client is where the
placement policy lands. Note that it calls the same EWMA-driven
score-extremes function used by rebalancing:
// coordinator::handle_message_queue_new_client — src/connection/coordinator.cppstd::tie (worker, std::ignore) = statistics_find_score_extremes ();
m_statistics[worker].m_contexts.emplace (id, std::pair</*EWMA*/, /*prev*/>{ });m_statistics[worker].m_client_num++;
request.type = connection::worker::message_type::NEW_CLIENT;request.ctx = m_parent->claim_context ();request.ctx->m_worker = worker;request.ctx->m_id = id++;request.conn = item.conn;workers[worker]->enqueue (queue_type::IMMEDIATE, std::move (request));workers[worker]->notify ();this->statistics_update_score (worker);— so every new client is immediately routed to the worker with the lowest current score, and that score is updated on the spot to bias the next placement.
The context migration protocol (used by both rebalancing and scale-down) is a four-step handshake between coordinator and two workers:
sequenceDiagram participant C as coordinator participant Wf as worker[from] participant Wt as worker[to] C->>C: m_migrating.insert(id) C->>Wf: HANDOFF_CLIENT(id, worker_ptr=Wt, worker_index=to) Wf->>Wf: locate ctx, remove from epoll/m_context Wf->>Wt: TAKEOVER_CLIENT(ctx) Wt->>Wt: register ctx in epoll (EPOLLIN | maybe EPOLLOUT) Wf->>C: HANDOFF_REPLY(transferred=true, id, from, to) C->>C: m_migrating.erase(id)<br/>fix m_statistics[from/to]
m_migrating prevents a connection from being targeted twice in
flight. If the worker discovers the context is already gone (the
client closed concurrently with the migration), the reply carries
transferred=false and the coordinator reverts the projected
stats. This is the single concurrency invariant the design
relies on: a context is only ever owned by exactly one
connection worker at a time, with ownership transferred via
explicit message. No locks are required around the context
itself — only the conn entry’s cmutex, briefly, for adapter
field updates.
The context freelist (described above under CBRD-26255) was finalised in this same ticket. The CBRD-26407 description states the goal directly:
“context는 생성마다 Physical Memory와 Virtual Memory를 할당받고 이를 mapping하므로 이 과정을 생략하도록 한다.” (Each context creation allocates physical and virtual memory and maps the two, so skip this process.)
By pre-warming the freelist with max_connections * 1.1
preallocated freelist nodes (each 32 KB capacity), the runtime
hot path is a pointer swap, not a mmap/page-fault sequence.
Score-based connection assignment (CBRD-26424)
Section titled “Score-based connection assignment (CBRD-26424)”The coordinator’s score function combines three signals into a single comparable scalar per worker:
// coordinator::statistics_update_score — src/connection/coordinator.cppm_statistics[worker].m_score = 1 * static_cast<double> (m_statistics[worker].m_client_num) / 1 + EVAL_WORKER (EWMA(MQ_COMPLETED), EWMA(BLOCKED_RMUTEX)) + EVAL_CONTEXT (EWMA(BYTES_IN_TOTAL) + EWMA(BYTES_OUT_TOTAL), EWMA(RECV_BUDGET_HIT) + EWMA(SEND_BUDGET_HIT));with the weight macros
#define VAL_TO_SCORE(w, m, s) ((w) * static_cast<double> (s) / (m))#define EVAL_WORKER(mq, rmutex) (VAL_TO_SCORE (25, 3.5, (mq)) + VAL_TO_SCORE (500, 1, (rmutex)))#define EVAL_CONTEXT(bytes, bgt) (VAL_TO_SCORE (50, 1000, (bytes)) + VAL_TO_SCORE (10, 1, (bgt)))Concretely the weights mean: bytes-of-traffic count for 50 ×
1/1000 (≈ 1 unit per kilobyte); rmutex blocked microseconds count
for 500 × 1 (≈ 500 units per microsecond blocked); MQ completions
count for 25 × 1/3.5 (≈ 7 units per completion). Budget-hit
events (i.e., contexts that hit the recv/send budget cap) are
weighted at 10 — because a high budget-hit count means the worker
is repeatedly running into its admission cap and would benefit
from an extra peer to share load. CBRD-26424’s commentary
explains the dual local maxima visible in measured throughput
curves: small machines exhibit a non-monotonic relationship
between worker count and throughput because of NUMA / RX-TX /
HT-sibling interactions, and a naïve hill-climber gets stuck.
The randomised top-5% selection in scale_selection is the
escape hatch.
EWMA aggregation uses α = 0.06 (EWMA_ALPHA):
// coordinator::statistics_EWMA — src/connection/coordinator.hppacc = acc * (1 - alpha) + (current - prev) * (alpha / (time_delta * 1e-6));prev = current;The division by time_delta * 1e-6 normalises to microseconds, so
the EWMA is a smoothed rate (events per microsecond) rather
than a raw delta. With α = 0.06 and a 1 s sampling interval the
effective half-life is roughly 11 samples (≈ 11 s); aged samples
contribute less than 1 % after about a minute.
Atomic-free stats (CBRD-26191)
Section titled “Atomic-free stats (CBRD-26191)”The statistics::metrics<T, VT = uint64_t> template
(connection_statistics.hpp) is a fixed-size VT[STATS_COUNT]
with add / sub / get / set / reset operations. There is
no std::atomic anywhere — every increment is a plain memory
write, because every increment is performed by exactly one
thread (the worker that owns the metric). Aggregation across
workers happens once per second, when the worker copies its
metric block into a coordinator::message::statistics payload
and the coordinator does a per-worker EWMA update inside its own
single-threaded handler:
// worker::statistics_metrics_to_coordinator — src/connection/connection_worker.cppmessage.type = coordinator::message_type::STATISTICS;message.statistics.cpu_time_ns = get_time_ns (CLOCK_THREAD_CPUTIME_ID);message.statistics.time_ns = get_time_ns (CLOCK_MONOTONIC);message.statistics.worker.first = m_index;message.statistics.worker.second = m_stats; /* copy */message.statistics.contexts.reserve (m_context.size ());for (context *ctx : m_context) message.statistics.contexts.emplace_back (ctx->m_id, ctx->m_stats); /* copy */m_coordinator->enqueue (std::move (message));The bulk copy is cheap because m_stats is a fixed array (≈ 88
bytes) and the per-context array is at most a few hundred entries
of 56 bytes each. The copy moves ownership across the
single-producer-single-consumer queue without crossing any
cache line that the worker is concurrently writing to. Crucially,
this design exists to uphold the CBRD-26177 directive
(“no perfmon on the hot path”): the worker never increments a
shared counter, never spins on a lock, never executes a memory
barrier in the dispatch loop.
CBRD-26191 measured the wider goal — strip atomics from server-wide monitoring — on YCSB:
| workload | before | after | gain |
|---|---|---|---|
| workloada | 58 464.28 | 60 646.59 | +3.7% |
| workloadb | 70 009.99 | 72 976.31 | +4.2% |
| update | 44 158.66 | 45 128.96 | +2.2% |
| mix | 9 440.82 | 10 115.33 | +7.1% |
The connection-side metrics design follows the same template at the new layer.
TCP keepalive tunables
Section titled “TCP keepalive tunables”CBRD-26177 promised three new per-socket keepalive parameters:
tcp_keepalive_idle (start probing after N seconds idle),
tcp_keepalive_interval (interval between probes),
tcp_keepalive_count (consecutive failures = dead). The defaults
are 300 s / 300 s / 3 with a high cap at 1 year of seconds. They
are registered in system_parameter.c alongside the existing
tcp_keepalive boolean and are intended to be applied by the
socket-setup helper (tcp.c::css_sockopt) which already calls
setsockopt (SOL_SOCKET, SO_KEEPALIVE, ...) when tcp_keepalive
is set; the three new knobs feed TCP_KEEPIDLE, TCP_KEEPINTVL,
TCP_KEEPCNT respectively for fine-grained tuning of dead-peer
detection. The CUBRIDMAN-333 manual update covers the
documentation rollout.
Task worker rework — task_group and task_worker
Section titled “Task worker rework — task_group and task_worker”The back-end pool is still cubthread::worker_pool in
thread_worker_pool_impl.{hpp,cpp}. Its sizing is now controlled
by two parameters that replace the legacy
thread_core_count/thread_worker_count pair:
task_group(renamed fromthread_core_count) — number of cores in the worker pool. Each “core” in CUBRID terminology is a sub-pool with its own queue, owned by oneworker_pool::core.task_worker— total number of worker threads across all groups. Default at server startup:css_get_max_connections ()(i.e., effectively the legacymax_clients), normalised down if it exceeds the system core count.
The auto-tuning code clamps task_group ≤ system core count and
task_group ≤ task_worker (system_parameter.c boot
sysprm tuning block):
/* sysprm_tune_client_parameters — src/base/system_parameter.c */task_worker_prm = GET_PRM (PRM_ID_TASK_WORKER);if (PRM_GET_INT (task_worker_prm->value) < 0) { /* the value of task worker is default. */ sprintf (newval, "%d", task_worker); /* css_get_max_connections() */ (void) prm_set (task_worker_prm, newval, false); }
task_group_prm = GET_PRM (PRM_ID_TASK_GROUP);if (PRM_GET_INT (task_group_prm->value) > system_cpu_count) { sprintf (newval, "%d", system_cpu_count); (void) prm_set (task_group_prm, newval, false); }if (PRM_GET_INT (task_group_prm->value) > PRM_GET_INT (task_worker_prm->value)) { sprintf (newval, "%d", PRM_GET_INT (task_worker_prm->value)); (void) prm_set (task_group_prm, newval, false); }The semantic shift is that task_worker is now interpreted as the
total worker budget and task_group controls partitioning. The
legacy thread_core_count was loosely “number of cores” with no
policy; the new naming makes the intent explicit, and the
coordinator’s task-completion EWMA (m_task_statistics.completed)
uses css_get_task_stats from server_support.c to read the pool’s
running totals into the score.
CBRD-26636 (“[성능 실험] Worker 개수에 따른 성능 추이”) found
that task_worker ≈ 4–6 × cores consistently outperformed
task_worker = max_clients on read-heavy YCSB workloads, but at
the cost of a deadlock risk when task_worker < max_clients and
many workers wait on a long lock. That risk motivates CBRD-26662
(see Cross-check Notes).
Source Walkthrough
Section titled “Source Walkthrough”Symbols are grouped by subsystem. CBRD-* annotations attribute each symbol to its driving ticket where one is identifiable.
epoll wrapper (CBRD-26212)
Section titled “epoll wrapper (CBRD-26212)”cubsocket::epoll(class,src/base/epoll.hpp) — RAII wrapper overepoll_create1/epoll_ctl/epoll_wait. Constructor opens anEPOLL_CLOEXECinstance; destructor closes it.cubsocket::epoll::wait— thin shim overepoll_wait.cubsocket::epoll::add_descriptor—EPOLL_CTL_ADDwith optionalvoid *ptrpayload (used to thread context pointers throughevents[i].data.ptr).cubsocket::epoll::modify_descriptor—EPOLL_CTL_MOD, used to add/removeEPOLLOUTwhen the transmitter queues pending data.cubsocket::epoll::remove_descriptor—EPOLL_CTL_DEL.cubsocket::nonblocking(parent class,nonblocking.hpp) — defines theresultenum (Ok,Pending,BudgetExhausted,PeerReset,Error,ClosedConnection,Skewed,Aborted) that every receiver/transmitter/worker call returns.
connection::worker (CBRD-26212 / 26392 / 26406 / 26407 / 26617)
Section titled “connection::worker (CBRD-26212 / 26392 / 26406 / 26407 / 26617)”cubconn::connection::worker— class definition inconnection_worker.hpp. Members includem_parent(pool),m_coordinator,m_watcher, the per-thread state (m_thread,m_core,m_status,m_stop,m_entry), the context set (m_context,m_removed_context), the epoll (m_events), the eventfd/timerfd (m_eventfd,m_timerfd), the timer table (m_timer_handler), the dual-priority message queues (m_queue[IMMEDIATE/LAZY],m_queue_size[]), the budget knobs and exhausted map (m_recv_budget,m_send_budget,m_exhausted), and the worker-side metrics (m_stats).worker::worker— constructor; reads system parameters, installs three timers, spawns the thread.worker::attach— thread entry; callsinitialize → run → finalize.worker::initialize— sets affinity, claims thread entry, sets pthread name"connections"(the name leak CBRD-26617 caught).worker::run— main reactor loop.worker::finalize— drain still-open contexts, retire thread entry, signal watcher.worker::enqueue/worker::notify/worker::enqueue_and_notify— outside-thread interface.worker::push_task_into_worker_pool— single-line bridge tocss_push_server_task(the back-end pool).worker::handle_reception/worker::handle_transmission— per-fd I/O drivers; honourm_recv_budget/m_send_budgetand emitBudgetExhausted. (CBRD-26392)worker::handle_exhausted_add_context/worker::handle_exhausted— exhausted-fd revisitation queue. (CBRD-26392)worker::handle_message_queue_new_client— bind a fresh context to a fd; register in epoll withEPOLLET|EPOLLIN|EPOLLRDHUP.worker::handle_message_queue_handoff_client/worker::handle_message_queue_takeover_client— the two halves of the migration handshake. (CBRD-26406 / CBRD-26407)worker::handle_message_queue_send_packet/worker::handle_message_queue_release_packet— task workers shipping bytes back to a connection use these messages instead of writing the socket directly. Sending may addEPOLLOUTto the fd if the transmitter buffers data.worker::handle_message_queue_shutdown_client— close- connection request from outside; callshandle_connection_close.worker::handle_message_queue_hibernate/worker::handle_message_queue_awaken— auto-scaling state transitions.worker::handle_connection_close— six-step close protocol with retry-via-LAZY-queue when back-end task workers still hold the conn.worker::statistics_metrics_to_coordinator— every MEDIUM tick (1 s default), copym_statsplus per-context metrics into acoordinator::message::STATISTICS. (CBRD-26191)worker::hibernate_check— every MEDIUM tick, if status isHIBERNATINGandm_context.empty(), stop the timer.worker::ha_close_all_connections— every HIGH tick, ifcss_ha_server_state () == HA_SERVER_STATE_TO_BE_STANDBY, forcibly close all idle connections — the HA mode-change path that interacts with CBRD-26523.
connection::pool (CBRD-26255 / 26407)
Section titled “connection::pool (CBRD-26255 / 26407)”cubconn::connection::pool::freelist— the singly-linked context cache node.pool::initialize/pool::finalize— top-level bring-up / tear-down; called by the executable wire-up.pool::initialize_topology— interrogatesos::resources::cpu::effective ()and (where capable)os::resources::net::map_nic_to_index ().pool::initialize_freelist— pre-allocatemax_connections * 1.1freelist nodes.pool::initialize_workers— createmax_connection_workerspinned workers and pre-warm by sending each aSTARTmessage on both queues.pool::initialize_coordinator/pool::start_coordinator/pool::finalize_coordinator— coordinator lifecycle.pool::dispatch— accept hand-off; called bymaster_connectoronce a TCP connection has completed CUBRID handshake. Sends aNEW_CLIENTto the coordinator.pool::claim_context/pool::retire_context— freelist API; requirem_mutexheld by the calling thread.pool::lock_resource/pool::release_resource/pool::try_to_lock_resource— the pool-wide mutex used by the coordinator for the duration of its lifetime.
connection::coordinator (CBRD-26406 / 26407 / 26424)
Section titled “connection::coordinator (CBRD-26406 / 26407 / 26424)”cubconn::connection::coordinator— class definition incoordinator.hpp. Members includem_parent,m_watcher, the controller (Unix-domain socketm_controller,m_ctrlfd), the message queue (m_queue,m_queue_size), the worker count tracking (m_max_worker,m_min_worker,m_current_worker), the migration-in-flight set (m_migrating), the scaling bookkeeping (m_scaling,m_scaling_statistics), and per-worker statistics (m_statistics).coordinator::coordinator— opens the controller socket, registers fds into epoll, installs three timers, spawns thread.coordinator::run— main reactor loop.coordinator::initialize— pin to core 0 (or the first effective core), claim thread entry, set name"coordinator", take the pool lock for life.coordinator::handle_message_queue_new_client— placement: pick min-score worker, allocate context, forwardNEW_CLIENT. (CBRD-26424)coordinator::handle_message_queue_return_to_pool— bulk return from a worker’sm_removed_context; clears per-context stats and callspool::retire_context.coordinator::handle_message_queue_handoff_reply— finalise migration; revert stats ontransferred=false.coordinator::handle_message_queue_statistics— per-worker stats arrival; runs EWMA update viastatistics_update_connection, thenstatistics_update_score; if the reporting worker is the currentdraining_workerand reports empty contexts, callsscale_down_finish. (CBRD-26424)coordinator::handle_message_queue_shutdown— flipm_stoptrue.coordinator::transfer_connection— guarded bym_migrating; sendsHANDOFF_CLIENTto the source worker.coordinator::scale_up—AWAKENnext worker, bumpm_current_worker. (CBRD-26406)coordinator::scale_down/coordinator::scale_down_finish— drain target worker’s connections, thenHIBERNATE. (CBRD-26406)coordinator::scale_trial/coordinator::scale_selection/coordinator::statistics_scaling— the auto-scaling state machine. (CBRD-26406 / CBRD-26424)coordinator::statistics_rebalancing— every MEDIUM tick (5 s), find score extremes, transfer one context if the gap exceeds 20 % of the high score. (CBRD-26424)coordinator::statistics_EWMA— α = 0.06, microsecond-normalised, used for both worker and context metrics.coordinator::statistics_find_score_extremes— linear scan overm_statistics[0..m_current_worker)returning(min_index, max_index).coordinator::statistics_update_score— applies theEVAL_WORKER + EVAL_CONTEXT + client_numformula.coordinator::statistics_print— controller-driven console dump of per-worker score, EWMA, byte counts.coordinator::handle_controller/coordinator::handle_controller_request— dispatch the four control-socket commands.
connection::context, controller, statistics
Section titled “connection::context, controller, statistics”cubconn::connection::context— per-client state (worker index, id, ignore guard, recv state machine, receiver, transmitter, blocker shared_ptr, per-context metrics). 32 KB inline send/recv buffer.cubconn::connection::context::reset— reset for reuse via the freelist.cubconn::thread_watcher—mutex + cv + int activeused for ordered shutdown.cubconn::message_blocker— single-shotmutex + cv + bool doneused for blockingenqueue_and_notifycallers.cubconn::connection::controller<RX,TX>— templated Unix-domain datagram socket wrapper (controller.hpp).cubconn::statistics::context/cubconn::statistics::worker— enums of metric keys (connection_statistics.hpp).cubconn::statistics::metrics<T,VT>— fixed-size array of counters; supports+=,-(returnsmetrics<T,double>),*(scaling),add,sub,get,set,reset,copy_from. No atomics. (CBRD-26191)
task worker pool changes
Section titled “task worker pool changes”cubthread::worker_pool(thread_worker_pool.hpp) — unchanged abstract interface.cubthread::worker_pool::core— now sized bytask_group.cubthread::worker_pool::execute/execute_on_core— entry points called fromcss_push_server_task.cubthread::worker_pool_task_capper(thread_worker_pool_taskcap.{hpp,cpp}) — the legacy admission-cap wrapper retained for HA daemons;m_tasks_available = m_max_tasks = worker_pool->get_worker_count ().css_push_server_task(server_support.c) — the hot-path handoff; partitions bystatic_cast<size_t> (conn_ref.idx)so a connection always lands on the same task-pool core.css_get_task_stats(server_support.c) — fillsstats[3] = { requested, started, completed }from the pool’s internal counters; consumed bycoordinator::statistics_update_task.
system parameters
Section titled “system parameters”PRM_ID_TCP_KEEPALIVE_IDLE/PRM_ID_TCP_KEEPALIVE_INTERVAL/PRM_ID_TCP_KEEPALIVE_COUNT— keepalive tunables.PRM_ID_TASK_GROUP(renamed fromthread_core_count).PRM_ID_TASK_WORKER.PRM_ID_CSS_MAX_CONNECTION_WORKER/PRM_ID_CSS_MIN_CONNECTION_WORKER.PRM_ID_CSS_AUTO_SCALING_WINDOW_SIZE.PRM_ID_CSS_RECV_BUDGET_PER_CONNECTION/PRM_ID_CSS_SEND_BUDGET_PER_CONNECTION.
Position hints (as of 2026-04-30)
Section titled “Position hints (as of 2026-04-30)”| Symbol | File | Line |
|---|---|---|
cubsocket::epoll (class) | src/base/epoll.hpp | 42 |
cubsocket::epoll::epoll | src/base/epoll.cpp | 37 |
cubsocket::epoll::wait | src/base/epoll.cpp | 54 |
cubsocket::epoll::add_descriptor | src/base/epoll.cpp | 59 |
cubsocket::epoll::modify_descriptor | src/base/epoll.cpp | 80 |
cubsocket::epoll::remove_descriptor | src/base/epoll.cpp | 101 |
cubconn::connection::worker (class) | src/connection/connection_worker.hpp | 52 |
worker::message_type (enum) | src/connection/connection_worker.hpp | 106 |
worker::worker | src/connection/connection_worker.cpp | 75 |
worker::attach | src/connection/connection_worker.cpp | 2107 |
worker::initialize | src/connection/connection_worker.cpp | 1943 |
worker::finalize | src/connection/connection_worker.cpp | 1975 |
worker::run | src/connection/connection_worker.cpp | 2007 |
worker::enqueue | src/connection/connection_worker.cpp | 160 |
worker::notify | src/connection/connection_worker.cpp | 182 |
worker::enqueue_and_notify | src/connection/connection_worker.cpp | 218 |
worker::push_task_into_worker_pool | src/connection/connection_worker.cpp | 288 |
worker::purge_stale_contexts | src/connection/connection_worker.cpp | 294 |
worker::handle_connection_close | src/connection/connection_worker.cpp | 386 |
worker::statistics_metrics_to_coordinator | src/connection/connection_worker.cpp | 562 |
worker::hibernate_check | src/connection/connection_worker.cpp | 584 |
worker::ha_close_all_connections | src/connection/connection_worker.cpp | 606 |
worker::handle_message_queue_new_client | src/connection/connection_worker.cpp | 1016 |
worker::handle_message_queue_handoff_client | src/connection/connection_worker.cpp | 1079 |
worker::handle_message_queue_takeover_client | src/connection/connection_worker.cpp | 1160 |
worker::handle_message_queue_shutdown_client | src/connection/connection_worker.cpp | 1227 |
worker::handle_message_queue | src/connection/connection_worker.cpp | 1356 |
worker::handle_reception | src/connection/connection_worker.cpp | 1694 |
worker::handle_transmission | src/connection/connection_worker.cpp | 1782 |
worker::handle_exhausted_add_context | src/connection/connection_worker.cpp | 1837 |
worker::handle_exhausted | src/connection/connection_worker.cpp | 1854 |
cubconn::connection::pool (class) | src/connection/connection_pool.hpp | 39 |
pool::freelist | src/connection/connection_pool.hpp | 42 |
pool::initialize | src/connection/connection_pool.cpp | 62 |
pool::finalize | src/connection/connection_pool.cpp | 89 |
pool::dispatch | src/connection/connection_pool.cpp | 109 |
pool::claim_context | src/connection/connection_pool.cpp | 140 |
pool::retire_context | src/connection/connection_pool.cpp | 160 |
pool::initialize_freelist | src/connection/connection_pool.cpp | 213 |
pool::initialize_topology | src/connection/connection_pool.cpp | 249 |
pool::initialize_workers | src/connection/connection_pool.cpp | 269 |
pool::finalize_workers | src/connection/connection_pool.cpp | 314 |
pool::initialize_coordinator | src/connection/connection_pool.cpp | 353 |
pool::start_coordinator | src/connection/connection_pool.cpp | 376 |
cubconn::connection::coordinator (class) | src/connection/coordinator.hpp | 41 |
coordinator::coordinator | src/connection/coordinator.cpp | 57 |
coordinator::initialize | src/connection/coordinator.cpp | 1192 |
coordinator::run | src/connection/coordinator.cpp | 1240 |
coordinator::transfer_connection | src/connection/coordinator.cpp | 237 |
coordinator::scale_up | src/connection/coordinator.cpp | 281 |
coordinator::scale_down | src/connection/coordinator.cpp | 348 |
coordinator::scale_down_finish | src/connection/coordinator.cpp | 317 |
coordinator::scale_trial | src/connection/coordinator.cpp | 378 |
coordinator::scale_selection | src/connection/coordinator.cpp | 415 |
coordinator::statistics_find_score_extremes | src/connection/coordinator.cpp | 460 |
coordinator::statistics_update_score | src/connection/coordinator.cpp | 482 |
coordinator::statistics_update_connection | src/connection/coordinator.cpp | 502 |
coordinator::statistics_update_task | src/connection/coordinator.cpp | 545 |
coordinator::statistics_rebalancing | src/connection/coordinator.cpp | 586 |
coordinator::statistics_scaling | src/connection/coordinator.cpp | 629 |
coordinator::handle_message_queue_new_client | src/connection/coordinator.cpp | 934 |
coordinator::handle_message_queue_return_to_pool | src/connection/coordinator.cpp | 970 |
coordinator::handle_message_queue_handoff_reply | src/connection/coordinator.cpp | 992 |
coordinator::handle_message_queue_statistics | src/connection/coordinator.cpp | 1032 |
coordinator::handle_controller_request | src/connection/coordinator.cpp | 1110 |
cubconn::connection::context | src/connection/connection_context.hpp | 141 |
cubconn::statistics::metrics | src/connection/connection_statistics.hpp | 111 |
cubconn::connection::controller (template) | src/connection/controller.hpp | 43 |
cubthread::worker_pool | src/thread/thread_worker_pool.hpp | 54 |
cubthread::worker_pool_task_capper | src/thread/thread_worker_pool_taskcap.hpp | 30 |
css_push_server_task | src/connection/server_support.c | 2354 |
css_get_task_stats | src/connection/server_support.c | 2647 |
REGISTER_CONNECTION (macro) | src/thread/thread_manager.hpp | 496 |
PRM_ID_TCP_KEEPALIVE_IDLE (param row) | src/base/system_parameter.c | 5161 |
PRM_ID_TASK_WORKER (param row) | src/base/system_parameter.c | 5197 |
PRM_ID_CSS_MAX_CONNECTION_WORKER (param row) | src/base/system_parameter.c | 5209 |
PRM_ID_CSS_AUTO_SCALING_WINDOW_SIZE (param row) | src/base/system_parameter.c | 5243 |
PRM_ID_CSS_RECV_BUDGET_PER_CONNECTION (param row) | src/base/system_parameter.c | 5259 |
PRM_ID_CSS_SEND_BUDGET_PER_CONNECTION (param row) | src/base/system_parameter.c | 5271 |
Cross-check Notes
Section titled “Cross-check Notes”Sibling doc — cubrid-thread-worker-pool.md. The legacy
doc describes (a) css_master_thread accept loop, (b) one
polling thread per accepted connection, (c) the
cubthread::worker_pool and its core::worker machinery,
(d) css_push_server_task as the dispatch point. Of those,
(c) and (d) are still live and current. (a) is unchanged at
the master-thread accept layer, but the handover point is
now pool::dispatch (forwarding NEW_CLIENT to the
coordinator) instead of “spawn a polling thread for this
fd”. (b) is replaced: any reference in the legacy doc to
“each connection has a thread” is no longer accurate.
Look-up symbols that moved domains:
- Polling/recv-loop logic in legacy was scattered across
per-connection threads driven by
css_internal_request_handler; now lives incubconn::connection::worker::handle_receptionand friends. - Connection-close protocol in legacy was a synchronous
css_close_socketfrom the polling thread; now isworker::handle_connection_closewith retry-via-LAZY-queue and a separate freelist return phase. - Stats in legacy were per-worker
cubperf::stat_valuearrays read with the worker pool’sget_stats; for the connection side, those readings no longer exist as counters at all (CBRD-26177 directive). Use the coordinator’s controller socket (SHOW_STATS) for diagnostics. - Admission control in legacy was
worker_pool_task_capperfor HA daemons only; in NG, every connection worker enforces a per-tick byte budget. The capper class is still in tree but is not on the connection-worker path.
Sibling doc — cubrid-server-session.md. Server session
state lookup happens during request processing inside the
task worker (after css_push_server_task lands in the
back-end pool). The connection worker does not look up
sessions; it only parses the network protocol. The
session_p field on css_conn_entry is read on the task
side (see css_server_task::execute in server_support.c).
This is unchanged from the legacy doc and the redesign does
not move it.
Regressions tracked under the EPIC.
- CBRD-26586 — parallel query uses only one CPU after worker
timeout. Root cause confirmed by Hong Yechan to be the
interaction between
thread_worker_timeout_secondsand affinity inheritance: when the connection worker creates a task worker (because the task pool let a thread expire), the new pthread inherits the connection worker’s CPU affinity, pinning all back-end work to the connection worker’s core. Fix: do not inherit affinity for newly-spawned task workers. Workaround until fix lands: setthread_worker_timeout_secondshigh so back-end threads are not recycled. - CBRD-26617 — task worker thread name inherits “connections”.
Same mechanism (attribute inheritance from the spawning
thread). Confusing in core dumps because the thread name is
used to label
core.<name>...files, so a task-worker crash producedcore.connections.*. Fix: set thread name when the task pool spawns a worker. - CBRD-26544 — schema_type_str synonym enum coredump. Pre-existing on develop; surfaced under the new build because CCI’s enum and its string array drifted out of sync. Fixed in the same merge window.
- CBRD-26523 — HA test cases cbrd_21506_02, cbrd_22705_02
fail. Diagnosed as a pre-existing HA timing bug
(logwr/copylog interaction on
tid:0system commits) that the redesign exposes because the new connection structure speeds up state transitions. Not a redesign regression; rerouted to CBRD-26576 for the actual fix.
HA-shell test set after merge. CBRD-26255 comments
record a separate batch of HA shell-test failures
(bug_bts_5212, bug_bts_9047, cbrd_22207, cbrd_23854,
etc.) all attributed to timing changes — the redesign
genuinely is faster, and that exposes test scripts whose
sleeps and grep filters were calibrated to the legacy speed.
The fixes were a mix of test-script timing tweaks and one
genuine bug (-353 Resource temporarily unavailable under
ulimit -n constraint, fixed by raising the FD limit and
documenting the new minimum).
The CBRD-26177 “no perfmon” directive. Repeated here because it is the most likely thing to be broken by a future contributor:
“connection worker는 매우 동시성이 높은 hot-path이므로 perfmon 계열의 모니터링 코드를 추가해서는 안된다. 심각한 성능 저하를 일으킬 수 있다.”
Practical implications when reading or editing the code:
- Do not add
perfmon_inc_stator any global atomic increment toworker::run,worker::handle_reception,worker::handle_transmission,worker::handle_packet, the message-queue handlers, or any of their callees. - Do add metrics to
statistics::metrics<>instances on the worker (they are privateuint64_t[]); the coordinator already sums them. - The controller socket (
SHOW_STATS) is the supported read-out path;statistics_printis the renderer. - Per-context counters belong on
context::m_stats, and their aggregation via the coordinator’sstatistics_update_connectionis already wired.
Pivot to CBRD-26662 — Logical-Wait-Aware Concurrency
Control. This redesign delivered “high throughput at high
concurrency” but exposed a follow-on weakness articulated in
CBRD-26636: when task_worker is sized aggressively low
(4–6 × cores) for throughput, lock waits on a few workers
can block the whole back-end. CBRD-26662 introduces a
slot abstraction — workers must hold a slot to be
“active”; a worker entering a logical wait (lock or
condition variable) returns its slot, freeing the slot for a
new worker — bounded above by high_concurrency, a
runtime-tunable. The plan is to retire task_group /
task_worker and replace them with high_concurrency. That
work is in progress; for now, treat task_worker and
task_group as the canonical knobs.
Open Questions
Section titled “Open Questions”-
Affinity-aware connection placement. The coordinator picks the minimum-score worker. When a connection is
pgxc-style stateful (HA replication, CDC consumer, log-writer slave), is there value in pinning it to a fixed worker for the connection lifetime? The currenttransfer_connectionwill re-balance even long-lived sessions; the only opt-out isis_wait_requiredreturning false forcdc_Gl.conn.fdinworker::is_wait_required. A first-class “affinity-pinned connection” flag would close the gap. -
HA replication’s connection model. The connection worker honours
HA_SERVER_STATE_TO_BE_STANDBYby force-closing non-active contexts (ha_close_all_connections). What happens during the opposite transition (standby → master), when a fresh batch of clients reconnects en masse and the coordinator has to allocate many contexts in a burst? The freelist is sized tomax_connections * 1.1, so it should absorb the burst, but the coordinator is single-threaded onhandle_message_queue_new_client. Concrete bound on the new-connection rate the coordinator can sustain is not measured. -
Score-function weights. The macros
EVAL_WORKER (25, 3.5, …) + (500, 1, …)andEVAL_CONTEXT (50, 1000, …) + (10, 1, …)are tuned constants. CBRD-26424 acknowledged this is empirical. What is the sensitivity surface? Could a runtime-tunable weight set obviateauto_scaling_window_sizeby letting operators bias the score toward latency or throughput? -
Verification gap from CBRD-26421. The task explicitly stated that connection-worker rebalancing and dynamic scaling are not covered by automated tests because the connection pool’s internal state is not exposed through any user-visible interface. The controller socket is for debugging only. A read-only
SHOW STATSSQL or DBA-RPC view would close the test gap. -
std::nothrowvs. STL exceptions (CBRD-26412). The ticket’s resolution is essentially “we cannot guard exhaustively because STL throws and the codebase uses STL”. Some hot-path allocations (pool::freelist (32 * 1024),m_context.reserve (256),m_exhausted.reserve (128)) still throw on OOM. What’s the failure semantic the operator should expect — server crash, dropped connection, or graceful degradation? Today it is the first. -
Send/recv budget defaults. 16 KB / 32 KB are reasonable for OLTP but are likely small for bulk-load and CDC streaming. Is there a per-connection-class override path short of editing
cubrid.conf?
Sources
Section titled “Sources”Source paths
Section titled “Source paths”src/connection/connection_worker.cpp(≈ 58 KB)src/connection/connection_worker.hpp(≈ 10 KB)src/connection/connection_pool.cpp(≈ 10 KB)src/connection/connection_pool.hpp(≈ 3 KB)src/connection/coordinator.cpp(≈ 35 KB)src/connection/coordinator.hpp(≈ 10 KB)src/connection/controller.hppsrc/connection/connection_context.hppsrc/connection/connection_statistics.hppsrc/connection/connection_support.{cpp,hpp}src/connection/server_support.c—css_push_server_task(line 2354),css_get_task_stats(line 2647)src/connection/tcp.c—setsockopt SO_KEEPALIVE(line 203)src/base/epoll.{cpp,hpp}src/thread/thread_worker_pool.hpp— abstract pool interface (line 54)src/thread/thread_worker_pool_impl.{cpp,hpp}— pool implementationsrc/thread/thread_worker_pool_taskcap.{cpp,hpp}— legacy admission capsrc/thread/thread_manager.hpp—REGISTER_CONNECTION(line 496)src/base/system_parameter.{c,h}— param IDs and rows fortcp_keepalive_*,task_group,task_worker,min/max_connection_worker,auto_scaling_window_size,recv/send_budget_per_connectionsrc/executables/server.c—cubconn::connection::pool connections;(line 557)
JIRA tickets
Section titled “JIRA tickets”- EPIC: http://jira.cubrid.org/browse/CBRD-26177
- Survey: http://jira.cubrid.org/browse/CBRD-26152
- POC: http://jira.cubrid.org/browse/CBRD-26212
- Pool redesign: http://jira.cubrid.org/browse/CBRD-26255
- Send/recv budgets: http://jira.cubrid.org/browse/CBRD-26392
- Rebalancing + auto scaling: http://jira.cubrid.org/browse/CBRD-26406
- Coordinator + freelist: http://jira.cubrid.org/browse/CBRD-26407
- Null-guard for
new: http://jira.cubrid.org/browse/CBRD-26412 - Verification cases (postponed): http://jira.cubrid.org/browse/CBRD-26421
- Score formula: http://jira.cubrid.org/browse/CBRD-26424
- HA bugs: http://jira.cubrid.org/browse/CBRD-26523
- Synonym-enum coredump: http://jira.cubrid.org/browse/CBRD-26544
- Parallel-query CPU regression: http://jira.cubrid.org/browse/CBRD-26586
- Thread-name inheritance: http://jira.cubrid.org/browse/CBRD-26617
- Worker-count sweep: http://jira.cubrid.org/browse/CBRD-26636
- Atomic-free monitoring: http://jira.cubrid.org/browse/CBRD-26191
- Logical-Wait-Aware (follow-on EPIC): http://jira.cubrid.org/browse/CBRD-26662
- Manual update: http://jira.cubrid.org/browse/CUBRIDMAN-333
Textbook references
Section titled “Textbook references”- Silberschatz, Korth, Sudarshan. Database System Concepts, 6th ed. — Ch. 13 “Storage and File Structure” (buffer basics, framing of front-end vs back-end).
- Petrov, Alex. Database Internals (O’Reilly, 2019). §5.3 “Concurrent Execution” — pool sizing intuition, C10K framing.
- Stevens, W. Richard. UNIX Network Programming, Vol. 1, 3rd ed. — §16.5 “TCP Concurrent Server, One Child per Client” (the model the redesign moves away from).
- Pai, V., P. Druschel, W. Zwaenepoel. Flash: An Efficient and Portable Web Server. USENIX 1999. (event-driven asymmetric multi-process design — direct ancestor of the reactor pattern in this redesign).
- Welsh, M., D. Culler, E. Brewer. SEDA: An Architecture
for Well-Conditioned, Scalable Internet Services. SOSP
2001. (admission control via bounded stage queues — the
intellectual basis for
recv_budget_per_connection/send_budget_per_connection). - Linux kernel docs —
epoll(7),eventfd(2),timerfd_create(2). TheEPOLLET(edge-triggered) semantics are mandatory background reading for anyone modifyingworker::run.